Crawl a provided set of websites for email addresses
$250-750 USD
Completed
Posted about 11 years ago
$250-750 USD
Paid on delivery
You will receive a large CSV file (approx 1.2mm rows) of names of professors at American universities. For each professor the URL of the university is listed as well.
Your job will be to write software that can crawl each website and look for pages on which the professor's name appears, and extract email addresses from there. The goal is to obtain one or more email addresses for each professor.
Since it's impossible to determine simply from the name and the URL which email address corresponds to the professor, one potential approach is to retrieve multiple pages on which the name appears and on which at least one email address appears as well (using a regex). Then, rank the email addresses based on how frequently they appear. The address that appears most often is likely to be the correct one. Example:
page 1: John Smith, [login to view URL](at)[login to view URL]
page 2: John Smith, [login to view URL](at)[login to view URL]
page 3: John Smith, [login to view URL](at)[login to view URL]
page 4: John Smith, [login to view URL](at)[login to view URL]
From this example it is pretty clear that is likely to be the correct address.
The output of your software, provided in CSV or other database-readable format, should contain the professor ID (from the input file) and one or more email addresses, each with a rank. Each row should also contain the URL of the page where the address was found.
Here are a few sample rows from the input file:
ID Name Department InstitutionID InstitutionName State Location URL
1 Obaid, Evelyn Computer Science 881 Obaid, Evelyn CA San Jose, CA [login to view URL]
2 Khuri, Sami Computer Science 881 Khuri, Sami CA San Jose, CA [login to view URL]
3 Beeson, Michael Computer Science 881 Beeson, Michael CA San Jose, CA [login to view URL]
15 Kubelka, Richard Mathematics 881 Kubelka, Richard CA San Jose, CA [login to view URL]
18 Lin, Ty Computer Science 881 Lin, Ty CA San Jose, CA [login to view URL]
29 Key, Scott Philosophy 145 Key, Scott CA Riverside, CA [login to view URL]
45 Lash, Jamie Foundations 1230 Lash, Jamie TX Dallas, TX [login to view URL]
47 Swain, John Physics 696 Swain, John MA Boston, MA [login to view URL]
48 Signorielli, Nancy Communication 1094 Signorielli, Nancy DE Newark, DE [login to view URL]
57 Frederick, Joan English 457 Frederick, Joan VA Harrisonburg, VA [login to view URL]
To save you time, one possibility is to query Google using their API for pages that contain the name of each professor and are on the domain provided. Example (this is from the first row above):
Query: "Khuri, Sami site:[login to view URL]"
[login to view URL]
As you can see the first result in this case is actually a very good page to collect the email from:
[login to view URL]
Generally speaking the first 10-20 results are very likely contain the correct address.
Once again, the deliverable of this project is a text (CSV or TSV) file containing one or more email addresses for each professor, ranked by probability of being correct.
The project must be delivered in at most 1 month.
I've done many similar projects, actually I already have a module to start with, it will crawl every university website from the csv looking for the name and a pattern of an email, it will look for the left side of the email address, in most cases the name of the person appears in part of this string. most univerties if not all make the email address of the professor out of their names.
We can make a small AI engine that check all the patterns of the name and compare it to the left side of the address found on the same page. here regex is of paramount importance.
Google API will be a way to confirm that the result is correct.
Anyway, I'm confident I can make this project to your fullest satifaction.
Hope to work for you soon
I've read your project specs fully and carefully. They are very well written. I can definitely code this scraper for you; it's my specialty ;). I will send you a message with my proposed approach. Also, my bid is very negotiable. I wouldn't charge over $200 for this. Regards
Hi,
Good Day!!!
Upon reading the project description. I am willing to work on this. I have an extensive experiences in web crawling on any languages.
Thanks