Web Data Extraction

Closed Posted Nov 26, 2013 Paid on delivery
Closed Paid on delivery

Scope :

Develop a system using Apache Nutch, Apache Haddop and Apache Solr to crawl the pages @100 (configurable) for given websites on round robin basis and store automatically in the particular folder on hadoop by using the name of websites.

Some websites ask for authentication i.e. User id & Password, Hence system should be capable enough to pass the user id & password dynamically at runtime by reading the information from text file or configuration (XML) file. The system should be able to store multiple user credentials and provide them in a round robin basis.

Crawled pages will be stored in the respective site folders on Apache Hadoop.

Crawled page contents and metadata will be stored and indexed in Solr with following fields.

All the documents like pdf, videos, audio, doc, docx, jpeg, png etc will be stored in folders with clear identification i.e. with url so that web page can be reconstructed from the content.

The crawling will be a focused crawling where first the meta data is extracted and passed on to a API which either passes or fails it. If passed, the whole page content is extracted and processes further. The API will be provided as a part of the project.

Solr Fields:

• Site

• Title

• Host

• Segment

• Boost

• Digest

• Time Stamp

• Url

• Site Content (Text)

• Site Content (HTML)

• Metadata (Keywords, Content)

• Metadata (Description, Content)•

Input:

[url removed, login to view] (URLs)

Typical Steps:

1. The first step is to load the URL State database with an initial set of URLs. These can be a broad set of top-level domains such as the 1.7 million web sites with the highest US-based traffic, or the results from selective searches against another index, or manually selected URLs that point to specific, high quality pages.

2. Once the URL State database has been loaded with some initial URLs, the first loop in the focused crawl can begin. The first step in each loop is to extract all of the unprocessed URLs, and sort them by their link score.

3. Next comes one of the two critical steps in the workflow. A decision is made about how many of the top-scoring URLs to process in this loop.

4. Once the set of accepted URLs has been created, the standard fetch process begins. This includes all of the usual steps required for polite & efficient fetching, such as [url removed, login to view] processing. Pages that are successfully fetched can then be parsed.

5. Typically fetched pages are also saved into the Fetched Pages database.

6. Decision on whether page has to be crawled or not will be done based on the given object. The meta data is passed on to the object and If the given object return true then page will be crawled otherwise page will be discarded.

7. Page rank computation: Calculate the importance of page based on algorithm provided by nutch/solr

8. Once the page has been scored, each outlink found in the parse is extracted.

9. The score for the page is divided among all of the outlinks.

10. Finally, the URL State database is updated with the results of fetch attempts (succeeded, failed), all newly discovered URLs are added, and any existing URLs get their link score increased by all matching outlinks that were extracted during this loop.

Part II.

Classification of extracted pages

1. Run the pages into the classified API

2. Depending on the classification returned, store the page into that folder along with the relevance score.

Output:

Crawled pages will be stored in the respective site folders on Apache Hadoop.

Crawled page contents and metadata will be stored and indexed in Solr.

Tools and Techniques:

Apache Nutch, Solr, Apache Hadoop

Local system

Test Case:

1. check crawl data and xml file in respective folders.

2. Search query parameter in xml and text files.

Web Scraping

Project ID: #5168918

About the project

3 proposals Remote project Active Jan 2, 2014

3 freelancers are bidding on average $166 for this job

abhilekhverma91

A proposal has not yet been provided

$144 USD in 3 days
(2 Reviews)
2.6
MasterExcel

I am Data Entry ,MS Word and MS Excel Expert. i am very much professional in this work i am pretty sure that you cant find a best person for this job like me so i am ready to work on your project with low rate and high More

$147 USD in 3 days
(1 Review)
1.8