**PLEASE READ CAREFULLY BEFORE BIDDING ** - We need to extract data and images from several catalogs on the Internet. Each catalog contains hundreds of records of items with fields such as *Title & Description*, *Price*, and an *Image*. These must be tabulated and delivered in a folder per catalog containing:
1) An excel file (xls format) with 3 columns for the Title & Description, Price, and Image Name, and
2) A separate file which will contain the images and named exactly as indicated in the corresponding excel file under "Image name"
Examples of URLs and more detailed instructions can be provided upon request, ***so please ask if you are not sure before bidding on this project.***
**I have a total of about 65 - 70 separate URLs that need to be extracted. NOTE THAT each URL contains 20 - 30 catalogs of 100 - 1200 records with images to be extracted. Please bid for the entire project for *ALL *URLs (*NOT* per URL). This is a total of at least hundreds of thousands (almost a million) records.**
**IMPORTANT ** - You must have experience with extraction patterns and Databases or MySQL/php. **This is *NOT* a Copy & Paste or Data Entry project**. Although most catalogs are fairly easy, some may be challenging and often require deep scanning to open images or to flip through pages. Writing the extraction pattern code should normally take 5-10 minutes. However, the actual data extraction takes several hours, but it should be automatic and can be done on a separate computer or server. **Must have a *FAST* Internet connection.**
Please ask any questions...
## Deliverables
So that you can estimate the work involved ***before bidding***, we can provide some sample URLs of where the catalogs are posted. Each URL contains at least 20 - 30 separate catalogs and each catalog contains anywhere from 100 - 1200 records (the average typical catalog is about 300 - 600 records long).
All catalogs for each URL will have the same pattern, so you only need to determine the extraction pattern once for each URL and then apply it to each catalog under the same URL. However, because this extraction also involves images, please note that it can be time consuming and ***it is advised that you have a fairly fast Internet connection and separate computer that you can use to run the extractions so that they do not conflict with your other work you may have.***
There will be some filters that will be included, which will make the extraction shorter. For example, only items that also have images should be extracted. Records with no images can be skipped. Also, only items that have a price of greater than 0 (i.e. have actually sold) should be extracted.