Web Scraper

$30-100 USD

Cancelled

Posted

over 15 years ago

$30-100 USD

Paid on delivery

We are looking for a web scraper to be built as a PHP (not WebSite Project), MYSQL Database. Rather than using regular expressions the scraper should parse the pages using HTMLAgility pack. [login to view URL] Project needs to be started right away and be finished in about (email me first). ## Deliverables The basic program flow of the scraper should be as follows: - Read from the database a list of starting URL's - Scan the page for product information - info to scrape - product name, description, retail price, sale price, brand, product url, image url, in stock, sizes, colors, sku, scrape date, expiration date (if applicable) - Insert the information into a database table - Go to subsequent pages to scan and insert to database until they are all scraped The web based program needs to have the following features: - Be able to scrape just one product on a given page (product detail page) or scrape a series of products on a page and then all subsequent pages. - Example of one product [login to view URL] - Example of many products to scrape with other pages to drill down to and scrape [login to view URL] - The program needs to be able to run on a schedule and also on-demand. - Insert gathered data into an MS SQL database. We will provide the table schemas. - Scraper should not insert duplicate items but if price/size/color has changed it should add it as a new entry while keeping a reference to original item it is duplicating. These new updates should be flagged somehow so we know they are new changes. - Scraper should be able to detect "bad" data or page layout changes so we know to update the scraper. - Scraper needs to be an asynchronous and multithreaded application. Since many sites and pages are being scraped we need to be able to see the progress as it is running. And since many page hits will be required it needs to be multithreaded. - Scraper should be able to run behind a proxy server if necessary - Every site we scrape will need to have its own "template" which lets the scraper know how to find the data to extract. This is where HTMLAgility pack will be used. If it's easier to do this using regularexpressions then that can be used. - We should be able to easily create new "templates" for other pages we want to scrape in the future. And the scraper should be smart enough to know when a template doesn't match the given site it's scraping. - Along with the scraping templates we need a way to specify how the scraper can go to the next page and all following pages until they are all scraped. We must be able to specify this for each website. - Provide a function with the following signature that will be able to figure out the domain being scraped, pick the appropriate "template" to use and also know how to get to subsequent pages. This is assuming we have a predefined list of templates to use when the project is finished. ## Platform PHP and MySQL Database

Odd Jobs

Project ID: 3424987