SOLR search engine from OCR'ed, indexed PDFs

The attached word document is ESSENTIAL to understanding this project as it contains very important images. I will ask if you have read the attached brief before I will accept your bid. This is a short description of the project. Please read the attached document for the whole story.

We need a SOLR search engine built from old, multi-page PDFs. All of the indexed documents will be PDFs and many will need to go through OCR first. We will probably use something like Foxit to do the image to text conversion. We know the output will be messy, but text will only be used in indexing process. When user does a search, s/he will access the PDF directly.

Note: All of our work is in Java. This will be running on a large Linux server.

This project is not that simple though. Let’s take a look at this example > [url removed, login to view]

We will want to index this 30 page document. But it contains more than one form (unique section). State Oil & Gas sites will often put an entire wellbore’s files in a single PDF. 20 years of paperwork can be sitting in a single PDF. If we index as-is and return results with a 30 to 100-page PDF attached, the user will never be able to find the single mention of their search string after opening the very long PDF file.

For this reason, we need to break the 30+ page PDF into individual pages, OCR each, and index each page separately. When doing a search, user is actually searching individual pages. We tell the user we found the queried text on page 19 of the PDF. S/he clicks to get the full 30 pages, but knows to go to page 19. We may even load the PDF in a frame and keep a header at the top that reminds user to look on page 19. And there may be multiple mentions of the search query in a single PDF file.

A lot of it will be nasty looking. Documentation goes back 50+ years to typewriters.

If this all seems pretty impossible, you would be right. In fact, we believe the OCR will be so incomplete in places, we cannot even show a snippet (10-20 words) of text on the search results page, because it will be nonsensical. But this is ok. If we can OCR 70% of the data from these PDFs, that’s 70% we didn’t have yesterday. And no one will ever see the OCR text to complain how incomplete it is…

Why are we going to all this effort? We plan on using SOLR to build a metadata engine around these documents. We are less interested in the content of each page and more interested in the page type, that a particular wellbore even has a C-144 form. We'd like to get as much data as we can but realize we won't be able to get it all.

The end user will probably do very little “free text” searching of SOLR. Instead, we will process 10,000 of our own search phrases (tokenization and algorithms), e.g. “Tank Closure” or “C-144” and build a table of all the document types that are inside PDFs for each wellbore. We may tell a user that wellbore [Removed by Admin - please see Section 13 of our Terms and Conditions]

Now, it starts to make sense why we are breaking apart all the PDFs for OCR and indexing. We may store page 1, 2, 3, 4 and 5 in a database row for wellbore [Removed by Admin - please see Section 13 of our Terms and Conditions]

We cannot stress this enough. The user never sees the OCR text or the broken apart PDFs. Will be way too confusing. Instead, we will direct the user to open the original PDFs and go to page 6 or page 1 or page 27 and read further about a tank disclosure for this particular wellbore.

Expect 10-15 million PDFs. If this work is good, we have many more follow on projects from this that we will LOVE for you to work on.

OK! That should be enough to communicate the main purpose of this project. Please read the attached document which has more detailed information about the entire project.

Skills: Apache Solr, Data Extraction, Java, OCR, PDF

See more: search engine like solr, i want google chrome as my search engine, how do you use a search engine to search for information on the web, setting solr search engine, solr local search engine, solr search engine, solr search engine magento, magento solr search engine, solr search engine implementation, search engine comparison indexed urls, cs cartgetting pages indexed google search engine, hide bury information web search engine, private group ning indexed search engine, yahoo alike search engine script, remove search engine links, job search engine clone

About the Employer:
( 9 reviews ) Oklahoma City, United States

Project ID: #15634361

8 freelancers are bidding on average $2259 for this job


I have worked with lucene search with java so I understand fully about solr search. I also worked on OCR tech like Tesseract, ephesoft , Asprise etc. I understand how OCRs work. I can really help you. Relevant Skills More

$2222 USD in 20 days
(17 Reviews)

Hi, I would like to discuss about how we have converted many ideas into successfully running businesses. Please let me know when you are available for discussion. Relevant Skills and Experience We have developed 300+ More

$2500 USD in 30 days
(5 Reviews)

Hi I review the word file and understand the requirements. I propose to use C# to provide the index file and a simple GUI that you can put your queries. Relevant Skills and Experience Algorithm Proposed Milestones $3 More

$3000 USD in 20 days
(12 Reviews)

You can see my last projects based on Apache Solr, Data Extraction, Java, OCR, PDF and I can complete your project perfectly. Relevant Skills and Experience We have 10+ years More

$2000 USD in 30 days
(23 Reviews)

Hi Sir/Mam, It is being my pleasure to introduce you to me. I am expert in write handwritten text and any conversion (Format Example: [login to view URL], .txt, .pdf, .[login to view URL] & .jpeg etc ). Relevant Skills and Experience excel, More

$1850 USD in 10 days
(20 Reviews)

We have already worked on something of this sort. We are team of Scientists and Developers having rich experience with Artificial Intelligence and Machine Learning Techniques like Neural and NLP Relevant Skills and More

$2500 USD in 30 days
(2 Reviews)
$2500 USD in 30 days
(1 Review)
$1500 USD in 14 days
(1 Review)