Page 1 of 1

Reusing OCR results

Posted: Tue Jun 04, 2013 12:43 am
by Gemini64
Hi,
I have a process that should make a pdf searchable and that also should verify OCR results with a business logic.

Is there a way, to reuse the OCR results of either OCR_MakeSearchable for the business logic process or
the OCR results of function OCRp_Page for OCR_MakeSearchable, so I don’t have to OCR a 40 pages doc twice ?

Thanks a lot
Robert

Re: Reusing OCR results

Posted: Tue Jun 04, 2013 4:27 pm
by Walter-Tracker Supp
Yes, you can re-use results; the function OCRp_Page() will return a pointer to page information that remains valid until explicitly freed with OCRp_FreePage().

So, in psuedocode:

Code: Select all

PXO_Page pages[40];

for (nPage in range(40))
    OCRp_Page(doc, nPage, options, &pages[nPage], &settings);

DoStuff(pages);

// Release pages and free memory
for (nPage in range(40))
    OCRp_FreePage(&pages[nPage]);


Re: Reusing OCR results

Posted: Tue Jun 04, 2013 5:43 pm
by Walter-Tracker Supp
Also, the results of OCR_MakeSearchable() remain valid until the document is freed with OCR_Delete(). You can work with multiple documents by creating multiple input documents with OCR_Init() and OCR_Load()/OCR_LoadW(), e.g.:

In pseudocode:

Code: Select all

// ocr first document
PXODocument doc1;
OCR_Init(..., doc1, ...);
OCR_MakeSearchable(..., doc1, ...);

// ocr second document
PXODocument doc2;
OCR_Init(..., doc2, ...);
OCR_MakeSearchable(..., doc2, ...);

// call your function on the documents
DoStuffWithBoth(doc1, doc2);

// Free memory / invalidate documents
OCR_Delete(doc1);
OCR_Delete(doc2);

Re: Reusing OCR results

Posted: Fri Jun 07, 2013 9:06 am
by Gemini64
Cool, thanks a lot :)
Regards Robert

Re: Reusing OCR results

Posted: Fri Jun 07, 2013 9:11 am
by Tracker Supp-Stefan
:)

Re: Reusing OCR results

Posted: Fri Jun 07, 2013 11:36 am
by Gemini64
Is there any functionality, that rereads, what Ocr_MakeSearchable wrote to the pdf, means that reconstruct the output area ?
Sometimes pdf docs already comes searchable, so such functionality would avoid the need to ocr them again.
That would also be useful, to have independend services, to ocr docs and business services, that use the ocr results.

Regards
Robert

Re: Reusing OCR results

Posted: Fri Jun 07, 2013 4:25 pm
by Walter-Tracker Supp
The OCR SDK does not support this; OCR results persist after an OCR job is performed and until you free the OCR document object with OCR_Delete(), but they cannot be directly recovered (e.g. into a PXO_Page object) from an already OCR'd document on disk.

You can extract text from documents using the PDF Tools SDK, but there is no function that informs you whether it is OCR text or not. You would have to come up with your own logic to analyze page contents, and decide whether or not the text was OCR text or regular document text. In particular you might have to check the properties of extracted text and see that it is invisible.

However, if you know that the documents you are processing are one of the following types:

A: Plain scan without OCR text
B: Scan that has been OCR'd and contains an invisible text layer

You might write a routine to try to extract text (using PDF Tools SDK), and if it succeeds in finding text you will know it is a type "B" document, and if it fails you can assume it is a type "A" document and can then run OCR. Writing the logic for this would be up to you, though the PDF Tools SDK manual contains information on text extraction, and usage examples, in section 3.1.10.

Re: Reusing OCR results

Posted: Sun Jun 09, 2013 7:24 pm
by Gemini64
Thank's a lot, Walter,
I already guess something like that,
but before I reinvent the wheel, I prefer to ask, whether someone did it before me :-)

Regards
Robert

Re: Reusing OCR results

Posted: Mon Jun 10, 2013 10:55 am
by Tracker Supp-Stefan
:)