Reusing OCR results

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
Gemini64
User
Posts: 13
Joined: Thu May 02, 2013 10:21 am

Reusing OCR results

Post by Gemini64 »

Hi,
I have a process that should make a pdf searchable and that also should verify OCR results with a business logic.

Is there a way, to reuse the OCR results of either OCR_MakeSearchable for the business logic process or
the OCR results of function OCRp_Page for OCR_MakeSearchable, so I don’t have to OCR a 40 pages doc twice ?

Thanks a lot
Robert
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Reusing OCR results

Post by Walter-Tracker Supp »

Yes, you can re-use results; the function OCRp_Page() will return a pointer to page information that remains valid until explicitly freed with OCRp_FreePage().

So, in psuedocode:

Code: Select all

PXO_Page pages[40];

for (nPage in range(40))
    OCRp_Page(doc, nPage, options, &pages[nPage], &settings);

DoStuff(pages);

// Release pages and free memory
for (nPage in range(40))
    OCRp_FreePage(&pages[nPage]);

Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Reusing OCR results

Post by Walter-Tracker Supp »

Also, the results of OCR_MakeSearchable() remain valid until the document is freed with OCR_Delete(). You can work with multiple documents by creating multiple input documents with OCR_Init() and OCR_Load()/OCR_LoadW(), e.g.:

In pseudocode:

Code: Select all

// ocr first document
PXODocument doc1;
OCR_Init(..., doc1, ...);
OCR_MakeSearchable(..., doc1, ...);

// ocr second document
PXODocument doc2;
OCR_Init(..., doc2, ...);
OCR_MakeSearchable(..., doc2, ...);

// call your function on the documents
DoStuffWithBoth(doc1, doc2);

// Free memory / invalidate documents
OCR_Delete(doc1);
OCR_Delete(doc2);
Gemini64
User
Posts: 13
Joined: Thu May 02, 2013 10:21 am

Re: Reusing OCR results

Post by Gemini64 »

Cool, thanks a lot :)
Regards Robert
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17823
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Reusing OCR results

Post by Tracker Supp-Stefan »

:)
Gemini64
User
Posts: 13
Joined: Thu May 02, 2013 10:21 am

Re: Reusing OCR results

Post by Gemini64 »

Is there any functionality, that rereads, what Ocr_MakeSearchable wrote to the pdf, means that reconstruct the output area ?
Sometimes pdf docs already comes searchable, so such functionality would avoid the need to ocr them again.
That would also be useful, to have independend services, to ocr docs and business services, that use the ocr results.

Regards
Robert
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Reusing OCR results

Post by Walter-Tracker Supp »

The OCR SDK does not support this; OCR results persist after an OCR job is performed and until you free the OCR document object with OCR_Delete(), but they cannot be directly recovered (e.g. into a PXO_Page object) from an already OCR'd document on disk.

You can extract text from documents using the PDF Tools SDK, but there is no function that informs you whether it is OCR text or not. You would have to come up with your own logic to analyze page contents, and decide whether or not the text was OCR text or regular document text. In particular you might have to check the properties of extracted text and see that it is invisible.

However, if you know that the documents you are processing are one of the following types:

A: Plain scan without OCR text
B: Scan that has been OCR'd and contains an invisible text layer

You might write a routine to try to extract text (using PDF Tools SDK), and if it succeeds in finding text you will know it is a type "B" document, and if it fails you can assume it is a type "A" document and can then run OCR. Writing the logic for this would be up to you, though the PDF Tools SDK manual contains information on text extraction, and usage examples, in section 3.1.10.
Gemini64
User
Posts: 13
Joined: Thu May 02, 2013 10:21 am

Re: Reusing OCR results

Post by Gemini64 »

Thank's a lot, Walter,
I already guess something like that,
but before I reinvent the wheel, I prefer to ask, whether someone did it before me :-)

Regards
Robert
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17823
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Reusing OCR results

Post by Tracker Supp-Stefan »

:)
Post Reply