Reusing OCR results
Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan
Reusing OCR results
Hi,
I have a process that should make a pdf searchable and that also should verify OCR results with a business logic.
Is there a way, to reuse the OCR results of either OCR_MakeSearchable for the business logic process or
the OCR results of function OCRp_Page for OCR_MakeSearchable, so I don’t have to OCR a 40 pages doc twice ?
Thanks a lot
Robert
I have a process that should make a pdf searchable and that also should verify OCR results with a business logic.
Is there a way, to reuse the OCR results of either OCR_MakeSearchable for the business logic process or
the OCR results of function OCRp_Page for OCR_MakeSearchable, so I don’t have to OCR a 40 pages doc twice ?
Thanks a lot
Robert
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: Reusing OCR results
Yes, you can re-use results; the function OCRp_Page() will return a pointer to page information that remains valid until explicitly freed with OCRp_FreePage().
So, in psuedocode:
So, in psuedocode:
Code: Select all
PXO_Page pages[40];
for (nPage in range(40))
OCRp_Page(doc, nPage, options, &pages[nPage], &settings);
DoStuff(pages);
// Release pages and free memory
for (nPage in range(40))
OCRp_FreePage(&pages[nPage]);
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: Reusing OCR results
Also, the results of OCR_MakeSearchable() remain valid until the document is freed with OCR_Delete(). You can work with multiple documents by creating multiple input documents with OCR_Init() and OCR_Load()/OCR_LoadW(), e.g.:
In pseudocode:
In pseudocode:
Code: Select all
// ocr first document
PXODocument doc1;
OCR_Init(..., doc1, ...);
OCR_MakeSearchable(..., doc1, ...);
// ocr second document
PXODocument doc2;
OCR_Init(..., doc2, ...);
OCR_MakeSearchable(..., doc2, ...);
// call your function on the documents
DoStuffWithBoth(doc1, doc2);
// Free memory / invalidate documents
OCR_Delete(doc1);
OCR_Delete(doc2);
Re: Reusing OCR results
Cool, thanks a lot
Regards Robert
Regards Robert
- Tracker Supp-Stefan
- Site Admin
- Posts: 17948
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: Reusing OCR results
Is there any functionality, that rereads, what Ocr_MakeSearchable wrote to the pdf, means that reconstruct the output area ?
Sometimes pdf docs already comes searchable, so such functionality would avoid the need to ocr them again.
That would also be useful, to have independend services, to ocr docs and business services, that use the ocr results.
Regards
Robert
Sometimes pdf docs already comes searchable, so such functionality would avoid the need to ocr them again.
That would also be useful, to have independend services, to ocr docs and business services, that use the ocr results.
Regards
Robert
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: Reusing OCR results
The OCR SDK does not support this; OCR results persist after an OCR job is performed and until you free the OCR document object with OCR_Delete(), but they cannot be directly recovered (e.g. into a PXO_Page object) from an already OCR'd document on disk.
You can extract text from documents using the PDF Tools SDK, but there is no function that informs you whether it is OCR text or not. You would have to come up with your own logic to analyze page contents, and decide whether or not the text was OCR text or regular document text. In particular you might have to check the properties of extracted text and see that it is invisible.
However, if you know that the documents you are processing are one of the following types:
A: Plain scan without OCR text
B: Scan that has been OCR'd and contains an invisible text layer
You might write a routine to try to extract text (using PDF Tools SDK), and if it succeeds in finding text you will know it is a type "B" document, and if it fails you can assume it is a type "A" document and can then run OCR. Writing the logic for this would be up to you, though the PDF Tools SDK manual contains information on text extraction, and usage examples, in section 3.1.10.
You can extract text from documents using the PDF Tools SDK, but there is no function that informs you whether it is OCR text or not. You would have to come up with your own logic to analyze page contents, and decide whether or not the text was OCR text or regular document text. In particular you might have to check the properties of extracted text and see that it is invisible.
However, if you know that the documents you are processing are one of the following types:
A: Plain scan without OCR text
B: Scan that has been OCR'd and contains an invisible text layer
You might write a routine to try to extract text (using PDF Tools SDK), and if it succeeds in finding text you will know it is a type "B" document, and if it fails you can assume it is a type "A" document and can then run OCR. Writing the logic for this would be up to you, though the PDF Tools SDK manual contains information on text extraction, and usage examples, in section 3.1.10.
Re: Reusing OCR results
Thank's a lot, Walter,
I already guess something like that,
but before I reinvent the wheel, I prefer to ask, whether someone did it before me
Regards
Robert
I already guess something like that,
but before I reinvent the wheel, I prefer to ask, whether someone did it before me
Regards
Robert
- Tracker Supp-Stefan
- Site Admin
- Posts: 17948
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact: