How to determine is a PDF is searchable

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
arno.engelbrecht
User
Posts: 4
Joined: Tue Aug 12, 2014 7:37 am

How to determine is a PDF is searchable

Post by arno.engelbrecht »

I have the PDF X-Change PRO SDK that includes the OCR module. I can OCR documents, but I have a large amount of documents, some of which are image-based and thus need to be OCR'ed and other that are already searchable and do not need to be OCR'ed. Is there a way with the SDK to determine if a document is already searchable or not?
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6835
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: How to determine is a PDF is searchable

Post by Paul - Tracker Supp »

Hi Arno,

thanks for the post,

I moved it from the End User OCR to the SDK OCR forum.

I an not personally sure how to do this and will have one of the development team advise when they have a spare moment.

regards
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 2352
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: How to determine is a PDF is searchable

Post by Vasyl-Tracker Dev Team »

Hi, arno.engelbrecht.

Possible way - you can check if any page contains any text by:

Code: Select all

PDFDocument hDoc;

// open document...

DWORD pagesNum = 0;
PXCp_GetPagesCount(hDoc, &pagesNum);

// check for existing text

PXCp_ET_Prepare(hDoc);

bool isSeachable = false;

for (DWORD i = 0; i < pageNum; i++)
{
     PXCp_ET_AnalyzePageContent(hDoc, i);
     DWORD textElementsNum = 0;
     PXCp_ET_GetElementCount(hDocument, &textElementsNum);
     if (textElementsNum != 0)  
     {  
        isSeachable = true;   
        break; 
     }
}

PXCp_ET_Finish(hDoc);
HTH
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
arno.engelbrecht
User
Posts: 4
Joined: Tue Aug 12, 2014 7:37 am

Re: How to determine is a PDF is searchable

Post by arno.engelbrecht »

Hi

Thanks a lot. Can I assume that if I find any text that it is already searchable or should I search for a minimum amount of text? Basically I just want to make sure that I don't get a few random characters in some files that aren't actually searchable.
User avatar
John - Tracker Supp
Site Admin
Posts: 5219
Joined: Tue Jun 29, 2004 10:34 am
Location: United Kingdom
Contact:

Re: How to determine is a PDF is searchable

Post by John - Tracker Supp »

Well that would be down to you to analyse what's returned and decide if its usable or not ...
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
Post Reply