SDK to extract text, then search

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
Archie
User
Posts: 4
Joined: Tue Dec 03, 2013 9:48 pm

SDK to extract text, then search

Post by Archie »

We write software for the Logistics industry.
We have an "Edoc" system that keeps shipping documents as PDFs.
Our customers are asking for the ability to search the recently received PDFs (say 3,000) for things like a company name.
I am thinking about using the PDF-XChange Viewer SDK to do that automatically in the background with no user input as the "Edoc" pdf is created.
I expect to create, for each Edoc PDF, a pdf.txt file containing the extracted text from the .pdf

After that I need to write a program that will do the search of the pdf.txt files looking for the desired string, eg company name.

Question:
Does anyone know of an ActiveX module that I can use to do the search with as little user input as possible, just the string to search for?

Thanks
Archie
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6831
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: SDK to extract text, then search

Post by Paul - Tracker Supp »

Hi Archie,

thanks for the post and welcome to the Tracker Forums.

I think you should look at the PDF-Tools SDK. https://www.pdf-xchange.com/product/pdf-tools-sdk Assuming your PDFs are being created as text based PDFs and not image based then you should be able to search the strings directly on the PDF without the txt file in between.

If it's image based then you'd need to OCR the PDF first then search the text.

All the SDKs are fully functional, even in 'Trial Mode' and you can test every aspect of your program before committing to a purchase. The caveat is that until licensed anything you do with the SDK will result in water marked PDFs. Once you are happy that you have the right solution simply purchase a license, inject the serial keys and dev code we give you into your source code, recompile and go...

I hope that helps. Do be sure to let us know if you have further questions.

regards
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Archie
User
Posts: 4
Joined: Tue Dec 03, 2013 9:48 pm

Re: SDK to extract text, then search

Post by Archie »

Hi Paul

Thanks for the quick reply.
Some PDFs come from forms that have text but lots come from faxes attached to emails. They are image based.
For those I will need the OCR stuff.

I presume your OCR stuff will allow me to create a pdf.txt file with the OCR produced text and that you have ActiveX modules that my programs can call to do it.

The question for which I am looking for some guidance is related to the next step where I build an application program that asks the user for a search string and it goes off and searches all the pdf.txt files for the desired string.

Question:
Does anyone know of an ActiveX module that I can use to do the search with as little user input as possible, just the string to search for?

Thanks
User avatar
John - Tracker Supp
Site Admin
Posts: 5219
Joined: Tue Jun 29, 2004 10:34 am
Location: United Kingdom
Contact:

Re: SDK to extract text, then search

Post by John - Tracker Supp »

Hi Archie,

Topic moved to the correct forum ...

Walter (our OCR specialist) will reply shortly with regards your question - but in regards licensing - you will need a PDF-XChange PRO SDK (Not PDF-Tools SDK as advised by Paul) to gain access to the Live OCR SDK functions.

HTH
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: SDK to extract text, then search

Post by Walter-Tracker Supp »

You can do this with the Pro Tools SDK, but it is not active-X but rather native C++ DLL with a flat C-style API. We have functions to extract existing text, and an OCR component that lets you perform OCR and create either a searchable PDF output, or extract text which you can save to a text file if you wish.

We have wrappers for .NET and a few other languages, so you aren't restricted to C++, but it is not an Active-X component.

We do have an Active-X viewer component but this is typically used for providing customized viewing (and annotating, etc) capabilities in the scope of a custom application. You can't use it to automate text extraction.

-Walter
Archie
User
Posts: 4
Joined: Tue Dec 03, 2013 9:48 pm

Re: SDK to extract text, then search

Post by Archie »

We develop in a language called VisualDataflex which easily supports using ActiveX controls.

I went to the VDF forum and asked if your stuff would be usable.
The best reply I got was the following:
ask them how their exported functions are declared. If they use __STDCALL then you're all good. Each of their functions becomes an External_function statement in VDF

My question, then, is the above. Are the exported functions declared using __STDCALL?

Thanks
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: SDK to extract text, then search

Post by Walter-Tracker Supp »

Yes, we use the __stdcall calling convention. You do not need to purchase the product to try it; there are some limitations (e.g. watermarks if you create documents, limits on the number of pages you can OCR, etc) but you can try every feature out without purchasing a license.

-Walter
Archie
User
Posts: 4
Joined: Tue Dec 03, 2013 9:48 pm

Re: SDK to extract text, then search

Post by Archie »

Good news Walter.
Thanks for the info and the quick response.
User avatar
John - Tracker Supp
Site Admin
Posts: 5219
Joined: Tue Jun 29, 2004 10:34 am
Location: United Kingdom
Contact:

Re: SDK to extract text, then search

Post by John - Tracker Supp »

Thanks Archie - do come back if you need any further info.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
Peter2
User
Posts: 946
Joined: Mon Sep 13, 2010 10:09 am
Location: Switzerland

Re: SDK to extract text, then search

Post by Peter2 »

This posting is nearly 2 years old, but I'm looking for a similar solution. So my question is:

Has something important changed since 2013? New features, new tools out-of-the-box, new SDKs??

Peter
PDF-X-Change Pro German
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: SDK to extract text, then search

Post by Will - Tracker Supp »

Hi Peter,

Lots has changed since 2013, but nothing too dramatic in terms of the OCR's overall functionality. What, specifically, are you looking for?

Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Peter2
User
Posts: 946
Joined: Mon Sep 13, 2010 10:09 am
Location: Switzerland

Re: SDK to extract text, then search

Post by Peter2 »

Hi Will

this is the main-thread with the side-discussion "Where is menu "text Props"?"
https://forum.pdf-xchange.com/ ... 62&t=24215

The (half baked) idea is:
The reason why I'm asking is that scanned drawings need
- OCR
- finding the position (coordinates) of the new strings
- "transform" (in a way which needs to be found ...) the content and the position of the strings to a vector-drawing.

This is why I'm thinking about "which text is where"?
PDF-X-Change Pro German
User avatar
Ivan - Tracker Software
Site Admin
Posts: 3549
Joined: Thu Jul 08, 2004 10:36 pm
Location: Vancouver Island - Canada
Contact:

Re: SDK to extract text, then search

Post by Ivan - Tracker Software »

You can use Editor SDK or Core API SDK to retrieve a text from page and search in it.

Please take a look at IPXC_PageText interface https://sdkhelp.pdf-xchange.com/vie ... C_PageText
Tracker Software (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
Post Reply