How to separate pdfs with images from "normal" pdfs

michipapa · Post by **michipapa** » Fri Aug 31, 2018 8:44 am

Hi,
We use the tracker ocr as a service at a server to manipulate all incoming pdfs of our document management system.
How can I separate the pdfs with images from the "normal" pdfs to reduce the time to run ?

And what happend with "normal" pdfs if I put theses files in the ocr process ?

regards Michael

Post by **TrackerSupp-Daniel** » Fri Aug 31, 2018 4:07 pm

Hello Michipapa,
Currently there is not a method to separate the PDF's based on content.
There is a checkbox in the OCR function to "Skip pages that already contain text content items", this may help in your situation. Note however that this function will also skip pages that have both images and base content text on them, so it may not be a catch all solution.

For "normal" PDFs that are processed with OCR, if the aforementioned checkbox is checked off, they will not be affected, and will add minimal time to the process que. If the tickbox is not checked, you may find that you have a duplicate layer of invisible text on the document.

I hope this helps!

Edit:
I have just brought this to the Dev team, and we have decided to undertake the challenge. I cannot make any promises about a timeline for the function, but If you are ever looking for updates on the progress, please ask any member of our support staff about the below ticket number, and we will be able to assist.

RT #4474

michipapa · Post by **michipapa** » Fri Aug 31, 2018 7:00 pm

Hi Daniel,

If you write
>There is a checkbox in the OCR function to "Skip pages that already contain text content items"

which function or parameter of your OCR - SDK do you mean ? I see this in the GUI of the PDF Editor but not in the OCR Optionlist .....

regards Michael

Post by **TrackerSupp-Daniel** » Fri Aug 31, 2018 8:06 pm

Hello michipapa,

My sincerest apologies, I jumped on this a bit quickly and did not notice that it was an SDK issue.
While this option is available from the End User GUI, I do not believe that they are available from the OCR SDK. With that being said, I've created another feature request for you, this time to add these functions into the SDK products.

#4475: FR: OCR SDK Add more scan options

Hopefully we can add these in soon, but until then, I do not have an interim solution for you. Ive asked the dev team for more information on this, so should anything come up, or if they find a workaround to help you implement it, I am sure they will let you know.

Sat Sep 01, 2018 5:48 am

Hello Michael,

If you want to deeply control the OCR logic - I recommend using it in pair with the Core API SDK. What I see from this page is that you should have it in the PRO SDK bundle:
https://www.pdf-xchange.com/produc ... ge-pro-sdk
Though I do not know what license do you have exactly.

Cheers,
Alex

How to separate pdfs with images from "normal" pdfs

How to separate pdfs with images from "normal" pdfs

Re: How to separate pdfs with images from "normal" pdfs

Re: How to separate pdfs with images from "normal" pdfs

Re: How to separate pdfs with images from "normal" pdfs

Re: How to separate pdfs with images from "normal" pdfs