Page 1 of 1

Very large file size after the text recognition

Posted: Mon Dec 17, 2012 2:25 pm
by Dorwol
Hi!

I like this library very much. But most of the support requests we receive because of the incredibly large file size after the text recognition is performed. Often a 1MB PDF will become 6MB or more (The same PDF-document recognition with the Adobe Acrobat will still have 1MB or sometimes less). That annoys many of our customers.

Will this problem fixed with the next major version and when will the next major version released?

Thank you!!!

Re: Very large file size after the text recognition

Posted: Mon Dec 17, 2012 2:40 pm
by John - Tracker Supp
HI,

thanks - but this is not what we would expect - though some bloat is anticipated - could you please zip an example and send to us at support@pdf-xchange.com and I can confirm we do have improvements coming ...

thanks

Re: Very large file size after the text recognition

Posted: Mon Dec 17, 2012 3:06 pm
by Dorwol
attached a very small example for you.

Size of "test.pdf" is 114KB. After OCR with Acrobat the size is 124KB (only 10KB more).

But with the OCR DLL more than 7 times larger then the original file = 817KB.

*hope this little example will already help you, because other documents are too big and/or have a little bit secret content.

Re: Very large file size after the text recognition

Posted: Mon Dec 17, 2012 3:28 pm
by Dorwol
another example...

original file size: 528KB
after OCR with Acrobat: 825KB (300KB more than original PDF)
after OCR with DLL: 1,23 MB (more than double size than original PDF)

Re: Very large file size after the text recognition

Posted: Mon Dec 17, 2012 5:31 pm
by John - Tracker Supp
Thanks - our OCR project manager (Walter) will review ASAP and come back.

Re: Very large file size after the text recognition

Posted: Mon Dec 17, 2012 6:23 pm
by Walter-Tracker Supp
It depends a mainly on the resolution used for OCRing; if it is higher than the resolution of the source documents, you'll get increased file size. The next version will have an option (much like the viewer currently has) to output only a text layer to the existing file or a copy of it (rather than creating a whole new file), and to downsample the images for output. Ensuring output file sizes are small is a very high priority.

You'll have to ask sales or support when the new SDK is scheduled for release. I cannot promise any new features in the current SDK as our development priorities are very heavily weighted towards the new releases, although this doesn't mean it's not possible.

I will have a meeting the first week of January to talk about getting a file size mitigating feature added if we can.

There is a flag for outputting only "invisible" (non-rendered) text to a PDF: "OCR_Image_SuppressOutput". You could do this, then merge pages (with the input PDF) afterwards.

On a test document scanned at 300DPI, then OCR'd at 300DPI, we get the following results:

input: littlehouse-300DPI.pdf 3,505KB
output: littlehouse-standardOCRtest-300DPI.pdf 3,320KB

Re: Very large file size after the text recognition

Posted: Mon Dec 17, 2012 6:30 pm
by Walter-Tracker Supp
By the way, the information in this thread might be useful if you decide to merge pages yourself:

https://forum.pdf-xchange.com/ ... put#p60520

The short version is "see page 340 of the PDF-ToolsV4SDK.pdf manual", example of using function PXCp_PlaceContents() to merge pages.

Re: Very large file size after the text recognition

Posted: Tue Dec 18, 2012 8:44 am
by Dorwol
> The next version will have an option to output only
> a text layer to the existing file or a copy of it (rather than creating a whole new file),
> and to downsample the images for output.
OK, I Think I do understand. Then all existing booksmarks also don't be deleted. Correct? :idea:
see > https://forum.pdf-xchange.com/ ... 42&t=13937

Re: Very large file size after the text recognition

Posted: Wed Dec 19, 2012 1:18 pm
by John - Tracker Supp
Correct.

Re: Very large file size after the text recognition

Posted: Mon May 06, 2013 10:08 pm
by kman
It seems there was a release April 3, 2013 of the ocr sdk.
Did the option to output only a text layer to the existing file get incorporated?
If so how do you call this function?

If not is there another way of doing the ocr layer seamlessly via the PDFXchange Viewer Pro?

Re: Very large file size after the text recognition

Posted: Mon May 06, 2013 10:12 pm
by Walter-Tracker Supp
The option to output to an existing PDF is a feature of the Viewer but not directly available in the OCR SDK. The last release was primarily a bug fix.

You can, however, access the text and position results and place them yourself in an existing PDF if you wish.

Looking back over the thread I guess I've already mentioned this, but we also have an option to suppress image output (leaving visibly blank pages, containing invisible OCR text). To do this, pass the flag "OCR_Image_SuppressOutput" to the OCR_MakeSearchable() function, then save the PDF with OCR_Save(). The resultant PDF can then be merged with your input document to effectively provide the same outcome as the Viewer's option. Make sure you do not specify deskew (auto-rotate) when doing this, or there's a possibility the output positions will not match the input.

https://forum.pdf-xchange.com/ ... put#p60520

Re: Very large file size after the text recognition

Posted: Thu Oct 03, 2013 10:39 am
by Dorwol
Walter-Tracker Supp wrote:> The next version will have an option to output only
> a text layer to the existing file or a copy of it (rather than creating a whole new file),
> and to downsample the images for output.
Any news about "The next version"? Need this features so very much!

Re: Very large file size after the text recognition

Posted: Thu Oct 03, 2013 11:18 am
by Tracker Supp-Stefan
Hi Dorwol,

I will check with Walter when he comes to work later today and we will post an update here shortly after.

Regards,
Stefan

Re: Very large file size after the text recognition

Posted: Wed Oct 16, 2013 11:17 am
by Dorwol
? :?

Re: Very large file size after the text recognition

Posted: Wed Oct 16, 2013 11:31 am
by Tracker Supp-Stefan
Hi Dorwol,

Apologies - seems like I failed following this one properly. Will speak with Walter again today and we will post back here.

Regards,
Stefan

Re: Very large file size after the text recognition

Posted: Wed Oct 16, 2013 11:35 am
by Dorwol
:P No Prob!

Re: Very large file size after the text recognition

Posted: Wed Oct 16, 2013 4:30 pm
by Walter-Tracker Supp
We are providing this kind of functionality in the new SDK which will be out after the editor is finalized.

-Walter

Re: Very large file size after the text recognition

Posted: Mon Dec 23, 2013 10:39 am
by mr_steph
Hello,

I have the same problem with the demo version.
- My pdf 38 pages 2.4MB
- The generated file: 2 pages 1.1 MB

The options used are:
- OCR_Image_FastAutorotate (ImageFlags)
- OCR_Auto (RegionMode)
- 300 dpi

Is this is related to the demo?
163549.zip
Thank you for your reply

Stéphane

Re: Very large file size after the text recognition

Posted: Mon Dec 23, 2013 10:43 am
by Tracker Supp-Stefan
Thanks for the post Stephane,

I will ask Walter to check this further.

Merry Christmas and
Happy New Year
Stefan

Re: Very large file size after the text recognition

Posted: Fri Jan 10, 2014 10:51 am
by mr_steph
Hello,

Do you have an answer?

Stephane

Re: Very large file size after the text recognition

Posted: Fri Jan 10, 2014 3:09 pm
by Tracker Supp-Stefan
Hi Stephane,

I've once again requested an update from Walter.

Regards,
Stefan

Re: Very large file size after the text recognition

Posted: Fri Jan 10, 2014 5:19 pm
by Walter-Tracker Supp
As I mentioned, we'll be providing better compression in the next major release. There are some constraints on the compression methods we use in the current release that can result in larger files in some cases.

Meanwhile I am investigating the file provided most recently in this thread - there may be some issues with the structure of this PDF (malformed) but we need to investigate it for a day or two. Either way I will report back shortly.

Re: Very large file size after the text recognition

Posted: Tue Feb 04, 2014 9:18 am
by joost
Hi guys,

Im facing the same problem, a single page pdf becomes three times its original size. An update to the SDK would be very appreciated. Is the updated SDK still in the pipeline?

Re: Very large file size after the text recognition

Posted: Tue Feb 04, 2014 12:15 pm
by Tracker Supp-Stefan
Hi joost,

Yes this is still being worked on, but Walter should be able to provide further details on when this new major release is expected.

Regards,
Stefan