Very large file size after the text recognition

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
Dorwol
User
Posts: 275
Joined: Mon Aug 04, 2008 5:04 pm

Very large file size after the text recognition

Post by Dorwol »

Hi!

I like this library very much. But most of the support requests we receive because of the incredibly large file size after the text recognition is performed. Often a 1MB PDF will become 6MB or more (The same PDF-document recognition with the Adobe Acrobat will still have 1MB or sometimes less). That annoys many of our customers.

Will this problem fixed with the next major version and when will the next major version released?

Thank you!!!
Last edited by Dorwol on Tue Dec 18, 2012 8:50 am, edited 2 times in total.
User avatar
John - Tracker Supp
Site Admin
Posts: 5219
Joined: Tue Jun 29, 2004 10:34 am
Location: United Kingdom
Contact:

Re: Very large file size after the text recognition

Post by John - Tracker Supp »

HI,

thanks - but this is not what we would expect - though some bloat is anticipated - could you please zip an example and send to us at support@pdf-xchange.com and I can confirm we do have improvements coming ...

thanks
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
Dorwol
User
Posts: 275
Joined: Mon Aug 04, 2008 5:04 pm

Re: Very large file size after the text recognition

Post by Dorwol »

attached a very small example for you.

Size of "test.pdf" is 114KB. After OCR with Acrobat the size is 124KB (only 10KB more).

But with the OCR DLL more than 7 times larger then the original file = 817KB.

*hope this little example will already help you, because other documents are too big and/or have a little bit secret content.
Attachments
sample.zip
(988.14 KiB) Downloaded 241 times
Dorwol
User
Posts: 275
Joined: Mon Aug 04, 2008 5:04 pm

Re: Very large file size after the text recognition

Post by Dorwol »

another example...

original file size: 528KB
after OCR with Acrobat: 825KB (300KB more than original PDF)
after OCR with DLL: 1,23 MB (more than double size than original PDF)
Attachments
test2.zip
(2.5 MiB) Downloaded 248 times
User avatar
John - Tracker Supp
Site Admin
Posts: 5219
Joined: Tue Jun 29, 2004 10:34 am
Location: United Kingdom
Contact:

Re: Very large file size after the text recognition

Post by John - Tracker Supp »

Thanks - our OCR project manager (Walter) will review ASAP and come back.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Very large file size after the text recognition

Post by Walter-Tracker Supp »

It depends a mainly on the resolution used for OCRing; if it is higher than the resolution of the source documents, you'll get increased file size. The next version will have an option (much like the viewer currently has) to output only a text layer to the existing file or a copy of it (rather than creating a whole new file), and to downsample the images for output. Ensuring output file sizes are small is a very high priority.

You'll have to ask sales or support when the new SDK is scheduled for release. I cannot promise any new features in the current SDK as our development priorities are very heavily weighted towards the new releases, although this doesn't mean it's not possible.

I will have a meeting the first week of January to talk about getting a file size mitigating feature added if we can.

There is a flag for outputting only "invisible" (non-rendered) text to a PDF: "OCR_Image_SuppressOutput". You could do this, then merge pages (with the input PDF) afterwards.

On a test document scanned at 300DPI, then OCR'd at 300DPI, we get the following results:

input: littlehouse-300DPI.pdf 3,505KB
output: littlehouse-standardOCRtest-300DPI.pdf 3,320KB
Attachments
littlehouse-test.7z
(6.67 MiB) Downloaded 263 times
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Very large file size after the text recognition

Post by Walter-Tracker Supp »

By the way, the information in this thread might be useful if you decide to merge pages yourself:

https://forum.pdf-xchange.com/ ... put#p60520

The short version is "see page 340 of the PDF-ToolsV4SDK.pdf manual", example of using function PXCp_PlaceContents() to merge pages.
Dorwol
User
Posts: 275
Joined: Mon Aug 04, 2008 5:04 pm

Re: Very large file size after the text recognition

Post by Dorwol »

> The next version will have an option to output only
> a text layer to the existing file or a copy of it (rather than creating a whole new file),
> and to downsample the images for output.
OK, I Think I do understand. Then all existing booksmarks also don't be deleted. Correct? :idea:
see > https://forum.pdf-xchange.com/ ... 42&t=13937
User avatar
John - Tracker Supp
Site Admin
Posts: 5219
Joined: Tue Jun 29, 2004 10:34 am
Location: United Kingdom
Contact:

Re: Very large file size after the text recognition

Post by John - Tracker Supp »

Correct.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
kman
User
Posts: 14
Joined: Thu May 08, 2008 8:15 pm

Re: Very large file size after the text recognition

Post by kman »

It seems there was a release April 3, 2013 of the ocr sdk.
Did the option to output only a text layer to the existing file get incorporated?
If so how do you call this function?

If not is there another way of doing the ocr layer seamlessly via the PDFXchange Viewer Pro?
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Very large file size after the text recognition

Post by Walter-Tracker Supp »

The option to output to an existing PDF is a feature of the Viewer but not directly available in the OCR SDK. The last release was primarily a bug fix.

You can, however, access the text and position results and place them yourself in an existing PDF if you wish.

Looking back over the thread I guess I've already mentioned this, but we also have an option to suppress image output (leaving visibly blank pages, containing invisible OCR text). To do this, pass the flag "OCR_Image_SuppressOutput" to the OCR_MakeSearchable() function, then save the PDF with OCR_Save(). The resultant PDF can then be merged with your input document to effectively provide the same outcome as the Viewer's option. Make sure you do not specify deskew (auto-rotate) when doing this, or there's a possibility the output positions will not match the input.

https://forum.pdf-xchange.com/ ... put#p60520
Dorwol
User
Posts: 275
Joined: Mon Aug 04, 2008 5:04 pm

Re: Very large file size after the text recognition

Post by Dorwol »

Walter-Tracker Supp wrote:> The next version will have an option to output only
> a text layer to the existing file or a copy of it (rather than creating a whole new file),
> and to downsample the images for output.
Any news about "The next version"? Need this features so very much!
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17820
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Very large file size after the text recognition

Post by Tracker Supp-Stefan »

Hi Dorwol,

I will check with Walter when he comes to work later today and we will post an update here shortly after.

Regards,
Stefan
Dorwol
User
Posts: 275
Joined: Mon Aug 04, 2008 5:04 pm

Re: Very large file size after the text recognition

Post by Dorwol »

? :?
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17820
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Very large file size after the text recognition

Post by Tracker Supp-Stefan »

Hi Dorwol,

Apologies - seems like I failed following this one properly. Will speak with Walter again today and we will post back here.

Regards,
Stefan
Dorwol
User
Posts: 275
Joined: Mon Aug 04, 2008 5:04 pm

Re: Very large file size after the text recognition

Post by Dorwol »

:P No Prob!
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Very large file size after the text recognition

Post by Walter-Tracker Supp »

We are providing this kind of functionality in the new SDK which will be out after the editor is finalized.

-Walter
mr_steph
User
Posts: 52
Joined: Tue Jul 22, 2008 8:18 am

Re: Very large file size after the text recognition

Post by mr_steph »

Hello,

I have the same problem with the demo version.
- My pdf 38 pages 2.4MB
- The generated file: 2 pages 1.1 MB

The options used are:
- OCR_Image_FastAutorotate (ImageFlags)
- OCR_Auto (RegionMode)
- 300 dpi

Is this is related to the demo?
163549.zip
Mon fichier de test
(2.02 MiB) Downloaded 224 times
Thank you for your reply

Stéphane
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17820
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Very large file size after the text recognition

Post by Tracker Supp-Stefan »

Thanks for the post Stephane,

I will ask Walter to check this further.

Merry Christmas and
Happy New Year
Stefan
mr_steph
User
Posts: 52
Joined: Tue Jul 22, 2008 8:18 am

Re: Very large file size after the text recognition

Post by mr_steph »

Hello,

Do you have an answer?

Stephane
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17820
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Very large file size after the text recognition

Post by Tracker Supp-Stefan »

Hi Stephane,

I've once again requested an update from Walter.

Regards,
Stefan
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Very large file size after the text recognition

Post by Walter-Tracker Supp »

As I mentioned, we'll be providing better compression in the next major release. There are some constraints on the compression methods we use in the current release that can result in larger files in some cases.

Meanwhile I am investigating the file provided most recently in this thread - there may be some issues with the structure of this PDF (malformed) but we need to investigate it for a day or two. Either way I will report back shortly.
joost
User
Posts: 7
Joined: Mon Dec 05, 2011 9:16 pm

Re: Very large file size after the text recognition

Post by joost »

Hi guys,

Im facing the same problem, a single page pdf becomes three times its original size. An update to the SDK would be very appreciated. Is the updated SDK still in the pipeline?
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17820
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Very large file size after the text recognition

Post by Tracker Supp-Stefan »

Hi joost,

Yes this is still being worked on, but Walter should be able to provide further details on when this new major release is expected.

Regards,
Stefan
Post Reply