PDF XChange Forum

Posted: **Mon Dec 17, 2012 2:25 pm**

Hi!

I like this library very much. But most of the support requests we receive because of the incredibly large file size after the text recognition is performed. Often a 1MB PDF will become 6MB or more (The same PDF-document recognition with the Adobe Acrobat will still have 1MB or sometimes less). That annoys many of our customers.

Will this problem fixed with the next major version and when will the next major version released?

Thank you!!!

Posted: **Mon Dec 17, 2012 2:40 pm**

HI,

thanks - but this is not what we would expect - though some bloat is anticipated - could you please zip an example and send to us at support@pdf-xchange.com and I can confirm we do have improvements coming ...

thanks

Posted: **Mon Dec 17, 2012 3:06 pm**

attached a very small example for you.

Size of "test.pdf" is 114KB. After OCR with Acrobat the size is 124KB (only 10KB more).

But with the OCR DLL more than 7 times larger then the original file = 817KB.

*hope this little example will already help you, because other documents are too big and/or have a little bit secret content.

Posted: **Mon Dec 17, 2012 3:28 pm**

another example...

original file size: 528KB
after OCR with Acrobat: 825KB (300KB more than original PDF)
after OCR with DLL: 1,23 MB (more than double size than original PDF)

Posted: **Mon Dec 17, 2012 5:31 pm**

Thanks - our OCR project manager (Walter) will review ASAP and come back.

Posted: **Mon Dec 17, 2012 6:23 pm**

It depends a mainly on the resolution used for OCRing; if it is higher than the resolution of the source documents, you'll get increased file size. The next version will have an option (much like the viewer currently has) to output only a text layer to the existing file or a copy of it (rather than creating a whole new file), and to downsample the images for output. Ensuring output file sizes are small is a very high priority.

You'll have to ask sales or support when the new SDK is scheduled for release. I cannot promise any new features in the current SDK as our development priorities are very heavily weighted towards the new releases, although this doesn't mean it's not possible.

I will have a meeting the first week of January to talk about getting a file size mitigating feature added if we can.

There is a flag for outputting only "invisible" (non-rendered) text to a PDF: "OCR_Image_SuppressOutput". You could do this, then merge pages (with the input PDF) afterwards.

On a test document scanned at 300DPI, then OCR'd at 300DPI, we get the following results:

input: littlehouse-300DPI.pdf 3,505KB
output: littlehouse-standardOCRtest-300DPI.pdf 3,320KB

Posted: **Mon Dec 17, 2012 6:30 pm**

By the way, the information in this thread might be useful if you decide to merge pages yourself:

https://forum.pdf-xchange.com/ ... put#p60520

The short version is "see page 340 of the PDF-ToolsV4SDK.pdf manual", example of using function PXCp_PlaceContents() to merge pages.

Posted: **Tue Dec 18, 2012 8:44 am**

> The next version will have an option to output only
> a text layer to the existing file or a copy of it (rather than creating a whole new file),
> and to downsample the images for output.
OK, I Think I do understand. Then all existing booksmarks also don't be deleted. Correct?

see > https://forum.pdf-xchange.com/ ... 42&t=13937

Posted: **Wed Dec 19, 2012 1:18 pm**

Correct.

Posted: **Mon May 06, 2013 10:08 pm**

It seems there was a release April 3, 2013 of the ocr sdk.
Did the option to output only a text layer to the existing file get incorporated?
If so how do you call this function?

If not is there another way of doing the ocr layer seamlessly via the PDFXchange Viewer Pro?

Posted: **Mon May 06, 2013 10:12 pm**

The option to output to an existing PDF is a feature of the Viewer but not directly available in the OCR SDK. The last release was primarily a bug fix.

You can, however, access the text and position results and place them yourself in an existing PDF if you wish.

Looking back over the thread I guess I've already mentioned this, but we also have an option to suppress image output (leaving visibly blank pages, containing invisible OCR text). To do this, pass the flag "OCR_Image_SuppressOutput" to the OCR_MakeSearchable() function, then save the PDF with OCR_Save(). The resultant PDF can then be merged with your input document to effectively provide the same outcome as the Viewer's option. Make sure you do not specify deskew (auto-rotate) when doing this, or there's a possibility the output positions will not match the input.

https://forum.pdf-xchange.com/ ... put#p60520

Posted: **Thu Oct 03, 2013 10:39 am**

Walter-Tracker Supp wrote:> The next version will have an option to output only
> a text layer to the existing file or a copy of it (rather than creating a whole new file),
> and to downsample the images for output.

Any news about "The next version"? Need this features so very much!

Posted: **Thu Oct 03, 2013 11:18 am**

Hi Dorwol,

I will check with Walter when he comes to work later today and we will post an update here shortly after.

Regards,
Stefan

Posted: **Wed Oct 16, 2013 11:17 am**

Posted: **Wed Oct 16, 2013 11:31 am**

Hi Dorwol,

Apologies - seems like I failed following this one properly. Will speak with Walter again today and we will post back here.

Regards,
Stefan

Posted: **Wed Oct 16, 2013 11:35 am**

No Prob!

Posted: **Wed Oct 16, 2013 4:30 pm**

We are providing this kind of functionality in the new SDK which will be out after the editor is finalized.

-Walter

Posted: **Mon Dec 23, 2013 10:39 am**

Hello,

I have the same problem with the demo version.
- My pdf 38 pages 2.4MB
- The generated file: 2 pages 1.1 MB

The options used are:
- OCR_Image_FastAutorotate (ImageFlags)
- OCR_Auto (RegionMode)
- 300 dpi

Is this is related to the demo?

163549.zip

Thank you for your reply

Stéphane

Posted: **Mon Dec 23, 2013 10:43 am**

Thanks for the post Stephane,

I will ask Walter to check this further.

Merry Christmas and
Happy New Year
Stefan

Posted: **Fri Jan 10, 2014 10:51 am**

Hello,

Do you have an answer?

Stephane

Posted: **Fri Jan 10, 2014 3:09 pm**

Hi Stephane,

I've once again requested an update from Walter.

Regards,
Stefan

Posted: **Fri Jan 10, 2014 5:19 pm**

As I mentioned, we'll be providing better compression in the next major release. There are some constraints on the compression methods we use in the current release that can result in larger files in some cases.

Meanwhile I am investigating the file provided most recently in this thread - there may be some issues with the structure of this PDF (malformed) but we need to investigate it for a day or two. Either way I will report back shortly.

Posted: **Tue Feb 04, 2014 9:18 am**

Hi guys,

Im facing the same problem, a single page pdf becomes three times its original size. An update to the SDK would be very appreciated. Is the updated SDK still in the pipeline?

Posted: **Tue Feb 04, 2014 12:15 pm**

Hi joost,

Yes this is still being worked on, but Walter should be able to provide further details on when this new major release is expected.

Regards,
Stefan

PDF XChange Forum

Very large file size after the text recognition

Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition

Re: Very large file size after the text recognition