PDF library crash

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

PDF library crash

Post by scdawson »

Hello,

I'm getting a crash of the PDF library on a particular document (which I'll send over email due to privacy concerns).

In this case, the PDF in question already has a text layer (it looks like it's been OCRed already). Could that have something to do with it?

Although I would like to figure out the cause of the crash and fix, I'm not sure that in practice, I actually want to be running OCR on documents that already have text layers like this. What do you think is the best way to detect this situation?

Thanks!

Shaun
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: PDF library crash

Post by Tracker Supp-Stefan »

Hello Shaun,

Thanks for the file - we've got it and I've passed it to our devs. We will post here as soon as we have any news or comments.

Best,
Stefan
User avatar
Lzcat - Tracker Supp
Site Admin
Posts: 677
Joined: Thu Jun 28, 2007 8:42 am

Re: PDF library crash

Post by Lzcat - Tracker Supp »

Sorry, we cannot reproduce crash with latest build (199 at the moment).
Please try update and if problem remains - please provide step-by-step instructions how to reproduce it (best case - test project with sources).
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

I have updated to the latest version of PDFSDK, and the problem still occurs.

I have emailed you the project I'm using (though it's just the command line test program, with some slight modifications). Here is what I did to reproduce:

* Installed PDFSCDPRO over my existing installation
* Restarted when asked
* Opened TEST_OCR.sln in VS2010
* Clean, rebuild
* Confirmed that the correct file is at C:\temp\ocr_test.pdf
* start debugging (F5)
* Get the following output on the console:

Code: Select all

Page 0, Rasterizing: 3.57%
Page 0, Auto-rotating: 7.14%
Cancelled repeat of length 0 due to Joined
Cancelled repeat of length 0 due to Joined
Cancelled repeat of length 0 due to Joined
Cancelled repeat of length 0 due to Joined
Cancelled repeat of length 0 due to Joined
Cancelled repeat of length 0 due to Joined
Page 0, Running OCR: 10.71%
Cancelled repeat of length 0 due to Joined
Cancelled repeat of length 0 due to Joined
Cancelled repeat of length 0 due to Joined
Cancelled repeat of length 0 due to Joined
Cancelled repeat of length 0 due to Joined
Cancelled repeat of length 0 due to Joined
Page 0, Placing Image: 14.29%
Got an access violation at this line:

Code: Select all

	hr = OCR_MakeSearchable(Doc, &Options, NULL);
The exact wording of the error I get is:

Code: Select all

Unhandled exception at 0x771d15ee in TEST_OCR.exe: 0xC0000005: Access violation writing location 0x042ed000.
Thanks!

Shaun
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Any news on this?

Thanks!

Shaun
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: PDF library crash

Post by Tracker Supp-Stefan »

Hello Shaun,

Seems like it was passed on to the wrong developer. We are now awaiting one of our OCR guys to check this topic and will advise hopefully a bit later today.

Best,
Stefan
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

:(. Thanks, Stefan.

Shaun
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

Hi Shaun,

Could you send the offending pdf to support@pdf-xchange.com again? We received a file from you which contained all the object files and executable code for the example, but I did not see the pdf.

We will look into this as soon as possible.

-Walter
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

We have reproduced the issue with your file, but have not been able to reproduce it with another test input file of similar page dimensions. We will continue to investigate and post updates.
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Interesting. This problem has occurred with other files also. If need be, I can probably get more examples. That's not trivial to do, but if it will help, let me know.

Thanks!

Shaun
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

scdawson wrote:Interesting. This problem has occurred with other files also. If need be, I can probably get more examples. That's not trivial to do, but if it will help, let me know.

Thanks!

Shaun
I don't think it will be necessary for now; are the issues the same? Same error message, etc? And similar input type (CAD drawings with text)?
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

I'm pretty sure the answer to both of those questions is yes.

Shaun
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

scdawson wrote:I'm pretty sure the answer to both of those questions is yes.

Shaun
Okay, we will work on it and keep you updated.

-Walter
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Thanks, Walter! If there's anything I can do to help, please let me know. We're doing a go-live starting the 18th, and we have to determine if we're going to back out this capability or not.
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

scdawson wrote:Thanks, Walter! If there's anything I can do to help, please let me know. We're doing a go-live starting the 18th, and we have to determine if we're going to back out this capability or not.
It looks like there are two unrelated issues here. The first is the message you report seeing, which is not indicative of an error but is just a rarely occurring debug message sent to console, that should be suppressed (and will be in the next release).

The second issue is the crash; we are now working on a fix for this but it is a separate bug related to large page sizes.

I will have an update on progress for you by end of day tomorrow.

-Walter
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Interesting.

Thank you for the update, and I'll talk to you tomorrow!

Shaun
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

The access violation comes as a result of resource exhaustion (memory allocation failure due to the large size of the page, > 500 MByte for the internal image representation at 300 DPI (and more than that during some stages of processing)) which was not properly trapped. A fix to cause OCR_MakeSearchable() to return OCR_ERR_MEMALLOC as expected, instead of ungracefully crashing, is in the next build. However, this does not solve the bigger problem of wanting to OCR large document pages. There are a couple of possible ways to accomplish this which we will work on, as you have brought to light an important (if not necessarily common) usage scenario, but it will likely take more than a couple of days to implement.

Meanwhilere, there is a potential workaround, at least for extracting text. You could use the Fields capability to OCR smaller subsets of the page and extract plaintext. Unfortunately you will not be able to make a searchable PDF until we have added the capability to deal with very large pages.

Hope this helps.

-Walter
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Thanks for the update, Walter.

Bummer :(. Are there other workarounds that I might be able to employ in the meantime? For example, can I set the DPI of the internal image to 150 and be less likely to hit that limit? Can the issue be mitigated by increasing the available memory on the machine?

Thanks!

Shaun
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: PDF library crash

Post by Tracker Supp-Stefan »

Hello Shaun,

Certainly halving the DPI will have a significant effect on the memory needed, and if you can increase the physical memory - that should also have it's impact - so both are viable solutions if you can use them.

Best,
Stefan
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Thank you, Stefan.

Do you know when the new build (that fails gracefully rather than crashes in this situation) will be available?

Thanks!

Shaun
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: PDF library crash

Post by Tracker Supp-Stefan »

Hi Shaun,

Walter would be able to reply better to this question, but looking at his earlier replies - probably a few days.

Will ask him to update the topic when he comes to work (it's still 8:00 AM in the Vancouver Island office).

Best,
Stefan
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

That fix will be available later today or tomorrow.

Reducing the input resolution will mitigate this issue to some extent; even going to 200 DPI will use less than half the memory of 300 DPI.

-Walter
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Are we still on track for this?

Thanks!

Shaun
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: PDF library crash

Post by Tracker Supp-Stefan »

Hello Shaun,

We just uploaded a fresh build of our SDK products less than two hours ago.
And Walter confirmed that the fix for your issue should be included in it.

Best,
Stefan
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

Tracker Supp-Stefan wrote:Hello Shaun,

We just uploaded a fresh build of our SDK products less than two hours ago.
Not sure whether this one got in it - but Walter will confirm/deny in about 1/2 an hour.

Best,
Stefan
Yes, the fix is in this build - version 1.0.4 of the ocrtools.dll (which you can check with "Properties" in windows explorer, etc).

-Walter
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Thank you for a quick turnaround on this. We have gotten the fix in place, and the OCR capability is back in test. We go live with the capability starting next week.

One further question. The performance for small page sizes is pretty good, but once we get into the bigger page sizes, the performance slows significantly (it can take minutes to OCR one file). Is there anything that can be done about that?

Thanks!

Shaun
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

scdawson wrote:Thank you for a quick turnaround on this. We have gotten the fix in place, and the OCR capability is back in test. We go live with the capability starting next week.

One further question. The performance for small page sizes is pretty good, but once we get into the bigger page sizes, the performance slows significantly (it can take minutes to OCR one file). Is there anything that can be done about that?

Thanks!

Shaun
There are a number of improvements we plan to make in the next major release. Along with improved support for images that will be too large for memory we will likely see an improvement in performance as we tweak things like the page segmentation and image auto rotation analysis.

If you are certain that image auto-rotation is not necessary (ie, no skew in the input documents), I would recommend turning that option off for OCRing particularly large images as this will incur an unnecessary penalty in those cases where it is not needed.
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Great, thanks for the update, Walter.

Good point about the image rotation. I'll look into that and see what we can do about that.

Any idea on timeframe of the next major release?

Shaun
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

scdawson wrote:Great, thanks for the update, Walter.

Good point about the image rotation. I'll look into that and see what we can do about that.

Any idea on timeframe of the next major release?

Shaun
At the moment I can't provide an exact timeframe for this as we have a number of competing development priorities, not the least of which is the new end-user tools which will also support OCR capabilities. However, I will say that one of our top OCR feature priorities (for both end-users and developers) coincides nicely with the handling of larger images, so this will not be a back burner issue by any means.

We really appreciate the work you have done to test the OCR SDK and report issues to us.

-Walter
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Thanks, Walter!

My pleasure. The functionality has been getting a great deal of use in our testing environment, and it will get much more in production, so it will be a good trial by fire, I think.

Beyond the ability to process large files, the next thing that we really care about is training the system to improve results, so when we have a file that doesn't quite OCR correctly, we can have a human address the problems and drop it into a hopper which will improve the results over time. That's also an opportunity for us to increase performance, since if we know we are OCRing, say, CAD files, we can use a specific data directory for those types of files, which speeds things up quite a bit.

We've done quite a bit of research on our end on how to do that, and are looking into developing tools to facilitate that.

The reason I mention this is that if you ever wind up developing or looking at training tools for this platform, we might want to talk.

Shaun
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

scdawson wrote:Thanks, Walter!

My pleasure. The functionality has been getting a great deal of use in our testing environment, and it will get much more in production, so it will be a good trial by fire, I think.

Beyond the ability to process large files, the next thing that we really care about is training the system to improve results, so when we have a file that doesn't quite OCR correctly, we can have a human address the problems and drop it into a hopper which will improve the results over time. That's also an opportunity for us to increase performance, since if we know we are OCRing, say, CAD files, we can use a specific data directory for those types of files, which speeds things up quite a bit.

We've done quite a bit of research on our end on how to do that, and are looking into developing tools to facilitate that.

The reason I mention this is that if you ever wind up developing or looking at training tools for this platform, we might want to talk.

Shaun
You bring up a good issue, and providing end-user and developer access to training is also one of our top priorities for feature enhancement. I can't give an exact timeline on this either, but we will be including it at some point in the near future.

Thanks again for the valuable feedback.

-Walter
NBachus
User
Posts: 31
Joined: Tue Oct 26, 2010 2:40 pm

Re: PDF library crash

Post by NBachus »

We are trying to process another .pdf that does not have OCR enabled, but keep having the OCR library crash on conversion. I am e-mailing the file to support@pdf-xchange.com and will reference this forum post in the e-mail.

Thanks,
Nathan
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

NBachus wrote:We are trying to process another .pdf that does not have OCR enabled, but keep having the OCR library crash on conversion. I am e-mailing the file to support@pdf-xchange.com and will reference this forum post in the e-mail.

Thanks,
Nathan
Thank you, we will look at it shortly.

-Walter
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

I have looked at your sample file but was unable to reproduce the problem. I have sent you an email with the output and requested further information to help diagnose this.
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Hello, support,

After a great deal of troubleshooting, we finally discovered that the proximate cause of the crash was some invalid error handling on our side. There is a deeper issue, though, and that seems that there is a memory leak in the OCR library, which eventually causes a conversion to fail (which would sometimes trigger our crash).

We've fixed our crash, but need you to check out the memory leak issue.

I will send an archive of the project that I'm using to reproduce the memory leak, along with the test file.

Thanks!

Shaun
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

If you mean memory consumption during OCR with OCR_MakeSearchable(), this is because the created PDF is stored in memory as it is produced, so there is a limit at some point on the size of document you can process. We will be working on resolving this in the future (e.g. by writing to disk) but for the meantime it remains a limitation.

However, we also received your project in the support inbox and will investigate it shortly.

-Walter
scdawson wrote:Hello, support,

After a great deal of troubleshooting, we finally discovered that the proximate cause of the crash was some invalid error handling on our side. There is a deeper issue, though, and that seems that there is a memory leak in the OCR library, which eventually causes a conversion to fail (which would sometimes trigger our crash).

We've fixed our crash, but need you to check out the memory leak issue.

I will send an archive of the project that I'm using to reproduce the memory leak, along with the test file.

Thanks!

Shaun
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Hello, Walter,

That does make sense, but not quite what I'm talking about.

It looks like even after OCR_Load or OCR_Delete is called, there is memory that is not being cleaned up. So, if you write a program to continuously process files, it will eventually fill up all of the available memory and fail. In the program that I sent you, on our test machine, 1 loop will use up about 200 MB of memory, which seems about right (100 for the in-memory bitmap for OCR, and 100 for the searchable PDF). After 5 iterations, we're using 1.5 GB.

Thanks!

Shaun
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

It obviously shouldn't be doing this; I will look into your example tomorrow morning (or tonight if I can). We have tested this functionality already and it should not leak memory when doing this, but I will have to examine the source.
NBachus
User
Posts: 31
Joined: Tue Oct 26, 2010 2:40 pm

Re: PDF library crash

Post by NBachus »

Walter,

I was just checking in to see if there was an update or an ETA on this.

Thanks,
Nathan
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

NBachus wrote:Walter,

I was just checking in to see if there was an update or an ETA on this.

Thanks,
Nathan
Hi Nathan,

It is unclear whether you are talking about the library crash issue scdawson reported, or your own file. I could not reproduce the issue with the file you provided - it worked fine. Without more information there's not much I can do.

The source of scdawson's reported memory leak has been discovered - it was the result of an uncommon internal state, and it will be resolved in a fix to be made available shortly (likely tomorrow morning). I will post an update here once it is available.

-Walter
NBachus
User
Posts: 31
Joined: Tue Oct 26, 2010 2:40 pm

Re: PDF library crash

Post by NBachus »

Thanks Walter.

The two are actually part of the same problem. We discovered the problem by using the file that I had submitted in the previous e-mail. We'll look forward to the update!

Thanks again,
Nathan
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

NBachus wrote:Thanks Walter.

The two are actually part of the same problem. We discovered the problem by using the file that I had submitted in the previous e-mail. We'll look forward to the update!

Thanks again,
Nathan
Oh, got it. Sorry, my misunderstanding ;)
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Any news?
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: PDF library crash

Post by Tracker Supp-Stefan »

Hello all,

Walter is coming to work in about 1 hour - and I will ask him to follow up on this topic when he does.

Best,
Stefan
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

Yes,

The new build will be ready in a couple of hours and will go online once our installer can be updated (most likely on Monday) . The fix pertaining to this particular issue is complete and tested, however there were a couple of other small changes that we want to finish testing before releasing. Nothing major - just an additional method of dealing with OCR templates that we decided was useful.

I will contact you via email once it is complete.

-Walter
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PDF library crash

Post by Walter-Tracker Supp »

I'm pleased to say that the new build (DLL Version 1.0.5) is ready and will be available on our website shortly. It dramatically improves memory handling and also fixes some other small issues (mostly relevant for Dutch users).

-Walter
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

Great news!

I just tried downloading the "live" version of the .dll, and it looks like that is still the 1.0.4 version. Am I doing something wrong?

Thanks!

Shaun
User avatar
John - Tracker Supp
Site Admin
Posts: 5219
Joined: Tue Jun 29, 2004 10:34 am
Location: United Kingdom
Contact:

Re: PDF library crash

Post by John - Tracker Supp »

Not up yet Shaun ...

Please see the file date for the build - once that is updated - then you will know its ready - I suspect will be Monday sometime...
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
scdawson
User
Posts: 43
Joined: Thu Oct 20, 2011 3:40 pm

Re: PDF library crash

Post by scdawson »

OK, thanks, John.

Shaun
User avatar
John - Tracker Supp
Site Admin
Posts: 5219
Joined: Tue Jun 29, 2004 10:34 am
Location: United Kingdom
Contact:

Re: PDF library crash

Post by John - Tracker Supp »

Pleasure Shaun :-)
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
Post Reply