I desire the best possible OCR result. And this post ([https://forum.pdf-xchange.com/viewtopic.php?f=63&t=35943]) seems to suggest that using an Accuracy of Auto is better then using High.
Could you please confirm if that is correct?
Thank you,
Patrick
OCR Accuracy: Auto vs High [Paperless Office]
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
-
- User
- Posts: 25
- Joined: Wed Oct 13, 2021 5:43 am
- Location: Los Angeles, CA
OCR Accuracy: Auto vs High [Paperless Office]
Last edited by patrickm on Mon Jan 23, 2023 10:29 pm, edited 1 time in total.
-
- Site Admin
- Posts: 5219
- Joined: Tue Jun 29, 2004 10:34 am
- Location: United Kingdom
Re: OCR Accuracy: Auto vs High
Yes that is correct - some pre-analysis of the file is done in the 'Auto' mode which is obviously not done when you simply select high and despite it defying logic - High will NOT always produce the best results as many factors affect this.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.
Best regards
Tracker Support
http://www.tracker-software.com
Best regards
Tracker Support
http://www.tracker-software.com
-
- User
- Posts: 25
- Joined: Wed Oct 13, 2021 5:43 am
- Location: Los Angeles, CA
Re: OCR Accuracy: Auto vs High
Got it. My intuitive concern with Auto was that it might prioritize speed over accuracy.
Maybe renaming it to "Prioritize Speed" and "Prioritize Accuracy" would be more accurate?
Maybe renaming it to "Prioritize Speed" and "Prioritize Accuracy" would be more accurate?
-
- Site Admin
- Posts: 1797
- Joined: Mon Jan 15, 2018 9:01 am
Re: OCR Accuracy: Auto vs High
Hi,
Thank you for your suggestion.
I will forward it to our team of developers for consideration.
Regards.
Thank you for your suggestion.
I will forward it to our team of developers for consideration.
Regards.
-
- User
- Posts: 252
- Joined: Fri Jun 23, 2017 1:47 am
Re: OCR Accuracy: Auto vs High
Without rehashing what I've already said here
https://forum.pdf-xchange.com/viewtopic.php?f=63&t=37455&p=158413#p158413
yes, I think rewording is required.
And I'm glad to see I'm not the only one who feels that the existing phrasing might seem to "defy logic"
Now, with regard to the "Auto" setting, I still don't know for sure what it does, but I think it is neither prioritising speed nor prioritising output quality (which is nominally always set to 'maximum', according to the above-linked thread).
Per the thread linked above, I am guessing that Auto will be a multi-step procedure:
Also, the output should be better than a user who manually chooses an inappropriate category, but should be either as good as or worse than the output for a user who manually chooses the most appropriate category.
Please confirm or correct.
And I furthermore suggest adding a brief (non-technical) description to the help page https://help.pdf-xchange.com/pdfxe9/ocr-pages_ed.html — currently "Auto" is not mentioned at all.
—DIV
https://forum.pdf-xchange.com/viewtopic.php?f=63&t=37455&p=158413#p158413
yes, I think rewording is required.
And I'm glad to see I'm not the only one who feels that the existing phrasing might seem to "defy logic"
Now, with regard to the "Auto" setting, I still don't know for sure what it does, but I think it is neither prioritising speed nor prioritising output quality (which is nominally always set to 'maximum', according to the above-linked thread).
Per the thread linked above, I am guessing that Auto will be a multi-step procedure:
- analyse input document to determine resolution, font sizes, imperfections (such as blur or speckle);
- categorise the quality of images in the input document;
- run the OCR with the so-called "Accuracy" of images in the input document set to the above-selected category.
Also, the output should be better than a user who manually chooses an inappropriate category, but should be either as good as or worse than the output for a user who manually chooses the most appropriate category.
Please confirm or correct.
And I furthermore suggest adding a brief (non-technical) description to the help page https://help.pdf-xchange.com/pdfxe9/ocr-pages_ed.html — currently "Auto" is not mentioned at all.
—DIV
-
- Site Admin
- Posts: 6903
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
Re: OCR Accuracy: Auto vs High
Seeing there is a lot of discussion here on the forums about this we will be having a conversation here about it internally.
It's not a super high priority so will likely be brought up at a regular development meeting. I am mentioning that here because I don't expect a decision today.
We'll have to see how this discussion pans out in the next few weeks.
It's not a super high priority so will likely be brought up at a regular development meeting. I am mentioning that here because I don't expect a decision today.
We'll have to see how this discussion pans out in the next few weeks.
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
-
- Site Admin
- Posts: 2353
- Joined: Thu Jun 30, 2005 4:11 pm
- Location: Canada
Re: OCR Accuracy: Auto vs High
Hi DIV.
At the moment the Accuracy=Auto means that the application has permission to try to OCR existing images on the pages in case:
1. when one pdf-page contains one raster image only. It's the typical situation with 'scanned pdf's' - as a result of simple scanning paper documents.
2. when such single image has enough resolution, at least 300 dpi (it is often used for scanning).
3. when such single image isn't distorted too much by advanced geometrical transformation: rotated and scaled are allowed only, but not sloped for example.
Otherwise, when:
Accuracy≠Auto
or
any condition above isn't met - then the application might decide to rasterize the whole pdf-page and use the resulting image in the recognition process. For sure, this additional rasterization might(will in most cases) reduce the performance of the recognition process, in common terms.
And when:
Accuracy=High - it forces the application to ensure 600(±50) dpi for each image that will be processed by OCR
Accuracy=Medium - it forces the application to ensure 400(±50) dpi for each image that will be processed by OCR
Accuracy=Low - it forces the application to ensure 150(±50) dpi for each imag that will be processed by OCR
Cheers.
Not exactly. The OCR works with raster images and only with rasters, while pdf-page may contain rasters(many), text, and graphics. So the application must 'convert' pdf-page to a corresponding raster image and then recognize it and then apply OCR-result back to the existing pdf-content on the page.Per the thread linked above, I am guessing that Auto will be a multi-step procedure:
analyse input document to determine resolution, font sizes, imperfections (such as blur or speckle);
categorise the quality of images in the input document;
run the OCR with the so-called "Accuracy" of images in the input document set to the above-selected category.
So this couldn't be faster than specifying the input image quality yourself.
At the moment the Accuracy=Auto means that the application has permission to try to OCR existing images on the pages in case:
1. when one pdf-page contains one raster image only. It's the typical situation with 'scanned pdf's' - as a result of simple scanning paper documents.
2. when such single image has enough resolution, at least 300 dpi (it is often used for scanning).
3. when such single image isn't distorted too much by advanced geometrical transformation: rotated and scaled are allowed only, but not sloped for example.
Otherwise, when:
Accuracy≠Auto
or
any condition above isn't met - then the application might decide to rasterize the whole pdf-page and use the resulting image in the recognition process. For sure, this additional rasterization might(will in most cases) reduce the performance of the recognition process, in common terms.
And when:
Accuracy=High - it forces the application to ensure 600(±50) dpi for each image that will be processed by OCR
Accuracy=Medium - it forces the application to ensure 400(±50) dpi for each image that will be processed by OCR
Accuracy=Low - it forces the application to ensure 150(±50) dpi for each imag that will be processed by OCR
Cheers.
Vasyl Yaremyn
Tracker Software Products
Project Developer
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software Products
Project Developer
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
-
- User
- Posts: 252
- Joined: Fri Jun 23, 2017 1:47 am
Re: OCR Accuracy: Auto vs High
Thanks, Vasyl, for a detailed technical insight into what the various configurations would yield!
Just to confirm, when so-called Accuracy is not set to Auto, then does it mean that images (maybe of various resolutions) on the page that are below the implied threshold resolutions (e.g. ~400 dpi for "Medium") would be resampled to increase their resolution to the specified level?
It kind of seems to me now that these settings are still inherently about setting the OCR analysis.
Reviewing the phrasing in the GUI is still worthwhile.
Following my current understanding, other options for the dialogue box phrasing (besides my previous suggestions) could therefore be something like:
Note: I'm assuming above that downsampling doesn't occur.
—DIV
Just to confirm, when so-called Accuracy is not set to Auto, then does it mean that images (maybe of various resolutions) on the page that are below the implied threshold resolutions (e.g. ~400 dpi for "Medium") would be resampled to increase their resolution to the specified level?
It kind of seems to me now that these settings are still inherently about setting the OCR analysis.
Reviewing the phrasing in the GUI is still worthwhile.
Following my current understanding, other options for the dialogue box phrasing (besides my previous suggestions) could therefore be something like:
- No upsampling [replaces "Auto"]
- Upsample to 150 dpi minimum [replaces "Low"]
- Upsample to 400 dpi minimum [replaces "Medium"]
- Upsample to 600 dpi minimum [replaces "High"]
Note: I'm assuming above that downsampling doesn't occur.
—DIV
-
- Site Admin
- Posts: 17960
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
Re: OCR Accuracy: Auto vs High
Hello DIV,
Thanks for getting back to us!
Indeed we are still continuing discussion on this internally in our team (but off the forums). If a decision is made for any changes here - we might include those in a future build.
Kind regards,
Stefan
Thanks for getting back to us!
Indeed we are still continuing discussion on this internally in our team (but off the forums). If a decision is made for any changes here - we might include those in a future build.
Kind regards,
Stefan