Different outputs OCRp_PageText and OCRp_GetSymbolFromRegion
Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan
-
- User
- Posts: 13
- Joined: Thu May 02, 2013 10:21 am
Different outputs OCRp_PageText and OCRp_GetSymbolFromRegion
Hi,
I'm facing some problems, using the OCRp_GetSymbolFromRegion (from C#).
e.g.
OCR detected a minus Symbol as code 97 instead of code 2D within OCRp_PageText result.
O.k., that was a fax tif and delivers usually a series of detection errors.
But
I got code 14 (Pi) as the result from OCRp_GetSymbolFromRegion for the same symbol.
Since the second of the two characters, returned by the api is always \0 for the end
of string, there's not really a correction chance via Encoding.
Have you got a tip to that issue, please ?
I'm facing some problems, using the OCRp_GetSymbolFromRegion (from C#).
e.g.
OCR detected a minus Symbol as code 97 instead of code 2D within OCRp_PageText result.
O.k., that was a fax tif and delivers usually a series of detection errors.
But
I got code 14 (Pi) as the result from OCRp_GetSymbolFromRegion for the same symbol.
Since the second of the two characters, returned by the api is always \0 for the end
of string, there's not really a correction chance via Encoding.
Have you got a tip to that issue, please ?
-
- Site Admin
- Posts: 17960
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
Hello Gemini64,
Welcome to our forums. I've passed your question to our OCR lead developer and he will reply back here shortly.
Best,
Stefan
Welcome to our forums. I've passed your question to our OCR lead developer and he will reply back here shortly.
Best,
Stefan
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
Does OCRp_GetSymbolFromRegion() work in other cases for you, but not this one?
Can you provide a piece of sample code to reproduce this issue, as well as the input PDF?
Can you provide a piece of sample code to reproduce this issue, as well as the input PDF?
-
- User
- Posts: 13
- Joined: Thu May 02, 2013 10:21 am
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
Hi Walter,
can you send me an email address, where I can send you the attachments to ?
These are confidential docs, so I can't post them to public
it seems as if the problem changes by changing the language.
Now I changed the language to PXO_Language.PXO_German,
so the former problems are gone, but similar others appeared.
But first the Process:
hResult = PDFXOCR_Funcs.OCR_Init ( out pdf, PDFXOCR_Funcs.key, PDFXOCR_Funcs.code );
hResult = PDFXOCR_Funcs.OCR_LoadW ( pdf, m_SourceFilename );
hResult = PDFXOCR_Funcs.OCR_GetNumInputPages ( pdf, out m_PagesCount );
for each page I call
hResult = PDFXOCR_Funcs.OCRp_Page ( pdf, page, ref Options, out pxoPage, out pxoRasterSettings );
hResult = PDFXOCR_Funcs.OCRp_PageText ( pxoPage, out sPageText ); // => that goes to a dump file
hResult = PDFXOCR_Funcs.OCRp_RegionCountFromPage ( pxoPage, out nRegionCount );
for each region I call
hResult = PDFXOCR_Funcs.OCRp_GetRegionFromPage ( pxoPage, region, out pxoRegion );
hResult = PDFXOCR_Funcs.OCRp_SymbolCountFromRegion ( pxoRegion, out nSymbolCount );
for each symbol I call
hResult = PDFXOCR_Funcs.OCRp_GetSymbolFromRegion ( pxoRegion, symbol, out pxoSymbol );
Now I put it into a dump file
OCRp_FreePage ( pxoPage )
OCR_Delete ( out pdf )
So far, so good and it work's fine.
The enclosed dump file contains:
line 1 - 122, the output of OCRp_PageText
line 127 - 1971, dump of regions with their symbols
line 1976 - end, dump of wordwise regions (that's the goal in this first step)
My problems are:
1. mismatches between fulltext from OCRp_PageText and the corresponding region
a) fulltext shows a question mark for "I can't show this" e.g. dump file line 21 and corresponding line 265 back to 235 (region dump)
b) fulltext shows completely different characters,
- e.g. dump file line 22 and corresponding line 305 back to 266 (region dump)
- e.g. dump file line 30 and corresponding line 455 back to 434 (region dump)
2. Fulltext shows additional characters, that are not part of the region
e.g. line 115 shows a double blank in full text, that is not defined by the region (line 1778 to 1808)
There is neither a trailing blank behind the left char, nor a leading blank before the following char.
Since the OCR regions are not organized wordwise and the symbol rects inside a region sometimes overlap,
I need to find a way to get wordwise regions, when I cant' calculate them e.g. by the widths between the symbols.
So I try a match between the page text and the regions and their symbols, that come in the same order.
When I can't depend on a full match between fulltext-chars and symbol-chars, I tried it over the character count.
That's the reason, why 2. now becomes a problem.
Now it's your turn
With kindly regards and thanks
Robert
can you send me an email address, where I can send you the attachments to ?
These are confidential docs, so I can't post them to public
it seems as if the problem changes by changing the language.
Now I changed the language to PXO_Language.PXO_German,
so the former problems are gone, but similar others appeared.
But first the Process:
hResult = PDFXOCR_Funcs.OCR_Init ( out pdf, PDFXOCR_Funcs.key, PDFXOCR_Funcs.code );
hResult = PDFXOCR_Funcs.OCR_LoadW ( pdf, m_SourceFilename );
hResult = PDFXOCR_Funcs.OCR_GetNumInputPages ( pdf, out m_PagesCount );
for each page I call
hResult = PDFXOCR_Funcs.OCRp_Page ( pdf, page, ref Options, out pxoPage, out pxoRasterSettings );
hResult = PDFXOCR_Funcs.OCRp_PageText ( pxoPage, out sPageText ); // => that goes to a dump file
hResult = PDFXOCR_Funcs.OCRp_RegionCountFromPage ( pxoPage, out nRegionCount );
for each region I call
hResult = PDFXOCR_Funcs.OCRp_GetRegionFromPage ( pxoPage, region, out pxoRegion );
hResult = PDFXOCR_Funcs.OCRp_SymbolCountFromRegion ( pxoRegion, out nSymbolCount );
for each symbol I call
hResult = PDFXOCR_Funcs.OCRp_GetSymbolFromRegion ( pxoRegion, symbol, out pxoSymbol );
Now I put it into a dump file
OCRp_FreePage ( pxoPage )
OCR_Delete ( out pdf )
So far, so good and it work's fine.
The enclosed dump file contains:
line 1 - 122, the output of OCRp_PageText
line 127 - 1971, dump of regions with their symbols
line 1976 - end, dump of wordwise regions (that's the goal in this first step)
My problems are:
1. mismatches between fulltext from OCRp_PageText and the corresponding region
a) fulltext shows a question mark for "I can't show this" e.g. dump file line 21 and corresponding line 265 back to 235 (region dump)
b) fulltext shows completely different characters,
- e.g. dump file line 22 and corresponding line 305 back to 266 (region dump)
- e.g. dump file line 30 and corresponding line 455 back to 434 (region dump)
2. Fulltext shows additional characters, that are not part of the region
e.g. line 115 shows a double blank in full text, that is not defined by the region (line 1778 to 1808)
There is neither a trailing blank behind the left char, nor a leading blank before the following char.
Since the OCR regions are not organized wordwise and the symbol rects inside a region sometimes overlap,
I need to find a way to get wordwise regions, when I cant' calculate them e.g. by the widths between the symbols.
So I try a match between the page text and the regions and their symbols, that come in the same order.
When I can't depend on a full match between fulltext-chars and symbol-chars, I tried it over the character count.
That's the reason, why 2. now becomes a problem.
Now it's your turn
With kindly regards and thanks
Robert
-
- Site Admin
- Posts: 17960
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
Hi Robert,
Walter is in our Vancouver Island office (Pacific time zone) so he will follow up here when he gets to work, but in the mean time you can send the sample file to support@pdf-xchange.com and I will pass it to him.
Regards,
Stefan
Walter is in our Vancouver Island office (Pacific time zone) so he will follow up here when he gets to work, but in the mean time you can send the sample file to support@pdf-xchange.com and I will pass it to him.
Regards,
Stefan
-
- User
- Posts: 13
- Joined: Thu May 02, 2013 10:21 am
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
Thank you, Stefan,
the documents are on the way.
Here there are no times. The days have 24 hours, in summertime they're longer
Regards
Robert
the documents are on the way.
Here there are no times. The days have 24 hours, in summertime they're longer
Regards
Robert
-
- Site Admin
- Posts: 17960
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
Hi Robert
Got the files and passed them to Walter.
Indeed - we are living in the "global village" nowadays
Cheers,
Stefan
Got the files and passed them to Walter.
Indeed - we are living in the "global village" nowadays
Cheers,
Stefan
-
- User
- Posts: 13
- Joined: Thu May 02, 2013 10:21 am
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
Thanks a lot,
so let's wait, what comes out of the byte cracker kitchen
Regards
Robert
so let's wait, what comes out of the byte cracker kitchen
Regards
Robert
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
Thanks, am looking at them now.
-Walter
-Walter
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
My suspicion is that this involves encoding of the text (unicode, or UTF-8, ANSI, etc). The text you receive from those functions is unicode text, and you must ensure to use unicode functions or do the correct conversion (e.g. to UTF-8). If you are outputting with ANSI text functions these characters may be incorrectly represented (e.g. ligatures - OCR has incorrectly determined that the "ri" in Amtsgericht is a ligature for "fi", and this ligature character "fi" is part of unicode and not represented in typical ANSI code pages).
How do you output the text to the dump files? Do you perform any conversions, or do you ensure to use string output functions that can handle unicode?
Meanwhile I'm working on reproducing it on my end by implementing a quick version of your algorithm.
How do you output the text to the dump files? Do you perform any conversions, or do you ensure to use string output functions that can handle unicode?
Meanwhile I'm working on reproducing it on my end by implementing a quick version of your algorithm.
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
Just a note: a good way to workaround this problem, if you don't want to tweak your code to deal with unicode or UTF8 handling, would be to apply specific whitelists that only contain ANSI / ASCII characters (e.g. "1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ~!@#$%^&*()"), assuming this is compatible with the kinds of documents you are OCRing.
You can set whitelists in the PXO_Options structure you pass to OCRp_Page (do not set a blacklist - set it to an empty string or NULL).
You can use it explicitly as a string literal (as above), or you might want to write a loop to fill it with visible characters from the ASCII code page you use on your system.
For example, for the standard Latin code page used in North America (not sure if this is identical to what you would use in Germany, perhaps not), visible characters include the numeric ranges:
0x20 to 0x7E (decimal 32 to 126) and 0x80 to 0xFF (decimal 128-254). You might write a loop to create a whitelist containing only these characters like so (pseudocode):
See here:
http://www.unicodetools.com/unicode/codepages.php
You can set whitelists in the PXO_Options structure you pass to OCRp_Page (do not set a blacklist - set it to an empty string or NULL).
You can use it explicitly as a string literal (as above), or you might want to write a loop to fill it with visible characters from the ASCII code page you use on your system.
For example, for the standard Latin code page used in North America (not sure if this is identical to what you would use in Germany, perhaps not), visible characters include the numeric ranges:
0x20 to 0x7E (decimal 32 to 126) and 0x80 to 0xFF (decimal 128-254). You might write a loop to create a whitelist containing only these characters like so (pseudocode):
Code: Select all
string whitelist
for i in ranges (32-126 and 128-254)
whitelist.append(i)
PXO_Options opts;
opts.whitelist = whitelist;
OCRp_Page(pdf, opts, ...)
See here:
http://www.unicodetools.com/unicode/codepages.php
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
I have checked with my version of your sample code here and indeed characters do match up between that from OCRp_PageText() and that taken from OCRp_GetSymbolFromRegion(). I suspect, as stated, that this all relates to how you handle the unicode strings and characters returned by these functions.
See the attached screenshot.
"Text Visualizer" contains output from OCRp_PageText with the ligature "fi" highlighted, and the red arrow points to the current symbol from OCRp_GetSymbolFromRegion (your region 10, symbol 3, in your attached dump file). Both are the "fi" ligature.
Note that while this is an example of incorrect identification (unavoidable - OCR is not 100% accurate), the root of the problem is not due to incorrect recognition but the fact that the output is unicode and there may be cases where extended unicode characters are legitimately identified.
See the attached screenshot.
"Text Visualizer" contains output from OCRp_PageText with the ligature "fi" highlighted, and the red arrow points to the current symbol from OCRp_GetSymbolFromRegion (your region 10, symbol 3, in your attached dump file). Both are the "fi" ligature.
Note that while this is an example of incorrect identification (unavoidable - OCR is not 100% accurate), the root of the problem is not due to incorrect recognition but the fact that the output is unicode and there may be cases where extended unicode characters are legitimately identified.
You do not have the required permissions to view the files attached to this post.
-
- User
- Posts: 13
- Joined: Thu May 02, 2013 10:21 am
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
Hi Walter, thanks a lot for your investigations.
Now I did the following:
In the PDFOCC_Funcs.cs, comes with the SDK, I changed the [DllImport("ocrtools")] to [DllImport("ocrtools", CharSet = CharSet.Unicode)].
After the call hResult = PDFXOCR_Funcs.OCRp_PageText ( pxoPage, out sPageText );
I call
sPageText = DecodeUnicodeString(sPageText);
and after the call
hResult = PDFXOCR_Funcs.OCRp_GetSymbolFromRegion ( pxoRegion, symbol, out pxoSymbol );
I call
pxoSymbol.wcSymbol = DecodeUnicodeString ( pxoSymbol.wcSymbol );
Where DecodeUnicodeString is defined as:
public String DecodeUnicodeString(String unicodeString)
{
Encoding def = Encoding.Default;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte[].
byte[] unicodeBytes = unicode.GetBytes ( unicodeString );
// Perform the conversion from one encoding to the other.
byte[] defBytes = Encoding.Convert ( unicode, def, unicodeBytes );
// Convert the new byte[] into a char[] and then into a string.
// This is a slightly different approach to converting to illustrate
// the use of GetCharCount/GetChars.
char[] defChars = new char[def.GetCharCount ( defBytes, 0, defBytes.Length )];
def.GetChars ( defBytes, 0, defBytes.Length, defChars, 0 );
string defString = new string ( defChars );
return defString;
}
There are no changes in the results.
Usually unicode has two bytes to represent a char.
struct OCR_SymbolBox in PDFOCC_Funcs.cs, comes with the SDK
defines
[MarshalAsAttribute ( UnmanagedType.ByValTStr, SizeConst = 2 )]
public string wcSymbol;
But the secondbByte is always \0, so I think, there's nothing to convert from.
I guess you are testing within the c++ Environment.
Is it possible, that information get lost, while marshalling a unicode into a const char 2 struct,
where the last byte always needs to be \0 ?
That would explain the different results of PDFXOCR_Funcs.OCRp_PageText and OCRp_GetSymbolFromRegion in C#
Regards
Robert
Now I did the following:
In the PDFOCC_Funcs.cs, comes with the SDK, I changed the [DllImport("ocrtools")] to [DllImport("ocrtools", CharSet = CharSet.Unicode)].
After the call hResult = PDFXOCR_Funcs.OCRp_PageText ( pxoPage, out sPageText );
I call
sPageText = DecodeUnicodeString(sPageText);
and after the call
hResult = PDFXOCR_Funcs.OCRp_GetSymbolFromRegion ( pxoRegion, symbol, out pxoSymbol );
I call
pxoSymbol.wcSymbol = DecodeUnicodeString ( pxoSymbol.wcSymbol );
Where DecodeUnicodeString is defined as:
public String DecodeUnicodeString(String unicodeString)
{
Encoding def = Encoding.Default;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte[].
byte[] unicodeBytes = unicode.GetBytes ( unicodeString );
// Perform the conversion from one encoding to the other.
byte[] defBytes = Encoding.Convert ( unicode, def, unicodeBytes );
// Convert the new byte[] into a char[] and then into a string.
// This is a slightly different approach to converting to illustrate
// the use of GetCharCount/GetChars.
char[] defChars = new char[def.GetCharCount ( defBytes, 0, defBytes.Length )];
def.GetChars ( defBytes, 0, defBytes.Length, defChars, 0 );
string defString = new string ( defChars );
return defString;
}
There are no changes in the results.
Usually unicode has two bytes to represent a char.
struct OCR_SymbolBox in PDFOCC_Funcs.cs, comes with the SDK
defines
[MarshalAsAttribute ( UnmanagedType.ByValTStr, SizeConst = 2 )]
public string wcSymbol;
But the secondbByte is always \0, so I think, there's nothing to convert from.
I guess you are testing within the c++ Environment.
Is it possible, that information get lost, while marshalling a unicode into a const char 2 struct,
where the last byte always needs to be \0 ?
That would explain the different results of PDFXOCR_Funcs.OCRp_PageText and OCRp_GetSymbolFromRegion in C#
Regards
Robert
-
- User
- Posts: 13
- Joined: Thu May 02, 2013 10:21 am
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
Hi Walter,
could you find any evidence to my last guesses ?
"
I guess you are testing within the c++ Environment.
Is it possible, that information get lost, while marshalling a unicode into a const char 2 struct,
where the last byte always needs to be \0 ?
That would explain the different results of PDFXOCR_Funcs.OCRp_PageText and OCRp_GetSymbolFromRegion in C#
"
Regards
Robert
could you find any evidence to my last guesses ?
"
I guess you are testing within the c++ Environment.
Is it possible, that information get lost, while marshalling a unicode into a const char 2 struct,
where the last byte always needs to be \0 ?
That would explain the different results of PDFXOCR_Funcs.OCRp_PageText and OCRp_GetSymbolFromRegion in C#
"
Regards
Robert
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe
The encoded text is in UTF-8 which is a variable width encoding (1 byte "backwards compatibility" mode for ASCII chars, 2 bytes for non-ASCII unicode). I'd make sure you're working with UTF-8 during your conversions.