Different outputs OCRp_PageText and OCRp_GetSymbolFromRegion

Gemini64 · Post by **Gemini64** » Thu May 02, 2013 10:37 am

Hi,
I'm facing some problems, using the OCRp_GetSymbolFromRegion (from C#).

e.g.
OCR detected a minus Symbol as code 97 instead of code 2D within OCRp_PageText result.
O.k., that was a fax tif and delivers usually a series of detection errors.

But
I got code 14 (Pi) as the result from OCRp_GetSymbolFromRegion for the same symbol.
Since the second of the two characters, returned by the api is always \0 for the end
of string, there's not really a correction chance via Encoding.

Have you got a tip to that issue, please ?

Thu May 02, 2013 11:37 am

Hello Gemini64,

Welcome to our forums. I've passed your question to our OCR lead developer and he will reply back here shortly.

Best,
Stefan

Walter-Tracker Supp · Post by **Walter-Tracker Supp** » Thu May 02, 2013 4:12 pm

Does OCRp_GetSymbolFromRegion() work in other cases for you, but not this one?

Can you provide a piece of sample code to reproduce this issue, as well as the input PDF?

Gemini64 · Post by **Gemini64** » Fri May 03, 2013 4:26 am

Hi Walter,

can you send me an email address, where I can send you the attachments to ?
These are confidential docs, so I can't post them to public

it seems as if the problem changes by changing the language.
Now I changed the language to PXO_Language.PXO_German,
so the former problems are gone, but similar others appeared.

But first the Process:
hResult = PDFXOCR_Funcs.OCR_Init ( out pdf, PDFXOCR_Funcs.key, PDFXOCR_Funcs.code );
hResult = PDFXOCR_Funcs.OCR_LoadW ( pdf, m_SourceFilename );
hResult = PDFXOCR_Funcs.OCR_GetNumInputPages ( pdf, out m_PagesCount );

for each page I call
hResult = PDFXOCR_Funcs.OCRp_Page ( pdf, page, ref Options, out pxoPage, out pxoRasterSettings );
hResult = PDFXOCR_Funcs.OCRp_PageText ( pxoPage, out sPageText ); // => that goes to a dump file
hResult = PDFXOCR_Funcs.OCRp_RegionCountFromPage ( pxoPage, out nRegionCount );
for each region I call
hResult = PDFXOCR_Funcs.OCRp_GetRegionFromPage ( pxoPage, region, out pxoRegion );
hResult = PDFXOCR_Funcs.OCRp_SymbolCountFromRegion ( pxoRegion, out nSymbolCount );
for each symbol I call
hResult = PDFXOCR_Funcs.OCRp_GetSymbolFromRegion ( pxoRegion, symbol, out pxoSymbol );
Now I put it into a dump file
OCRp_FreePage ( pxoPage )

OCR_Delete ( out pdf )

So far, so good and it work's fine.
The enclosed dump file contains:
line 1 - 122, the output of OCRp_PageText
line 127 - 1971, dump of regions with their symbols
line 1976 - end, dump of wordwise regions (that's the goal in this first step)

My problems are:
1. mismatches between fulltext from OCRp_PageText and the corresponding region
a) fulltext shows a question mark for "I can't show this" e.g. dump file line 21 and corresponding line 265 back to 235 (region dump)
b) fulltext shows completely different characters,
- e.g. dump file line 22 and corresponding line 305 back to 266 (region dump)
- e.g. dump file line 30 and corresponding line 455 back to 434 (region dump)

2. Fulltext shows additional characters, that are not part of the region
e.g. line 115 shows a double blank in full text, that is not defined by the region (line 1778 to 1808)
There is neither a trailing blank behind the left char, nor a leading blank before the following char.

Since the OCR regions are not organized wordwise and the symbol rects inside a region sometimes overlap,
I need to find a way to get wordwise regions, when I cant' calculate them e.g. by the widths between the symbols.
So I try a match between the page text and the regions and their symbols, that come in the same order.
When I can't depend on a full match between fulltext-chars and symbol-chars, I tried it over the character count.
That's the reason, why 2. now becomes a problem.

Now it's your turn

With kindly regards and thanks
Robert

Post by **Tracker Supp-Stefan** » Fri May 03, 2013 8:38 am

Hi Robert,

Walter is in our Vancouver Island office (Pacific time zone) so he will follow up here when he gets to work, but in the mean time you can send the sample file to support@pdf-xchange.com and I will pass it to him.

Regards,
Stefan

Gemini64 · Post by **Gemini64** » Fri May 03, 2013 11:30 am

Thank you, Stefan,
the documents are on the way.

Here there are no times. The days have 24 hours, in summertime they're longer

Regards
Robert

Post by **Tracker Supp-Stefan** » Fri May 03, 2013 1:14 pm

Hi Robert

Got the files and passed them to Walter.

Indeed - we are living in the "global village" nowadays

Cheers,
Stefan

Gemini64 · Post by **Gemini64** » Fri May 03, 2013 1:30 pm

Thanks a lot,
so let's wait, what comes out of the byte cracker kitchen

Regards
Robert

Walter-Tracker Supp · Fri May 03, 2013 10:19 pm

Thanks, am looking at them now.

-Walter

Walter-Tracker Supp · Fri May 03, 2013 10:44 pm

My suspicion is that this involves encoding of the text (unicode, or UTF-8, ANSI, etc). The text you receive from those functions is unicode text, and you must ensure to use unicode functions or do the correct conversion (e.g. to UTF-8). If you are outputting with ANSI text functions these characters may be incorrectly represented (e.g. ligatures - OCR has incorrectly determined that the "ri" in Amtsgericht is a ligature for "fi", and this ligature character "fi" is part of unicode and not represented in typical ANSI code pages).

How do you output the text to the dump files? Do you perform any conversions, or do you ensure to use string output functions that can handle unicode?

Meanwhile I'm working on reproducing it on my end by implementing a quick version of your algorithm.

Walter-Tracker Supp · Fri May 03, 2013 11:02 pm

Just a note: a good way to workaround this problem, if you don't want to tweak your code to deal with unicode or UTF8 handling, would be to apply specific whitelists that only contain ANSI / ASCII characters (e.g. "1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ~!@#$%^&*()"), assuming this is compatible with the kinds of documents you are OCRing.

You can set whitelists in the PXO_Options structure you pass to OCRp_Page (do not set a blacklist - set it to an empty string or NULL).

You can use it explicitly as a string literal (as above), or you might want to write a loop to fill it with visible characters from the ASCII code page you use on your system.

For example, for the standard Latin code page used in North America (not sure if this is identical to what you would use in Germany, perhaps not), visible characters include the numeric ranges:

0x20 to 0x7E (decimal 32 to 126) and 0x80 to 0xFF (decimal 128-254). You might write a loop to create a whitelist containing only these characters like so (pseudocode):

Code: Select all

  string whitelist
  for i in ranges (32-126 and 128-254) 
    whitelist.append(i)


  PXO_Options opts;
  opts.whitelist = whitelist;

  OCRp_Page(pdf, opts, ...)

See here:

http://www.unicodetools.com/unicode/codepages.php

Walter-Tracker Supp · Fri May 03, 2013 11:23 pm

I have checked with my version of your sample code here and indeed characters do match up between that from OCRp_PageText() and that taken from OCRp_GetSymbolFromRegion(). I suspect, as stated, that this all relates to how you handle the unicode strings and characters returned by these functions.

See the attached screenshot.

"Text Visualizer" contains output from OCRp_PageText with the ligature "fi" highlighted, and the red arrow points to the current symbol from OCRp_GetSymbolFromRegion (your region 10, symbol 3, in your attached dump file). Both are the "fi" ligature.

Note that while this is an example of incorrect identification (unavoidable - OCR is not 100% accurate), the root of the problem is not due to incorrect recognition but the fact that the output is unicode and there may be cases where extended unicode characters are legitimately identified.

Gemini64 · Post by **Gemini64** » Sat May 04, 2013 11:19 am

Hi Walter, thanks a lot for your investigations.

Now I did the following:
In the PDFOCC_Funcs.cs, comes with the SDK, I changed the [DllImport("ocrtools")] to [DllImport("ocrtools", CharSet = CharSet.Unicode)].

After the call hResult = PDFXOCR_Funcs.OCRp_PageText ( pxoPage, out sPageText );
I call
sPageText = DecodeUnicodeString(sPageText);

and after the call
hResult = PDFXOCR_Funcs.OCRp_GetSymbolFromRegion ( pxoRegion, symbol, out pxoSymbol );
I call
pxoSymbol.wcSymbol = DecodeUnicodeString ( pxoSymbol.wcSymbol );

Where DecodeUnicodeString is defined as:
public String DecodeUnicodeString(String unicodeString)
{
Encoding def = Encoding.Default;
Encoding unicode = Encoding.Unicode;

// Convert the string into a byte[].
byte[] unicodeBytes = unicode.GetBytes ( unicodeString );

// Perform the conversion from one encoding to the other.
byte[] defBytes = Encoding.Convert ( unicode, def, unicodeBytes );

// Convert the new byte[] into a char[] and then into a string.
// This is a slightly different approach to converting to illustrate
// the use of GetCharCount/GetChars.
char[] defChars = new char[def.GetCharCount ( defBytes, 0, defBytes.Length )];
def.GetChars ( defBytes, 0, defBytes.Length, defChars, 0 );
string defString = new string ( defChars );

return defString;
}

There are no changes in the results.
Usually unicode has two bytes to represent a char.

struct OCR_SymbolBox in PDFOCC_Funcs.cs, comes with the SDK
defines
[MarshalAsAttribute ( UnmanagedType.ByValTStr, SizeConst = 2 )]
public string wcSymbol;

But the secondbByte is always \0, so I think, there's nothing to convert from.

I guess you are testing within the c++ Environment.
Is it possible, that information get lost, while marshalling a unicode into a const char 2 struct,
where the last byte always needs to be \0 ?
That would explain the different results of PDFXOCR_Funcs.OCRp_PageText and OCRp_GetSymbolFromRegion in C#

Regards
Robert

Gemini64 · Post by **Gemini64** » Tue Jun 04, 2013 12:49 am

Hi Walter,
could you find any evidence to my last guesses ?

"
I guess you are testing within the c++ Environment.
Is it possible, that information get lost, while marshalling a unicode into a const char 2 struct,
where the last byte always needs to be \0 ?
That would explain the different results of PDFXOCR_Funcs.OCRp_PageText and OCRp_GetSymbolFromRegion in C#
"

Regards
Robert

Walter-Tracker Supp · Post by **Walter-Tracker Supp** » Tue Jun 04, 2013 4:13 pm

The encoded text is in UTF-8 which is a variable width encoding (1 byte "backwards compatibility" mode for ASCII chars, 2 bytes for non-ASCII unicode). I'd make sure you're working with UTF-8 during your conversions.

Different outputs OCRp_PageText and OCRp_GetSymbolFromRegion

Different outputs OCRp_PageText and OCRp_GetSymbolFromRegion

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe

Re: Different outputs OCRp_PageText and OCRp_GetSymbolFromRe