OCR of pdf and pictures

crimsonlogic · Post by **crimsonlogic** » Sat Jan 16, 2016 1:51 am

We bought Pro SDK license under CrimsonLogic Pte Ltd.

I have 3 problems now while doing OCR in my WPF application.

1) I am not able to OCR pdf with 17 pages and above.

2) I notice that some successfully OCRed files have text overlaid as in attached screenshot. How can I fix it?

3) When I convert image to pdf, the image size is quite small compared to original image. Where can I change the image size?
I’ve played around with the last 2 values in below line but I couldn’t manage to make the image bigger in pdf file.
PDFXC_Funcs.PXC_PlaceImage(cpage, p, Common.I2L(1), Common.PH - Common.I2L(1), Common.I2L(3), Common.I2L(2));

Please help to advise. Thank you very much.

Post by **John - Tracker Supp** » Mon Jan 18, 2016 2:11 pm

Hi,

Can we please keep all OCR related questions in one forum - or email please - you are posting in multiple forums and also then sending emails - which is not helpful and just divides the effort to assist you as we are having to check if some items have been answered in emails or other forums first ...

I will move this one to the OCR forums and any others - so we can address them all logically - thank you.

Post by **John - Tracker Supp** » Mon Jan 18, 2016 2:17 pm

RE: Questions;

1) I am not able to OCR pdf with 17 pages and above.

Please advise what version of our products are being used, the spec of the hardware (processor, drive space and also Ram, OS) Also please provide an example of the PDF being OCR'd - could it be you are running out of resources ??? Perhaps try breaking the job into 'chunks'

2) I notice that some successfully OCRed files have text overlaid as in attached screenshot. How can I fix it?

Please supply before/after PDF files for us to analyse along with a snippet of the code you are using for this specific task.

3) When I convert image to pdf, the image size is quite small compared to original image. Where can I change the image size?
I’ve played around with the last 2 values in below line but I couldn’t manage to make the image bigger in pdf file.
PDFXC_Funcs.PXC_PlaceImage(cpage, p, Common.I2L(1), Common.PH - Common.I2L(1), Common.I2L(3), Common.I2L(2));

I have asked a colleague to help and advise on this specifically...

crimsonlogic · Post by **crimsonlogic** » Tue Jan 19, 2016 2:14 am

Hi John,

1) I am not able to OCR pdf with 17 pages and above.
>> We bought the license of PDF Xchange PRO SDK
>> On your website it shows
**NEW OCR Module Included** - Now includes PDF-X OCR SDK Module for converting image based PDF files to fully text searchable PDF files at no charge. For more information on this exciting new module and usage requirements for the free new add-on please visit our PDF-X OCR SDK Module page
>> We are using this PDF-X OCR SDK.
>> machine : 8 GB ram, I7, 64Bit OS.
>> Attached the pdf of 17 pages where you can try to OCR and update us on the outcome.
>> (please note that this 17 pages PDF was converted from word doc as your forum does not allow upload)
>> (let us know if you need the word copy to email to you.)
>> please see the code below.

2) I notice that some successfully OCRed files have text overlaid as in attached screenshot. How can I fix it?
>> Attached the pdf for your investigation. Please go through the pdf to see the issue.
>> ( Provide the program file on the OCR code)

3) When I convert image to pdf, the image size is quite small compared to original image. Where can I change the image size?
I’ve played around with the last 2 values in below line but I couldn’t manage to make the image bigger in pdf file.
PDFXC_Funcs.PXC_PlaceImage(cpage, p, Common.I2L(1), Common.PH - Common.I2L(1), Common.I2L(3), Common.I2L(2));
>> This we will wait for your feedback.

>> The code for OCR pdf.
private string ConvertPDFToOCR(string m_SourceFilename, string m_DestFilename, string language)
{
string result = "OK";
IntPtr pdf;
int hResult;
string OCRretcode;
int m_DPI;
string m_Datapath = Path.GetDirectoryName(Assembly.GetExecutingAssembly().GetName().CodeBase).Replace("file:\\", "") + @"\OCRLanguages\";

PDFXOCR_Funcs.PXO_Language m_Language = (PDFXOCR_Funcs.PXO_Language)Array.IndexOf(PDFXOCR_Funcs.OCR_LangFullArrayW, language); //GetOCRLanguage(language);

string langinit = PDFXOCR_Funcs.OCR_LangArrayW[Array.IndexOf(PDFXOCR_Funcs.OCR_LangFullArrayW, language)];

// Check if language file exists
string langfile = m_Datapath + @"ocrdats\" + langinit + "_pxvocr.dat";// m_Datapath + @"ocrdats\eng_pxvocr.dat"; //OCR Language file

// string err = string.Empty;

try
{
if (!System.IO.File.Exists(langfile))
{
result += "Language File Missing";
}
m_DPI = 200; //quality of OCR

string regkey = "XXXXXXXXXXXXXXXXXXXXXXX";
string devcode = "XXXXXXXXXXXXXXXXXXXXXXX";

//string key = "YOUR PRODUCT KEY";
//string code = "YOUR DEVELOPER CODE";
hResult = PDFXOCR_Funcs.OCR_Init(out pdf, regkey, devcode);

if (PDFXOCR_Funcs.IS_DS_FAILED(hResult))
{
result += "OCR Initialization failure.";
}

hResult = PDFXOCR_Funcs.OCR_SetCallback(pdf, thecallback, 0);

hResult = PDFXOCR_Funcs.OCR_LoadW(pdf, m_SourceFilename);
if (PDFXOCR_Funcs.IS_DS_FAILED(hResult))
{
result += "Error loading file: \n" + m_SourceFilename + "OCR Library Error";
}

PDFXOCR_Funcs.PXO_Options Options = new PDFXOCR_Funcs.PXO_Options();
Options.blacklist = string.Empty;
Options.whitelist = string.Empty;
Options.raster_dpi = m_DPI;
Options.ImageFlags = (uint)PDFXOCR_Funcs.OCR_ImageProcessingFlags.OCR_Image_FastAutorotate;
Options.DataPath = m_Datapath;
Options.lang = m_Language;
Options.RegionMode = PDFXOCR_Funcs.OCR_RegionMode.OCR_Auto;
Options.reserved = 0;

IntPtr pxoPagelist = IntPtr.Zero; // null pointer passed to OCR_MakeSearchable() will result in all pages being OCRd.

hResult = PDFXOCR_Funcs.OCR_MakeSearchable(pdf, ref Options, pxoPagelist);

if (PDFXOCR_Funcs.IS_DS_FAILED(hResult))
{
result += "Error running searchable.\nError code: " + hResult.ToString();
}
else
{
OCRretcode = hResult.ToString();
}

hResult = PDFXOCR_Funcs.OCR_SaveW(pdf, m_DestFilename);
if (PDFXOCR_Funcs.IS_DS_FAILED(hResult))
{
result += "Error saving output PDF file.\nError code: " + hResult.ToString();
}
PDFXOCR_Funcs.OCR_Delete(out pdf);
}
catch (Exception ex)
{
//throw ex;
result += "[EXCEPTION]" + ex.GetType();
result += "[EXCEPTION]" + ex.Message;
result += "[EXCEPTION]" + ex.StackTrace;
//Dispose();
//result += "Disposed OCRHelper class";
}
return result;
}

>> The code of Convert Word to PDF
private bool ConvertToPDF(string pdfpath, string inputfile)
{

bool isDone = false;
PXCComLib5.CPXCPrinter PDFPrinter;
PXCComLib5.CPXCControlEx prnFactory = new PXCComLib5.CPXCControlEx();
string regkey = "XXXXXXXXXXXX";
string devcode = "XXXXXXXXXXXX";
PDFPrinter = (PXCComLib5.CPXCPrinter)prnFactory.get_Printer("", "PDF-XChange Printer 2012", regkey, devcode);
PDFPrinter.Option["Save.ShowSaveDialog"] = false;
PDFPrinter.Option["Save.RunApp"] = false;
PDFPrinter.Option["Save.Path"] = pdfpath;
PDFPrinter.Option["Save.WhenExists"] = 1; //overwrite

PDFPrinter.SetAsDefaultPrinter();

System.Diagnostics.Process printJob = new System.Diagnostics.Process();
printJob.StartInfo.FileName = inputfile;
printJob.StartInfo.UseShellExecute = true;
printJob.StartInfo.Verb = "print";
printJob.StartInfo.WindowStyle = System.Diagnostics.ProcessWindowStyle.Minimized;
printJob.Start();
printJob.WaitForExit();
isDone = true;
return isDone;
}

Tue Jan 19, 2016 7:41 am

Hi.

3) When I convert image to pdf, the image size is quite small compared to original image. Where can I change the image size?
I’ve played around with the last 2 values in below line but I couldn’t manage to make the image bigger in pdf file.
PDFXC_Funcs.PXC_PlaceImage(cpage, p, Common.I2L(1), Common.PH - Common.I2L(1), Common.I2L(3), Common.I2L(2));

If you read help for PXC_PlaceImage function you can see that the last two parameters specify width and height of an image in points (1/72 inch). I cannot see code of your I2L function, so cannot say why you are getting such small images - because of the error in I2L or because 3 and 2 values are simply too small.
HTH.

YouTube · Tue Jan 19, 2016 7:51 am

Hello crimsonlogic,

As for the error code - it means OCR_ERR_INVALID_DICT_PATH meaning that you gave wrong path to the dictionary folder.

Do use these for problem investigating in future:

Code: Select all

OCRCORE_API LONG OCR_API OCRE_Err_FormatSeverity(HRESULT errorcode, LPSTR buf, LONG maxlen);
OCRCORE_API LONG OCR_API OCRE_Err_FormatFacility(HRESULT errorcode, LPSTR buf, LONG maxlen);
OCRCORE_API LONG OCR_API OCRE_Err_FormatErrorCode(HRESULT errorcode, LPSTR buf, LONG maxlen);

HTH,
Alex

crimsonlogic · Post by **crimsonlogic** » Tue Jan 19, 2016 9:14 am

Hi Sasha,

Sorry, don't quite understand. which error code you are referring to??

Thanks

YouTube · Tue Jan 19, 2016 9:32 am

Hello crimsonlogic,

It's about the error code that you've asked about ERROR CODE – 2113263855 == 0x820A2711

HTH

YouTube · Tue Jan 19, 2016 9:04 pm

By the way - it would be better if you could provide a small sample project (with your dlls included) where the problems occur and the guide on how to reproduce them. Then we could help you more efficiently. Because right now there are many questions from our side which could be answered if we had a working project.

crimsonlogic · Post by **crimsonlogic** » Wed Jan 20, 2016 9:23 am

Hi Sasha,

We will email you a sample program and documents to try out via email (support@pdf-xchange.com) due to file size limitation in file attachment in this forum. We will send them in 2 separate emails. Thanks for your help.

crimsonlogic · Post by **crimsonlogic** » Wed Jan 20, 2016 9:30 am

Hi Sasha,

We've tried to send you the programs and sample files via email but failed to send due to the file size. Do you have any other alternative way to deposit our files? Thanks.

Post by **John - Tracker Supp** » Wed Jan 20, 2016 9:44 am

How big are the attachments ?

crimsonlogic · Post by **crimsonlogic** » Wed Jan 20, 2016 9:51 am

Program file is about 25MB and sample files are about 4MB after zipping

YouTube · Wed Jan 20, 2016 9:58 am

Please post them to google drive or dropbox and give us a link.

Cheers,
Alex

crimsonlogic · Post by **crimsonlogic** » Wed Jan 20, 2016 10:32 am

Hi Sasha,

Our client is a government agency and they prohibit us to upload their code to cloud due to security concern.

Please help us to provide a secured repository to upload the files. Thank you very much.

Wed Jan 20, 2016 10:40 am

Hello crimsonlogic,

Maybe you can upload the files to our ftp server?
You can find the details for it here:
https://www.pdf-xchange.com/knowledgebase/321
However as the FTP is open to anyone - we would recommend you to password protect the files uploaded, and then send us the password e.g. via e-mail to support@pdf-xchange.com

Regards,
Stefan

crimsonlogic · Post by **crimsonlogic** » Thu Jan 21, 2016 3:30 am

Hi Stefan,

Thank you for your reply. We have uploaded the files and sent password in email.

YouTube · Thu Jan 21, 2016 7:30 am

Hello crimsonlogic,

Thanks for the sample - we'll look at it.

crimsonlogic · Post by **crimsonlogic** » Fri Jan 22, 2016 1:36 am

Hi Sasha,

Any updates??

Thanks

YouTube · Fri Jan 22, 2016 11:12 am

Hello crimsonlogic,

Looking at your files in media.zip we've investigated this so far:
The DWC.pdf created had been already OCR'd by some external converter (libtiff / tiff2pdf - 2.3.606.0) with the text overlay that has invisible text.

When this file is OCR'd the text becomes visible and the background image + this text is going through our OCR engine. Thus you'll have the visible text (aligned by top in you example) and the OCR'd image background with the invisible text on top of it. Of course this text will be corrupted where it was overlayed with previously invisible text.

HTH,
Alex

crimsonlogic · Post by **crimsonlogic** » Mon Jan 25, 2016 8:21 am

HI Sasha,

is it possible to know if the file has already been OCR when pass through PDF Xchange SDK?

Any updates on the other issue?

Thanks
fya

YouTube · Mon Jan 25, 2016 8:28 am

Hello crimsonlogic,

Maybe it's better to look at the pdf generator and it's options so that it won't generate any text?

Do you mean the 17 page problem as the other problem?

Cheers,
Alex

crimsonlogic · Post by **crimsonlogic** » Tue Jan 26, 2016 2:45 am

HI Sasha,

yes, we need the solution of the 17 pages error.

Thanks
fya

crimsonlogic · Post by **crimsonlogic** » Tue Jan 26, 2016 2:52 am

HI Sasha,

Don't understand your statement

Maybe it's better to look at the pdf generator and it's options so that it won't generate any text?

The PDF program given performs OCR which causes the overlay. What do you mean by the PDF generator??

The other issue is a word file, convert to PDF format and the OCR.
The convert to PDF format has no issue.
Where as the OCR process throws error.
Please try the program as we take effort to build to show the issue.
Please get the developer to look at the codes if you are not able to do so.

We need the solution ASAP as we are already reported the issues for over a week with no progress.

thanks
fya

Tue Jan 26, 2016 7:54 am

yes, we need the solution of the 17 pages error.

As we already mentioned, the problem is because your process is 32-bit.
32-bit processes have limited address space available, and, what is most important, in modern OSes Address Space Layout Randomization (https://en.wikipedia.org/wiki/Address_s ... domization) technology makes this address space highly fragmented and application often cannot allocate big continues buffer of memory (for example, one Letter page on 300 dpi requires about 32 Mb of memory on rasterization).
The only possible solutions I can recommend here:
1. create separate .exe that will OCR document and turn off ASLR for this .exe (not sure in .NET allows to do that)
2. convert your app to 64-bits.

HTH

crimsonlogic · Post by **crimsonlogic** » Tue Feb 02, 2016 4:49 am

Hi,

As Alex said above, overlaid text is due to the pdf we use has been already OCRed. How can we know whether the pdf is already OCRed?

We have another problem in converting word file to pdf. Our code is as follow:

Firstly, we opened one word document (doc1.docx). Then, launch our application and upload another word document (doc2.docx) which will run below code to convert to PDF. Default printer is set to physical printer.

The below code still uses physical printer instead of using PDF-Xchange Printer. doc2.docx is printed out from physical printer instead of getting converted to PDF. Please advise us ASAP as this issue is stopping business flows for our live system.

PDFPrinter = (PXCComLib5.CPXCPrinter)prnFactory.get_Printer("", "PDF-XChange Printer 2012", regkey, devcode);
PDFPrinter.Option["Save.ShowSaveDialog"] = false;
PDFPrinter.Option["Save.RunApp"] = false;
PDFPrinter.Option["Save.Path"] = pdfpath;
PDFPrinter.Option["Save.WhenExists"] = 1; //overwrite

PDFPrinter.SetAsDefaultPrinter();

System.Diagnostics.Process printJob = new System.Diagnostics.Process();
printJob.StartInfo.FileName = inputfile;
printJob.StartInfo.UseShellExecute = true;
printJob.StartInfo.Verb = "print";
printJob.StartInfo.WindowStyle = System.Diagnostics.ProcessWindowStyle.Minimized;
printJob.Start();
printJob.WaitForExit(60000);

PDFPrinter.RestoreDefaultPrinter();

YouTube · Tue Feb 02, 2016 8:39 am

Hello crimsonlogic,

We suspect that this is a Windows 10 issue.
Do try this - we've just tested this code and it worked for us:

Code: Select all

            PXCComLib5.CPXCPrinter PDFPrinter;
            PXCComLib5.CPXCControlEx prnFactory = new PXCComLib5.CPXCControlEx();

            PDFPrinter = (PXCComLib5.CPXCPrinter)prnFactory.get_Printer("", "PDF-XChange Printer 2012", regkey, devcode);
            PDFPrinter.Option["Save.ShowSaveDialog"] = false;
            PDFPrinter.Option["Save.RunApp"] = false;
            PDFPrinter.Option["Save.Path"] = ocrfile;
            PDFPrinter.Option["Save.WhenExists"] = 1; //overwrite

            System.Diagnostics.Process printJob = new System.Diagnostics.Process();
            printJob.StartInfo.FileName = inputfile;
            printJob.StartInfo.UseShellExecute = true;
            printJob.StartInfo.Verb = "printto";
            printJob.StartInfo.Arguments = "\"" + PDFPrinter.Name + "\"";
            printJob.StartInfo.WindowStyle = System.Diagnostics.ProcessWindowStyle.Minimized;
            printJob.Start();
            printJob.WaitForExit(60000);

            return "ok";

HTH

crimsonlogic · Post by **crimsonlogic** » Wed Feb 17, 2016 10:25 am

Hi Support,

I converted my application to 64bit according to Tracker's advice.
I am not able to convert image files to pdf. I've replaced all dlls from Bin.64 folders from Tracker Software\PDF-XChange PRO 5 SDK\Examples
Our code is as follows:
if (Common.IS_DS_FAILED(PDFXC_Funcs.PXC_NewDocument(out pdf, regkey, devcode)))
resultstr += "ConvertOthersToOCR: IS_DS_FAILED";
PDFXC_Funcs.PXC_SetDocumentInfoA(pdf, PDFXC_Funcs.PXC_StdInfoField.InfoField_Author, "Tracker Software");
PDFXC_Funcs.PXC_SetDocumentInfoA(pdf, PDFXC_Funcs.PXC_StdInfoField.InfoField_Title, "PDF-XChange 4.0 Examples");
PDFXC_Funcs.PXC_SetDocumentInfoA(pdf, PDFXC_Funcs.PXC_StdInfoField.InfoField_Creator, "PDF-XChange 4.0");
PDFXC_Funcs.PXC_SetDocumentInfoA(pdf, PDFXC_Funcs.PXC_StdInfoField.InfoField_Keywords, "PDF-XChange; Examples; 4.0; C#");
PDFXC_Funcs.PXC_EnableLinkAnalyzer(pdf, true);
PDFXC_Funcs.PXC_SetCompression(pdf, false, false, PDFXC_Funcs.PXC_CompressionType.ComprType_C_Auto,
75, PDFXC_Funcs.PXC_CompressionType.ComprType_I_Auto, PDFXC_Funcs.PXC_CompressionType.ComprType_M_Auto);

int res = PDFXC_Funcs.PXC_AddPage(pdf, Common.PW, Common.PH, out page);
if (Common.IS_DS_FAILED(res))
resultstr += "ConvertOthersToOCR: " + res;
cpage = page;

double iw, ih;
res = PDFXC_Funcs.PXC_AddImageA(pdf, inputfile, out p);
if (Common.IS_DS_FAILED(res))
resultstr += "ConvertOthersToOCR: " + res;
PDFXC_Funcs.PXC_GetImageDimension(pdf, p, out iw, out ih);
PDFXC_Funcs.PXC_PlaceImage(cpage, p, Common.I2L(1), Common.PH - Common.I2L(1), Common.I2L(7), Common.I2L(8));

PDFXC_Funcs.PXC_WriteDocumentExA(pdf, extractfile, extractfile.Length, fl, "");
PDFXC_Funcs.PXC_ReleaseDocument(pdf);

I am getting this error code -2113667071 from below line and no pdf is generated.

res = PDFXC_Funcs.PXC_AddImageA(pdf, inputfile, out p);

Please advise.

Thank you very much.

YouTube · Wed Feb 17, 2016 10:58 am

Hello crimsonlogic,

Please do not post error codes only - use PXC_Err_FormatErrorCode method.
The error code that you've provided means Invalid Argument.
The code sample does not contain enough information for that method.
Please provide samples with FULL problem data.

crimsonlogic · Post by **crimsonlogic** » Wed Feb 17, 2016 11:30 am

Hi Sasha,

We are uploading sample project (TestPDFXChangeORG.zip) to Tracker's FTP . Please unzip with the password sent in a separate email to 'support@pdf-xchange.com'

The sample data file (CL.TIF) is in Temp.zip.

Please advise how we can use PXC_Err_FormatErrorCode in our program too.

Thank you very much.

YouTube · Wed Feb 17, 2016 3:18 pm

How to use FormatErrorCode method:

Code: Select all

					byte[] bytes = new byte[128 * sizeof(char)];
					PDFXC_Funcs.PXC_Err_FormatErrorCode(-2113667071, bytes, bytes.Length);
					string str = System.Text.Encoding.ASCII.GetString(bytes);

Please post the error message with the error code itself when you need to include it in your message.

Cheers,
Alex

YouTube · Thu Feb 18, 2016 12:42 pm

Hello crimsonlogic,

I've updated the zip archive ClassLibrary1.zip with the same password that you've specified.
The problem was in the int type - C# understands int as the 32 bit value thus when you switched to the x64 the pointers that were used became corrupted. I've modified them to IntPtr and it all worked properly.
In the archive there are files that I modified.

HTH,
Alex

OCR of pdf and pictures

OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures

Re: OCR of pdf and pictures