PDF-XChange - Tracker PDF Viewer - TIFF-XChange - Image-XChange - XMF-XChange - Raster-XChange - Support

Moderators: Tracker Support, Paul - Tracker Supp, Chris - Tracker Supp, Vasyl-Tracker Dev Team, Ivan - Tracker Software, Sean - Tracker, Tracker Supp-Stefan

 
antonio111
User
Topic Author
Posts: 4
Joined: Tue Oct 10, 2017 7:19 pm

Creating a new language for OCR

Tue Oct 10, 2017 7:39 pm

Dear Recipients,

I was wandering whether it is possible to install a new language (new languages) for the OCR function in PDF X-Change Editor.

The language I would mostly like to use is Pāli, an early language in which Buddha's teaching are preserved. Pāli is spelled in many alphabets, among others extended roman with charachters such as ā, ī, ū, ṅ, ṇ, ñ, ṭ, ḍ, ṃ (or ṁ), and ḷ. Do we need an particular language pack for this or just one that supports Latin Extended-A and Latin Extended Additional, which would also work for Vietnamese, for example?

Thank you for your attention.

Kind regards,
Antonio
 
User avatar
Bhikkhu Pesala
User
Posts: 1757
Joined: Tue May 29, 2007 9:29 am
Location: East London
Contact:

Re: Creating a new language for OCR

Wed Oct 11, 2017 9:44 am

I see that there is an Additional Language Pack for Vietnamese.

Try that, and let us know your results.
Windows 10 64-bit • AMD A10-6800K, 8 Gbyte RAM
Review: http://www.softerviews.org/PDF-XChange.html
 
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 11746
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Creating a new language for OCR

Wed Oct 11, 2017 10:38 am

Thanks for the help Bhikkhu!

Let us know how it went with the Vietnameese language file Antonio!

Regards,
Stefan
 
antonio111
User
Topic Author
Posts: 4
Joined: Tue Oct 10, 2017 7:19 pm

Re: Creating a new language for OCR

Wed Oct 11, 2017 9:07 pm

Thank you for your replies Bhikkhu Pesala and Stefan. I will try and let you know how it goes.
 
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 11746
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Creating a new language for OCR

Thu Oct 12, 2017 10:39 am

Looking forward to your feedback antonio111!

Cheers,
Stefan
 
antonio111
User
Topic Author
Posts: 4
Joined: Tue Oct 10, 2017 7:19 pm

Re: Creating a new language for OCR

Thu Oct 12, 2017 7:01 pm

Dear Stefan and Bhikkhu Pesala,

I have used OCR function on a pāli text, setting vietnamese as recognizing language. The recognization is good where there are no diacritical marks but letters as ā, ī, ū, ṇ, ṅ, ñ, ṃ, ṭ, ḍ seem totally misread by the program.

I compared then the same text whith another recognization of it done setting another language as recognization language. I did it some months ago, I don't remember which language I used, probably English. The result using this language was very similar except that this other language could recognize ñ.

I remember I tried to recognize another pāli text (with PDF X-change) some time ago using many languages in order to test which one was most fit to Pāli. Among these languages was Latvian which seems to have some diacritical marks as in Pāli. However I remember that the results were not even good in that case. This is the reason for which I have asked myself and others whether we can have better results using a Pāli language pack. What do you think about it?

Kind regards,
Antonio
 
Paul - Tracker Supp
User
Posts: 4609
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: Creating a new language for OCR

Fri Oct 13, 2017 6:29 pm

Hi antonio111,

unfortunately we do not maintain the OCR libraries. This is the only component of PDF-XChange that is not entirely and 100% our code. We use the Tesseract libraries for the OCR : https://en.wikipedia.org/wiki/Tesseract_(software)

As such we do not manage the available languages and you would be best to get in touch with them for adding new language support.

One suggestion we have, since you are seeing relatively good results with some of the existing languages, that you enable, in the editor, more than one. That may well give you the best of a number of different languages.

It is also possible to add any of the languages currently listed listed here: https://github.com/tesseract-ocr/tessdata

To make any of those language files usable in the Editor rename the desired file from XXX.traindeddata to XXX_pxvocr.dat, and place it into ”C:\Programe Files\Tracker Software\PDF Editor\PluginsData\OCRLanguages” .

You will also need to create a corresponding XML file with name XXX_pxvocr.lng and the following content:
<?xml version=”1.0" encoding=”utf-8”?>
<language name="UI Name of the Language” prefix=”XXX” version=”1.00"/>


That will give you access to any language Tesserac support.

I hope that helps.
_________________
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
 
antonio111
User
Topic Author
Posts: 4
Joined: Tue Oct 10, 2017 7:19 pm

Re: Creating a new language for OCR

Fri Oct 13, 2017 8:24 pm

Dear Paul,

Thank you very much for your support. I appreciate it!

Kind regards,
Antonio
 
Paul - Tracker Supp
User
Posts: 4609
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: Creating a new language for OCR

Fri Oct 13, 2017 8:35 pm

:D

My pleasure Antonio.
_________________
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com

Who is online

Users browsing this forum: No registered users and 4 guests