Hello,
We're using pxclib40.dll (4.0.201.0) for generating pdf docs. Some time ago We had a problem with non searchable text in cyryllic in generated PDF. I've found solution on this forum that
PXC_SetEmbeddingOptions(mydoc, TRUE, TRUE, TRUE); should help and indeed it helped.
Unfortunately now I got request that it doesn't work with some fonts (e.g. GOST_A font)
Is there any additional setting that should I use?
I've attached generated PDF with two texts exported with exactly the same settings (but one with MS Arial Unicode font and the second with GOST Type A font) and GOST_A font file
This is how more or less our code resposible for generating text looks like (We have a C# wrapper for pxclib40 library)
int eCode = PDFWrapper.PXC_SetEmbeddingOptions(this.pdfPtr, true, true, true);
if (PdfHelper.IS_DS_FAILED(eCode))
{
return;
}
eCode = PDFWrapper.PXC_SetFontEmbeddW(this.pdfPtr, font.TTFFileKey.FamilyName, PDFWrapper.PXC_EmbeddType.EmbeddType_ForceEmbedd);
if (PdfHelper.IS_DS_FAILED(eCode))
{
return;
}
eCode = PDFWrapper.PXC_AddFontW(this.pdfPtr, tm.tmWeight, font.TTFFileKey.IsItalic, font.TTFFileKey.FamilyName, out fntID);
if (PdfHelper.IS_DS_FAILED(eCode))
{
return;
}
PDFWrapper.PXC_TextOptions newTextOpt = PDFWrapper.PXC_GetTextOptions(this.pdfPage, out newTextOpt);
newTextOpt.fontID = fntID;
newTextOpt.nTextPosition = PDFWrapper.PXC_TextPosition.TextPosition_Baseline;
newTextOpt.fontSize = PdfHelper.MM2PsP(lenX);
PDFWrapper.PXC_SetTextOptions(this.pdfPage, ref newTextOpt);
PDFWrapper.PXC_TextOutW(this.pdfPage, ref origin, charTxt, -1);
Cyryllic text not searchable with some fonts in generated PDF
Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan
-
- User
- Posts: 1
- Joined: Wed Dec 08, 2021 1:49 pm
Cyryllic text not searchable with some fonts in generated PDF
You do not have the required permissions to view the files attached to this post.
-
- Site Admin
- Posts: 8624
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Cyryllic text not searchable with some fonts in generated PDF
Hello, jacekP
Thank you for the report, I am afraid that this topic goes beyond my personal knowledge, but I have asked our Dev team to take a look. Someone should come along and post here today or tomorrow to help with this.
Kind regards,
Thank you for the report, I am afraid that this topic goes beyond my personal knowledge, but I have asked our Dev team to take a look. Someone should come along and post here today or tomorrow to help with this.
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- Site Admin
- Posts: 3556
- Joined: Thu Jul 08, 2004 10:36 pm
- Location: Vancouver Island - Canada
Re: Cyryllic text not searchable with some fonts in generated PDF
I'm afraid that the problem is with the font, not with the library.
For some reason, this font maps two ranges of codes into Cyrillic characters as shown below: And, a non-Unicode range was used when the text was rendered.
For example, when you copy text from your PDF file and paste it into notepad with Arial font selected you will see this But once you change the font in Notepad to "GOST type A" you will see readable text At the moment I'm not ready to answer why the incorrect code range was used, and I'm afraid there is no simple solution for that.
For some reason, this font maps two ranges of codes into Cyrillic characters as shown below: And, a non-Unicode range was used when the text was rendered.
For example, when you copy text from your PDF file and paste it into notepad with Arial font selected you will see this But once you change the font in Notepad to "GOST type A" you will see readable text At the moment I'm not ready to answer why the incorrect code range was used, and I'm afraid there is no simple solution for that.
You do not have the required permissions to view the files attached to this post.
Tracker Software (Project Director)
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
-
- Site Admin
- Posts: 677
- Joined: Thu Jun 28, 2007 8:42 am
Re: Cyryllic text not searchable with some fonts in generated PDF
Hi, jacekP
Problem is that this font has incorrect character mapping. As you can see in screenshot below font has 3 cmap subtables, with different mappings. Windows and most programs use subtable 3.1 when possible, so we take look on it first. As you can see this table have specific mapping - many times two characters are mapped into same glyph. For example Cyrillic letter 'Б' (U+0411) is mapped into glyph 120 (0x78), but also character Aacute 'Á' (U+00C1) is mapped into same glyph. So if you type one of them you will see Cyrillic letter 'Б' in both cases. To properly display embedded font programs must embed correct glyph, and that is all. To make text searchable they may also embed additional information which will map embedded glyphs to correct Unicode characters. Normally it is not a problem, but with this font we have a problem - some glyphs are mapped into two Unicode characters, and programs should choose one of them. Our software in such cases use first mapping, and this is Aacute 'Á' (U+00C1). I'm afraid that this cannot be changed, because we have also other fonts, where two or even more characters are mapped into same glyphs, and in many cases selecting first mapping is correct.
Well, even we will try to find workaround for your font, we have other problem: all other cmap subtables and even post table state that glyph 120 (0x78) correspond to character Aacute 'Á' (U+00C1). So yes, this font has three tables with incorrect mapping, and one tricky table with double mapping, which allow 'correct' mapping work too. Kind regards,
Lzcat - Tracker Supp
Problem is that this font has incorrect character mapping. As you can see in screenshot below font has 3 cmap subtables, with different mappings. Windows and most programs use subtable 3.1 when possible, so we take look on it first. As you can see this table have specific mapping - many times two characters are mapped into same glyph. For example Cyrillic letter 'Б' (U+0411) is mapped into glyph 120 (0x78), but also character Aacute 'Á' (U+00C1) is mapped into same glyph. So if you type one of them you will see Cyrillic letter 'Б' in both cases. To properly display embedded font programs must embed correct glyph, and that is all. To make text searchable they may also embed additional information which will map embedded glyphs to correct Unicode characters. Normally it is not a problem, but with this font we have a problem - some glyphs are mapped into two Unicode characters, and programs should choose one of them. Our software in such cases use first mapping, and this is Aacute 'Á' (U+00C1). I'm afraid that this cannot be changed, because we have also other fonts, where two or even more characters are mapped into same glyphs, and in many cases selecting first mapping is correct.
Well, even we will try to find workaround for your font, we have other problem: all other cmap subtables and even post table state that glyph 120 (0x78) correspond to character Aacute 'Á' (U+00C1). So yes, this font has three tables with incorrect mapping, and one tricky table with double mapping, which allow 'correct' mapping work too. Kind regards,
Lzcat - Tracker Supp
You do not have the required permissions to view the files attached to this post.
Victor
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.