Knowledgebase

Back to Articles List

I have a problem with extracting simple chinese text into a text file.

Problem:

I have a problem with extracting simple Chinese text into a text file.

I extract all of the text into a text file, in simple Chinese, but i can't get the correct, resulting content.

Resolution:

Sometimes the problem is in the PDF file - this one uses encoding which is not supported by the xcpro40.dll and does not provide alternate information for text decoding from multi-byte to Unicode (the optional ToUnicode table is missing).

The problem is that text in PDF files text may be represented with a lot of different encoding (including one byte, multi-byte and two-byte encoding), and any of them can be custom (non-standard). Not all of this encoding can be translated to Unicode without additional information (which is optional). So, in general, you cannot extract text from PDF with 100% accuracy (there are a lot of files from which text cannot be extracted without using OCR-like mechanisms).

Returning to your file - the text uses multimeter encoding, based on Adobe GB1 character set, and xcpro40 does not support such encoding. You may get original bytes (multi-byte sequences, which represent character codes in this encoding) using the PXCp_ET_AnalyzePageContent function by specifying GTEF_OriginalCodes flag. Note then you will get WCHAR for each BYTE, so you should use only low byte of each WCHAR.

After retrieving the original multi-byte sequence, you may translate it to Unicode yourself. Please note that other files may contain fonts with another encoding (even one file may contain a lot of fonts with different encoding), so you will need make some additional analysis to determine which encoding is used. You may use PXCp_ET_GetFontInfo function to determine can xcpro40 translate text with corresponding font to Unicode or not (see Quality field in PXP_TEFontInfo structure; values TEFQ_ToUnicode and TEFQ_Encoding indicate that it can). To get additional information about fonts (which is not listed in PXP_TEFontInfo structure) you may use PXCp_ET_GetFontObj functions and then use Low-Level API (actually Low-Level API gives you a lot of possibilities, but you will need to know much more about PDF).

An alternate way to extract text from such PDFs, is to use Viewer ActiveX SDK - it has a more advanced text extraction engine and can handle most cases.

Note:

Though the Low-Level API functions in PDF-Tools 4 SDK in demo mode, you will require a PRO SDK license in order to use it in its licensed mode, and we do not provide support for it, as it requires advanced PDF format knowledge or can lead to disastrous results - be very careful

Was this article helpful?
Yes No Somewhat