How do I
detect blanks between text characters using PXCp_ET.. methods?
How do I detect blanks between text characters using PXCp_ET.. methods?
Using the PXCp_ET_... methods I'm able to extract all text elements that are positioned inside a given area. Now I want to copy the found text to clipboard, preserving as much formatting as possible (line breaks and blanks). While I can do this by simply merging all the characters of the found text elements to a string, this procedure will loose any information about blanks between characters. Now I'm looking for a way to detect if, between two characters of a text element, there is a blank space.
My idea is to compare the two characters' offset and if it is larger than the 'size' of a blank or it inserts a blank between them on my result string. But how do I get the size of a blank space? Is there a better way as starting with text elements' font info and using the low-level API to process all the font objects to get the width?
Or in general: is there a good algorithm to detect if there is a blank between two characters when only their position is known?
I tried it already by computing the average distance between two text element characters (for each different font) and if the distance between two chars is larger than this average distance multiplied by a factor then insert a blank. But this gives too much false blanks and also some missing one.
In general, fonts in PDF may not contain information about space character width, because some PDF creators do not use space character at all, and may not include any information about it. I'm afraid there is no common algorithm to detect spaces, just some approximations. PXCp_ET_... functions cannot provide all information you need, but you may try to collect it using low-level API, when it is possible. You will need to read Section 5 (Text), in the PDF Reference, especially subsections 5.5 and 5.6. But do not expect that solution will be easy or complete.