Export PDF to Plain Text has problems with german umlauts, like ü, ä, ö

This Forum is for the use of End Users requiring help and assistance for Tracker Software's PDF-Tools.

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

SiliasPD
User
Posts: 3
Joined: Wed May 11, 2022 1:23 pm

Export PDF to Plain Text has problems with german umlauts, like ü, ä, ö

Post by SiliasPD »

Hi,
I want to extract text from a pdf, but at places with german umlaut letters like ü, ä, ö I get funny results like:

---------
The letter ü results in:


¨

u
---------
The letter ä results in:
¨a
---------
The letter ö results in:
¨o
---------

Is there a way to make pdf tools recognize these umlaut letters?

Thanks in advance,
Silias
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8624
Joined: Wed Jan 03, 2018 6:52 pm

Re: Export PDF to Plain Text has problems with german umlauts, like ü, ä, ö

Post by TrackerSupp-Daniel »

Hello, SiliasPD

It depeneds on how the characters have been input within the document. First I would advise that you open the preferences (Ctrl_K) and enable the "preserve original ligatures" option, under the Page text category:
image.png
After that, if these are still not properly being copied, the issue may be with the content formatting/order/placement. To look further into it, I would need to ask for a copy of the document you are using, so that I can share it with the Dev team and see what we can do.

Kind regards,
You do not have the required permissions to view the files attached to this post.
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
SiliasPD
User
Posts: 3
Joined: Wed May 11, 2022 1:23 pm

Re: Export PDF to Plain Text has problems with german umlauts, like ü, ä, ö

Post by SiliasPD »

Hi Daniel,
thanks for your advice!

The "preserve original ligatures" option was already enabled, so that could not be the problem.
Thanks for the offer to check or share the original document with the Dev team. Could you please provide therefore an email address to which I can send the document?

Thanks in advance,
Silias
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17960
Joined: Mon Jan 12, 2009 8:07 am
Location: London

Re: Export PDF to Plain Text has problems with german umlauts, like ü, ä, ö

Post by Tracker Supp-Stefan »

Hello SiliasPD,

Please send the sample file to support@pdf-xchange.com with a link back to this topic, and we will pass the file on to our devs for further checking.

Kind regards,
Stefan
User avatar
Jordan - Tracker Supp
Site Admin
Posts: 91
Joined: Mon Jul 03, 2023 3:10 pm

Re: Export PDF to Plain Text has problems with german umlauts, like ü, ä, ö

Post by Jordan - Tracker Supp »

Hello SiliasPD,

We have received your file and I have answered you in an email. Please check your inbox.

Edit:

Our development team has investigated the file that SiliasPD has provided and here are their results:

There are actually no German umlaut characters in his file. They are composed of two characters that are placed one over the other. If we take a note of the Content Pane in every instance where it is supposed for an umlaut symbol to be used, you will see this:
image.png
And the end result is this:
2023-10-02_11-50-29.gif
Another part of the issue here is that the original file use embedded fonts with built-in encoding and no ToUnicode tables specified, so it may be problematic to copy text from this file in some software, regardless of the “umlauts” issue.

Our developers provided 3 options how this can be solved:
  • Recreate file using other software, which use normal umlauts characters instead of simulate them.
  • OCR this file using correct language.
  • Make replacement in exported plain text – replace pair “¨u” with normal “ü” character and respectively to any of the other characters.
Kind regards,

Jordan
You do not have the required permissions to view the files attached to this post.
Best regards,
Jordan