Removing hyphenations and extra spaces when saving a document in text formats

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

User avatar
Jensen Head
User
Posts: 430
Joined: Mon Sep 13, 2021 8:12 am

Removing hyphenations and extra spaces when saving a document in text formats

Post by Jensen Head »

In the current version of PDF-XChange Editor, when saving a PDF document in .txt or .docx formats, consecutive lines are simply glued together, which forces additional work to be done on the exported text. Please consider the possibility of automatically processing text the way it is done, for example, by Abbyy Screenshot Reader.

(The screenshots show this document — https://solzhenitsyn.ru/proizvedeniya/publizistika/stati_i_rechi/v_sovetskom_soyuze/obrazovanshzina.pdf)
You do not have the required permissions to view the files attached to this post.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17960
Joined: Mon Jan 12, 2009 8:07 am
Location: London

Re: Removing hyphenations and extra spaces when saving a document in text formats

Post by Tracker Supp-Stefan »

Hello Jensen Head,

That tool is using OCR - and tries to guess where new lines are to be added, and where not.
The original PDF file might very well have separate text objects for each line or a block of continuous text without 'proper' line breaks or end of line characters.

Cheers,
Stefan
User avatar
Jensen Head
User
Posts: 430
Joined: Mon Sep 13, 2021 8:12 am

Re: Removing hyphenations and extra spaces when saving a document in text formats

Post by Jensen Head »

Stefan, thank you for your answer! Yes, I understand that Abbyy's tool uses OCR, as does, for example, Text Extractor from the Microsoft PowerToys utility package.

To this message I have attached a screenshot of a .txt document saved from the same PDF document using Abbyy FineReader PDF 16.0. As you can see, there are no double spaces or hyphenated words. Compare with the result of the document when saving from PDF-XChange Editor (screenshot above). This is exactly the result I was talking about.

Should we understand your answer that Tracker Software does not plan to implement any pre-preparation of the document when saving PDF in .txt in PDF-XChange Editor?
You do not have the required permissions to view the files attached to this post.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8624
Joined: Wed Jan 03, 2018 6:52 pm

Re: Removing hyphenations and extra spaces when saving a document in text formats

Post by TrackerSupp-Daniel »

Hello, Jensen Head

We are currently using ABBYY's V12 engine, which is lacking many of the newer features and capabilities of their latest releases. We are looking into upgrading the versions in the future. To my understanding, we are offering nearly everything that is available to this version of the engine. We have the ability to tweak the precise use of some functions, but some specific actions are less plausible in the current version.
I cannot say we do not plan to implement any of this, just that to do so may require an engine upgrade, and thus would be something with a heavy investment (both development time, and financially) when/if we decide to do it.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com