OCR and removing path from Content view for graphics

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
DWC121
User
Posts: 66
Joined: Thu Jul 30, 2015 5:18 am

OCR and removing path from Content view for graphics

Post by DWC121 »

Greetings,

I use OCR for my pdf's so I can search using Windows Explorer. When I open the Content Pane after applying OCR, the text is listed. Some it also say "path". The "path" is usually graphics or box borders in the pdf. Since graphics and box borders can not be searched for in Windows I deleted the "path" entries. Unfortunately the box borders disappear from the pdf.

Is there a way to OCR what looks only like text and not graphics?

Sometimes I reuse a pdf that has been OCR'd, then I'll re-OCR the document. The documents are forms. Before I re-OCR the document, I'll remove the text entries in the Content Pane (otherwise, the original OCR data remains even if the data is not in the new pdf). This often removes form text (background text) in the document too.

Is there a way to remove only the OCR data from the fields and not the form text in the background ?

Thanks - David
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: OCR and removing path from Content view for graphics

Post by Tracker Supp-Stefan »

Hello DWC121,

The contents pane will show you all the contents of your PDF file - including the text you just added with the OCR process, as well as any content that was there before that.
It is expected when you remove an element from the content pane - it will also disappear from the corresponding page as well!

I am afraid that there's no way to distinguish between the text elements via code - they are all the same type.

Regards,
Stefan
DWC121
User
Posts: 66
Joined: Thu Jul 30, 2015 5:18 am

Re: OCR and removing path from Content view for graphics

Post by DWC121 »

Stefan,

In other words, the Content pane shows more than just the results of OCR; it shows all the content of the document whether it be text or graphics.

(Re-reading your reply, the following is probably not possible, but I'll ask again...) Is there a way to remove the OCR metadata without using the Content pane? When OCR is applied, the results get added to previously applied OCR data. If the text in the pdf has disappeared (maybe because the text in a field was changed), that original field text continues to be in the metadata. In the Content pane, deleting the original OCR'd data for text that has disappeared can be time consuming when going line by line or group by group. (Maybe this part of my question should be a separate posting).
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: OCR and removing path from Content view for graphics

Post by Tracker Supp-Stefan »

Hello DWC121,

Unfortunately it is not possible indeed. The OCR text is indistinguishable from other text content in the file once added - so there's no machine way to separate it and remove only the previous OCR 'layer' (It is not really a layer - hence the ' ').
The best I can think of, if I have understood the use case correctly, would be to keep a copy of the original file before the OCR - so that you can easily do new OCRs on that.

Regards,
Stefan
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR and removing path from Content view for graphics

Post by Timur Born »

Maybe reprint the file to a new PDF and thus create a single layer of text again?!
DWC121
User
Posts: 66
Joined: Thu Jul 30, 2015 5:18 am

Re: OCR and removing path from Content view for graphics

Post by DWC121 »

Timur,

I think that is what Stefan was suggesting. I've been doing that, although if I find a mistake with text I've entered into fields I have to start all over. Sometimes that means to re-do several pdf's. The only way around it is to save an un-OCR'd pdf after I fill in the fields. For me that means keeping track of several hundred extra pdf's each year. The pdf's are records of donated items and services to our Historical Society.

David
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR and removing path from Content view for graphics

Post by Timur Born »

OCR data seems to be added at the end of the content list. So if your document includes some distinguishable object right at its own end, before the OCR text is added, then it might be possible to automate the removal process via scripts?! I am not script savvy, though, so others would have to provide hints on that.

For example: If the very last content object of your document is a path then all OCR text is added after that last path object. Now a script might be able to automatically delete all text objects from the end of the document until it hits the first path object. Even if the last object of your pages is a text it might be possible to check the text content and then stop the script one it hits a certain text. This would be a less tedious way of removing the OCR text. If you include a command to start a new OCR at the end of the script that this might even be a one-button solution.

But again, I don't know if scripts are able to do this.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: OCR and removing path from Content view for graphics

Post by Tracker Supp-Stefan »

Hello Timur,

That's a nice idea - but scripts do not have access to the base content of the file - so the removal will still need to be done manually - but at least your suggestion will be a good indication for when DWC121 needs to determine himself where to start deleting content.

Regards,
Stefan
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR and removing path from Content view for graphics

Post by Timur Born »

Feature suggestion: Editor could put "border" content objects around its own OCR content, like putting a specific container, path or similar before and after the OCR text content. That way OCR could identify older OCR versions and offer a new option to first remove old (Editor based) OCR text when new OCR text is added.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: OCR and removing path from Content view for graphics

Post by Tracker Supp-Stefan »

Thanks for the suggestion Timur,

I will pass it along for consideration.

Regards,
Stefan
DWC121
User
Posts: 66
Joined: Thu Jul 30, 2015 5:18 am

Re: OCR and removing path from Content view for graphics

Post by DWC121 »

Maybe instead of just "Text:" or "Path:" in the Content Pane, include a time stamp such as "Text-20171012185343:" (which means yyyy-mm-dd-hh-mm-ss). Just a quick thought off the top of my head. Perhaps later incorporate something to delete content entries based on the time stamp.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: OCR and removing path from Content view for graphics

Post by Tracker Supp-Stefan »

Hello DWC121,

Thanks for the suggestion. It seems nice - but I think the content pane is supposed to only show the types of the elements and not time stamps.
In any case - I've passed this discussion to our devs for consideration - and we will think of what can be done in regards of improving this area of the Editor.

Regards,
Stefan
Post Reply