Search Index for large # of PDF's  SOLVED

This Forum is for the use of End Users requiring help and assistance for Tracker Software's PDF-Tools.

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

4mc
User
Posts: 42
Joined: Tue Apr 27, 2021 12:42 am

Search Index for large # of PDF's

Post by 4mc »

I had planned to put a large (200+) collection of magazines online via a site that specializes in search and builds PDF's from individual magazine pages. Unfortunately the publisher and original owner has taken legal steps to stop that.

I'm left with a large number of PDFs most of which have 50-pages, I also have about a dozen books that have 150-400 pages. PDF-Xchange is great at searching across these PDF's to find names, terms etc. However, it's getting slower and slower.

Does anyone know of a tool, or app that could improve the performance?

I can put the pdf's on a shared NAS and run a server but would prefer not to break-up the PDF's now. Ideas?

++Mark.
https://ctproduced.com
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8624
Joined: Wed Jan 03, 2018 6:52 pm

Re: Search Index for large # of PDF's

Post by TrackerSupp-Daniel »

Hello, 4mc

If you are searching a very large quantity of files and are finding that it is beginning to take a very long time, my first step would be to check the files themselves, perhaps you can "save as optimized" the files to reduce the excess data and allow the search to operate faster with less extraneous content to search through.

Another possibility is that, even with indexing, this is still a process which is heavy on your storage drives, and local processor, it may be time to consider some hardware upgrades to increase the speed of actions like this.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
4mc
User
Posts: 42
Joined: Tue Apr 27, 2021 12:42 am

Re: Search Index for large # of PDF's

Post by 4mc »

Saving as optimized is possible, but would require a duplicate set of PDF's. As per this discussion, the quality is also important.
viewtopic.php?t=40365

My current processor is an 11th Gen Intel 8-Core i7 with 32GB RAM, so thats not the issue.

I was hoping someone had tackled this problem before. There is some discussion about it on the Adobe forums but the solutions are only relevant to Adobe products. Adobe also allows catalogs and creating a unified index of the catalog which would be very interesting.

I'd rather find a non-Adobe solution and was somewhat obscure in my request to solicit ideas from other users of PDF-Xchange. PDFMiner seems like a starting point. There are lots of suggestions on Stackoverflow https://stackoverflow.com/questions/5725278/how-do-i-use-pdfminer-as-a-library/8325135#8325135

I was hoping for something more ready to go. I've tried numerous online solutions, but that is no use as I can't put the PDF's online(see https://worldradiohistory.com/Archive-All-Music/Down-Beat.htm).
4mc
User
Posts: 42
Joined: Tue Apr 27, 2021 12:42 am

Re: Search Index for large # of PDF's

Post by 4mc »

I would say, that running Windows 11 and using the file manager search works noticeably(it seems) faster than PDF-Xchange, it returns lists of magazines but with no context or adjacent text. The only option is to select all and open then find.

Running a PDF-Xchange search against \Book scans\ takes 7-minutes and 6-seconds to find 195 documents and 871 results. Windows File Manager search over exactly the same finishes in less than 5-seconds.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8624
Joined: Wed Jan 03, 2018 6:52 pm

Re: Search Index for large # of PDF's  SOLVED

Post by TrackerSupp-Daniel »

Hello, 4mc

Thank you for clarifying. I can say that at the moment, between the fact that, we do not have indexing, and that we search through a great deal more data than the windows search is capable of, it is still not unexpected that around 200 documents takes a notable time to search through, reducing the search breadth (such as disabling search of bookmarks) may give a notable improvement in search speed, at the cost of not including those items in the results.

Beyond that, I should note that we are beginning work on indexing functions, so while I cannot offer a timeline, it is looking like something I can say will eventually be coming down the pipeline.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
4mc
User
Posts: 42
Joined: Tue Apr 27, 2021 12:42 am

Re: Search Index for large # of PDF's

Post by 4mc »

Thanks Daniel.

Since in most cases the books, magazines are scanned and then searched, they typically only have the OCR text and the image data. At least from my perspective. I don't add bookmarks or anything else. I don't even add to the properties.

While I'd be interested in an all encompassing search index, my primary need is an index of the OCR data created by PDF-Xchange and the other fields and information would be secondary or tertiary.

For now the Windows File Explorer does a good job on the OCR text. That said, it comes with a cost when NOT wanting the search to include .pdf OCR data. This would be primarily why I was hoping for another solution even if it wasn't part of PDF-Xchange, let alone part of the core.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8624
Joined: Wed Jan 03, 2018 6:52 pm

Re: Search Index for large # of PDF's

Post by TrackerSupp-Daniel »

Hello, 4mc

If you want to prevent the page content from being searched (effectively only searching for titles) one option would be to disable the internal search terms and enable the option to only search document info (which should include the file name and title):
image.png
Another would be to use our shell utility to disable our ifilter extension, this article details how to re-enable all extensions, but if you use the GUI option, you can disable just a single one of them: https://www.pdf-xchange.com/knowle ... extensions

Kind regards,
You do not have the required permissions to view the files attached to this post.
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com