Lossy compression for content streams

Mathew · Post by **Mathew** » Mon Nov 27, 2023 6:34 pm

I often run into extremely bloated pdf's that won't optimize because all of their size is in content streams. It's usually because someone generated the pdf with some linework that was overly complex or detailed for the resolution needed in the document:

image.png

There are many ways developed to reduce the complexity of .gpx files by reducing the number of stored points or smoothing. It would be great to have a similar option for content streams in pdfs.

A rule of thumb in my mind to see whether the content stream is disproportionately large would be: If the page were converted to an image at the resolution acceptable for images in the document, and it saved space, the content stream may be excessively detailed. I'm not suggesting this rule of thumb be in PDFX-Change but I frequently run into pdf files that are like this. I've attached one example.

OPM-0393-13.pdf

Tue Nov 28, 2023 9:41 am

Hi, Mathew
Unfortunately, lossy compression for content is not possible even in theory - drawing operators must be preserved, otherwise entire content will be not parsed and displayed.
As I can see your file contain a lot of path items (vector drawings), and this is a problem. You file was generated by printing from Autocad to Adobe Distiler, and this is a problem. Each page in file has "transparent background logo" in the middle. I'm not sure what it was BEFORE printing, but after - it has no transparency and consist of many (really many!) figures with different fill colors, and this is causing file size grow. Other pages have additional objects, transformed in similar way, which also increase file size.
So, you may have several options how to reduce file size:
1. If you have no access to original files you may try to rasterize pages - all, or some of them. For example, most size in attached file is used by pages 3 (~1 Mb) and 7 (~4 Mb). Of course, this may also increase file size in case when page content is not so complex. You may check approximate page size in bytes by extracting it to separate file, and then decide what to do.
2. if you have original files, you may try to recreate PDF files. You may try to change printing options either in Acutocad or printer driver, you may try to use other virtual printers - this will change resulting files, quality and size. Maybe there is other ways to convert source files to PDF rather than printing - in most cases creating PDF using virtual printers is not best solution regarding to result quality and size.
Kind regards,
Lzcat - Tracker Supp

Mathew · Post by **Mathew** » Tue Nov 28, 2023 6:11 pm

Lzcat - Tracker Supp wrote: ↑Tue Nov 28, 2023 9:41 am Unfortunately, lossy compression for content is not possible even in theory

Maybe you are misunderstanding what I'm asking for: This would be an option as part of the optimization process. The compression options for images are also lossy. I would not expect the fidelity of the resulting vectors to be the same, but frequently pdf's contain vast amounts of data that would not be visible at the resolution that document is printed or viewed at, so it makes sense to have a way to reduce the stored information. We do exactly the same thing for raster images.

There are numerous algorithms developed for reducing the complexity of vector data. I mentioned gpx files because I know that is commonly done:
https://en.wikipedia.org/wiki/Vector_optimization
https://kb.geoczech.org/knowledge-base/how-to-reduce-size-of-vector-data/
https://gpx.studio/

People have developed relatively small scripts to do this within Autocad that look at offset and angle between adjacent control points, and reduce them based on entered criteria. If you have a programmer who can understand AutoCAD lisp, this simple tool demonstrates something that could apply to vector data within pdf's:
https://forums.autodesk.com/t5/autocad-forum/reduce-polyline-points/m-p/8520506#M963759

I understand that the ideal option is not to create bloated pdf's in the first place: The same could be said for including too high resolution images in pdf's.

Wed Nov 29, 2023 1:59 pm

Hi, Mathew,

Well, it looks like you might not have full understanding of the generation/optimization problem in PDF files specifically. In most cases vector optimization can give you some benefits, but in PDF we have a lot of limitations, which will seriously degrade optimization possibilities.
First, paths in PDF content are represented in text form (as all other content items too), have no unnecessary data that may be omitted, and often this text is compressed (in your example too, without compression file size will be about 43 MB, not 9.5Mb). So even when we might be able to reduce the text representation of path items at the cost of precision or because of some better orientation tricks (this is all we can do without recreating the drawings), text representation will still contain the same amount of commands and data, so it is almost impossible to expect more than 2 times size reduction (in most cases we would be looking at about 10-20%). In the case of your sample file, as I can see, vector drawings are not using any non-optimal text representation, so optimization will be less effective than you may expect.

Main problem of this (and similar) files created from AutoCAD using printing it that it does not use transparency and prefers to separate complex or overlapping figures into a bunch of smaller ones. You may try to open the content pane and examine that file yourself - you will see that almost each intersection of "transparent" logo with text or line is different, often complex figure. Normally, if this file will be created using transparency you will have 3-5 times less (or even more!) number of figures. Two other known problems when printing from AutoCAD are tessellation and filling figures using stroke (because printing from AutoCAD is plotter-oriented). Both of these techniques produce good results on paper, but when it comes to virtual printing they cause huge files. For example, let's take the EASE logo on page 3. It looks like it is simply four letters over five circles, but in fact they are very complex figures using a lot of space. For example the smallest circle (around the letter A) is not a circle but a very complex polygon (may hundreds of commands), taking about 55 Kb of space (8 Kb compressed), while normal circle in PDF is represented by 4 Bezier curves, which will use up to 200-300 bytes (uncompressed!). And the problem is that we cannot reconstruct a circle from such a complex polygon. All other "circles" and "letters" it that logo are constructed in a similar way, which bloats the file size. Parts of them are rasterized as images and placed over (I do not understand why). Also, most text on page 3 is not text, each letter is represented by complex figure describing its shape, which also bloat file many times (several hundreds or even thousands, depending on used font). Of course, compression will help to reduce resulting file, but using text instead of curves is much more effective.

In conclusion - you have some bloated files (because they were printed from AutoCAD) and ask us to reduce their size (which is very complex task in this case) which is not a feasible task. A better solution would instead be if you can find ways to recreate these using better conversion (which may give you more adequate results).
None of the vector compressing techniques we could possibly apply will compress a letter shape into 1 or 2 bytes (yes, to display text letter in PDF we need only one or two bytes, not several tens or hundreds drawing commands and coordinates to draw corresponding figure), So I hope you understand what will be better / ideal case. Perhaps you cannot recreate files (have no sources or cannot find a better conversion tools). I'm afraid in this case you will not find any good solution. Sometimes rasterization will be good (may be in addition with OCR), but this will not fix the situation "in general". We have no plans to implement drawing optimizations in the nearest future (especially lossy) because of its complexity and the not so great expected file size reduction "in general". So, I'm afraid that you should search for another solution other than waiting for "Lossy compression for content streams" and expecting that it will help in such cases - it will not, even in case it has been implemented.

Kind regards,
Lzcat - Tracker Supp

Mathew · Post by **Mathew** » Thu Nov 30, 2023 8:43 pm

Lzcat,

Thank you so much for your detailed thought on this. Yes, clearly I don't understand the complexities involved, and fundamentally yes, the issue is people generating excessively large pdf's in the first place and not realizing or caring. And the example I posted is not a good example even of what I was envisioning being able to "compress" - I don't have anything I can post publicly right now. Monstrously large pdf's generated in AutoCAD that clog my inbox are a cause of exasperation to me - and usually it's just a logo repeated on every page, or a complex hatch pattern at a very fine scale that prints as a solid color. There's no easy solution to this.

Thanks again for your attention,
Mathew.

Post by **Tracker Supp-Stefan** » Fri Dec 01, 2023 4:44 pm

Hello Mathew,

Thanks for the understanding!
It seems like you were hoping for an optimisation that would e.g. replace the shapes with much simpler variants (e.g. the circles that are consisting of many triangles in the logo on page 3 with a circle shape that as Victor said takes only 4 Bezier curves to describe):

image.png

However there aren't really any tools that we can use to do so on our end with a consistent result. OCR engines try to do similar things with OCR - but they then compare the shapes/(pixels in their case) with a pre-determined set of character shapes they have at hand, while a tool that can optimize any shape will likely be much more complex!

Kind regards,
Stefan

Mathew · Post by **Mathew** » Fri Dec 01, 2023 5:33 pm

I've pasted a screen capture of an example from one of my projects. It's part of a logo that's on every page, and I'm just a consultant, so am required to include it on my drawings too, but the bulk of the drawings are generated by others. This is zoomed in 6400% to a corner:

image.png

Just this logo, which only measures 68 pt x 68 pt (less than 1inch) on the printed page of 30"x42", doubles the size of the pdf.

The lossy optimization I was thinking about was that each of these lines are so close to one another that if they were replaced by a single line between them, it would still look the same when printed. At its most simplistic it would require an algorithm that looks at proximity of points to decide if their locations can be merged. Subsequently, if two objects have the same end points, one could be removed.

The previous image is a fill pattern, but generating the pdf with the fill pattern turned off, there's a similar issue with the outline that's hidden behind it (again zoomed 6400%):

image(2).png

Each line is made up of individual control points spaced at times hundreds between 1pt (ie well over 5000dpi). Again, an algorithm that merged control points based on spacing and angle within a line would reduce the file size. This is what gpx optimization and the autolisp routine I linked to above does - and as a test I used it on this logo for my drawings to reduce the generated file size by almost 40%; but again, I cannot get the client to change their logo file so my only current workaround is to apply redaction on their logo at every page of the drawing set when I receive it.

Thankfully PDF-XChange has tools to make that somewhat easy (duplicate a redaction to every page - although some consultants generate the page differently, so the redaction ends up in the wrong place).

Post by **TrackerSupp-Daniel** » Wed Dec 06, 2023 9:59 pm

Hello, Mathew

Thank you for the example, The Dev team has seen this and may look further into it. Unfortunately, I cannot offer any guarantees for changes or new features at this time.

Kind regards,

Lossy compression for content streams

Lossy compression for content streams

Re: Lossy compression for content streams

Re: Lossy compression for content streams

Re: Lossy compression for content streams

Re: Lossy compression for content streams

Re: Lossy compression for content streams

Re: Lossy compression for content streams

Re: Lossy compression for content streams