Rozdiely

Tu môžete vidieť rozdiely medzi vybranou verziou a aktuálnou verziou danej stránky.

--- blog:odborny:2025-05-07-command-line_tools_for_pdf_processing [2025/05/07 22:15] – Róbert Toth
+++ blog:odborny:2025-05-07-command-line_tools_for_pdf_processing [2026/04/30 10:56] (aktuálne) – [1. Minimize PDF size] Róbert Toth
@@ Riadok 6: / Riadok 6: @@
   * …
-TODO
+Due to the nature of the topic, this post is (and probably will remain) a work-in-progress.
 ===== General: Overview of PDF-processing and manipulation tools =====
+==== Coherent PDF (cpdf) ====
+  * **Download:**    https://www.coherentpdf.com/eval.html (fully functional demo) – the link at that page downloads all binaries, to select only needed binary, visit [[https://github.com/coherentgraphics/cpdf-binaries/|GitHub page]] and go to relevant folder.
+  * **Changelog:**   https://www.coherentpdf.com/cpdfmanual/cpdfmanualap2.html#x26-177000B.1
+  * **Manual:**      https://www.coherentpdf.com/cpdfmanual/cpdfmanual.html
 ==== MuPDF (mutool) ====
   * **Download:**    no binary, install via ''sudo port install mupdf''
-  * **Changelog:**   https://mupdf.com/releases/history
+  * **Changelog:**
-  * **Manual:**      https://mupdf.readthedocs.io/en/latest/
+    * https://mupdf.com/releases/history
+    * https://github.com/ArtifexSoftware/mupdf/blob/master/CHANGES
+  * **Manual:**      https://mupdf.readthedocs.io/en/latest/tools/mutool.html
+==== pdfcpu ====
+  * **Download:**    https://github.com/pdfcpu/pdfcpu
+  * **Changelog:**   https://pdfcpu.io/changelog/
+  * **Manual:**      https://pdfcpu.io/about/command_set
 ==== PDFtk server (pdftk) ====
@@ Riadok 20: / Riadok 33: @@
   * **Manual:**      https://www.pdflabs.com/docs/pdftk-man-page/ ([[https://www.pdflabs.com/docs/pdftk-cli-examples/|examples here]])
-==== Coherent PDF (cpdf) ====
+==== QPDF ====
-  * **Download:**    https://www.coherentpdf.com/eval.html (fully functional demo)
+  * **Download:**    https://github.com/qpdf/qpdf/releases (for macOS using [[https://ports.macports.org/port/qpdf/details/|MacPorts QPDF port]])
-  * **Changelog:**   https://www.coherentpdf.com/cpdfmanual/cpdfmanualap2.html#x26-177000B.1
+  * **Changelog:**   https://qpdf.readthedocs.io/en/stable/release-notes.html
-  * **Manual:**      https://www.coherentpdf.com/cpdfmanual/cpdfmanual.html
+  * **Manual:**      https://qpdf.readthedocs.io/en/stable/cli.html
-==== pdfcpu ====
-  * **Download:**    https://github.com/pdfcpu/pdfcpu
-  * **Changelog:**   https://pdfcpu.io/changelog.html
-  * **Manual:**      https://pdfcpu.io/about/command_set
 <html>
@@ Riadok 39: / Riadok 47: @@
 </html>
-==== Other (non-tested) tools ====
+==== Other (non-/not-yet-tested) tools ====
-  * **QPDF:**        https://qpdf.sourceforge.io/
+TODO
-  * …
 <html>
 <!--
@@ Riadok 48: / Riadok 55: @@
 </html>
-===== Case #000: Minimize PDF size =====
+===== - Minimize PDF size =====
 **Example use-case:** Obvious.
-==== cpdf ====
+==== TL;DR: Summary first ====
+  * **Effectiveness of the tools** (ordered by file size):
+    * <color green>**acrobat optimize**</color> ≪ mutool clean ≪ cpdf squeeze < pdfcpu optimize ≪ <color red>**pdftk compress**</color>
+    * ❶ Acrobat wins (but is paid), ❷ mutool is second & free (but is hard to configure), ❸ cpdf is third but is easiest to use
+    * I have also tested all possible combinations between these tools, and the results are best when Acrobat Optimizer is followed by //yet another// tool. So…
+  * **Effectiveness of the //combined// tools** (ordered by file size):
+    * <color green>**acrobat optimize** + **cpdf squeeze**</color> < acrobat optimize + pdfcpu optimize ≪ <color red>**acrobat optimize** + **mutool clean**</color>
+<color blue/lightgrey>**Conclusion:** Always optimize PDF in Acrobat PDF Optimizer, then squeeze it with cpdf squeeze!</color>
+==== Coherent PDF (cpdf) ====
 [[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch5.html#x9-660005.3|cpdf -squeeze]]:
 <code>
-cpdf -squeeze "src.pdf" [-squeeze-no-recompress] -o "dst.pdf"
+cpdf -squeeze "in.pdf" [-squeeze-no-recompress] -o "out.pdf"
 </code>
@@ Riadok 60: / Riadok 77: @@
 [[https://pdfcpu.io/core/optimize.html|pdfcpu optimize]]:
 <code>
-pdfcpu optimize "src.pdf" "dst.pdf"
+pdfcpu optimize "in.pdf" "out.pdf"
 </code>
+==== MuPDF (mutool) ====
+[[https://mupdf.readthedocs.io/en/1.24.0/mutool-clean.html|mutool clean]]:
+<code>
+mutool clean -gggg -l -d -z -s "in.pdf" "out.pdf"
+</code>
+''mutool clean'' has many options and it takes some experimentation to see what actually shrinks the PDF size:
+  * **''-gggg'' (Garbage collect unused objects / compact xref table / merge duplicate objects / check streams for duplication):** first three ''g'''s do not affect the file size for me, and the fourth makes it sometimes a bit larger and sometimes a bit smaller. I usually use the whole ''-gggg'' parameter.
+  * **''-l'' (Linearize PDF):** this makes PDF ready for "Fast web view", but makes it slightly larger. Note that MuPDF 1.26.0 [[https://artifex.com/blog/mupdf-removes-linearisation|removes linearisation support]], so it does not really makes sense to use it.
+  * **''-d'' (Decompress streams):** this decompresses the whole file, which makes it larger – but when combined with ''-z'' (Deflate uncompressed streams), it allows better PDF compression
+  * **''-z'' (Deflate uncompressed streams):** this is the most important switch, and when combined with ''-d'' (Decompress streams) it lowers PDF size even more
+  * **''-f'' (Compress font streams):** no effect for me
+  * **''-i'' (Compress image streams):** no effect for me
+  * **''-c'' (Clean content streams):** no effect for me
+  * **''-s'' (Sanitize content streams):** this actually lowers file size for me
+  * **''-AA'' (Recreate appearance streams for annotations):** no effect for me (and not sure what it really does – at least while there are no annotations, it does not affect PDF file size at all)
-===== Case #001: Split each page of PDF into several pages (posterisation) =====
+==== PDFtk server (pdftk) ====
+[[https://www.pdflabs.com/docs/pdftk-man-page/#dest-compress|pdftk compress]] is not suitable for the task of really shrinking the PDF size to minimum, as the developer himself claims:
+<code>
+pdftk "in.pdf" output "out.pdf" compress
+</code>
+==== QPDF ====
+[[https://qpdf.readthedocs.io/en/stable/cli.html#optimizing-file-size|Optimizing file size with qpdf]] suggests these flags to optimize file size:
+  * **''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-compress-streams|--compress-streams]]=y'':** This is the default, so it is not in fact needed.
+  * **''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-decode-level|--decode-level]]=generalized'':** Again, this is the default, so it is not in fact needed.
+  * **''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-recompress-flate|--recompress-flate]]'':** This forces recompression of already flate-compressed streams. This only make sense when combined with the ''--compression-level=9'' to achieve maximum compression.
+  * **''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-compression-level|--compression-level]]=9'':** Use maximum compression when flate-compressing PDF streams (the default is only ''6'').
+  * **''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-object-streams|--object-streams]]=generate'':** This will try to produce new object streams to lower file size as much as possible (the default is only ''preserve'').
+So the resulting command is:
+<code>
+qpdf "in.pdf" --recompress-flate --compression-level=9 --object-streams=generate "out.pdf"
+</code>
+==== Adobe Acrobat ====
+Acrobat [[https://helpx.adobe.com/acrobat/using/optimizing-pdfs-acrobat-pro.html#pdf_optimizer_options_acrobat_pro|PDF Optimizer]] has tons of options. After thorough testing, the best settings seems usually to be these:
+  ; <color blue>Images</color> tab            : ⚠️ This is a per-PDF setting – it depends on your particular use-case and file. My default is to have it turned off.
+  ; <color blue>Fonts</color> tab             : ✅ always on, with font subsetting turned on
+  ; <color blue>Transparency</color> tab      : 🚫 This causes optimisation to take extremely long time in some PDFs. My default is to have it turned off (and I cannot remember a situation when I actually needed to turn it on).
+  ; <color blue>Discard Objects</color> tab   : ✅ Everything on, except **Discard bookmarks** (you never want that) and also **Convert smooth lines to curves** and **Detect and merge image fragments** (these are not needed in 99% of cases and they cause optimisation to take significantly longer, while they usually do not lower the PDF size at all). You might also want to **Discard all Javascript actions**, but if there are any internal links (e.g. table of contents, references or index pointing to specific places in PDF), this causes them to disfunction.
+  ; <color blue>Discard User Data</color> tab : ✅ Everything on.
+  ; <color blue>Clean Up</color> tab          : ✅ Everything on.
+===== - Split each page of PDF into several pages (posterisation) =====
 **Example use-case:** you have (scanned) pages where each PDF page contains two physical pages, and want to crop those into two.
+==== TL;DR: Summary first ====
+<color blue/lightgray>**Conclusion:** Use cpdf -chop. MuPDF has some problems and combining pdftk with manual cropping in Adobe Acrobat is tedious.</color>
 ==== MuPDF (mutool) ====
 [[https://mupdf.readthedocs.io/en/latest/mutool-poster.html|mutool poster]]:
 <code>
-mutool poster -x 2 "src.pdf" "dst.pdf"
+mutool poster -x 2 "in.pdf" "out.pdf"
 </code>
+Note that ''mutool poster'' sometimes sets //both// mediabox and cropbox – the latter seems unnecessary. More importantly, ''mutool poster'' causes problems when ''mutool trim'' is used afterwards – for some reason, it leaves the PDF completely empty.  This does not happen when ''mutool trim'' is used after ''cpdf -chop''.
 ==== Coherent PDF (cpdf) ====
 [[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch9.html#x13-940009.4|cpdf -chop]]:
 <code>
-cpdf -chop "1 2" "src.pdf" -o "dst.pdf"
+cpdf -chop "2 1" "in.pdf" -o "out.pdf"
+</code>
+Unlike MuPDF, ''cpdf -chop'' only sets mediabox, which is enough.
+Moreover, ''cpdf'' can also remove page labels, if needed:
+<code>
+cpdf -chop "2 1" "in.pdf" AND -remove-page-labels -o "dstNoLabels.pdf"
 </code>
@@ Riadok 82: / Riadok 152: @@
 [[https://www.pdflabs.com/docs/pdftk-man-page/#dest-op-shuffle|pdftk shuffle]] && [[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch3.html#x7-570003.6|cpdf -mediabox]]:
 <code>
-pdftk A=src.pdf shuffle A A output dst.pdf
+pdftk A=in.pdf shuffle A A output out.pdf
-cpdf -mediabox "0mm 0mm a5landscape" "src.pdf" odd -o "srcOdd.pdf"
+cpdf -mediabox "0mm 0mm a5landscape" "in.pdf" odd -o "srcOdd.pdf"
-cpdf -mediabox "148.5mm 0mm a5landscape" "srcOdd.pdf" even -o "dst.pdf"
+cpdf -mediabox "148.5mm 0mm a5landscape" "srcOdd.pdf" even -o "out.pdf"
 </code>
 The second step might also be done in Adobe Acrobat "Crop pages" function.
+==== QPDF ====
+N/A (not tested yet TODO)
-===== Case #002: Crop pages of PDF =====
+===== - Crop pages of PDF =====
 See MuPDF documentation on [[https://mupdf.readthedocs.io/en/latest/mutool-trim.html#mutool-trim-defined-boxes|different PDF page boxes]] (media|crop|art|trim|bleed]box).
+Mediabox is a "physical" size of the page, while other boxes are in a way only "virtual": they specify which content should be visible at what point and in which cases, but they do not alter real PDF page size.
 ==== Coherent PDF (cpdf) ====
@@ Riadok 97: / Riadok 172: @@
 cpdf -mediabox "0mm 0mm a4portrait" "srcA3.pdf" -o "dstA4.pdf"
 </code>
+Note that adjusting page sizes by cropping mediabox is (to my knowledge) the only way to do it without altering the PDF content in any way (as explained [[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch3.html#x7-570003.6|here in Coherent PDF manual]]). Cropping other boxes might lead to PDF structure being changed.
 ==== Adobe Acrobat ====
-Note that Acrobat won't let you crop MediaBox.
+Note that Acrobat won't let you crop MediaBox, only other boxes (duh!).
   - Go to "Edit PDF" and then "Crop pages" function.
+==== QPDF ====
+N/A (not tested yet TODO)
-===== Case #003: Remove cropped content from PDF =====
+===== - Remove cropped content from PDF =====
 **Example use-case:** You have cropped some pages but you want to actually remove the content, since otherwise it is only hidden but remains in PDF – this can be seen when you inspect the PDF in Adobe Acrobat via "Edit PDF" and zoom out the page – the cropped content will be selectable, although not visible, since it is out of the page margins.
-mutool trim -o "derotateSplit trim.pdf" -b cropbox "derotateSplit.pdf"
+According to its author, it seems that [[https://github.com/johnwhitington/cpdf-source/issues/116|cpdf is not going to support this feature]].
 ==== MuPDF (mutool) ====
 [[https://mupdf.readthedocs.io/en/latest/mutool-poster.html|mutool trim]]:
 <code>
-mutool trim -o "dst.pdf" -b cropbox "src.pdf"
+mutool trim -o "out.pdf" -b cropbox "in.pdf"
 </code>
+<color red>**Warning:**</color> In some PDFs, I have experienced ''mutool trim'' to also cut out some contents which was actually visible on-page (that is, inside mediabox & cropbox). I don't know whether it was because the PDF itself was malformed, or whether it is a problem with mutool – but just in case, better avoid and use Acrobat below.
 ==== Adobe Acrobat ====
@@ Riadok 119: / Riadok 200: @@
   - Run preflight and the script
+==== QPDF ====
+N/A (not tested yet TODO)
+===== - Text in Calibri-created PDF files cannot be copied/searched in Preview =====
+Preview app (or any PDFkit-based PDF viewer, such as Skim) cannot search/copy text from PDFs generated by Calibri. This is because the PDF created in such way uses CID Fonts, something which Preview app based on Apple's PDFKit does not support.
+This leads to a situation where **such PDF files can be properly shown and visually read, but trying to select (and copy) or search the text inside the PDF fails**. See these sources for reference:
+  * [[https://superuser.com/q/884060/601266]]
+  * [[https://www.mobileread.com/forums/showthread.php?t=220576]]
+==== Fixing the problem with GhostScript ====
+This can be fixed by converting/running the PDF file through the following GhostScript command:
+<code>
+gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dBATCH \
+-dPrinted=false -dQUIET -sOutputFile="out.pdf" "in.pdf"
+</code>
+This was mentioned [[https://discourse.devontechnologies.com/t/pdfs-created-in-calibre-have-no-text-in-dt3/47874|here]].
+GhostScript can be installed e.g. by MacPorts:
+<code>
+sudo port install ghostscript
+</code>
+===== - Decompress the whole PDF for editing in text editor =====
+**Example use-case:** There are some cases when you need to see or edit the actual text contents of the PDF. For example, there are some metadata at the level of individual pages (like ) which no existing program will actually clean (Acrobat "Find hidden information" lists them, but keeps them there when you select to remove them).
+So directly editing text contents of the PDF might come in handy.
+==== TL;DR: Summary first ====
+<color blue/lightgrey>**Conclusion:** TODO</color>
+==== Coherent PDF (cpdf) ====
+[[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch5.html#x9-640005.1|cpdf -decompress]]:
+<code>
+cpdf -decompress [-no-preserve-objstm] "in.pdf" -o "out.pdf"
+</code>
+As mentioned by manual, ''[[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch1.html#x5-360001.12|-no-preserve-objstm]]'' will remove data from separate object streams and put them back into normal flow of PDF, which should make the PDF easier for direct editing.
+<color red>**Warning:**</color> After processing a PDF using this command (with or without the ''-no-preserve-objstm'' switch), I have not been able to use Adobe Acrobat's "Optimize" or "Compare documents" function on it ever again – no matter how much tinkering, document-processing, transforming and cleaning I did on the PDF. So I consider this method to be unreliable if you want to maintain Adobe Acrobat compatibility (which I always do).
+==== MuPDF (mutool) ====
+[[https://mupdf.readthedocs.io/en/1.24.0/mutool-clean.html|mutool clean]]:
+<code>
+mutool clean -d "in.pdf" "out.pdf"
+</code>
+''mutool clean'' is specifically stated by developer as designated to "make a PDF file human editable" and to "expand compressed streams", so this is the command to go.
+There is one additional switch which deals with PDF decompression: **''-a''**, which ASCII-hex-encodes binary streams. This safely encodes binary streams so that there should be no problems when editing the PDF in text editor. However, this forces almost //**all**// streams in PDF to be encoded, which makes the whole PDF human-unreadable, so it is not really helpful.
+Of all tested tools, ''mutool clean'' is the only one which maintains original object reference numbers (e.g. ''/Metadata 22 0 R'' referencing object #22).
+<color red>**Warning:**</color> After processing a PDF using this command (with or without the ''-a'' switch), I have not been able to use Adobe Acrobat's "Optimize" or "Compare documents" function on it ever again – no matter how much tinkering, document-processing, transforming and cleaning I did on the PDF. So I consider this method to be unreliable if you want to maintain Adobe Acrobat compatibility (which I always do).
+==== pdfcpu ====
+pdfcpu does not seem to support decompressing PDFs.
+==== PDFtk server (pdftk) ====
+[[https://www.pdflabs.com/docs/pdftk-man-page/#dest-compress|pdftk uncompress]]:
+<code>
+pdftk "in.pdf" output "out.pdf" uncompress
+</code>
+PDFtk separates individual PDF dictionary elements by newlines (''0A''). E.g. a page definition looks like this:
+<code>
+<<
+/pdftk_PageNum 7
+/Metadata 23 0 R
+/Rotate 0
+/Resources 24 0 R
+/Type /Page
+/Parent 25 0 R
+/Contents 26 0 R
+/MediaBox [0 0 370.158 591.26]
+/CropBox [0 0 370.158 591.26]
+>>
+</code>
+It also adds its own elements (e.g. ''/pdftk_PageNum'').
+==== QPDF ====
+[[https://qpdf.readthedocs.io/en/stable/cli.html#option-stream-data|qpdf --stream-data=uncompress]]:
+<code>
+qpdf "in.pdf" --stream-data=uncompress "out_qpdf.pdf"
+</code>
+This will effectively equivalent to using both [[https://qpdf.readthedocs.io/en/stable/cli.html#option-compress-streams|--compress-streams=n]] and [[https://qpdf.readthedocs.io/en/stable/cli.html#option-decode-level|--decode-level=generalized]].
+There is also another switch, ''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-qdf|--qdf]]'', which TODO
+QPDF currently [[https://github.com/qpdf/qpdf/issues/339|does not support]] maintaining the original object ID.
+===== New cases to come… =====
 <html>
-<!--
+<!-- ———————————————————————————————————————————————————————————————————————————————————————————————
-===== Case #TODO: TODO =====
+———————————————————————————————————————————————————————————————————————————————————————————————  -->
+<!-- New Case Template BEGIN
+===== - New Case TODO =====
 **Example use-case:** TODO.
-==== TODO ====
+==== TL;DR: Summary first ====
+<color blue/lightgrey>**Conclusion:** TODO</color>
+==== Coherent PDF (cpdf) ====
+[[https://www.coherentpdf.com/cpdfmanual/TODO|cpdf -TODO]]:
 <code>
-TODO
+cpdf -TODO "in.pdf" -o "out.pdf"
 </code>
--->
+==== MuPDF (mutool) ====
+[[https://mupdf.readthedocs.io/en/1.24.0/mutool-clean.html|mutool TODO]]:
+<code>
+mutool TODO "in.pdf" "out.pdf"
+</code>
+==== pdfcpu ====
+[[https://pdfcpu.io/core/TODO|pdfcpu TODO]]:
+<code>
+pdfcpu TODO "in.pdf" "out.pdf"
+</code>
+==== PDFtk server (pdftk) ====
+[[https://www.pdflabs.com/docs/pdftk-man-page/#TODO|pdftk TODO]]:
+<code>
+pdftk "in.pdf" output "out.pdf" TODO
+</code>
+==== QPDF ====
+[[https://qpdf.readthedocs.io/en/stable/cli.html#TODO|qpdf --TODO]]:
+<code>
+qpdf "in.pdf" --TODO "out.pdf"
+</code>
+--><!-- New Case Template END -->
 </html>