Rozdiely

Tu môžete vidieť rozdiely medzi vybranou verziou a aktuálnou verziou danej stránky.

--- blog:odborny:2025-05-07-command-line_tools_for_pdf_processing [2026/01/19 11:39] – Róbert Toth
+++ blog:odborny:2025-05-07-command-line_tools_for_pdf_processing [2026/04/30 10:56] (aktuálne) – [1. Minimize PDF size] Róbert Toth
@@ Riadok 6: / Riadok 6: @@
   * …
-TODO
+Due to the nature of the topic, this post is (and probably will remain) a work-in-progress.
 ===== General: Overview of PDF-processing and manipulation tools =====
 ==== Coherent PDF (cpdf) ====
-  * **Download:**    https://www.coherentpdf.com/eval.html (fully functional demo)
+  * **Download:**    https://www.coherentpdf.com/eval.html (fully functional demo) – the link at that page downloads all binaries, to select only needed binary, visit [[https://github.com/coherentgraphics/cpdf-binaries/|GitHub page]] and go to relevant folder.
   * **Changelog:**   https://www.coherentpdf.com/cpdfmanual/cpdfmanualap2.html#x26-177000B.1
   * **Manual:**      https://www.coherentpdf.com/cpdfmanual/cpdfmanual.html
@@ Riadok 24: / Riadok 25: @@
 ==== pdfcpu ====
   * **Download:**    https://github.com/pdfcpu/pdfcpu
-  * **Changelog:**   https://pdfcpu.io/changelog.html
+  * **Changelog:**   https://pdfcpu.io/changelog/
   * **Manual:**      https://pdfcpu.io/about/command_set
@@ Riadok 31: / Riadok 32: @@
   * **Changelog:**   https://www.pdflabs.com/docs/pdftk-version-history/
   * **Manual:**      https://www.pdflabs.com/docs/pdftk-man-page/ ([[https://www.pdflabs.com/docs/pdftk-cli-examples/|examples here]])
+==== QPDF ====
+  * **Download:**    https://github.com/qpdf/qpdf/releases (for macOS using [[https://ports.macports.org/port/qpdf/details/|MacPorts QPDF port]])
+  * **Changelog:**   https://qpdf.readthedocs.io/en/stable/release-notes.html
+  * **Manual:**      https://qpdf.readthedocs.io/en/stable/cli.html
 <html>
@@ Riadok 41: / Riadok 47: @@
 </html>
-==== Other (non-tested) tools ====
+==== Other (non-/not-yet-tested) tools ====
-  * **QPDF:**        https://qpdf.sourceforge.io/
+TODO
-  * …
 <html>
 <!--
@@ Riadok 50: / Riadok 55: @@
 </html>
-===== Minimize PDF size =====
+===== - Minimize PDF size =====
 **Example use-case:** Obvious.
@@ Riadok 94: / Riadok 100: @@
 <code>
 pdftk "in.pdf" output "out.pdf" compress
+</code>
+==== QPDF ====
+[[https://qpdf.readthedocs.io/en/stable/cli.html#optimizing-file-size|Optimizing file size with qpdf]] suggests these flags to optimize file size:
+  * **''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-compress-streams|--compress-streams]]=y'':** This is the default, so it is not in fact needed.
+  * **''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-decode-level|--decode-level]]=generalized'':** Again, this is the default, so it is not in fact needed.
+  * **''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-recompress-flate|--recompress-flate]]'':** This forces recompression of already flate-compressed streams. This only make sense when combined with the ''--compression-level=9'' to achieve maximum compression.
+  * **''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-compression-level|--compression-level]]=9'':** Use maximum compression when flate-compressing PDF streams (the default is only ''6'').
+  * **''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-object-streams|--object-streams]]=generate'':** This will try to produce new object streams to lower file size as much as possible (the default is only ''preserve'').
+So the resulting command is:
+<code>
+qpdf "in.pdf" --recompress-flate --compression-level=9 --object-streams=generate "out.pdf"
 </code>
@@ Riadok 101: / Riadok 119: @@
   ; <color blue>Fonts</color> tab             : ✅ always on, with font subsetting turned on
   ; <color blue>Transparency</color> tab      : 🚫 This causes optimisation to take extremely long time in some PDFs. My default is to have it turned off (and I cannot remember a situation when I actually needed to turn it on).
-  ; <color blue>Discard Objects</color> tab   : ✅ Everything on, except **Discard bookmarks** (you never want that) and also **Convert smooth lines to curves** and **Detect and merge image fragments** (these are not needed in 99% of cases and they cause optimisation to take significantly longer, while they usually do not lower the PDF size at all).
+  ; <color blue>Discard Objects</color> tab   : ✅ Everything on, except **Discard bookmarks** (you never want that) and also **Convert smooth lines to curves** and **Detect and merge image fragments** (these are not needed in 99% of cases and they cause optimisation to take significantly longer, while they usually do not lower the PDF size at all). You might also want to **Discard all Javascript actions**, but if there are any internal links (e.g. table of contents, references or index pointing to specific places in PDF), this causes them to disfunction.
   ; <color blue>Discard User Data</color> tab : ✅ Everything on.
   ; <color blue>Clean Up</color> tab          : ✅ Everything on.
-===== Split each page of PDF into several pages (posterisation) =====
+===== - Split each page of PDF into several pages (posterisation) =====
 **Example use-case:** you have (scanned) pages where each PDF page contains two physical pages, and want to crop those into two.
@@ Riadok 139: / Riadok 158: @@
 The second step might also be done in Adobe Acrobat "Crop pages" function.
+==== QPDF ====
+N/A (not tested yet TODO)
-===== Crop pages of PDF =====
+===== - Crop pages of PDF =====
 See MuPDF documentation on [[https://mupdf.readthedocs.io/en/latest/mutool-trim.html#mutool-trim-defined-boxes|different PDF page boxes]] (media|crop|art|trim|bleed]box).
@@ Riadok 155: / Riadok 177: @@
 Note that Acrobat won't let you crop MediaBox, only other boxes (duh!).
   - Go to "Edit PDF" and then "Crop pages" function.
+==== QPDF ====
+N/A (not tested yet TODO)
-===== Remove cropped content from PDF =====
+===== - Remove cropped content from PDF =====
 **Example use-case:** You have cropped some pages but you want to actually remove the content, since otherwise it is only hidden but remains in PDF – this can be seen when you inspect the PDF in Adobe Acrobat via "Edit PDF" and zoom out the page – the cropped content will be selectable, although not visible, since it is out of the page margins.
@@ Riadok 175: / Riadok 200: @@
   - Run preflight and the script
+==== QPDF ====
+N/A (not tested yet TODO)
-===== Text in Calibri-created PDF files cannot be copied/searched in Preview =====
+===== - Text in Calibri-created PDF files cannot be copied/searched in Preview =====
 Preview app (or any PDFkit-based PDF viewer, such as Skim) cannot search/copy text from PDFs generated by Calibri. This is because the PDF created in such way uses CID Fonts, something which Preview app based on Apple's PDFKit does not support.
@@ Riadok 197: / Riadok 225: @@
-===== Decompress the whole PDF for editing in text editor =====
+===== - Decompress the whole PDF for editing in text editor =====
 **Example use-case:** There are some cases when you need to see or edit the actual text contents of the PDF. For example, there are some metadata at the level of individual pages (like ) which no existing program will actually clean (Acrobat "Find hidden information" lists them, but keeps them there when you select to remove them).
@@ Riadok 211: / Riadok 239: @@
 </code>
 As mentioned by manual, ''[[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch1.html#x5-360001.12|-no-preserve-objstm]]'' will remove data from separate object streams and put them back into normal flow of PDF, which should make the PDF easier for direct editing.
+<color red>**Warning:**</color> After processing a PDF using this command (with or without the ''-no-preserve-objstm'' switch), I have not been able to use Adobe Acrobat's "Optimize" or "Compare documents" function on it ever again – no matter how much tinkering, document-processing, transforming and cleaning I did on the PDF. So I consider this method to be unreliable if you want to maintain Adobe Acrobat compatibility (which I always do).
 ==== MuPDF (mutool) ====
 [[https://mupdf.readthedocs.io/en/1.24.0/mutool-clean.html|mutool clean]]:
 <code>
-mutool clean -d [-a] "in.pdf" "out.pdf"
+mutool clean -d "in.pdf" "out.pdf"
 </code>
-''mutool clean'' is specifically stated by developer as designated to "make a PDF file human editable" and to "expand compressed streams", so this is the command to go. There are two switches which deal with PDF decompression:
+''mutool clean'' is specifically stated by developer as designated to "make a PDF file human editable" and to "expand compressed streams", so this is the command to go.
-  * **''-d'' (Decompress streams):** the basic switch for decompressing the whole file,
-  * **''-a'' (ASCII hex encode binary streams):** this safely encodes binary streams so that there should be no problems when editing the PDF in text editor.
+There is one additional switch which deals with PDF decompression: **''-a''**, which ASCII-hex-encodes binary streams. This safely encodes binary streams so that there should be no problems when editing the PDF in text editor. However, this forces almost //**all**// streams in PDF to be encoded, which makes the whole PDF human-unreadable, so it is not really helpful.
+Of all tested tools, ''mutool clean'' is the only one which maintains original object reference numbers (e.g. ''/Metadata 22 0 R'' referencing object #22).
+<color red>**Warning:**</color> After processing a PDF using this command (with or without the ''-a'' switch), I have not been able to use Adobe Acrobat's "Optimize" or "Compare documents" function on it ever again – no matter how much tinkering, document-processing, transforming and cleaning I did on the PDF. So I consider this method to be unreliable if you want to maintain Adobe Acrobat compatibility (which I always do).
 ==== pdfcpu ====
-[[https://pdfcpu.io/core/TODO|pdfcpu TODO]]:
+pdfcpu does not seem to support decompressing PDFs.
-<code>
-pdfcpu TODO "in.pdf" "out.pdf"
-</code>
 ==== PDFtk server (pdftk) ====
@@ Riadok 232: / Riadok 263: @@
 pdftk "in.pdf" output "out.pdf" uncompress
 </code>
+PDFtk separates individual PDF dictionary elements by newlines (''0A''). E.g. a page definition looks like this:
+<code>
+<<
+/pdftk_PageNum 7
+/Metadata 23 0 R
+/Rotate 0
+/Resources 24 0 R
+/Type /Page
+/Parent 25 0 R
+/Contents 26 0 R
+/MediaBox [0 0 370.158 591.26]
+/CropBox [0 0 370.158 591.26]
+>>
+</code>
+It also adds its own elements (e.g. ''/pdftk_PageNum'').
+==== QPDF ====
+[[https://qpdf.readthedocs.io/en/stable/cli.html#option-stream-data|qpdf --stream-data=uncompress]]:
+<code>
+qpdf "in.pdf" --stream-data=uncompress "out_qpdf.pdf"
+</code>
+This will effectively equivalent to using both [[https://qpdf.readthedocs.io/en/stable/cli.html#option-compress-streams|--compress-streams=n]] and [[https://qpdf.readthedocs.io/en/stable/cli.html#option-decode-level|--decode-level=generalized]].
+There is also another switch, ''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-qdf|--qdf]]'', which TODO
+QPDF currently [[https://github.com/qpdf/qpdf/issues/339|does not support]] maintaining the original object ID.
+===== New cases to come… =====
 <html>
+<!-- ———————————————————————————————————————————————————————————————————————————————————————————————
+———————————————————————————————————————————————————————————————————————————————————————————————  -->
 <!-- New Case Template BEGIN
-===== New Case TODO =====
+===== - New Case TODO =====
 **Example use-case:** TODO.
@@ Riadok 264: / Riadok 325: @@
 <code>
 pdftk "in.pdf" output "out.pdf" TODO
+</code>
+==== QPDF ====
+[[https://qpdf.readthedocs.io/en/stable/cli.html#TODO|qpdf --TODO]]:
+<code>
+qpdf "in.pdf" --TODO "out.pdf"
 </code>
 --><!-- New Case Template END -->