Rozdiely

Tu môžete vidieť rozdiely medzi vybranou verziou a aktuálnou verziou danej stránky.

--- blog:odborny:2025-05-07-command-line_tools_for_pdf_processing [2025/05/08 14:43] – [Case #000: Minimize PDF size] Róbert Toth
+++ blog:odborny:2025-05-07-command-line_tools_for_pdf_processing [2026/01/20 17:42] (aktuálne) – [6. Decompress the whole PDF for editing in text editor] Róbert Toth
@@ Riadok 6: / Riadok 6: @@
   * …
-TODO
+Due to the nature of the topic, this post is (and probably will remain) a work-in-progress.
 ===== General: Overview of PDF-processing and manipulation tools =====
+==== Coherent PDF (cpdf) ====
+  * **Download:**    https://www.coherentpdf.com/eval.html (fully functional demo)
+  * **Changelog:**   https://www.coherentpdf.com/cpdfmanual/cpdfmanualap2.html#x26-177000B.1
+  * **Manual:**      https://www.coherentpdf.com/cpdfmanual/cpdfmanual.html
 ==== MuPDF (mutool) ====
@@ Riadok 15: / Riadok 21: @@
     * https://mupdf.com/releases/history
     * https://github.com/ArtifexSoftware/mupdf/blob/master/CHANGES
-  * **Manual:**      https://mupdf.readthedocs.io/en/latest/
+  * **Manual:**      https://mupdf.readthedocs.io/en/latest/tools/mutool.html
+==== pdfcpu ====
+  * **Download:**    https://github.com/pdfcpu/pdfcpu
+  * **Changelog:**   https://pdfcpu.io/changelog.html
+  * **Manual:**      https://pdfcpu.io/about/command_set
 ==== PDFtk server (pdftk) ====
@@ Riadok 22: / Riadok 33: @@
   * **Manual:**      https://www.pdflabs.com/docs/pdftk-man-page/ ([[https://www.pdflabs.com/docs/pdftk-cli-examples/|examples here]])
-==== Coherent PDF (cpdf) ====
+==== QPDF ====
-  * **Download:**    https://www.coherentpdf.com/eval.html (fully functional demo)
+  * **Download:**    https://github.com/qpdf/qpdf/releases (for macOS using [[https://ports.macports.org/port/qpdf/details/|MacPorts QPDF port]])
-  * **Changelog:**   https://www.coherentpdf.com/cpdfmanual/cpdfmanualap2.html#x26-177000B.1
+  * **Changelog:**   https://qpdf.readthedocs.io/en/stable/release-notes.html
-  * **Manual:**      https://www.coherentpdf.com/cpdfmanual/cpdfmanual.html
+  * **Manual:**      https://qpdf.readthedocs.io/en/stable/cli.html
-==== pdfcpu ====
-  * **Download:**    https://github.com/pdfcpu/pdfcpu
-  * **Changelog:**   https://pdfcpu.io/changelog.html
-  * **Manual:**      https://pdfcpu.io/about/command_set
 <html>
@@ Riadok 41: / Riadok 47: @@
 </html>
-==== Other (non-tested) tools ====
+==== Other (non-/not-yet-tested) tools ====
-  * **QPDF:**        https://qpdf.sourceforge.io/
+TODO
-  * …
 <html>
 <!--
@@ Riadok 50: / Riadok 55: @@
 </html>
-===== Case #000: Minimize PDF size =====
+===== - Minimize PDF size =====
 **Example use-case:** Obvious.
-==== cpdf ====
+==== TL;DR: Summary first ====
+  * **Effectiveness of the tools** (ordered by file size):
+    * <color green>**acrobat optimize**</color> ≪ mutool clean ≪ cpdf squeeze < pdfcpu optimize ≪ <color red>**pdftk compress**</color>
+    * ❶ Acrobat wins (but is paid), ❷ mutool is second & free (but is hard to configure), ❸ cpdf is third but is easiest to use
+    * I have also tested all possible combinations between these tools, and the results are best when Acrobat Optimizer is followed by //yet another// tool. So…
+  * **Effectiveness of the //combined// tools** (ordered by file size):
+    * <color green>**acrobat optimize** + **cpdf squeeze**</color> < acrobat optimize + pdfcpu optimize ≪ <color red>**acrobat optimize** + **mutool clean**</color>
+<color blue/lightgrey>**Conclusion:** Always optimize PDF in Acrobat PDF Optimizer, then squeeze it with cpdf squeeze!</color>
+==== Coherent PDF (cpdf) ====
 [[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch5.html#x9-660005.3|cpdf -squeeze]]:
 <code>
-cpdf -squeeze "src.pdf" [-squeeze-no-recompress] -o "dst.pdf"
+cpdf -squeeze "in.pdf" [-squeeze-no-recompress] -o "out.pdf"
 </code>
@@ Riadok 62: / Riadok 77: @@
 [[https://pdfcpu.io/core/optimize.html|pdfcpu optimize]]:
 <code>
-pdfcpu optimize "src.pdf" "dst.pdf"
+pdfcpu optimize "in.pdf" "out.pdf"
 </code>
@@ Riadok 68: / Riadok 83: @@
 [[https://mupdf.readthedocs.io/en/1.24.0/mutool-clean.html|mutool clean]]:
 <code>
-mutool clean -gggg -l -d -z -s "src.pdf" "dst.pdf"
+mutool clean -gggg -l -d -z -s "in.pdf" "out.pdf"
 </code>
 ''mutool clean'' has many options and it takes some experimentation to see what actually shrinks the PDF size:
@@ Riadok 81: / Riadok 96: @@
   * **''-AA'' (Recreate appearance streams for annotations):** no effect for me (and not sure what it really does – at least while there are no annotations, it does not affect PDF file size at all)
+==== PDFtk server (pdftk) ====
+[[https://www.pdflabs.com/docs/pdftk-man-page/#dest-compress|pdftk compress]] is not suitable for the task of really shrinking the PDF size to minimum, as the developer himself claims:
+<code>
+pdftk "in.pdf" output "out.pdf" compress
+</code>
+==== QPDF ====
+N/A (not tested yet TODO)
+==== Adobe Acrobat ====
+Acrobat [[https://helpx.adobe.com/acrobat/using/optimizing-pdfs-acrobat-pro.html#pdf_optimizer_options_acrobat_pro|PDF Optimizer]] has tons of options. After thorough testing, the best settings seems usually to be these:
+  ; <color blue>Images</color> tab            : ⚠️ This is a per-PDF setting – it depends on your particular use-case and file. My default is to have it turned off.
+  ; <color blue>Fonts</color> tab             : ✅ always on, with font subsetting turned on
+  ; <color blue>Transparency</color> tab      : 🚫 This causes optimisation to take extremely long time in some PDFs. My default is to have it turned off (and I cannot remember a situation when I actually needed to turn it on).
+  ; <color blue>Discard Objects</color> tab   : ✅ Everything on, except **Discard bookmarks** (you never want that) and also **Convert smooth lines to curves** and **Detect and merge image fragments** (these are not needed in 99% of cases and they cause optimisation to take significantly longer, while they usually do not lower the PDF size at all). You might also want to **Discard all Javascript actions**, but if there are any internal links (e.g. table of contents, references or index pointing to specific places in PDF), this causes them to disfunction.
+  ; <color blue>Discard User Data</color> tab : ✅ Everything on.
+  ; <color blue>Clean Up</color> tab          : ✅ Everything on.
-===== Case #001: Split each page of PDF into several pages (posterisation) =====
+===== - Split each page of PDF into several pages (posterisation) =====
 **Example use-case:** you have (scanned) pages where each PDF page contains two physical pages, and want to crop those into two.
+==== TL;DR: Summary first ====
+<color blue/lightgray>**Conclusion:** Use cpdf -chop. MuPDF has some problems and combining pdftk with manual cropping in Adobe Acrobat is tedious.</color>
 ==== MuPDF (mutool) ====
 [[https://mupdf.readthedocs.io/en/latest/mutool-poster.html|mutool poster]]:
 <code>
-mutool poster -x 2 "src.pdf" "dst.pdf"
+mutool poster -x 2 "in.pdf" "out.pdf"
 </code>
+Note that ''mutool poster'' sometimes sets //both// mediabox and cropbox – the latter seems unnecessary. More importantly, ''mutool poster'' causes problems when ''mutool trim'' is used afterwards – for some reason, it leaves the PDF completely empty.  This does not happen when ''mutool trim'' is used after ''cpdf -chop''.
 ==== Coherent PDF (cpdf) ====
 [[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch9.html#x13-940009.4|cpdf -chop]]:
 <code>
-cpdf -chop "1 2" "src.pdf" -o "dst.pdf"
+cpdf -chop "2 1" "in.pdf" -o "out.pdf"
+</code>
+Unlike MuPDF, ''cpdf -chop'' only sets mediabox, which is enough.
+Moreover, ''cpdf'' can also remove page labels, if needed:
+<code>
+cpdf -chop "2 1" "in.pdf" AND -remove-page-labels -o "dstNoLabels.pdf"
 </code>
@@ Riadok 103: / Riadok 143: @@
 [[https://www.pdflabs.com/docs/pdftk-man-page/#dest-op-shuffle|pdftk shuffle]] && [[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch3.html#x7-570003.6|cpdf -mediabox]]:
 <code>
-pdftk A=src.pdf shuffle A A output dst.pdf
+pdftk A=in.pdf shuffle A A output out.pdf
-cpdf -mediabox "0mm 0mm a5landscape" "src.pdf" odd -o "srcOdd.pdf"
+cpdf -mediabox "0mm 0mm a5landscape" "in.pdf" odd -o "srcOdd.pdf"
-cpdf -mediabox "148.5mm 0mm a5landscape" "srcOdd.pdf" even -o "dst.pdf"
+cpdf -mediabox "148.5mm 0mm a5landscape" "srcOdd.pdf" even -o "out.pdf"
 </code>
 The second step might also be done in Adobe Acrobat "Crop pages" function.
+==== QPDF ====
+N/A (not tested yet TODO)
-===== Case #002: Crop pages of PDF =====
+===== - Crop pages of PDF =====
 See MuPDF documentation on [[https://mupdf.readthedocs.io/en/latest/mutool-trim.html#mutool-trim-defined-boxes|different PDF page boxes]] (media|crop|art|trim|bleed]box).
@@ Riadok 126: / Riadok 169: @@
   - Go to "Edit PDF" and then "Crop pages" function.
+==== QPDF ====
+N/A (not tested yet TODO)
-===== Case #003: Remove cropped content from PDF =====
+===== - Remove cropped content from PDF =====
 **Example use-case:** You have cropped some pages but you want to actually remove the content, since otherwise it is only hidden but remains in PDF – this can be seen when you inspect the PDF in Adobe Acrobat via "Edit PDF" and zoom out the page – the cropped content will be selectable, although not visible, since it is out of the page margins.
-According to author, it seems that [[https://github.com/johnwhitington/cpdf-source/issues/116|cpdf is not going to support this feature]].
+According to its author, it seems that [[https://github.com/johnwhitington/cpdf-source/issues/116|cpdf is not going to support this feature]].
 ==== MuPDF (mutool) ====
 [[https://mupdf.readthedocs.io/en/latest/mutool-poster.html|mutool trim]]:
 <code>
-mutool trim -o "dst.pdf" -b cropbox "src.pdf"
+mutool trim -o "out.pdf" -b cropbox "in.pdf"
 </code>
+<color red>**Warning:**</color> In some PDFs, I have experienced ''mutool trim'' to also cut out some contents which was actually visible on-page (that is, inside mediabox & cropbox). I don't know whether it was because the PDF itself was malformed, or whether it is a problem with mutool – but just in case, better avoid and use Acrobat below.
 ==== Adobe Acrobat ====
@@ Riadok 143: / Riadok 191: @@
   - Run preflight and the script
+==== QPDF ====
+N/A (not tested yet TODO)
+===== - Text in Calibri-created PDF files cannot be copied/searched in Preview =====
+Preview app (or any PDFkit-based PDF viewer, such as Skim) cannot search/copy text from PDFs generated by Calibri. This is because the PDF created in such way uses CID Fonts, something which Preview app based on Apple's PDFKit does not support.
+This leads to a situation where **such PDF files can be properly shown and visually read, but trying to select (and copy) or search the text inside the PDF fails**. See these sources for reference:
+  * [[https://superuser.com/q/884060/601266]]
+  * [[https://www.mobileread.com/forums/showthread.php?t=220576]]
+==== Fixing the problem with GhostScript ====
+This can be fixed by converting/running the PDF file through the following GhostScript command:
+<code>
+gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dBATCH \
+-dPrinted=false -dQUIET -sOutputFile="out.pdf" "in.pdf"
+</code>
+This was mentioned [[https://discourse.devontechnologies.com/t/pdfs-created-in-calibre-have-no-text-in-dt3/47874|here]].
+GhostScript can be installed e.g. by MacPorts:
+<code>
+sudo port install ghostscript
+</code>
+===== - Decompress the whole PDF for editing in text editor =====
+**Example use-case:** There are some cases when you need to see or edit the actual text contents of the PDF. For example, there are some metadata at the level of individual pages (like ) which no existing program will actually clean (Acrobat "Find hidden information" lists them, but keeps them there when you select to remove them).
+So directly editing text contents of the PDF might come in handy.
+==== TL;DR: Summary first ====
+<color blue/lightgrey>**Conclusion:** TODO</color>
+==== Coherent PDF (cpdf) ====
+[[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch5.html#x9-640005.1|cpdf -decompress]]:
+<code>
+cpdf -decompress [-no-preserve-objstm] "in.pdf" -o "out.pdf"
+</code>
+As mentioned by manual, ''[[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch1.html#x5-360001.12|-no-preserve-objstm]]'' will remove data from separate object streams and put them back into normal flow of PDF, which should make the PDF easier for direct editing.
+<color red>**Warning:**</color> After processing a PDF using this command (with or without the ''-no-preserve-objstm'' switch), I have not been able to use Adobe Acrobat's "Optimize" or "Compare documents" function on it ever again – no matter how much tinkering, document-processing, transforming and cleaning I did on the PDF. So I consider this method to be unreliable if you want to maintain Adobe Acrobat compatibility (which I always do).
+==== MuPDF (mutool) ====
+[[https://mupdf.readthedocs.io/en/1.24.0/mutool-clean.html|mutool clean]]:
+<code>
+mutool clean -d "in.pdf" "out.pdf"
+</code>
+''mutool clean'' is specifically stated by developer as designated to "make a PDF file human editable" and to "expand compressed streams", so this is the command to go.
+There is one additional switch which deals with PDF decompression: **''-a''**, which ASCII-hex-encodes binary streams. This safely encodes binary streams so that there should be no problems when editing the PDF in text editor. However, this forces almost //**all**// streams in PDF to be encoded, which makes the whole PDF human-unreadable, so it is not really helpful.
+Of all tested tools, ''mutool clean'' is the only one which maintains original object reference numbers (e.g. ''/Metadata 22 0 R'' referencing object #22).
+<color red>**Warning:**</color> After processing a PDF using this command (with or without the ''-a'' switch), I have not been able to use Adobe Acrobat's "Optimize" or "Compare documents" function on it ever again – no matter how much tinkering, document-processing, transforming and cleaning I did on the PDF. So I consider this method to be unreliable if you want to maintain Adobe Acrobat compatibility (which I always do).
+==== pdfcpu ====
+pdfcpu does not seem to support decompressing PDFs.
+==== PDFtk server (pdftk) ====
+[[https://www.pdflabs.com/docs/pdftk-man-page/#dest-compress|pdftk uncompress]]:
+<code>
+pdftk "in.pdf" output "out.pdf" uncompress
+</code>
+PDFtk separates individual PDF dictionary elements by newlines (''0A''). E.g. a page definition looks like this:
+<code>
+<<
+/pdftk_PageNum 7
+/Metadata 23 0 R
+/Rotate 0
+/Resources 24 0 R
+/Type /Page
+/Parent 25 0 R
+/Contents 26 0 R
+/MediaBox [0 0 370.158 591.26]
+/CropBox [0 0 370.158 591.26]
+>>
+</code>
+It also adds its own elements (e.g. ''/pdftk_PageNum'').
+==== QPDF ====
+[[https://qpdf.readthedocs.io/en/stable/cli.html#option-stream-data|qpdf --stream-data=uncompress]]:
+<code>
+qpdf "in.pdf" --stream-data=uncompress "out_qpdf.pdf"
+</code>
+This will effectively equivalent to using both [[https://qpdf.readthedocs.io/en/stable/cli.html#option-compress-streams|--compress-streams=n]] and [[https://qpdf.readthedocs.io/en/stable/cli.html#option-decode-level|--decode-level=generalized]].
+There is also another switch, ''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-qdf|--qdf]]'', which TODO
+QPDF currently [[https://github.com/qpdf/qpdf/issues/339|does not support]] maintaining the original object ID.
+===== New cases to come… =====
 <html>
-<!--
+<!-- ———————————————————————————————————————————————————————————————————————————————————————————————
-===== Case #TODO: TODO =====
+———————————————————————————————————————————————————————————————————————————————————————————————  -->
+<!-- New Case Template BEGIN
+===== - New Case TODO =====
 **Example use-case:** TODO.
-==== TODO ====
+==== TL;DR: Summary first ====
+<color blue/lightgrey>**Conclusion:** TODO</color>
+==== Coherent PDF (cpdf) ====
+[[https://www.coherentpdf.com/cpdfmanual/TODO|cpdf -TODO]]:
 <code>
-TODO
+cpdf -TODO "in.pdf" -o "out.pdf"
 </code>
--->
+==== MuPDF (mutool) ====
+[[https://mupdf.readthedocs.io/en/1.24.0/mutool-clean.html|mutool TODO]]:
+<code>
+mutool TODO "in.pdf" "out.pdf"
+</code>
+==== pdfcpu ====
+[[https://pdfcpu.io/core/TODO|pdfcpu TODO]]:
+<code>
+pdfcpu TODO "in.pdf" "out.pdf"
+</code>
+==== PDFtk server (pdftk) ====
+[[https://www.pdflabs.com/docs/pdftk-man-page/#TODO|pdftk TODO]]:
+<code>
+pdftk "in.pdf" output "out.pdf" TODO
+</code>
+==== QPDF ====
+[[https://qpdf.readthedocs.io/en/stable/cli.html#TODO|qpdf --TODO]]:
+<code>
+qpdf "in.pdf" --TODO "out.pdf"
+</code>
+--><!-- New Case Template END -->
 </html>