Nástroje používateľa

Nástoje správy stránok


blog:odborny:2025-05-07-command-line_tools_for_pdf_processing

Rozdiely

Tu môžete vidieť rozdiely medzi vybranou verziou a aktuálnou verziou danej stránky.

Odkaz na tento prehľad zmien

Obojstranná predošlá revíziaPredchádzajúca revízia
Nasledujúca revízia
Predchádzajúca revízia
blog:odborny:2025-05-07-command-line_tools_for_pdf_processing [2026/01/08 21:30] – Case #004: Text in Calibri-created PDF files cannot be copied/searched in Preview Róbert Tothblog:odborny:2025-05-07-command-line_tools_for_pdf_processing [2026/01/20 17:42] (aktuálne) – [6. Decompress the whole PDF for editing in text editor] Róbert Toth
Riadok 6: Riadok 6:
   * …   * …
  
-TODO+Due to the nature of the topic, this post is (and probably will remain) a work-in-progress. 
  
 ===== General: Overview of PDF-processing and manipulation tools ===== ===== General: Overview of PDF-processing and manipulation tools =====
 +
 +==== Coherent PDF (cpdf) ====
 +  * **Download:**    https://www.coherentpdf.com/eval.html (fully functional demo)
 +  * **Changelog:**   https://www.coherentpdf.com/cpdfmanual/cpdfmanualap2.html#x26-177000B.1
 +  * **Manual:**      https://www.coherentpdf.com/cpdfmanual/cpdfmanual.html
  
 ==== MuPDF (mutool) ==== ==== MuPDF (mutool) ====
Riadok 15: Riadok 21:
     * https://mupdf.com/releases/history     * https://mupdf.com/releases/history
     * https://github.com/ArtifexSoftware/mupdf/blob/master/CHANGES     * https://github.com/ArtifexSoftware/mupdf/blob/master/CHANGES
-  * **Manual:**      https://mupdf.readthedocs.io/en/latest/+  * **Manual:**      https://mupdf.readthedocs.io/en/latest/tools/mutool.html 
 + 
 +==== pdfcpu ==== 
 +  * **Download:**    https://github.com/pdfcpu/pdfcpu 
 +  * **Changelog:**   https://pdfcpu.io/changelog.html 
 +  * **Manual:**      https://pdfcpu.io/about/command_set
  
 ==== PDFtk server (pdftk) ==== ==== PDFtk server (pdftk) ====
Riadok 22: Riadok 33:
   * **Manual:**      https://www.pdflabs.com/docs/pdftk-man-page/ ([[https://www.pdflabs.com/docs/pdftk-cli-examples/|examples here]])   * **Manual:**      https://www.pdflabs.com/docs/pdftk-man-page/ ([[https://www.pdflabs.com/docs/pdftk-cli-examples/|examples here]])
  
-==== Coherent PDF (cpdf) ==== +==== QPDF ==== 
-  * **Download:**    https://www.coherentpdf.com/eval.html (fully functional demo) +  * **Download:**    https://github.com/qpdf/qpdf/releases (for macOS using [[https://ports.macports.org/port/qpdf/details/|MacPorts QPDF port]]) 
-  * **Changelog:**   https://www.coherentpdf.com/cpdfmanual/cpdfmanualap2.html#x26-177000B.1 +  * **Changelog:**   https://qpdf.readthedocs.io/en/stable/release-notes.html 
-  * **Manual:**      https://www.coherentpdf.com/cpdfmanual/cpdfmanual.html +  * **Manual:**      https://qpdf.readthedocs.io/en/stable/cli.html
- +
-==== pdfcpu ==== +
-  * **Download:**    https://github.com/pdfcpu/pdfcpu +
-  * **Changelog:**   https://pdfcpu.io/changelog.html +
-  * **Manual:**      https://pdfcpu.io/about/command_set+
  
 <html> <html>
Riadok 41: Riadok 47:
 </html> </html>
  
-==== Other (non-tested) tools ==== +==== Other (non-/not-yet-tested) tools ==== 
-  * **QPDF:**        https://qpdf.sourceforge.io/ +TODO
-  * …+
 <html> <html>
 <!-- <!--
Riadok 50: Riadok 55:
 </html> </html>
  
-===== Case #000: Minimize PDF size =====+ 
 +===== Minimize PDF size =====
 **Example use-case:** Obvious. **Example use-case:** Obvious.
  
Riadok 65: Riadok 71:
 [[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch5.html#x9-660005.3|cpdf -squeeze]]: [[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch5.html#x9-660005.3|cpdf -squeeze]]:
 <code> <code>
-cpdf -squeeze "src.pdf" [-squeeze-no-recompress] -o "dst.pdf"+cpdf -squeeze "in.pdf" [-squeeze-no-recompress] -o "out.pdf"
 </code> </code>
  
Riadok 71: Riadok 77:
 [[https://pdfcpu.io/core/optimize.html|pdfcpu optimize]]: [[https://pdfcpu.io/core/optimize.html|pdfcpu optimize]]:
 <code> <code>
-pdfcpu optimize "src.pdf" "dst.pdf"+pdfcpu optimize "in.pdf" "out.pdf"
 </code> </code>
  
Riadok 77: Riadok 83:
 [[https://mupdf.readthedocs.io/en/1.24.0/mutool-clean.html|mutool clean]]: [[https://mupdf.readthedocs.io/en/1.24.0/mutool-clean.html|mutool clean]]:
 <code> <code>
-mutool clean -gggg -l -d -z -s "src.pdf" "dst.pdf"+mutool clean -gggg -l -d -z -s "in.pdf" "out.pdf"
 </code> </code>
 ''mutool clean'' has many options and it takes some experimentation to see what actually shrinks the PDF size: ''mutool clean'' has many options and it takes some experimentation to see what actually shrinks the PDF size:
Riadok 93: Riadok 99:
 [[https://www.pdflabs.com/docs/pdftk-man-page/#dest-compress|pdftk compress]] is not suitable for the task of really shrinking the PDF size to minimum, as the developer himself claims: [[https://www.pdflabs.com/docs/pdftk-man-page/#dest-compress|pdftk compress]] is not suitable for the task of really shrinking the PDF size to minimum, as the developer himself claims:
 <code> <code>
-pdftk "src.pdf" output "dst.pdf" compress+pdftk "in.pdf" output "out.pdf" compress
 </code> </code>
 +
 +==== QPDF ====
 +N/A (not tested yet TODO)
  
 ==== Adobe Acrobat ==== ==== Adobe Acrobat ====
Riadok 101: Riadok 110:
   ; <color blue>Fonts</color> tab             : ✅ always on, with font subsetting turned on   ; <color blue>Fonts</color> tab             : ✅ always on, with font subsetting turned on
   ; <color blue>Transparency</color> tab      : 🚫 This causes optimisation to take extremely long time in some PDFs. My default is to have it turned off (and I cannot remember a situation when I actually needed to turn it on).   ; <color blue>Transparency</color> tab      : 🚫 This causes optimisation to take extremely long time in some PDFs. My default is to have it turned off (and I cannot remember a situation when I actually needed to turn it on).
-  ; <color blue>Discard Objects</color> tab   : ✅ Everything on, except **Discard bookmarks** (you never want that) and also **Convert smooth lines to curves** and **Detect and merge image fragments** (these are not needed in 99% of cases and they cause optimisation to take significantly longer, while they usually do not lower the PDF size at all).+  ; <color blue>Discard Objects</color> tab   : ✅ Everything on, except **Discard bookmarks** (you never want that) and also **Convert smooth lines to curves** and **Detect and merge image fragments** (these are not needed in 99% of cases and they cause optimisation to take significantly longer, while they usually do not lower the PDF size at all). You might also want to **Discard all Javascript actions**, but if there are any internal links (e.g. table of contents, references or index pointing to specific places in PDF), this causes them to disfunction.
   ; <color blue>Discard User Data</color> tab : ✅ Everything on.   ; <color blue>Discard User Data</color> tab : ✅ Everything on.
   ; <color blue>Clean Up</color> tab          : ✅ Everything on.   ; <color blue>Clean Up</color> tab          : ✅ Everything on.
  
-===== Case #001: Split each page of PDF into several pages (posterisation) =====+ 
 +===== Split each page of PDF into several pages (posterisation) =====
 **Example use-case:** you have (scanned) pages where each PDF page contains two physical pages, and want to crop those into two. **Example use-case:** you have (scanned) pages where each PDF page contains two physical pages, and want to crop those into two.
  
Riadok 114: Riadok 124:
 [[https://mupdf.readthedocs.io/en/latest/mutool-poster.html|mutool poster]]: [[https://mupdf.readthedocs.io/en/latest/mutool-poster.html|mutool poster]]:
 <code> <code>
-mutool poster -x 2 "src.pdf" "dst.pdf"+mutool poster -x 2 "in.pdf" "out.pdf"
 </code> </code>
 Note that ''mutool poster'' sometimes sets //both// mediabox and cropbox – the latter seems unnecessary. More importantly, ''mutool poster'' causes problems when ''mutool trim'' is used afterwards – for some reason, it leaves the PDF completely empty.  This does not happen when ''mutool trim'' is used after ''cpdf -chop''. Note that ''mutool poster'' sometimes sets //both// mediabox and cropbox – the latter seems unnecessary. More importantly, ''mutool poster'' causes problems when ''mutool trim'' is used afterwards – for some reason, it leaves the PDF completely empty.  This does not happen when ''mutool trim'' is used after ''cpdf -chop''.
Riadok 121: Riadok 131:
 [[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch9.html#x13-940009.4|cpdf -chop]]: [[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch9.html#x13-940009.4|cpdf -chop]]:
 <code> <code>
-cpdf -chop "2 1" "src.pdf" -o "dst.pdf"+cpdf -chop "2 1" "in.pdf" -o "out.pdf"
 </code> </code>
 Unlike MuPDF, ''cpdf -chop'' only sets mediabox, which is enough. Unlike MuPDF, ''cpdf -chop'' only sets mediabox, which is enough.
Riadok 127: Riadok 137:
 Moreover, ''cpdf'' can also remove page labels, if needed: Moreover, ''cpdf'' can also remove page labels, if needed:
 <code> <code>
-cpdf -chop "2 1" "src.pdf" AND -remove-page-labels -o "dstNoLabels.pdf"+cpdf -chop "2 1" "in.pdf" AND -remove-page-labels -o "dstNoLabels.pdf"
 </code> </code>
  
Riadok 133: Riadok 143:
 [[https://www.pdflabs.com/docs/pdftk-man-page/#dest-op-shuffle|pdftk shuffle]] && [[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch3.html#x7-570003.6|cpdf -mediabox]]: [[https://www.pdflabs.com/docs/pdftk-man-page/#dest-op-shuffle|pdftk shuffle]] && [[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch3.html#x7-570003.6|cpdf -mediabox]]:
 <code> <code>
-pdftk A=src.pdf shuffle A A output dst.pdf +pdftk A=in.pdf shuffle A A output out.pdf 
-cpdf -mediabox "0mm 0mm a5landscape" "src.pdf" odd -o "srcOdd.pdf" +cpdf -mediabox "0mm 0mm a5landscape" "in.pdf" odd -o "srcOdd.pdf" 
-cpdf -mediabox "148.5mm 0mm a5landscape" "srcOdd.pdf" even -o "dst.pdf"+cpdf -mediabox "148.5mm 0mm a5landscape" "srcOdd.pdf" even -o "out.pdf"
 </code> </code>
 The second step might also be done in Adobe Acrobat "Crop pages" function. The second step might also be done in Adobe Acrobat "Crop pages" function.
  
 +==== QPDF ====
 +N/A (not tested yet TODO)
  
-===== Case #002: Crop pages of PDF =====+ 
 +===== Crop pages of PDF =====
 See MuPDF documentation on [[https://mupdf.readthedocs.io/en/latest/mutool-trim.html#mutool-trim-defined-boxes|different PDF page boxes]] (media|crop|art|trim|bleed]box). See MuPDF documentation on [[https://mupdf.readthedocs.io/en/latest/mutool-trim.html#mutool-trim-defined-boxes|different PDF page boxes]] (media|crop|art|trim|bleed]box).
  
Riadok 155: Riadok 168:
 Note that Acrobat won't let you crop MediaBox, only other boxes (duh!). Note that Acrobat won't let you crop MediaBox, only other boxes (duh!).
   - Go to "Edit PDF" and then "Crop pages" function.   - Go to "Edit PDF" and then "Crop pages" function.
 +
 +==== QPDF ====
 +N/A (not tested yet TODO)
  
  
-===== Case #003: Remove cropped content from PDF =====+===== Remove cropped content from PDF =====
 **Example use-case:** You have cropped some pages but you want to actually remove the content, since otherwise it is only hidden but remains in PDF – this can be seen when you inspect the PDF in Adobe Acrobat via "Edit PDF" and zoom out the page – the cropped content will be selectable, although not visible, since it is out of the page margins. **Example use-case:** You have cropped some pages but you want to actually remove the content, since otherwise it is only hidden but remains in PDF – this can be seen when you inspect the PDF in Adobe Acrobat via "Edit PDF" and zoom out the page – the cropped content will be selectable, although not visible, since it is out of the page margins.
  
Riadok 165: Riadok 181:
 [[https://mupdf.readthedocs.io/en/latest/mutool-poster.html|mutool trim]]: [[https://mupdf.readthedocs.io/en/latest/mutool-poster.html|mutool trim]]:
 <code> <code>
-mutool trim -o "dst.pdf" -b cropbox "src.pdf"+mutool trim -o "out.pdf" -b cropbox "in.pdf"
 </code> </code>
  
Riadok 175: Riadok 191:
   - Run preflight and the script   - Run preflight and the script
  
 +==== QPDF ====
 +N/A (not tested yet TODO)
  
-===== Case #004: Text in Calibri-created PDF files cannot be copied/searched in Preview =====+ 
 +===== Text in Calibri-created PDF files cannot be copied/searched in Preview =====
 Preview app (or any PDFkit-based PDF viewer, such as Skim) cannot search/copy text from PDFs generated by Calibri. This is because the PDF created in such way uses CID Fonts, something which Preview app based on Apple's PDFKit does not support. Preview app (or any PDFkit-based PDF viewer, such as Skim) cannot search/copy text from PDFs generated by Calibri. This is because the PDF created in such way uses CID Fonts, something which Preview app based on Apple's PDFKit does not support.
  
Riadok 196: Riadok 215:
 </code> </code>
  
 +
 +===== - Decompress the whole PDF for editing in text editor =====
 +**Example use-case:** There are some cases when you need to see or edit the actual text contents of the PDF. For example, there are some metadata at the level of individual pages (like ) which no existing program will actually clean (Acrobat "Find hidden information" lists them, but keeps them there when you select to remove them).
 +
 +So directly editing text contents of the PDF might come in handy.
 +
 +==== TL;DR: Summary first ====
 +<color blue/lightgrey>**Conclusion:** TODO</color>
 +
 +==== Coherent PDF (cpdf) ====
 +[[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch5.html#x9-640005.1|cpdf -decompress]]:
 +<code>
 +cpdf -decompress [-no-preserve-objstm] "in.pdf" -o "out.pdf"
 +</code>
 +As mentioned by manual, ''[[https://www.coherentpdf.com/cpdfmanual/cpdfmanualch1.html#x5-360001.12|-no-preserve-objstm]]'' will remove data from separate object streams and put them back into normal flow of PDF, which should make the PDF easier for direct editing.
 +
 +<color red>**Warning:**</color> After processing a PDF using this command (with or without the ''-no-preserve-objstm'' switch), I have not been able to use Adobe Acrobat's "Optimize" or "Compare documents" function on it ever again – no matter how much tinkering, document-processing, transforming and cleaning I did on the PDF. So I consider this method to be unreliable if you want to maintain Adobe Acrobat compatibility (which I always do).
 +
 +==== MuPDF (mutool) ====
 +[[https://mupdf.readthedocs.io/en/1.24.0/mutool-clean.html|mutool clean]]:
 +<code>
 +mutool clean -d "in.pdf" "out.pdf"
 +</code>
 +''mutool clean'' is specifically stated by developer as designated to "make a PDF file human editable" and to "expand compressed streams", so this is the command to go.
 +
 +There is one additional switch which deals with PDF decompression: **''-a''**, which ASCII-hex-encodes binary streams. This safely encodes binary streams so that there should be no problems when editing the PDF in text editor. However, this forces almost //**all**// streams in PDF to be encoded, which makes the whole PDF human-unreadable, so it is not really helpful.
 +
 +Of all tested tools, ''mutool clean'' is the only one which maintains original object reference numbers (e.g. ''/Metadata 22 0 R'' referencing object #22).
 +
 +<color red>**Warning:**</color> After processing a PDF using this command (with or without the ''-a'' switch), I have not been able to use Adobe Acrobat's "Optimize" or "Compare documents" function on it ever again – no matter how much tinkering, document-processing, transforming and cleaning I did on the PDF. So I consider this method to be unreliable if you want to maintain Adobe Acrobat compatibility (which I always do).
 +
 +==== pdfcpu ====
 +pdfcpu does not seem to support decompressing PDFs.
 +
 +==== PDFtk server (pdftk) ====
 +[[https://www.pdflabs.com/docs/pdftk-man-page/#dest-compress|pdftk uncompress]]:
 +<code>
 +pdftk "in.pdf" output "out.pdf" uncompress
 +</code>
 +PDFtk separates individual PDF dictionary elements by newlines (''0A''). E.g. a page definition looks like this:
 +<code>
 +<<
 +/pdftk_PageNum 7
 +/Metadata 23 0 R
 +/Rotate 0
 +/Resources 24 0 R
 +/Type /Page
 +/Parent 25 0 R
 +/Contents 26 0 R
 +/MediaBox [0 0 370.158 591.26]
 +/CropBox [0 0 370.158 591.26]
 +>>
 +</code>
 +It also adds its own elements (e.g. ''/pdftk_PageNum'').
 +
 +
 +==== QPDF ====
 +[[https://qpdf.readthedocs.io/en/stable/cli.html#option-stream-data|qpdf --stream-data=uncompress]]:
 +<code>
 +qpdf "in.pdf" --stream-data=uncompress "out_qpdf.pdf"
 +</code>
 +This will effectively equivalent to using both [[https://qpdf.readthedocs.io/en/stable/cli.html#option-compress-streams|--compress-streams=n]] and [[https://qpdf.readthedocs.io/en/stable/cli.html#option-decode-level|--decode-level=generalized]].
 +
 +There is also another switch, ''[[https://qpdf.readthedocs.io/en/stable/cli.html#option-qdf|--qdf]]'', which TODO
 +
 +QPDF currently [[https://github.com/qpdf/qpdf/issues/339|does not support]] maintaining the original object ID.
 +
 +
 +===== New cases to come… =====
 <html> <html>
-<!-- +<!-- ——————————————————————————————————————————————————————————————————————————————————————————————— 
-===== Case #TODO: TODO =====+———————————————————————————————————————————————————————————————————————————————————————————————  --> 
 +<!-- New Case Template BEGIN 
 +===== - New Case TODO =====
 **Example use-case:** TODO. **Example use-case:** TODO.
  
-==== TODO ====+==== TL;DR: Summary first ==== 
 +<color blue/lightgrey>**Conclusion:** TODO</color> 
 + 
 +==== Coherent PDF (cpdf) ==== 
 +[[https://www.coherentpdf.com/cpdfmanual/TODO|cpdf -TODO]]:
 <code> <code>
-TODO+cpdf -TODO "in.pdf" -o "out.pdf"
 </code> </code>
--->+ 
 +==== MuPDF (mutool) ==== 
 +[[https://mupdf.readthedocs.io/en/1.24.0/mutool-clean.html|mutool TODO]]: 
 +<code> 
 +mutool TODO "in.pdf" "out.pdf" 
 +</code> 
 + 
 +==== pdfcpu ==== 
 +[[https://pdfcpu.io/core/TODO|pdfcpu TODO]]: 
 +<code> 
 +pdfcpu TODO "in.pdf" "out.pdf" 
 +</code> 
 + 
 +==== PDFtk server (pdftk) ==== 
 +[[https://www.pdflabs.com/docs/pdftk-man-page/#TODO|pdftk TODO]]: 
 +<code> 
 +pdftk "in.pdf" output "out.pdf" TODO 
 +</code> 
 + 
 +==== QPDF ==== 
 +[[https://qpdf.readthedocs.io/en/stable/cli.html#TODO|qpdf --TODO]]: 
 +<code> 
 +qpdf "in.pdf" --TODO "out.pdf" 
 +</code> 
 +--><!-- New Case Template END -->
 </html> </html>
  
blog/odborny/2025-05-07-command-line_tools_for_pdf_processing.1767904202.txt.gz · Posledná úprava: 2026/01/08 21:30 od Róbert Toth