blog:odborny:2025-05-07-command-line_tools_for_pdf_processing
Rozdiely
Tu môžete vidieť rozdiely medzi vybranou verziou a aktuálnou verziou danej stránky.
| Obojstranná predošlá revíziaPredchádzajúca revíziaNasledujúca revízia | Predchádzajúca revízia | ||
| blog:odborny:2025-05-07-command-line_tools_for_pdf_processing [2026/01/08 21:30] – Case #004: Text in Calibri-created PDF files cannot be copied/searched in Preview Róbert Toth | blog:odborny:2025-05-07-command-line_tools_for_pdf_processing [2026/01/20 17:42] (aktuálne) – [6. Decompress the whole PDF for editing in text editor] Róbert Toth | ||
|---|---|---|---|
| Riadok 6: | Riadok 6: | ||
| * … | * … | ||
| - | TODO | + | Due to the nature of the topic, this post is (and probably will remain) a work-in-progress. |
| ===== General: Overview of PDF-processing and manipulation tools ===== | ===== General: Overview of PDF-processing and manipulation tools ===== | ||
| + | |||
| + | ==== Coherent PDF (cpdf) ==== | ||
| + | * **Download: | ||
| + | * **Changelog: | ||
| + | * **Manual: | ||
| ==== MuPDF (mutool) ==== | ==== MuPDF (mutool) ==== | ||
| Riadok 15: | Riadok 21: | ||
| * https:// | * https:// | ||
| * https:// | * https:// | ||
| - | * **Manual: | + | * **Manual: |
| + | |||
| + | ==== pdfcpu ==== | ||
| + | * **Download: | ||
| + | * **Changelog: | ||
| + | * **Manual: | ||
| ==== PDFtk server (pdftk) ==== | ==== PDFtk server (pdftk) ==== | ||
| Riadok 22: | Riadok 33: | ||
| * **Manual: | * **Manual: | ||
| - | ==== Coherent PDF (cpdf) | + | ==== QPDF ==== |
| - | * **Download: | + | * **Download: |
| - | * **Changelog: | + | * **Changelog: |
| - | * **Manual: | + | * **Manual: |
| - | + | ||
| - | ==== pdfcpu ==== | + | |
| - | * **Download: | + | |
| - | * **Changelog: | + | |
| - | * **Manual: | + | |
| < | < | ||
| Riadok 41: | Riadok 47: | ||
| </ | </ | ||
| - | ==== Other (non-tested) tools ==== | + | ==== Other (non-/not-yet-tested) tools ==== |
| - | * **QPDF: | + | TODO |
| - | * … | + | |
| < | < | ||
| <!-- | <!-- | ||
| Riadok 50: | Riadok 55: | ||
| </ | </ | ||
| - | ===== Case #000: Minimize PDF size ===== | + | |
| + | ===== - Minimize PDF size ===== | ||
| **Example use-case:** Obvious. | **Example use-case:** Obvious. | ||
| Riadok 65: | Riadok 71: | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | cpdf -squeeze "src.pdf" [-squeeze-no-recompress] -o "dst.pdf" | + | cpdf -squeeze "in.pdf" [-squeeze-no-recompress] -o "out.pdf" |
| </ | </ | ||
| Riadok 71: | Riadok 77: | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | pdfcpu optimize "src.pdf" "dst.pdf" | + | pdfcpu optimize "in.pdf" "out.pdf" |
| </ | </ | ||
| Riadok 77: | Riadok 83: | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | mutool clean -gggg -l -d -z -s "src.pdf" "dst.pdf" | + | mutool clean -gggg -l -d -z -s "in.pdf" "out.pdf" |
| </ | </ | ||
| '' | '' | ||
| Riadok 93: | Riadok 99: | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | pdftk "src.pdf" output "dst.pdf" compress | + | pdftk "in.pdf" output "out.pdf" compress |
| </ | </ | ||
| + | |||
| + | ==== QPDF ==== | ||
| + | N/A (not tested yet TODO) | ||
| ==== Adobe Acrobat ==== | ==== Adobe Acrobat ==== | ||
| Riadok 101: | Riadok 110: | ||
| ; <color blue> | ; <color blue> | ||
| ; <color blue> | ; <color blue> | ||
| - | ; <color blue> | + | ; <color blue> |
| ; <color blue> | ; <color blue> | ||
| ; <color blue> | ; <color blue> | ||
| - | ===== Case #001: Split each page of PDF into several pages (posterisation) ===== | + | |
| + | ===== - Split each page of PDF into several pages (posterisation) ===== | ||
| **Example use-case:** you have (scanned) pages where each PDF page contains two physical pages, and want to crop those into two. | **Example use-case:** you have (scanned) pages where each PDF page contains two physical pages, and want to crop those into two. | ||
| Riadok 114: | Riadok 124: | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | mutool poster -x 2 "src.pdf" "dst.pdf" | + | mutool poster -x 2 "in.pdf" "out.pdf" |
| </ | </ | ||
| Note that '' | Note that '' | ||
| Riadok 121: | Riadok 131: | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | cpdf -chop "2 1" "src.pdf" -o "dst.pdf" | + | cpdf -chop "2 1" "in.pdf" -o "out.pdf" |
| </ | </ | ||
| Unlike MuPDF, '' | Unlike MuPDF, '' | ||
| Riadok 127: | Riadok 137: | ||
| Moreover, '' | Moreover, '' | ||
| < | < | ||
| - | cpdf -chop "2 1" "src.pdf" AND -remove-page-labels -o " | + | cpdf -chop "2 1" "in.pdf" AND -remove-page-labels -o " |
| </ | </ | ||
| Riadok 133: | Riadok 143: | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | pdftk A=src.pdf shuffle A A output | + | pdftk A=in.pdf shuffle A A output |
| - | cpdf -mediabox "0mm 0mm a5landscape" | + | cpdf -mediabox "0mm 0mm a5landscape" |
| - | cpdf -mediabox " | + | cpdf -mediabox " |
| </ | </ | ||
| The second step might also be done in Adobe Acrobat "Crop pages" function. | The second step might also be done in Adobe Acrobat "Crop pages" function. | ||
| + | ==== QPDF ==== | ||
| + | N/A (not tested yet TODO) | ||
| - | ===== Case #002: Crop pages of PDF ===== | + | |
| + | ===== - Crop pages of PDF ===== | ||
| See MuPDF documentation on [[https:// | See MuPDF documentation on [[https:// | ||
| Riadok 155: | Riadok 168: | ||
| Note that Acrobat won't let you crop MediaBox, only other boxes (duh!). | Note that Acrobat won't let you crop MediaBox, only other boxes (duh!). | ||
| - Go to "Edit PDF" and then "Crop pages" function. | - Go to "Edit PDF" and then "Crop pages" function. | ||
| + | |||
| + | ==== QPDF ==== | ||
| + | N/A (not tested yet TODO) | ||
| - | ===== Case #003: Remove cropped content from PDF ===== | + | ===== - Remove cropped content from PDF ===== |
| **Example use-case:** You have cropped some pages but you want to actually remove the content, since otherwise it is only hidden but remains in PDF – this can be seen when you inspect the PDF in Adobe Acrobat via "Edit PDF" and zoom out the page – the cropped content will be selectable, although not visible, since it is out of the page margins. | **Example use-case:** You have cropped some pages but you want to actually remove the content, since otherwise it is only hidden but remains in PDF – this can be seen when you inspect the PDF in Adobe Acrobat via "Edit PDF" and zoom out the page – the cropped content will be selectable, although not visible, since it is out of the page margins. | ||
| Riadok 165: | Riadok 181: | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | mutool trim -o "dst.pdf" -b cropbox "src.pdf" | + | mutool trim -o "out.pdf" -b cropbox "in.pdf" |
| </ | </ | ||
| Riadok 175: | Riadok 191: | ||
| - Run preflight and the script | - Run preflight and the script | ||
| + | ==== QPDF ==== | ||
| + | N/A (not tested yet TODO) | ||
| - | ===== Case #004: Text in Calibri-created PDF files cannot be copied/ | + | |
| + | ===== - Text in Calibri-created PDF files cannot be copied/ | ||
| Preview app (or any PDFkit-based PDF viewer, such as Skim) cannot search/copy text from PDFs generated by Calibri. This is because the PDF created in such way uses CID Fonts, something which Preview app based on Apple' | Preview app (or any PDFkit-based PDF viewer, such as Skim) cannot search/copy text from PDFs generated by Calibri. This is because the PDF created in such way uses CID Fonts, something which Preview app based on Apple' | ||
| Riadok 196: | Riadok 215: | ||
| </ | </ | ||
| + | |||
| + | ===== - Decompress the whole PDF for editing in text editor ===== | ||
| + | **Example use-case:** There are some cases when you need to see or edit the actual text contents of the PDF. For example, there are some metadata at the level of individual pages (like ) which no existing program will actually clean (Acrobat "Find hidden information" | ||
| + | |||
| + | So directly editing text contents of the PDF might come in handy. | ||
| + | |||
| + | ==== TL;DR: Summary first ==== | ||
| + | <color blue/ | ||
| + | |||
| + | ==== Coherent PDF (cpdf) ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | cpdf -decompress [-no-preserve-objstm] " | ||
| + | </ | ||
| + | As mentioned by manual, '' | ||
| + | |||
| + | <color red> | ||
| + | |||
| + | ==== MuPDF (mutool) ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | mutool clean -d " | ||
| + | </ | ||
| + | '' | ||
| + | |||
| + | There is one additional switch which deals with PDF decompression: | ||
| + | |||
| + | Of all tested tools, '' | ||
| + | |||
| + | <color red> | ||
| + | |||
| + | ==== pdfcpu ==== | ||
| + | pdfcpu does not seem to support decompressing PDFs. | ||
| + | |||
| + | ==== PDFtk server (pdftk) ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | pdftk " | ||
| + | </ | ||
| + | PDFtk separates individual PDF dictionary elements by newlines ('' | ||
| + | < | ||
| + | << | ||
| + | / | ||
| + | /Metadata 23 0 R | ||
| + | /Rotate 0 | ||
| + | /Resources 24 0 R | ||
| + | /Type /Page | ||
| + | /Parent 25 0 R | ||
| + | /Contents 26 0 R | ||
| + | /MediaBox [0 0 370.158 591.26] | ||
| + | /CropBox [0 0 370.158 591.26] | ||
| + | >> | ||
| + | </ | ||
| + | It also adds its own elements (e.g. ''/ | ||
| + | |||
| + | |||
| + | ==== QPDF ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | qpdf " | ||
| + | </ | ||
| + | This will effectively equivalent to using both [[https:// | ||
| + | |||
| + | There is also another switch, '' | ||
| + | |||
| + | QPDF currently [[https:// | ||
| + | |||
| + | |||
| + | ===== New cases to come… ===== | ||
| < | < | ||
| - | <!-- | + | < |
| - | ===== Case # | + | ——————————————————————————————————————————————————————————————————————————————————————————————— |
| + | <!-- New Case Template BEGIN | ||
| + | ===== - New Case TODO ===== | ||
| **Example use-case:** TODO. | **Example use-case:** TODO. | ||
| - | ==== TODO ==== | + | ==== TL;DR: Summary first ==== |
| + | <color blue/ | ||
| + | |||
| + | ==== Coherent PDF (cpdf) ==== | ||
| + | [[https:// | ||
| < | < | ||
| - | TODO | + | cpdf -TODO " |
| </ | </ | ||
| - | --> | + | |
| + | ==== MuPDF (mutool) ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | mutool TODO " | ||
| + | </ | ||
| + | |||
| + | ==== pdfcpu ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | pdfcpu TODO " | ||
| + | </ | ||
| + | |||
| + | ==== PDFtk server (pdftk) ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | pdftk " | ||
| + | </ | ||
| + | |||
| + | ==== QPDF ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | qpdf " | ||
| + | </ | ||
| + | -->< | ||
| </ | </ | ||
blog/odborny/2025-05-07-command-line_tools_for_pdf_processing.1767904202.txt.gz · Posledná úprava: 2026/01/08 21:30 od Róbert Toth
