blog:odborny:2025-05-07-command-line_tools_for_pdf_processing
Rozdiely
Tu môžete vidieť rozdiely medzi vybranou verziou a aktuálnou verziou danej stránky.
| Obojstranná predošlá revíziaPredchádzajúca revíziaNasledujúca revízia | Predchádzajúca revízia | ||
| blog:odborny:2025-05-07-command-line_tools_for_pdf_processing [2025/05/07 22:15] – Róbert Toth | blog:odborny:2025-05-07-command-line_tools_for_pdf_processing [2026/04/30 10:56] (aktuálne) – [1. Minimize PDF size] Róbert Toth | ||
|---|---|---|---|
| Riadok 6: | Riadok 6: | ||
| * … | * … | ||
| - | TODO | + | Due to the nature of the topic, this post is (and probably will remain) a work-in-progress. |
| ===== General: Overview of PDF-processing and manipulation tools ===== | ===== General: Overview of PDF-processing and manipulation tools ===== | ||
| + | |||
| + | ==== Coherent PDF (cpdf) ==== | ||
| + | * **Download: | ||
| + | * **Changelog: | ||
| + | * **Manual: | ||
| ==== MuPDF (mutool) ==== | ==== MuPDF (mutool) ==== | ||
| * **Download: | * **Download: | ||
| - | * **Changelog: | + | * **Changelog: |
| - | * **Manual: | + | * https:// |
| + | * https:// | ||
| + | * **Manual: | ||
| + | |||
| + | ==== pdfcpu ==== | ||
| + | * **Download: | ||
| + | * **Changelog: | ||
| + | * **Manual: | ||
| ==== PDFtk server (pdftk) ==== | ==== PDFtk server (pdftk) ==== | ||
| Riadok 20: | Riadok 33: | ||
| * **Manual: | * **Manual: | ||
| - | ==== Coherent PDF (cpdf) | + | ==== QPDF ==== |
| - | * **Download: | + | * **Download: |
| - | * **Changelog: | + | * **Changelog: |
| - | * **Manual: | + | * **Manual: |
| - | + | ||
| - | ==== pdfcpu ==== | + | |
| - | * **Download: | + | |
| - | * **Changelog: | + | |
| - | * **Manual: | + | |
| < | < | ||
| Riadok 39: | Riadok 47: | ||
| </ | </ | ||
| - | ==== Other (non-tested) tools ==== | + | ==== Other (non-/not-yet-tested) tools ==== |
| - | * **QPDF: | + | TODO |
| - | * … | + | |
| < | < | ||
| <!-- | <!-- | ||
| Riadok 48: | Riadok 55: | ||
| </ | </ | ||
| - | ===== Case #000: Minimize PDF size ===== | + | |
| + | ===== - Minimize PDF size ===== | ||
| **Example use-case:** Obvious. | **Example use-case:** Obvious. | ||
| - | ==== cpdf ==== | + | ==== TL;DR: Summary first ==== |
| + | * **Effectiveness of the tools** (ordered by file size): | ||
| + | * <color green> | ||
| + | * ❶ Acrobat wins (but is paid), ❷ mutool is second & free (but is hard to configure), ❸ cpdf is third but is easiest to use | ||
| + | * I have also tested all possible combinations between these tools, and the results are best when Acrobat Optimizer is followed by //yet another// tool. So… | ||
| + | * **Effectiveness of the // | ||
| + | * <color green> | ||
| + | <color blue/ | ||
| + | |||
| + | ==== Coherent PDF (cpdf) | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | cpdf -squeeze "src.pdf" [-squeeze-no-recompress] -o "dst.pdf" | + | cpdf -squeeze "in.pdf" [-squeeze-no-recompress] -o "out.pdf" |
| </ | </ | ||
| Riadok 60: | Riadok 77: | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | pdfcpu optimize "src.pdf" "dst.pdf" | + | pdfcpu optimize "in.pdf" "out.pdf" |
| </ | </ | ||
| + | ==== MuPDF (mutool) ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | mutool clean -gggg -l -d -z -s " | ||
| + | </ | ||
| + | '' | ||
| + | * **'' | ||
| + | * **'' | ||
| + | * **'' | ||
| + | * **'' | ||
| + | * **'' | ||
| + | * **'' | ||
| + | * **'' | ||
| + | * **'' | ||
| + | * **'' | ||
| - | ===== Case #001: Split each page of PDF into several pages (posterisation) ===== | + | ==== PDFtk server (pdftk) |
| + | [[https:// | ||
| + | < | ||
| + | pdftk " | ||
| + | </ | ||
| + | |||
| + | ==== QPDF ==== | ||
| + | [[https:// | ||
| + | * **'' | ||
| + | * **'' | ||
| + | * **'' | ||
| + | * **'' | ||
| + | * **'' | ||
| + | So the resulting command is: | ||
| + | < | ||
| + | qpdf " | ||
| + | </ | ||
| + | |||
| + | ==== Adobe Acrobat ==== | ||
| + | Acrobat [[https:// | ||
| + | ; <color blue> | ||
| + | ; <color blue> | ||
| + | ; <color blue> | ||
| + | ; <color blue> | ||
| + | ; <color blue> | ||
| + | ; <color blue> | ||
| + | |||
| + | |||
| + | ===== - Split each page of PDF into several pages (posterisation) ===== | ||
| **Example use-case:** you have (scanned) pages where each PDF page contains two physical pages, and want to crop those into two. | **Example use-case:** you have (scanned) pages where each PDF page contains two physical pages, and want to crop those into two. | ||
| + | |||
| + | ==== TL;DR: Summary first ==== | ||
| + | <color blue/ | ||
| ==== MuPDF (mutool) ==== | ==== MuPDF (mutool) ==== | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | mutool poster -x 2 "src.pdf" "dst.pdf" | + | mutool poster -x 2 "in.pdf" "out.pdf" |
| </ | </ | ||
| + | Note that '' | ||
| ==== Coherent PDF (cpdf) ==== | ==== Coherent PDF (cpdf) ==== | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | cpdf -chop "1 2" "src.pdf" -o "dst.pdf" | + | cpdf -chop "2 1" " |
| + | </ | ||
| + | Unlike MuPDF, '' | ||
| + | |||
| + | Moreover, '' | ||
| + | < | ||
| + | cpdf -chop "2 1" "in.pdf" | ||
| </ | </ | ||
| Riadok 82: | Riadok 152: | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | pdftk A=src.pdf shuffle A A output | + | pdftk A=in.pdf shuffle A A output |
| - | cpdf -mediabox "0mm 0mm a5landscape" | + | cpdf -mediabox "0mm 0mm a5landscape" |
| - | cpdf -mediabox " | + | cpdf -mediabox " |
| </ | </ | ||
| The second step might also be done in Adobe Acrobat "Crop pages" function. | The second step might also be done in Adobe Acrobat "Crop pages" function. | ||
| + | ==== QPDF ==== | ||
| + | N/A (not tested yet TODO) | ||
| - | ===== Case #002: Crop pages of PDF ===== | + | |
| + | ===== - Crop pages of PDF ===== | ||
| See MuPDF documentation on [[https:// | See MuPDF documentation on [[https:// | ||
| + | |||
| + | Mediabox is a " | ||
| ==== Coherent PDF (cpdf) ==== | ==== Coherent PDF (cpdf) ==== | ||
| Riadok 97: | Riadok 172: | ||
| cpdf -mediabox "0mm 0mm a4portrait" | cpdf -mediabox "0mm 0mm a4portrait" | ||
| </ | </ | ||
| + | Note that adjusting page sizes by cropping mediabox is (to my knowledge) the only way to do it without altering the PDF content in any way (as explained [[https:// | ||
| ==== Adobe Acrobat ==== | ==== Adobe Acrobat ==== | ||
| - | Note that Acrobat won't let you crop MediaBox. | + | Note that Acrobat won't let you crop MediaBox, only other boxes (duh!). |
| - Go to "Edit PDF" and then "Crop pages" function. | - Go to "Edit PDF" and then "Crop pages" function. | ||
| + | ==== QPDF ==== | ||
| + | N/A (not tested yet TODO) | ||
| - | ===== Case #003: Remove cropped content from PDF ===== | + | |
| + | ===== - Remove cropped content from PDF ===== | ||
| **Example use-case:** You have cropped some pages but you want to actually remove the content, since otherwise it is only hidden but remains in PDF – this can be seen when you inspect the PDF in Adobe Acrobat via "Edit PDF" and zoom out the page – the cropped content will be selectable, although not visible, since it is out of the page margins. | **Example use-case:** You have cropped some pages but you want to actually remove the content, since otherwise it is only hidden but remains in PDF – this can be seen when you inspect the PDF in Adobe Acrobat via "Edit PDF" and zoom out the page – the cropped content will be selectable, although not visible, since it is out of the page margins. | ||
| - | mutool trim -o " | + | According to its author, it seems that [[https:// |
| ==== MuPDF (mutool) ==== | ==== MuPDF (mutool) ==== | ||
| [[https:// | [[https:// | ||
| < | < | ||
| - | mutool trim -o "dst.pdf" -b cropbox "src.pdf" | + | mutool trim -o "out.pdf" -b cropbox "in.pdf" |
| </ | </ | ||
| + | |||
| + | <color red> | ||
| ==== Adobe Acrobat ==== | ==== Adobe Acrobat ==== | ||
| Riadok 119: | Riadok 200: | ||
| - Run preflight and the script | - Run preflight and the script | ||
| + | ==== QPDF ==== | ||
| + | N/A (not tested yet TODO) | ||
| + | |||
| + | ===== - Text in Calibri-created PDF files cannot be copied/ | ||
| + | Preview app (or any PDFkit-based PDF viewer, such as Skim) cannot search/copy text from PDFs generated by Calibri. This is because the PDF created in such way uses CID Fonts, something which Preview app based on Apple' | ||
| + | |||
| + | This leads to a situation where **such PDF files can be properly shown and visually read, but trying to select (and copy) or search the text inside the PDF fails**. See these sources for reference: | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | |||
| + | ==== Fixing the problem with GhostScript ==== | ||
| + | This can be fixed by converting/ | ||
| + | < | ||
| + | gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ | ||
| + | -dPrinted=false -dQUIET -sOutputFile=" | ||
| + | </ | ||
| + | This was mentioned [[https:// | ||
| + | |||
| + | GhostScript can be installed e.g. by MacPorts: | ||
| + | < | ||
| + | sudo port install ghostscript | ||
| + | </ | ||
| + | |||
| + | |||
| + | ===== - Decompress the whole PDF for editing in text editor ===== | ||
| + | **Example use-case:** There are some cases when you need to see or edit the actual text contents of the PDF. For example, there are some metadata at the level of individual pages (like ) which no existing program will actually clean (Acrobat "Find hidden information" | ||
| + | |||
| + | So directly editing text contents of the PDF might come in handy. | ||
| + | |||
| + | ==== TL;DR: Summary first ==== | ||
| + | <color blue/ | ||
| + | |||
| + | ==== Coherent PDF (cpdf) ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | cpdf -decompress [-no-preserve-objstm] " | ||
| + | </ | ||
| + | As mentioned by manual, '' | ||
| + | |||
| + | <color red> | ||
| + | |||
| + | ==== MuPDF (mutool) ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | mutool clean -d " | ||
| + | </ | ||
| + | '' | ||
| + | |||
| + | There is one additional switch which deals with PDF decompression: | ||
| + | |||
| + | Of all tested tools, '' | ||
| + | |||
| + | <color red> | ||
| + | |||
| + | ==== pdfcpu ==== | ||
| + | pdfcpu does not seem to support decompressing PDFs. | ||
| + | |||
| + | ==== PDFtk server (pdftk) ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | pdftk " | ||
| + | </ | ||
| + | PDFtk separates individual PDF dictionary elements by newlines ('' | ||
| + | < | ||
| + | << | ||
| + | / | ||
| + | /Metadata 23 0 R | ||
| + | /Rotate 0 | ||
| + | /Resources 24 0 R | ||
| + | /Type /Page | ||
| + | /Parent 25 0 R | ||
| + | /Contents 26 0 R | ||
| + | /MediaBox [0 0 370.158 591.26] | ||
| + | /CropBox [0 0 370.158 591.26] | ||
| + | >> | ||
| + | </ | ||
| + | It also adds its own elements (e.g. ''/ | ||
| + | |||
| + | |||
| + | ==== QPDF ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | qpdf " | ||
| + | </ | ||
| + | This will effectively equivalent to using both [[https:// | ||
| + | |||
| + | There is also another switch, '' | ||
| + | |||
| + | QPDF currently [[https:// | ||
| + | |||
| + | |||
| + | ===== New cases to come… ===== | ||
| < | < | ||
| - | <!-- | + | < |
| - | ===== Case # | + | ——————————————————————————————————————————————————————————————————————————————————————————————— |
| + | <!-- New Case Template BEGIN | ||
| + | ===== - New Case TODO ===== | ||
| **Example use-case:** TODO. | **Example use-case:** TODO. | ||
| - | ==== TODO ==== | + | ==== TL;DR: Summary first ==== |
| + | <color blue/ | ||
| + | |||
| + | ==== Coherent PDF (cpdf) ==== | ||
| + | [[https:// | ||
| < | < | ||
| - | TODO | + | cpdf -TODO " |
| </ | </ | ||
| - | --> | + | |
| + | ==== MuPDF (mutool) ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | mutool TODO " | ||
| + | </ | ||
| + | |||
| + | ==== pdfcpu ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | pdfcpu TODO " | ||
| + | </ | ||
| + | |||
| + | ==== PDFtk server (pdftk) ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | pdftk " | ||
| + | </ | ||
| + | |||
| + | ==== QPDF ==== | ||
| + | [[https:// | ||
| + | < | ||
| + | qpdf " | ||
| + | </ | ||
| + | -->< | ||
| </ | </ | ||
| ~~socialite~~ | ~~socialite~~ | ||
| - | {{tag>draft pdf}} | + | {{tag> |
blog/odborny/2025-05-07-command-line_tools_for_pdf_processing.1746648913.txt.gz · Posledná úprava: 2025/05/07 22:15 od Róbert Toth
