Nástroje používateľa

Nástoje správy stránok


blog:odborny:2025-05-07-command-line_tools_for_pdf_processing

Command-line tools for PDF processing

A collection of command line solutions for different PDFmanipulation usecases, such as:

  • PDF splitting (“explode” one multipage PDF into set of singlepage PDFs)
  • cropping PDF pages

TODO

General: Overview of PDF-processing and manipulation tools

MuPDF (mutool)

PDFtk server (pdftk)

Coherent PDF (cpdf)

pdfcpu

Other (non-tested) tools

Case #000: Minimize PDF size

Example usecase: Obvious.

TL;DR: Summary first

  • Effectiveness of the tools (ordered by file size):
    • acrobat optimize ≪ mutool clean ≪ cpdf squeeze < pdfcpu optimize ≪ pdftk compress
    • ❶ Acrobat wins (but is paid), ❷ mutool is second & free (but is hard to configure), ❸ cpdf is third but is easiest to use
    • I have also tested all possible combinations between these tools, and the results are best when Acrobat Optimizer is followed by yet another tool. So…
  • Effectiveness of the combined tools (ordered by file size):
    • acrobat optimize + cpdf squeeze < acrobat optimize + pdfcpu optimize ≪ acrobat optimize + mutool clean

Conclusion: Always optimize PDF in Acrobat PDF Optimizer, then squeeze it with cpdf squeeze!

Coherent PDF (cpdf)

cpdf -squeeze:

cpdf -squeeze "src.pdf" [-squeeze-no-recompress] -o "dst.pdf"

pdfcpu

pdfcpu optimize:

pdfcpu optimize "src.pdf" "dst.pdf"

MuPDF (mutool)

mutool clean:

mutool clean -gggg -l -d -z -s "src.pdf" "dst.pdf"

mutool clean has many options and it takes some experimentation to see what actually shrinks the PDF size:

  • gggg (Garbage collect unused objects / compact xref table / merge duplicate objects / check streams for duplication): first three g's do not affect the file size for me, and the fourth makes it sometimes a bit larger and sometimes a bit smaller. I usually use the whole gggg parameter.
  • l (Linearize PDF): this makes PDF ready for “Fast web view”, but makes it slightly larger. Note that MuPDF 1.26.0 removes linearisation support, so it does not really makes sense to use it.
  • d (Decompress streams): this decompresses the whole file, which makes it larger – but when combined with z (Deflate uncompressed streams), it allows better PDF compression
  • z (Deflate uncompressed streams): this is the most important switch, and when combined with d (Decompress streams) it lowers PDF size even more
  • f (Compress font streams): no effect for me
  • i (Compress image streams): no effect for me
  • c (Clean content streams): no effect for me
  • s (Sanitize content streams): this actually lowers file size for me
  • AA (Recreate appearance streams for annotations): no effect for me (and not sure what it really does – at least while there are no annotations, it does not affect PDF file size at all)

PDFtk server (pdftk)

pdftk compress is not suitable for the task of really shrinking the PDF size to minimum, as the developer himself claims:

pdftk "src.pdf" output "dst.pdf" compress

Adobe Acrobat

Acrobat PDF Optimizer has tons of options. After thorough testing, the best settings seems usually to be these:

Images tab
⚠️ This is a perPDF setting – it depends on your particular usecase and file. My default is to have it turned off.
Fonts tab
✅ always on, with font subsetting turned on
Transparency tab
🚫 This causes optimisation to take extremely long time in some PDFs. My default is to have it turned off (and I cannot remember a situation when I actually needed to turn it on).
Discard Objects tab
✅ Everything on, except Discard bookmarks (you never want that) and also Convert smooth lines to curves and Detect and merge image fragments (these are not needed in 99% of cases and they cause optimisation to take significantly longer, while they usually do not lower the PDF size at all).
Discard User Data tab
✅ Everything on.
Clean Up tab
✅ Everything on.

Case #001: Split each page of PDF into several pages (posterisation)

Example usecase: you have (scanned) pages where each PDF page contains two physical pages, and want to crop those into two.

TL;DR: Summary first

Conclusion: Use cpdf chop. MuPDF has some problems and combining pdftk with manual cropping in Adobe Acrobat is tedious.

MuPDF (mutool)

mutool poster:

mutool poster -x 2 "src.pdf" "dst.pdf"

Note that mutool poster sometimes sets both mediabox and cropbox – the latter seems unnecessary. More importantly, mutool poster causes problems when mutool trim is used afterwards – for some reason, it leaves the PDF completely empty. This does not happen when mutool trim is used after cpdf chop.

Coherent PDF (cpdf)

cpdf -chop:

cpdf -chop "2 1" "src.pdf" -o "dst.pdf"

Unlike MuPDF, cpdf chop only sets mediabox, which is enough.

Moreover, cpdf can also remove page labels, if needed:

cpdf -chop "2 1" "src.pdf" AND -remove-page-labels -o "dstNoLabels.pdf"

Indirect way: Duplicate each page, then crop out left or right half on odd and even pages

pdftk shuffle && cpdf -mediabox:

pdftk A=src.pdf shuffle A A output dst.pdf
cpdf -mediabox "0mm 0mm a5landscape" "src.pdf" odd -o "srcOdd.pdf"
cpdf -mediabox "148.5mm 0mm a5landscape" "srcOdd.pdf" even -o "dst.pdf"

The second step might also be done in Adobe Acrobat “Crop pages” function.

Case #002: Crop pages of PDF

See MuPDF documentation on different PDF page boxes (media|crop|art|trim|bleed]box).

Mediabox is a “physical” size of the page, while other boxes are in a way only “virtual”: they specify which content should be visible at what point and in which cases, but they do not alter real PDF page size.

Coherent PDF (cpdf)

cpdf -mediabox (/cropbox/artbox/trimbox/bleedbox):

cpdf -mediabox "0mm 0mm a4portrait" "srcA3.pdf" -o "dstA4.pdf"

Note that adjusting page sizes by cropping mediabox is (to my knowledge) the only way to do it without altering the PDF content in any way (as explained here in Coherent PDF manual). Cropping other boxes might lead to PDF structure being changed.

Adobe Acrobat

Note that Acrobat won't let you crop MediaBox, only other boxes (duh!).

  1. Go to “Edit PDF” and then “Crop pages” function.

Case #003: Remove cropped content from PDF

Example usecase: You have cropped some pages but you want to actually remove the content, since otherwise it is only hidden but remains in PDF – this can be seen when you inspect the PDF in Adobe Acrobat via “Edit PDF” and zoom out the page – the cropped content will be selectable, although not visible, since it is out of the page margins.

According to author, it seems that cpdf is not going to support this feature.

MuPDF (mutool)

mutool trim:

mutool trim -o "dst.pdf" -b cropbox "src.pdf"

Adobe Acrobat

  1. Download custom user script CropBoxFix
  2. Import it into Acrobat Preflight (by doubleclicking)
  3. Run preflight and the script

Comments

blog/odborny/2025-05-07-command-line_tools_for_pdf_processing.txt · Posledná úprava: 2025/05/09 11:44 od Róbert Toth