Obsah

General: Overview of PDF-processing and manipulation tools
1. Minimize PDF size
2. Split each page of PDF into several pages (posterisation)
3. Crop pages of PDF
4. Remove cropped content from PDF
5. Text in Calibri-created PDF files cannot be copied/searched in Preview
- Fixing the problem with GhostScript
6. Decompress the whole PDF for editing in text editor
New cases to come…
Comments

Command-line tools for PDF processing

A collection of command line solutions for different PDF‑manipulation use‑cases, such as:

PDF splitting (“explode” one multi‑page PDF into set of single‑page PDFs)
cropping PDF pages
…

Due to the nature of the topic, this post is (and probably will remain) a work‑in‑progress.

General: Overview of PDF-processing and manipulation tools

Other (non-/not-yet-tested) tools

1. Minimize PDF size

Example use‑case: Obvious.

TL;DR: Summary first

Effectiveness of the tools (ordered by file size):
- acrobat optimize ≪ mutool clean ≪ cpdf squeeze < pdfcpu optimize ≪ pdftk compress
- ❶ Acrobat wins (but is paid), ❷ mutool is second & free (but is hard to configure), ❸ cpdf is third but is easiest to use
- I have also tested all possible combinations between these tools, and the results are best when Acrobat Optimizer is followed by yet another tool. So…
Effectiveness of the combined tools (ordered by file size):
- acrobat optimize + cpdf squeeze < acrobat optimize + pdfcpu optimize ≪ acrobat optimize + mutool clean

Conclusion: Always optimize PDF in Acrobat PDF Optimizer, then squeeze it with cpdf squeeze!

Coherent PDF (cpdf)

cpdf -squeeze:

cpdf -squeeze "in.pdf" [-squeeze-no-recompress] -o "out.pdf"

pdfcpu

pdfcpu optimize:

pdfcpu optimize "in.pdf" "out.pdf"

MuPDF (mutool)

mutool clean:

mutool clean -gggg -l -d -z -s "in.pdf" "out.pdf"

mutool clean has many options and it takes some experimentation to see what actually shrinks the PDF size:

‑gggg (Garbage collect unused objects / compact xref table / merge duplicate objects / check streams for duplication): first three g's do not affect the file size for me, and the fourth makes it sometimes a bit larger and sometimes a bit smaller. I usually use the whole ‑gggg parameter.
‑l (Linearize PDF): this makes PDF ready for “Fast web view”, but makes it slightly larger. Note that MuPDF 1.26.0 removes linearisation support, so it does not really makes sense to use it.
‑d (Decompress streams): this decompresses the whole file, which makes it larger – but when combined with ‑z (Deflate uncompressed streams), it allows better PDF compression
‑z (Deflate uncompressed streams): this is the most important switch, and when combined with ‑d (Decompress streams) it lowers PDF size even more
‑f (Compress font streams): no effect for me
‑i (Compress image streams): no effect for me
‑c (Clean content streams): no effect for me
‑s (Sanitize content streams): this actually lowers file size for me
‑AA (Recreate appearance streams for annotations): no effect for me (and not sure what it really does – at least while there are no annotations, it does not affect PDF file size at all)

PDFtk server (pdftk)

pdftk compress is not suitable for the task of really shrinking the PDF size to minimum, as the developer himself claims:

pdftk "in.pdf" output "out.pdf" compress

QPDF

N/A (not tested yet )

Adobe Acrobat

Acrobat PDF Optimizer has tons of options. After thorough testing, the best settings seems usually to be these:

Images tab: ⚠️ This is a per‑PDF setting – it depends on your particular use‑case and file. My default is to have it turned off.
Fonts tab: ✅ always on, with font subsetting turned on
Transparency tab: 🚫 This causes optimisation to take extremely long time in some PDFs. My default is to have it turned off (and I cannot remember a situation when I actually needed to turn it on).
Discard Objects tab: ✅ Everything on, except Discard bookmarks (you never want that) and also Convert smooth lines to curves and Detect and merge image fragments (these are not needed in 99% of cases and they cause optimisation to take significantly longer, while they usually do not lower the PDF size at all). You might also want to Discard all Javascript actions, but if there are any internal links (e.g. table of contents, references or index pointing to specific places in PDF), this causes them to disfunction.
Discard User Data tab: ✅ Everything on.
Clean Up tab: ✅ Everything on.

2. Split each page of PDF into several pages (posterisation)

Example use‑case: you have (scanned) pages where each PDF page contains two physical pages, and want to crop those into two.

TL;DR: Summary first

Conclusion: Use cpdf ‑chop. MuPDF has some problems and combining pdftk with manual cropping in Adobe Acrobat is tedious.

MuPDF (mutool)

mutool poster:

mutool poster -x 2 "in.pdf" "out.pdf"

Note that mutool poster sometimes sets both mediabox and cropbox – the latter seems unnecessary. More importantly, mutool poster causes problems when mutool trim is used afterwards – for some reason, it leaves the PDF completely empty. This does not happen when mutool trim is used after cpdf ‑chop.

Coherent PDF (cpdf)

cpdf -chop:

cpdf -chop "2 1" "in.pdf" -o "out.pdf"

Unlike MuPDF, cpdf ‑chop only sets mediabox, which is enough.

Moreover, cpdf can also remove page labels, if needed:

cpdf -chop "2 1" "in.pdf" AND -remove-page-labels -o "dstNoLabels.pdf"

Indirect way: Duplicate each page, then crop out left or right half on odd and even pages

pdftk shuffle && cpdf -mediabox:

pdftk A=in.pdf shuffle A A output out.pdf
cpdf -mediabox "0mm 0mm a5landscape" "in.pdf" odd -o "srcOdd.pdf"
cpdf -mediabox "148.5mm 0mm a5landscape" "srcOdd.pdf" even -o "out.pdf"

The second step might also be done in Adobe Acrobat “Crop pages” function.

QPDF

N/A (not tested yet )

3. Crop pages of PDF

See MuPDF documentation on different PDF page boxes (media|crop|art|trim|bleed]box).

Mediabox is a “physical” size of the page, while other boxes are in a way only “virtual”: they specify which content should be visible at what point and in which cases, but they do not alter real PDF page size.

Coherent PDF (cpdf)

cpdf -mediabox (/cropbox/artbox/trimbox/bleedbox):

cpdf -mediabox "0mm 0mm a4portrait" "srcA3.pdf" -o "dstA4.pdf"

Note that adjusting page sizes by cropping mediabox is (to my knowledge) the only way to do it without altering the PDF content in any way (as explained here in Coherent PDF manual). Cropping other boxes might lead to PDF structure being changed.

Adobe Acrobat

Note that Acrobat won't let you crop MediaBox, only other boxes (duh!).

Go to “Edit PDF” and then “Crop pages” function.

QPDF

N/A (not tested yet )

4. Remove cropped content from PDF

Example use‑case: You have cropped some pages but you want to actually remove the content, since otherwise it is only hidden but remains in PDF – this can be seen when you inspect the PDF in Adobe Acrobat via “Edit PDF” and zoom out the page – the cropped content will be selectable, although not visible, since it is out of the page margins.

According to its author, it seems that cpdf is not going to support this feature.

MuPDF (mutool)

mutool trim:

mutool trim -o "out.pdf" -b cropbox "in.pdf"

Warning: In some PDFs, I have experienced mutool trim to also cut out some contents which was actually visible on‑page (that is, inside mediabox & cropbox). I don't know whether it was because the PDF itself was malformed, or whether it is a problem with mutool – but just in case, better avoid and use Acrobat below.

Adobe Acrobat

Download custom user script CropBoxFix
Import it into Acrobat Preflight (by double‑clicking)
Run preflight and the script

QPDF

N/A (not tested yet )

5. Text in Calibri-created PDF files cannot be copied/searched in Preview

Preview app (or any PDFkit‑based PDF viewer, such as Skim) cannot search/copy text from PDFs generated by Calibri. This is because the PDF created in such way uses CID Fonts, something which Preview app based on Apple's PDFKit does not support.

This leads to a situation where such PDF files can be properly shown and visually read, but trying to select (and copy) or search the text inside the PDF fails. See these sources for reference:

Fixing the problem with GhostScript

This can be fixed by converting/running the PDF file through the following GhostScript command:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dBATCH \
-dPrinted=false -dQUIET -sOutputFile="out.pdf" "in.pdf"

This was mentioned here.

GhostScript can be installed e.g. by MacPorts:

sudo port install ghostscript

6. Decompress the whole PDF for editing in text editor

Example use‑case: There are some cases when you need to see or edit the actual text contents of the PDF. For example, there are some metadata at the level of individual pages (like ) which no existing program will actually clean (Acrobat “Find hidden information” lists them, but keeps them there when you select to remove them).

So directly editing text contents of the PDF might come in handy.

TL;DR: Summary first

Conclusion:

Coherent PDF (cpdf)

cpdf -decompress:

cpdf -decompress [-no-preserve-objstm] "in.pdf" -o "out.pdf"

As mentioned by manual, -no-preserve-objstm will remove data from separate object streams and put them back into normal flow of PDF, which should make the PDF easier for direct editing.

Warning: After processing a PDF using this command (with or without the ‑no‑preserve‑objstm switch), I have not been able to use Adobe Acrobat's “Optimize” or “Compare documents” function on it ever again – no matter how much tinkering, document‑processing, transforming and cleaning I did on the PDF. So I consider this method to be unreliable if you want to maintain Adobe Acrobat compatibility (which I always do).

MuPDF (mutool)

mutool clean:

mutool clean -d "in.pdf" "out.pdf"

mutool clean is specifically stated by developer as designated to “make a PDF file human editable” and to “expand compressed streams”, so this is the command to go.

There is one additional switch which deals with PDF decompression: ‑a, which ASCII‑hex‑encodes binary streams. This safely encodes binary streams so that there should be no problems when editing the PDF in text editor. However, this forces almost all streams in PDF to be encoded, which makes the whole PDF human‑unreadable, so it is not really helpful.

Of all tested tools, mutool clean is the only one which maintains original object reference numbers (e.g. /Metadata 22 0 R referencing object #22).

Warning: After processing a PDF using this command (with or without the ‑a switch), I have not been able to use Adobe Acrobat's “Optimize” or “Compare documents” function on it ever again – no matter how much tinkering, document‑processing, transforming and cleaning I did on the PDF. So I consider this method to be unreliable if you want to maintain Adobe Acrobat compatibility (which I always do).

pdfcpu

pdfcpu does not seem to support decompressing PDFs.

PDFtk server (pdftk)

pdftk uncompress:

pdftk "in.pdf" output "out.pdf" uncompress

PDFtk separates individual PDF dictionary elements by newlines (0A). E.g. a page definition looks like this:

<<
/pdftk_PageNum 7
/Metadata 23 0 R
/Rotate 0
/Resources 24 0 R
/Type /Page
/Parent 25 0 R
/Contents 26 0 R
/MediaBox [0 0 370.158 591.26]
/CropBox [0 0 370.158 591.26]
>>

It also adds its own elements (e.g. /pdftk_PageNum).

QPDF

qpdf --stream-data=uncompress:

qpdf "in.pdf" --stream-data=uncompress "out_qpdf.pdf"

This will effectively equivalent to using both --compress-streams=n and --decode-level=generalized.

There is also another switch, --qdf, which

QPDF currently does not support maintaining the original object ID.

New cases to come…

pdf

Obsah

Command-line tools for PDF processing

General: Overview of PDF-processing and manipulation tools

Coherent PDF (cpdf)

MuPDF (mutool)

pdfcpu

PDFtk server (pdftk)

QPDF

Other (non-/not-yet-tested) tools

1. Minimize PDF size

TL;DR: Summary first

Coherent PDF (cpdf)

pdfcpu

MuPDF (mutool)

PDFtk server (pdftk)

QPDF

Adobe Acrobat

2. Split each page of PDF into several pages (posterisation)

TL;DR: Summary first

MuPDF (mutool)

Coherent PDF (cpdf)

Indirect way: Duplicate each page, then crop out left or right half on odd and even pages

QPDF

3. Crop pages of PDF

Coherent PDF (cpdf)

Adobe Acrobat

QPDF

4. Remove cropped content from PDF

MuPDF (mutool)

Adobe Acrobat

QPDF

5. Text in Calibri-created PDF files cannot be copied/searched in Preview

Fixing the problem with GhostScript

6. Decompress the whole PDF for editing in text editor

TL;DR: Summary first

Coherent PDF (cpdf)

MuPDF (mutool)

pdfcpu

PDFtk server (pdftk)

QPDF

New cases to come…

Comments