A Modern PDF Cleanup Workflow
This note provides a workflow for taking a less than optimized PDF and optimizing it for viewing and printing. It isn’t a cure-all for sick PDF’s, but it does work for a lot of them. I’ve struggled with badly scanned PDF’s for a long time and this workflow represents my current best approach.
The note also provides a cookbook of solutions to problems I have run up against and the solutions that I currently use to address those problems.
Caveats
Not many, but they’re important to note
- Every PDF is unique and no one solution fits them all
- The workflow is a MacOS workflow. It should translate to the BSD’s and Linuxes without much difficulty. If you’re on Windows, YMMV.
Packages
A variety of tools are useful if you are going to work with scanned images and pdfs. This note uses PhotoScape X to deal with Color adjustments. Feel free to use your tool of choice. If you can achieve good results with one of the other packages, drop me a line, I’ll happily change my workflow.
- ghostscript - pdf tools
- ImageMagick - conversion tools for images
- libtiff - tiff tools
- mupdf - more pdf tools
- poppler - more pdf tools
- pandoc - coversion tools for documents
- PhotoScape X - app that does nice batch operations on images
- tesseract and tesseract-eng - OCR tools
Install Packages
Get PhotoScape X. It’s available in the Apple App Store and the free version works great. I just downloaded it and archived off a copy of the app for later use as neeed. I am not a fan of apps, but this one is too good to ignore.
The other packages are available via macports:
sudo port install ghostscript ImageMagick libtiff mupdf poppler pandoc tesseract tesseract-eng
Short Version
Here is the short version, keep reading afterward for details and the cookbook. There are lots of details to follow. This pdf stuff is tricky. Testing is very time consuming and great results are hard to obtain. This is my current workflow for taking a less than optimal pdf and improving it. Caveats apply.
1. Phase One
- Create a work area
- Copy in a pdf
-
Extract tiffs
mkdir -p ~/pdf-work/{input,output,output-photoscape,output-small,output-ocr} cd ~/pdf-work/input cp ~/Desktop/input.pdf . cd ../input pdfimages -tiff -p input.pdf ../output/output
2. Phase Two
- Use PhotoScape to adjust colors
- Open PhotoScape X
- Click Batch
- Drag the output folder into the window
- Choose Color->Grayscale
- Magic Color
- Lighten Shadows - 100
- Darken Highlights - 50
- Click SAVE
- Change the Image Format to TIFF
- Select DPI and change it to 150
- Choose a custom destination to put the output (output-photoscape)
- Click OK
Phase Three
- Resize and compress the tiffs
- Combine the tiffs into a single tiff
-
OCR the single tiff and produce the OCR’ed PDF
cd ../output-photoscape for i in *.tiff; do convert $i -resize 1200x -compress zip ../output-small/$i.tiff;done cd ../output-small tiffcp *.tiff ../input/multi-image-input.tiff cd ../input tesseract multi-image-input.tiff ../output -l eng PDF
The result is found in output.pdf
The Details
The following gets into the details about working with a not-so-great scanned pdf, trying to make it better and more useful - cleaner looking and with OCR. In the following discussion, I will be using a copy of Adrian Nye’s Volume 4 of The Definitive Guides to the X Window System about Xt Intrinsics from archive.org. This PDF is particularly suited to being reworked. It has huge images, it’s color, the color is neither needed, nor clear, it isn’t OCR’ed, and it’s a great book.
Tools provided by the packages
We will use a variety of tools in the exploration. Here is a summary list:
- convert from the package ImageMagick, convert between image formats
- gs from the package ghostscript, extract images from pdf (single image tiffs)
- mutool from the package mupdf - get info about images in pdf
- pandoc from the package pandoc, convert document formats (pdf, html, markdown, latex, etc)
- pdfimages from the package poppler, extract images from pdf (multi image tiff) and get info about images in pdf
- pdfinfo from the package poppler, get info from pdf
- pdfunite from the package poppler, combine pdfs into a single pdf
- PhotoScape X from the app PhotoScape X, color adjustment
- tesseract from the package tesseract, OCR of images and creation of pdf
- tiffcp from the package libtiff, combinine single-image tiffs into a multi-image tiff
- tiffsplit from the package libtiff, split multi-image tiffs into single-image tiffs
1. Setting up a work area
Make a directory to work from (preferably on an SSD) with input and output subdirs, change into it, download our work of interest, and copy it to input.pdf to use as our source.
mkdir -p ~/pdf-work/{input,output,output-photoscape,output-small,output-ocr}
cd ~/pdf-work
aria2c https://archive.org/download/xtoolkitintrinsic04nyemiss/xtoolkitintrinsic04nyemiss_200KB_jp2.pdf -o nye-vol-04.pdf
cp nye-vol-04.pdf input/input.pdf
You might want to just use preview to drag a sampling of the pages into a pdf that you name input.pdf and work with that until you’re convinced you want to work with the many, many page input.pdf :).
2. Getting Information from a PDF
Let’s take a look at the meta-data about the PDF using pdfinfo from the poppler package.
pdfinfo input.pdf
...
Producer: iText 1.3 by lowagie.com (based on itext-paulo-153)
CreationDate: Sun Jan 8 00:02:44 2006 CST
ModDate: Sun Jan 8 00:02:44 2006 CST
Custom Metadata: no
Metadata Stream: no
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 622
Encrypted: no
Page size: 475 x 637 pts
Page rot: 0
File size: 127682751 bytes
Optimized: no
PDF version: 1.5
The first things to notice are Pages, Page size, and File size:
Pages: 622
Page size: 475 x 637 pts
File size: 127682751 bytes
This is one big pdf!
Note that points are 1/72 of an inch. We can use the units
command to find the conversion factors:
units point in
* 0.013888889
/ 72
We can then do the math to figure out the size, in inches, of the pdf:
echo '475/72' | bc -l
6.59722222222222222222
echo '637/72' | bc -l
8.84722222222222222222
So, it’s a 6.5x9 pdf (I’m not convinced, but that’s what the pdf thinks it is, so we’ll roll with it).
3. Getting information about a PDF’s images
For this, we can use mutool from the mupdf package and pdfimages from the poppler package.
Let’s start with mutool. this utility will display information about all of the images in a pdf, so be prepared for some lenghty output:
mutool info input.pdf
input.pdf:
PDF-1.5
Info object (1939 0 R):
<</CreationDate(D:20060108060244Z)/Producer(iText 1.3 by lowagie.com \(based on itext-paulo-153\))/ModDate(D:20060108060244Z)>>
Pages: 622
Retrieving info from pages 1-622...
Mediaboxes (15):
1 (4 0 R): [ 0 0 475 637 ]
2 (7 0 R): [ 0 0 455 630 ]
42 (131 0 R): [ 0 0 461 629 ]
48 (149 0 R): [ 0 0 464 632 ]
50 (155 0 R): [ 0 0 457 632 ]
52 (162 0 R): [ 0 0 468 637 ]
54 (168 0 R): [ 0 0 466 636 ]
74 (230 0 R): [ 0 0 471 639 ]
128 (397 0 R): [ 0 0 465 637 ]
162 (503 0 R): [ 0 0 458 642 ]
198 (614 0 R): [ 0 0 461 639 ]
200 (620 0 R): [ 0 0 468 642 ]
316 (980 0 R): [ 0 0 468 638 ]
332 (1030 0 R): [ 0 0 471 638 ]
490 (1519 0 R): [ 0 0 463 638 ]
Images (622):
1 (4 0 R): [ JPX ] 2644x3542 1bpc ImageMask (1 0 R)
2 (7 0 R): [ JPX ] 2528x3502 1bpc ImageMask (5 0 R)
...
There is a lot of information being displayed. But, we’re mostly concerned with the image resolutions at this point.
Pixels are picture elements and they don’t readily convert to more intuitive units like inches. But, we can do a conversion for the monitor that’ll give us a hint as to the actual size.
First, let’s get the dpi information from X (go get XQuartz and install it, if you don’t already have it):
xdpyinfo | grep dots
resolution: 96x96 dots per inch
and
xrandr | grep -w connected
default connected 2560x1440+0+0 0mm x 0mm
This tells us how many dots per inch our monitor has and how many dots there are both horizontally and vertically.
Converting to inches is pretty straightforward since we now know the DPI and dots:
echo '2560/96' | bc -l
26.66666666666666666666
(base) nebula:~ wsenn$ echo '1440/96' | bc -l
15.00000000000000000000
This monitor is 27 x 15… get the measuring tape out… I wish, it’s actually 23.5 x 13.25. Prolly some weird pixel thing… Not worth worrying about, it’s close enough :). Email me if you know something useful here.
The question is how big is that image in relation to what we know about the monitor:
echo '2600/96' | bc -l
27.08333333333333333333
echo '3500/96' | bc -l
36.45833333333333333333
27 in x 36.5 in
Wow! That’s huge :). It’s about the same width as my monitor and more than twice as tall. Way more than we need for viewing or printing.
Similarly, we can get image information using pdfimages from the poppler package:
pdfimages -list input.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 2644 3542 rgb 3 8 jpx no 1 0 401 400 200K 0.7%
2 1 image 2528 3502 rgb 3 8 jpx no 5 0 400 400 200K 0.8%
pdfimages is a lot slower than mutools, but we do get some additonal information making it a good utility to have around.
2. Extracting images from a PDF
To extract images, we can use pdfimages from the poppler package:
cd ~/pdf-work/input
pdfimages -tiff -p input.pdf ../output/output
This will extract all of the images from the pdf and put them into ../output. The images will be named ‘output-XXX-YYY.tif’. If there are a lot of big images, it will take a while… and use a lot of disk space to extract them all, so be patient.
3. Tweaking colors
I’m sure there are better ways to do this, I just haven’t figured them out yet. Email me with your tips. The command-line tools don’t seem to know about “magic color”. Twiddle as you like with Lighten/Darken, etc.
To tweak colors, we can use PhotoScape X and preview our tweaks in real-time:
- Open PhotoScape X
- Click Batch
- Drag the output folder into the window
- Choose Color->Grayscale
- Magic Color
- Lighten Shadows - 100
- Darken Highlights - 50
- Click SAVE
- Change the Image Format to TIFF
- Select DPI and change it to 150
- Choose a custom destination to put the output (output-photoscape)
- Click OK
Be patient. It’ll work.
4. Resizing images
I am going to resize the images to 1200 pixels wide and I’m going to preserve the scaling. I am also going to compress the images. You may want to tweak the size. We will use the convert utility from the ImageMagick package:
cd ~/pdf-work/output-photoscape
for i in *.tiff; do convert $i -resize 1200x -compress zip ../output-small/$i.tiff;done
1200x might not be your speed, just change it as you see fit. I’m still trying to figure out an optimal size.
Be patient, it’ll take a bit.
5. Recombining single-image tiffs
To combine a bunch of individual tiffs into a multi-image tiff suitable for additional processing, we will use tiffcp from the libtiff package:
cd ~/pdf-work/output-small
tiffcp *.tiff ../input/multi-image-input.tiff
6. OCRing the images
We will be using tesseract from the tesseract package to perform OCR on our images. tesseract will work with either single-image tiffs or with a multi-image tiff. I will show both options, but for the workflow, we will only be concerned with the multi-image tiff version (a single file with many images).
Option 1. Multi-image tiffs
cd ~/pdf-work/input
tesseract multi-image-input.tiff ../output -l eng PDF
Option 2. Single-image tiffs
Don’t do this if you did option 1!
cd ~/pdf-work/output-small
for i in *.tiff; do tesseract $i ../output-ocr/$i -l eng PDF;done
Using option 2, there will be lots of pdfs to combine. We will use pdfunite from the poppler package to do the combination:
cd ~/pdf-work/output-ocr
pdfunite *.pdf ../output.pdf
Either option will take quite a while and both will produce an OCR’ed PDF.
The result is a cleaner, better looking, and more functional PDF.
That’s it for the exploration! On to the cookbook!
Cookbook Solutions to various problems
This section provides snippets solving specific problems arising while working with pdf’s and images.
Extract images to single-image tiff files
pdfimages -tiff -p input.pdf output
This creates a bunch of tiff images named output-XXX-YYY.tif
Extract images to a multi-image tiff file
gs -q -dNOPAUSE -dBATCH -sDEVICE=tifflzw -sPAPERSIZE=letter \
-sOutputFile=output.tiff input.pdf
This will extract all of the images in a pdf into a single .tiff
A useful option to keep in mind is -r for resolution. A setting like -r300 specifies a desired DPI.
Combine single-image tiffs into a multi-image tiff
tiffcp -c zip *.tif ../output.tiff
This combines all of the .tif files into a single .tiff
Adjust Colors of Multiple Images at Once
- open PhotoScape X to batch correct the color
- click batch
- add in your image folder
- adjust colors
- save the results
This creates a bunch of tiff images named whatever-XXX-YYY.tiff
OCR a .tiff file and produce a PDF
The accuracy of tesseract rivals adobe now… finally.
tesseract input.tiff output -l eng PDF
Split a multi-image tiff into single-image tiffs
tiffsplit input.tiff
This will extract all of the images from the tiff into multiple tiffs with funky names like xaaa.tif xaab.tif and so on, but it does what it says :).
Resize multiple tiffs
for i in *.tif; do convert $i -resize 1200x ../$i.tiff;done
This resizes tiffs to 1200xwhatever preserving the scale.
Resize multi-image tiff
convert input.tiff -resize 1200x output.tiff
This does the same thing for a multi-image tiff.
Join PDFs
Option 1 - using pdfunit from the poppler package
pdfunite *.pdf ../output.pdf
This joins all of the pdf files in a directory into a single pdf. It presumes that the pdfs are numbered appropriately so they are in sort order.
Option 2 - using MacOS’s delivered script
- deactivate any python environments you have running that aren’t the system python
- run the script
conda deactivate
python '/System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py' -o 'senn_w_database_project.pdf' [your list of pdfs]
Reach out to me if you find any issues or have suggestions.
- will
post last updated 2023-02-01 20:39:00 -0600