How to Bring PDF Logical Page Numbers into Correspondence with Document Page Numbers
The intended audience of this post is those who are very familiar with the use of the Unix command line. Knowledge of the PostScript and/or PDF formats is useful but unnecessary. I hope to write software to make this task easier for those without the aforementioned knowledge.
One problem that plagues those who often read PDFs scanned from physical documents—and even those who read some digitally-typeset PDFs—is that the page numbers in the document do not correspond with the logical page numbers on the PDF. Before continuing, it is perhaps necessary to introduce the different types of page numbers: document page numbers, those reproduced in the document itself (usually in the header or footer); ordinal page numbers, obtained by counting pages from the first page in the PDF file—these start from 1 and increment by 1 without exception; logical page numbers, those displayed to the user by a PDF reader; and page indices, the zero-based indices used to internally reference pages within the PDF format (equivalent to ordinal page numbers minus one). Without special metadata to define a page number scheme, the logical page numbers of a PDF file are ordinal page numbers. However, many documents don't use ordinal page numbers—they number front matter with Roman numerals, restart page numbering for each section, or do similar things. This post discusses how to fix that incongruence.
The discrepancy between document page numbers and logical page numbers present in many PDF files makes it more difficult to follow internal references within the document, and creates ambiguity when one references the document in other media. Luckily, the PDF format allows one to define logical page numbers independent of the fixed, ordinal page numbers present in every PDF document, and most PDF readers at least support displaying these logical numbers. Unfortunately, I know of no free or open-source program that makes it easy to add logical page numbers to a PDF document.
Semi-automatically fixing PDF page numbering
I am currently writing such a program to make it easier for people uncomfortable with or uninterested in command-line usage or editing binary files to bring the logical page numbers within a PDF in correspondence with the page numbers of the document contained within. However, that software is not complete right now but the below procedure—which has served me very well for years—is. Even given the existence of working software to automate this task, it's worth describing on a technical level how to bring the page numbers of a PDF in correspondence with those of the document.
The manual procedure
You'll need the qpdf
command-line tool and
a text editor that can deal with binary data and large files.
Neovim is confirmed to work in the
latter role, but I suspect other editors will also work.
For the below steps, assume that /kettle/blue/australia.pdf
is our source PDF.
Preliminary steps
First, create a temporary directory:
1~% cd $(mktemp -d)
Next, use qpdf
to make the PDF editable.
This uncompresses data within the PDF, normalizes objects, saves relative
references in such a way that they can be restored after data are added or
removed, adds whitespace to the objects containing structured data,
and performs similar modifications to make the PDF file editable (while keeping
it a valid PDF file).
1/tmp/tmp.bluekettle% qpdf --qdf /kettle/blue/australia.pdf doc.qdf
Open the file in your editor of choice.
1/tmp/tmp.bluekettle% nvim doc.qdf
Look for a section of the file line the one below.
The numbers needn't match—the important part is that there be a line with
/Type /Catalog
.
1%% Original object ID: 1178 0
21 0 obj
3<<
4 /Metadata 3 0 R
5 /Pages 6 0 R
6 /Type /Catalog
7>>
8endobj
Modify it as such:
1%% Original object ID: 1178 0
21 0 obj
3<<
4 /Metadata 3 0 R
5 /Pages 6 0 R
6 /Type /Catalog
7 /PageLabels
8 << /Nums
9 [
10 page labels here
11 ]
12 >>
13>>
14endobj
Next, figure out how the pages will be labeled.
Labeling pages
Lines 9-11 of the above Catalog
object constitute an array in the PDF file
format.
Our task is to populate this array with key-value pairs containing the index of
the first page to which a particular numbering scheme applies as the key and the
numbering scheme as the value.
This is not particularly easy.
In order that this may be made as understandable as possible, I will spare the
Backus-Naur form
grammars and instead present variables that must have values substituted in by
the user as $single_words_with_prepended_dollar_signs
,
optional components with
square brackets [like this]
,
and potentially repeatable components with an elipses like this…
.
Every other syntactical element comes from the deeply weird syntax of
PostScript/PDF files.
Each key-value pair is formatted like $index << $numscheme >>
,
and $numscheme
is itself a set of key-value pairs describing the page
numbering scheme of the PDF.
Indices are zero-based, so subtract one from the page numbers you get from a
normal PDF reader.
Constructing a numbering scheme
This section is a simplified version of what is provided for by the PDF standard document, but should be adequate for nearly all purposes.
Numbering schemes have the form << $key $val [$key2 $val2]… >>
,
where $key
and $key2
are keys from this list:
-
/S
is numbering style: a$val
of:/D
selects decimal Arabic numerals (e.g. 1, 2, 3, …)/R
selects uppercase Roman numerals (e.g. I, II, III, …)/r
selects lowercase Roman numerals (e.g. i, ii, iii, …)/A
selects uppercase Latin-alphabet letters (e.g. A, B, C, …)/a
likewise but lowercase (e.g. a, b, c)
If a
/S
key is omitted, page numbers will consist solely of the prefix (or be blank). /P
is a prefix:$val
will be prepended to the page number (or will be the entire page number if/S
is omitted) Note that in PostScript (from which PDF is derived), strings are formatted(like this)
, with parentheses around the text./St
is the number at which page numbering starts. Successive page numbers will be derived from incrementing this number. Note that this value must be \(\ge 1\). Also, this value must always be an Arabic-numeral integer, even when such are not used for page numbering.
Labeling pages: the basics for those in a hurry
The following is appropriate for most conventionally-formatted documents:
1%% Original object ID: 1178 0
21 0 obj
3<<
4 /Metadata 3 0 R
5 /Pages 6 0 R
6 /Type /Catalog
7 /PageLabels
8 << /Nums
9 [
10 0 << /P (FC) >>
11 1 << /P (IFC) >>
12 2 << /S /r /St 1 >>
13 $x << /S /D /St 1 >>
14 $y << /P (IBC) >>
15 $z << /P (BC) >>
16 ]
17 >>
18>>
19endobj
Replace $x
with the zero-based page number of the first page with Arabic
numbering, $y
with the index of the page containing the inner back cover, and
$z
with $y+1
.
A somewhat in-depth example
Consider a document with the following ordinal and document page numbers.
Ordinal page number | Document page number |
---|---|
1 | Front cover |
2 | Inner front cover |
3 | Spine |
4 through 13 | i through ix |
14 through 19 | 1-1 through 1-6 |
20 through 49 | 2-1 through 2-20 |
50 through 59 | A-1 through A-10 |
60 | Inner back cover |
61 | Back cover |
We can align the logical page numbers to the document page numbers by defining
the following /Catalog
block.
1%% Original object ID: 1178 0
21 0 obj
3<<
4 /Metadata 3 0 R
5 /Pages 6 0 R
6 /Type /Catalog
7 /PageLabels
8 << /Nums
9 [
10 0 << /P (FC) >>
11 1 << /P (IFC) >>
12 2 << /P (Spine) >>
13 3 << /S /r /St 1 >>
14 13 << /P (1-) /S /D /St 1 >>
15 19 << /P (2-) /S /D /St 1 >>
16 49 << /P (A-) /S /D /St 1 >>
17 59 << /P (IBC) >>
18 60 << /P (BC) >>
19 ]
20 >>
21>>
22endobj
Cleaning up
Run fix-qdf
to fix object references.
It takes input on stdin
and output on stdout
, so use it with I/O
redirection:
1/tmp/tmp.bluekettle% fix-qdf <doc.qdf >doc-prelim.pdf
Open doc-prelim.pdf
in a PDF viewer and check that page numbers are right.
If not, edit doc.qdf
and re-run fix-qdf
.
Once page numbers are in correspondence,
save the output of fix-qdf
.
This output is a fully-compliant PDF document suitable for use and storage.
Since qpdf --qdf
decompresses parts of the input file and adds whitespace,
and because fix-qdf
doesn't strip out any of that,
the resulting file will be bigger than the input.
However, in my experience the file size gained is negligible and not worth
recompressing the final PDF file to eliminate.
References:
Thanks to TheMagicalC for proofreading an earlier version of this post.