How to Bring PDF Logical Page Numbers into Correspondence with Document Page Numbers
The intended audience of this post is those who are very familiar with the use of the Unix command line. Knowledge of the PostScript and/or PDF formats is useful but unnecessary. I hope to write software to make this task easier for those without the aforementioned knowledge.
One problem that plagues those who often read PDFs scanned from physical documents—and even those who read some digitally-typeset PDFs—is that the page numbers in the document do not correspond with the logical page numbers on the PDF. Before continuing, it is perhaps necessary to introduce the different types of page numbers: document page numbers, those reproduced in the document itself (usually in the header or footer); ordinal page numbers, obtained by counting pages from the first page in the PDF file—these start from 1 and increment by 1 without exception; logical page numbers, those displayed to the user by a PDF reader; and page indices, the zero-based indices used to internally reference pages within the PDF format (equivalent to ordinal page numbers minus one). Without special metadata to define a page number scheme, the logical page numbers of a PDF file are ordinal page numbers. However, many documents don't use ordinal page numbers—they number front matter with Roman numerals, restart page numbering for each section, or do similar things. This post discusses how to fix that incongruence.
The discrepancy between document page numbers and logical page numbers present in many PDF files makes it more difficult to follow internal references within the document, and creates ambiguity when one references the document in other media. Luckily, the PDF format allows one to define logical page numbers independent of the fixed, ordinal page numbers present in every PDF document, and most PDF readers at least support displaying these logical numbers. Unfortunately, I know of no free or open-source program that makes it easy to add logical page numbers to a PDF document.
Semi-automatically fixing PDF page numbering
I am currently writing such a program to make it easier for people uncomfortable with or uninterested in command-line usage or editing binary files to bring the logical page numbers within a PDF in correspondence with the page numbers of the document contained within. However, that software is not complete right now but the below procedure—which has served me very well for years—is. Even given the existence of working software to automate this task, it's worth describing on a technical level how to bring the page numbers of a PDF in correspondence with those of the document.
The manual procedure
You'll need the qpdf command-line tool and
a text editor that can deal with binary data and large files.
Neovim is confirmed to work in the
latter role, but I suspect other editors will also work.
For the below steps, assume that /kettle/blue/australia.pdf is our source PDF.
Preliminary steps
First, create a temporary directory:
1~% cd $(mktemp -d)Next, use qpdf to make the PDF editable.
This uncompresses data within the PDF, normalizes objects, saves relative
references in such a way that they can be restored after data are added or
removed, adds whitespace to the objects containing structured data,
and performs similar modifications to make the PDF file editable (while keeping
it a valid PDF file).
1/tmp/tmp.bluekettle% qpdf --qdf /kettle/blue/australia.pdf doc.qdfOpen the file in your editor of choice.
1/tmp/tmp.bluekettle% nvim doc.qdfLook for a section of the file line the one below.
The numbers needn't match—the important part is that there be a line with
/Type /Catalog.
1%% Original object ID: 1178 0
21 0 obj
3<<
4 /Metadata 3 0 R
5 /Pages 6 0 R
6 /Type /Catalog
7>>
8endobjModify it as such:
1%% Original object ID: 1178 0
21 0 obj
3<<
4 /Metadata 3 0 R
5 /Pages 6 0 R
6 /Type /Catalog
7 /PageLabels
8 << /Nums
9 [
10 page labels here
11 ]
12 >>
13>>
14endobjNext, figure out how the pages will be labeled.
Labeling pages
Lines 9-11 of the above Catalog object constitute an array in the PDF file
format.
Our task is to populate this array with key-value pairs containing the index of
the first page to which a particular numbering scheme applies as the key and the
numbering scheme as the value.
This is not particularly easy.
In order that this may be made as understandable as possible, I will spare the
Backus-Naur form
grammars and instead present variables that must have values substituted in by
the user as $single_words_with_prepended_dollar_signs,
optional components with
square brackets [like this],
and potentially repeatable components with an elipses like this….
Every other syntactical element comes from the deeply weird syntax of
PostScript/PDF files.
Each key-value pair is formatted like $index << $numscheme >>,
and $numscheme is itself a set of key-value pairs describing the page
numbering scheme of the PDF.
Indices are zero-based, so subtract one from the page numbers you get from a
normal PDF reader.
Constructing a numbering scheme
This section is a simplified version of what is provided for by the PDF standard document, but should be adequate for nearly all purposes.
Numbering schemes have the form << $key $val [$key2 $val2]… >>,
where $key and $key2 are keys from this list:
-
/Sis numbering style: a$valof:/Dselects decimal Arabic numerals (e.g. 1, 2, 3, …)/Rselects uppercase Roman numerals (e.g. I, II, III, …)/rselects lowercase Roman numerals (e.g. i, ii, iii, …)/Aselects uppercase Latin-alphabet letters (e.g. A, B, C, …)/alikewise but lowercase (e.g. a, b, c)
If a
/Skey is omitted, page numbers will consist solely of the prefix (or be blank). /Pis a prefix:$valwill be prepended to the page number (or will be the entire page number if/Sis omitted) Note that in PostScript (from which PDF is derived), strings are formatted(like this), with parentheses around the text./Stis the number at which page numbering starts. Successive page numbers will be derived from incrementing this number. Note that this value must be \(\ge 1\). Also, this value must always be an Arabic-numeral integer, even when such are not used for page numbering.
Labeling pages: the basics for those in a hurry
The following is appropriate for most conventionally-formatted documents:
1%% Original object ID: 1178 0
21 0 obj
3<<
4 /Metadata 3 0 R
5 /Pages 6 0 R
6 /Type /Catalog
7 /PageLabels
8 << /Nums
9 [
10 0 << /P (FC) >>
11 1 << /P (IFC) >>
12 2 << /S /r /St 1 >>
13 $x << /S /D /St 1 >>
14 $y << /P (IBC) >>
15 $z << /P (BC) >>
16 ]
17 >>
18>>
19endobjReplace $x with the zero-based page number of the first page with Arabic
numbering, $y with the index of the page containing the inner back cover, and
$z with $y+1.
A somewhat in-depth example
Consider a document with the following ordinal and document page numbers.
| Ordinal page number | Document page number |
|---|---|
| 1 | Front cover |
| 2 | Inner front cover |
| 3 | Spine |
| 4 through 13 | i through ix |
| 14 through 19 | 1-1 through 1-6 |
| 20 through 49 | 2-1 through 2-20 |
| 50 through 59 | A-1 through A-10 |
| 60 | Inner back cover |
| 61 | Back cover |
We can align the logical page numbers to the document page numbers by defining
the following /Catalog block.
1%% Original object ID: 1178 0
21 0 obj
3<<
4 /Metadata 3 0 R
5 /Pages 6 0 R
6 /Type /Catalog
7 /PageLabels
8 << /Nums
9 [
10 0 << /P (FC) >>
11 1 << /P (IFC) >>
12 2 << /P (Spine) >>
13 3 << /S /r /St 1 >>
14 13 << /P (1-) /S /D /St 1 >>
15 19 << /P (2-) /S /D /St 1 >>
16 49 << /P (A-) /S /D /St 1 >>
17 59 << /P (IBC) >>
18 60 << /P (BC) >>
19 ]
20 >>
21>>
22endobjCleaning up
Run fix-qdf to fix object references.
It takes input on stdin and output on stdout, so use it with I/O
redirection:
1/tmp/tmp.bluekettle% fix-qdf <doc.qdf >doc-prelim.pdfOpen doc-prelim.pdf in a PDF viewer and check that page numbers are right.
If not, edit doc.qdf and re-run fix-qdf.
Once page numbers are in correspondence,
save the output of fix-qdf.
This output is a fully-compliant PDF document suitable for use and storage.
Since qpdf --qdf decompresses parts of the input file and adds whitespace,
and because fix-qdf doesn't strip out any of that,
the resulting file will be bigger than the input.
However, in my experience the file size gained is negligible and not worth
recompressing the final PDF file to eliminate.
References:
Thanks to TheMagicalC for proofreading an earlier version of this post.