How to Bring PDF Logical Page Numbers into Correspondence with Document Page Numbers

Nov 11, 2024

The intended audience of this post is those who are very familiar with the use of the Unix command line. Knowledge of the PostScript and/or PDF formats is useful but unnecessary. I hope to write software to make this task easier for those without the aforementioned knowledge.

One problem that plagues those who often read PDFs scanned from physical documents—and even those who read some digitally-typeset PDFs—is that the page numbers in the document do not correspond with the logical page numbers on the PDF. Before continuing, it is perhaps necessary to introduce the different types of page numbers: document page numbers, those reproduced in the document itself (usually in the header or footer); ordinal page numbers, obtained by counting pages from the first page in the PDF file—these start from 1 and increment by 1 without exception; logical page numbers, those displayed to the user by a PDF reader; and page indices, the zero-based indices used to internally reference pages within the PDF format (equivalent to ordinal page numbers minus one). Without special metadata to define a page number scheme, the logical page numbers of a PDF file are ordinal page numbers. However, many documents don't use ordinal page numbers—they number front matter with Roman numerals, restart page numbering for each section, or do similar things. This post discusses how to fix that incongruence.

The discrepancy between document page numbers and logical page numbers present in many PDF files makes it more difficult to follow internal references within the document, and creates ambiguity when one references the document in other media. Luckily, the PDF format allows one to define logical page numbers independent of the fixed, ordinal page numbers present in every PDF document, and most PDF readers at least support displaying these logical numbers. Unfortunately, I know of no free or open-source program that makes it easy to add logical page numbers to a PDF document.

Semi-automatically fixing PDF page numbering

I am currently writing such a program to make it easier for people uncomfortable with or uninterested in command-line usage or editing binary files to bring the logical page numbers within a PDF in correspondence with the page numbers of the document contained within. However, that software is not complete right now but the below procedure—which has served me very well for years—is. Even given the existence of working software to automate this task, it's worth describing on a technical level how to bring the page numbers of a PDF in correspondence with those of the document.

The manual procedure

You'll need the qpdf command-line tool and a text editor that can deal with binary data and large files. Neovim is confirmed to work in the latter role, but I suspect other editors will also work.

For the below steps, assume that /kettle/blue/australia.pdf is our source PDF.

Preliminary steps

First, create a temporary directory:

1~% cd $(mktemp -d)

Next, use qpdf to make the PDF editable. This uncompresses data within the PDF, normalizes objects, saves relative references in such a way that they can be restored after data are added or removed, adds whitespace to the objects containing structured data, and performs similar modifications to make the PDF file editable (while keeping it a valid PDF file).

1/tmp/tmp.bluekettle% qpdf --qdf /kettle/blue/australia.pdf doc.qdf

Open the file in your editor of choice.

1/tmp/tmp.bluekettle% nvim doc.qdf

Look for a section of the file line the one below. The numbers needn't match—the important part is that there be a line with /Type /Catalog.

1%% Original object ID: 1178 0
21 0 obj
3<<
4  /Metadata 3 0 R
5  /Pages 6 0 R
6  /Type /Catalog
7>>
8endobj

Modify it as such:

 1%% Original object ID: 1178 0
 21 0 obj
 3<<
 4  /Metadata 3 0 R
 5  /Pages 6 0 R
 6  /Type /Catalog
 7  /PageLabels
 8  << /Nums
 9    [
10      page labels here
11    ]
12  >>
13>>
14endobj

Next, figure out how the pages will be labeled.

Labeling pages

Lines 9-11 of the above Catalog object constitute an array in the PDF file format. Our task is to populate this array with key-value pairs containing the index of the first page to which a particular numbering scheme applies as the key and the numbering scheme as the value. This is not particularly easy.

In order that this may be made as understandable as possible, I will spare the Backus-Naur form grammars and instead present variables that must have values substituted in by the user as $single_words_with_prepended_dollar_signs, optional components with square brackets [like this], and potentially repeatable components with an elipses like this…. Every other syntactical element comes from the deeply weird syntax of PostScript/PDF files.

Each key-value pair is formatted like $index << $numscheme >>, and $numscheme is itself a set of key-value pairs describing the page numbering scheme of the PDF. Indices are zero-based, so subtract one from the page numbers you get from a normal PDF reader.

Constructing a numbering scheme

This section is a simplified version of what is provided for by the PDF standard document, but should be adequate for nearly all purposes.

Numbering schemes have the form << $key $val [$key2 $val2]… >>, where $key and $key2 are keys from this list:

/S is numbering style: a $val of:
- /D selects decimal Arabic numerals (e.g. 1, 2, 3, …)
- /R selects uppercase Roman numerals (e.g. I, II, III, …)
- /r selects lowercase Roman numerals (e.g. i, ii, iii, …)
- /A selects uppercase Latin-alphabet letters (e.g. A, B, C, …)
- /a likewise but lowercase (e.g. a, b, c)
If a /S key is omitted, page numbers will consist solely of the prefix (or be blank).
/P is a prefix: $val will be prepended to the page number (or will be the entire page number if /S is omitted) Note that in PostScript (from which PDF is derived), strings are formatted (like this), with parentheses around the text.
/St is the number at which page numbering starts. Successive page numbers will be derived from incrementing this number. Note that this value must be $\ge 1$. Also, this value must always be an Arabic-numeral integer, even when such are not used for page numbering.

Labeling pages: the basics for those in a hurry

The following is appropriate for most conventionally-formatted documents:

 1%% Original object ID: 1178 0
 21 0 obj
 3<<
 4  /Metadata 3 0 R
 5  /Pages 6 0 R
 6  /Type /Catalog
 7  /PageLabels
 8  << /Nums
 9    [
10      0  << /P (FC)      >>
11      1  << /P (IFC)     >>
12      2  << /S /r /St  1 >>
13      $x << /S /D /St  1 >>
14      $y << /P (IBC)     >>
15      $z << /P (BC)      >>
16    ]
17  >>
18>>
19endobj

Replace $x with the zero-based page number of the first page with Arabic numbering, $y with the index of the page containing the inner back cover, and $z with $y+1.

A somewhat in-depth example

Consider a document with the following ordinal and document page numbers.

Ordinal page number	Document page number
1	Front cover
2	Inner front cover
3	Spine
4 through 13	i through ix
14 through 19	1-1 through 1-6
20 through 49	2-1 through 2-20
50 through 59	A-1 through A-10
60	Inner back cover
61	Back cover

We can align the logical page numbers to the document page numbers by defining the following /Catalog block.

 1%% Original object ID: 1178 0
 21 0 obj
 3<<
 4  /Metadata 3 0 R
 5  /Pages 6 0 R
 6  /Type /Catalog
 7  /PageLabels
 8  << /Nums
 9    [
10      0  << /P (FC)              >>
11      1  << /P (IFC)             >>
12      2  << /P (Spine)           >>
13      3  << /S /r /St  1         >>
14      13 << /P (1-) /S /D /St  1 >>
15      19 << /P (2-) /S /D /St  1 >>
16      49 << /P (A-) /S /D /St  1 >>
17      59 << /P (IBC)             >>
18      60 << /P (BC)              >>
19    ]
20  >>
21>>
22endobj

Cleaning up

Run fix-qdf to fix object references. It takes input on stdin and output on stdout, so use it with I/O redirection:

1/tmp/tmp.bluekettle% fix-qdf <doc.qdf >doc-prelim.pdf

Open doc-prelim.pdf in a PDF viewer and check that page numbers are right. If not, edit doc.qdf and re-run fix-qdf. Once page numbers are in correspondence, save the output of fix-qdf. This output is a fully-compliant PDF document suitable for use and storage. Since qpdf --qdf decompresses parts of the input file and adds whitespace, and because fix-qdf doesn't strip out any of that, the resulting file will be bigger than the input. However, in my experience the file size gained is negligible and not worth recompressing the final PDF file to eliminate.

References:

Thanks to TheMagicalC for proofreading an earlier version of this post.