The Illusion of Immutability: PDF Reality Drift

Q: What is Reality Drift in a PDF?

Reality Drift is a divergence between what a human sees on a rendered PDF page and what a machine parser extracts from the same file. Because ISO 32000 stores visual appearance separately from machine-readable semantics, the two layers can be made to disagree completely — a glyph drawn as '9' can extract as '1'.

Q: Can a PDF show one number to a human and a different number to an AI?

Yes. A custom font can map a byte to a glyph shape (say '9') while the /ToUnicode CMap routes the same byte to a different Unicode value (say '1'). A person reading the page sees 9%; an LLM, RAG pipeline, or financial ingestion system reading the text layer extracts 1%, with no visual indication that a remap occurred.

Q: Why do different PDF readers show different content for the same file?

The PDF specification is large and historically layered, so parsers implement non-standardised error-correction when they hit malformed structures. Given conflicting cross-reference streams, Adobe Reader, pdf.js, and Poppler each follow a different recovery path and can display different text — the basis of parser-discrepancy attacks and detection gaps.

Q: How can I detect Reality Drift in documents?

Run each document through multiple isolated parser engines and treat any variance as a signal; compute a Jaccard word-overlap score between the extracted text layer and an OCR scan of the rendered pixels, isolating files that fall below a tolerance threshold; and sanitise with absolute garbage collection that rewrites a fresh canonical binary. The PQ PDF Forensics Scanner implements these checks.

Executive summary

The PDF format deliberately stores what a human sees separately from what a machine reads, and that gap — Reality Drift — can be widened until the two never agree: a glyph drawn as “9” can extract as “1,” a field can store “Terminate” while painting “Renew,” and a valid cryptographic signature certifies the divergence either way. Cross-corpus measurement shows this is ambient, not exotic — even a clean real-world control drifts at a measurable baseline, while adversarial files drift almost completely. As legal workflows and AI pipelines increasingly trust the text layer without a human in the loop, closing that gap stops being a forensic curiosity and becomes a precondition for document integrity.

What you see is not what the machine processes

Digital commerce, global legal frameworks, and automated AI ingestion pipelines all rest on a single foundational assumption: what appears on the screen is exactly what the machine reads. For three decades the Portable Document Format (ISO 32000) has been the bedrock of that trust, governing land deeds, corporate acquisitions, treaties, and financial compliance filings.

But a PDF is not a flat, static piece of digital paper. It is a compiled, procedural rendering script. Because the specification deliberately separates visual appearance from semantic data, it permits a phenomenon we call Reality Drift — a complete divergence between what the human eye sees and what the machine parser extracts. Applied to digital forensics, electronic signatures, and Retrieval-Augmented Generation (RAG) pipelines, Reality Drift quietly invalidates the trust architecture those systems were built on.

This is a mechanism deep-dive. For the measured corpus-level prevalence behind it, see PDF Reality Drift and The PDF Is Not the Document.

1. The anatomy of the divergence: pixels vs. semantics

To understand why one PDF can report two different realities, you have to separate how characters are drawn from how they are stored. When an application renders a page, it reads visual-stream instructions — BT (begin text), font operators, glyph indexes — to place vector shapes at specific X/Y coordinates on the canvas. Simultaneously, for searching, copying, and machine extraction, the parser relies on a separate structure: the /ToUnicode CMap (character map) table.

The font-encoding exploitation: the visual glyph and the Unicode assignment are disconnected.

The font-encoding exploitation

An attacker can craft a custom font subtype where the drawn glyph and the underlying Unicode value are completely disconnected:

Visual layer: byte 0x41 is mapped to draw the glyph shape for “9”.
Extraction layer: the /ToUnicode map routes 0x41 to U+0031 — the character “1”.

A compliance officer reading the page sees a contractual rate of 9%. An automated financial ingestion system or an LLM RAG script parsing the raw text layer extracts 1% — with zero visual indication that any remapping occurred.

2. Cryptographic signatures: validating flawed bytes

The standard institutional rebuttal is that digital signatures (PAdES, Adobe CDS) guarantee integrity: alter the document after signing and the validation chain breaks. That defence overlooks the architectural separation between a form field’s value and its appearance stream.

A valid signature certifies the bytes — it never evaluates whether /V matches /AP.

An interactive form field carries two parameters: /V (the hidden value) and /AP (the literal graphics commands that paint the field). A signature covers a byte range; it verifies the raw bytes have not changed since signing — it does not verify that the painted appearance matches the stored value. An attacker crafts a form where /V is "Terminate" while the /AP stream paints "Renew". Both reside legally inside the signed byte scope, so the file passes full validation. The human signs a renewal; the automated execution system ingests a termination — and the security infrastructure reports a perfectly valid signature.

Real-world example — our form-security study reproduced this on a routine e-sign approval form: a checkbox whose stored value /V Off (do not approve) shipped with an /AS and /AP stream both rendering the box ticked. Worse, the form set /NeedAppearances true (ISO 32000 §12.7.2), so each viewer regenerates the appearance from /V at open time — meaning the same signed bytes can display “approved” in one reader and “not approved” in another, with the signature valid in both.

Full mechanism, including NeedAppearances and DocMDP certification: PDF Forms as Executable Security Boundaries.

3. Provenance survives the strip: orphaned objects

In high-stakes releases — litigation discovery, government declassification — files are run through sanitisation pipelines meant to strip authorship metadata and revision history. Most pipelines do this by editing the trailer and removing the primary metadata references: the /Info dictionary or the XMP stream. But the PDF format stores changes incrementally. When a file is updated or superficially “stripped,” the original objects — past edits, older redaction layers, original machine names — are usually unlinked from the object tree but not purged from the binary.

Deleting the pointer unlinks the object; it does not erase the bytes. Provenance survives the strip.

These are orphaned objects. Unless a sanitiser performs exhaustive garbage collection — rebuilding the cross-reference table and writing a fresh binary from scratch — the historical metadata stays in the compressed streams. A forensic object-graph walker bypasses the trailer pointers, scans the raw byte sequence, and reconstructs the document’s true production origin. In the full DOJ Epstein release this was not theoretical: every one of 16,971 PDFs was metadata-stripped from view, yet the toolchain (OmniPage CSDK 21.1) was still recoverable from exactly this kind of residue.

4. Parser disagreement: six engines, six realities

The ISO PDF specification is vast and historically layered, so writing a perfectly compliant parser is an open problem. When real parsers — Adobe Reader, Foxit, Poppler, PDF.js, MuPDF, Ghostscript — hit a malformed header, a structural error, or conflicting syntax, they don’t crash. They run custom, non-standardised error-correction loops. Feed them a file with dual conflicting cross-reference streams or mismatched version declarations and every engine follows a different recovery path.

Same bytes, three recovery paths, three documents — the basis of the parser-discrepancy attack.

This creates a detection gap. A security scanner guarding an API gateway might analyse a file with one library (say Poppler) and clear it. The file then passes into an internal workflow where a worker opens it with another engine (say Adobe Reader), triggering an active JavaScript payload or a malicious prompt sequence the gatekeeper never saw. In our controlled test, 11 crafted PDFs run through six production parsers produced a different reading every time.

Real-world example — this is the exact shape of a mail-gateway bypass. A document with two cross-reference tables is scanned at the perimeter by a Poppler-based engine that follows the first xref — a clean, empty-looking invoice — and is delivered. The recipient opens it in Adobe Reader, which prioritises the final incremental update and resolves a second object tree carrying an /OpenAction JavaScript trigger. Same bytes, same hash, same signature; the scanner and the victim’s reader simply disagree about which document the file is. Our six-parser run measured exactly this disagreement across 11 crafted files.

The full six-parser run: Parser Disagreement: Six Parsers, Eleven Divergences.

Reality Drift is measurable, not theoretical

These mechanisms leave a measurable signature: the fraction of documents whose machine-extracted text layer disagrees with the rendered page. We measured that fraction with the same multi-axis engine across three very different corpora — an adversarial set crafted to drift, a real-world government release, and a benign control — kept strictly separate.

Drift is everywhere — even a benign control carries a non-trivial baseline. Corpora measured separately, never blended.

The pattern is the point: even a clean, real-world control drifts at a measurable baseline, while an adversarially crafted set drifts almost completely. Drift is not an exotic edge case — it is an ambient property of the format that an attacker can dial up at will.

🤖 Why this matters for AI

Every mechanism above targets the text layer — and the text layer is exactly what an LLM ingests. A human reviewer is a built-in control: they see the rendered page and would notice “Renew” turning into “Terminate.” A RAG pipeline has no such control. It extracts the machine-readable stream, embeds it into a vector database, and serves it back as ground truth — ingesting the 1% no human ever saw, not the 9% on the page.

That makes Reality Drift a clean data-poisoning primitive: a single crafted PDF can plant contradictory facts in a knowledge base, smuggle instructions into a model via an accessibility or alternate-text layer, or pass an automated compliance gate while the document a person signed says the opposite. As agents begin to act on extracted text — approving payments, summarising contracts, filing reports — with no human in the loop, the pixel-vs-semantic gap moves from a forensic curiosity to a live integrity risk in the pipeline.

Deep-dive on RAG and training-corpus poisoning: PDF Structural Problems in AI Ingestion Pipelines.

5. Countermeasures for AI and enterprise ingestion

Because signature validation and metadata strippers don’t address low-level architectural drift, organisations processing untrusted PDFs at scale need a different approach — three layers of it.

Multi-engine structural analysis

Evaluate every document across multiple isolated parser implementations concurrently. Run the file through an array of independent forensic engines in separate namespaces and compare their extracted text-layer and structural outputs programmatically. Any variance across engines is itself the signal — it indicates structural manipulation, regardless of whether any single engine flags it.

The variance is the detector. If two honest engines disagree, the document is the problem.

Reality-Drift validation (Jaccard)

Ingestion pipelines should actively compute the word-overlap between a document’s extracted text layer and a localised OCR scan of the rendered pixels. The Jaccard similarity index gives a bounded score in [0, 1]:

If the parser’s textual output diverges from the visually rendered characters beyond a nominal tolerance, the file is an active Reality-Drift risk and should be isolated from AI ingestion and RAG vector databases before it ever reaches a model.

Definitive garbage collection

To safely sterilise documents for legal or corporate release, pipelines cannot rely on surface-level property clearing. Sanitisation must perform absolute garbage collection: parse the complete object graph, discard unreferenced object streams, decode and re-encode valid font-mapping matrices, and rewrite a fresh canonical binary that aligns the human view with the machine data. Anything short of a full rebuild leaves orphaned provenance and drift in place.

Frequently asked questions

What is Reality Drift in a PDF?

A divergence between what a human sees on the rendered page and what a machine parser extracts from the same file. Because ISO 32000 stores visual appearance separately from machine-readable semantics, the two layers can be made to disagree completely — a glyph drawn as “9” can extract as “1”.

Can a PDF show one number to a human and a different number to an AI?

Yes. A custom font can map a byte to a glyph shape (say “9”) while the /ToUnicode CMap routes the same byte to a different Unicode value (say “1”). The reader sees 9%; an LLM, RAG pipeline, or financial ingestion system reading the text layer extracts 1% — with no visual indication a remap occurred.

Does a valid digital signature guarantee a PDF was not manipulated?

No. A PAdES or Adobe CDS signature certifies that a byte range has not changed since signing. It does not verify that a form field’s stored value (/V) matches its painted appearance (/AP). An attacker can store "Terminate" while the appearance stream draws "Renew" — and the signature still validates.

Can authorship metadata survive PDF sanitisation or redaction?

Often, yes. Most sanitisers only delete the trailer pointer to the /Info dictionary or XMP stream. Because PDFs update incrementally, the original objects become orphaned — unlinked from the tree but still present in the binary. Only an exhaustive garbage-collection rewrite truly removes them.

Why do different PDF readers show different content for the same file?

The specification is large and historically layered, so parsers run non-standardised error-correction on malformed structures. Given conflicting cross-reference streams, Adobe Reader, pdf.js, and Poppler each follow a different recovery path and can display different text.

How can I detect Reality Drift in documents?

Run each document through multiple isolated engines and treat any variance as a signal; compute a Jaccard word-overlap score between the extracted text and an OCR scan of the pixels, isolating files below a tolerance; and sanitise with absolute garbage collection. The PQ PDF Forensics Scanner implements these checks.

Conclusion: redefining document trust

The friction around the PDF format highlights one modern reality: standard compliance is not operational security. While architecture boards focus on the theoretical intent of the ISO specification, defensive engineers have to operate inside the reality of compiled software behaviour — where a single file can present a contractual 9% to a human and a 1% to a machine, pass a valid signature while storing the opposite of what it paints, and carry its full authorship history through a sanitiser that believed it stripped everything.

As AI systems increasingly ingest documents autonomously, with no human in the loop, closing the gap between visual rendering and semantic data stops being a niche forensic exercise. It becomes a requirement for legal and operational data integrity. That gap — not malware — is what the PQ PDF scanner was built to measure.

See how the engine works, or read the full body of research.

Scan your own PDF for Reality Drift

47 forensic engines · six-parser differential analysis · V/AP divergence & orphaned-object recovery · free, in-browser, zero retention — file bytes never leave the server.

Run the PDF Forensics Scanner →