🔬 Security Research

Immutable Bytes, Mutable Meaning

A PDF isn't one document — it's a rendering program over an object graph, and the parser, renderer, signature validator, or AI pipeline each decides what it says. This is the single entry point to one argument, built in layers and measured across 24,824 real PDFs in three separate corpora.

Document AI assumes semantic determinism. The PDF format never provided it.

Ground truth, retrieval correctness, reproducible evaluation, hallucination control — every one quietly assumes a document means the same thing to every reader. For PDFs that assumption is false, and now that machines do the reading it is a measurable failure, not a curiosity. You don't have to care about PDFs to care about that.

Immutable bytes do not guarantee immutable meaning.

Imagine an LLM trained on a document no human has ever seen. Nothing was hidden and nothing was hacked — the PDF simply held several valid machine-readable realities, the ingestion pipeline picked one, and enshrined it as ground truth. That is not a thought experiment; it is what the studies below measure.

PDF was engineered to guarantee one thing: visual fidelity — that a page looks the same on every screen and printer. It never promised semantic determinism — that every system reading the file extracts the same meaning from it. For thirty years that gap was invisible, because humans read the pixels and the assumption “one parser, one truth” was never tested.

Machines changed the stakes. RAG knowledge bases, LLM training corpora, compliance pipelines and legal-discovery systems now read the object graph, not the page — and the gap becomes a security and correctness problem: forms signed while their value and their appearance disagree, knowledge bases that silently enshrine the wrong reality, evidence that reads one way to a person and another to a tool. This program names that property, measures how often it occurs, and shows what it enables.

Why this surfaced now

The flaw isn't new — the readers are. For thirty years a human read the rendered page and the gap between pixel and payload never mattered. Now machines read the structure at scale, and a latent property of the format becomes an active security and correctness problem.

For thirty years we assumed a document had one meaning, because a human read the page. Now machines read the structure — and many PDFs never had a single meaning at all.

One file, as many documents as there are readers

The same bytes, the same hash, the same valid signature — handed to five systems that each return a different document. None is malfunctioning; the PDF format guarantees they'll agree on how the page looks, never on what it says.

See the gap in one file

The claim sounds abstract until you watch a single file say two things at once. None of the files below is corrupt. None trips a malware alert. Each is simply a different document depending on who — or what — is reading it.

Exhibit A The character that isn't the glyph

👁 What the page shows a human

9rendered on screen and in print

⌨ What text extraction returns

1copied, indexed, fed to the model

The font's ToUnicode map points the glyph drawn as 9 at the character 1. The page is honest to the eye and lying to the machine — and a digital signature over the bytes certifies both readings. Proven and diagrammed in The Illusion of Immutability.

Exhibit B The signed form that stores a different number

👁 Appearance stream (/AP) — what the signer saw

$1,000.00the rendered field on the page

⌨ Field value (/V) — stored & submitted

$9,000.00the value a backend actually reads

A single form field carries two independent representations. The certificate covers the file's bytes — including both — so the signature stays valid while the value disagrees with its own appearance. See PDF Forms as Executable Security Boundaries.

Exhibit C One 686-byte file, six parsers, no consensus

Parser	Pages	JavaScript	Encrypted	AcroForm
MuPDF	1	None	No	No
Poppler	1	Present	No	No
Ghostscript	1	None	—	—
qpdf	1	—	No	—
pdfminer	1	Present	No	No
pdf.js	1	Present	No	—

The same file hides a JavaScript action in the /Names/JavaScript tree. Three parsers surface it; the others don't surface it as shipped — and that is not a matter of parsers being wrong. Each answers a different question: Ghostscript rasterizes and never enumerates interactive actions; qpdf has no path to enumerate it; and MuPDF can (mutool show file.pdf js with MuJS compiled in), but the builds shipped in most security stacks, distros and DFIR kits don't compile it in, and default workflows run mutool info, which never lists it. The point isn't which engine is correct — it's that two security pipelines built on two engines see two different files. A malware scanner built on an engine that never surfaces the action waves the file through as clean; this is a detection-visibility gap, not a spec violation. Real scanner output, all eleven crafted files, in Parser Disagreement: Six Parsers, Eleven Divergences.

Same bytes. Same hash. A valid signature. Three different documents.

See it on your own file

Every exhibit above is what the engine sees on a real upload. Drop in a PDF and watch the three verdict axes — threat, deception, structural — resolve in seconds, with the parser-disagreement and reality-drift signals called out. Zero-retention: nothing is stored.

Open the PDF Forensics Scanner →

24,824

real PDFs analyzed · three corpora

16,971

complete DOJ Epstein release

0.34%

false-positive rate (6,281-PDF control)

18.6%

human-vs-machine reality drift

1 in 3

files where six parsers disagree

verdict axes (threat · deception · structural)

Findings, visualized

24,824 real PDFs · three corpora, measured separately

16,971 DOJ Epstein release · 6,281 real-world benign control · 1,572 curated malicious / edge-case — never blended into a single number.

Six parsers, no consensus

502 of 1,572 files where production parsers disagree on extracted content or structure a security or ingestion pipeline would act on — page count, JavaScript visibility, encryption or form values. Spec-legal optional-feature absence and version strings are not counted as disagreement.

Human-vs-machine drift

18.6% of the Epstein release reads differently to a person than to a text extractor — nearly 1 in 5.

Reality drift in tax forms

43 of 44 IRS tax forms drift between the rendered page and the extracted text layer.

🧷 Start here · the synthesis

The PDF Is Not the Document — 24,824 PDFs, Three Corpora

Finding — One finding across an adversarial detection set, a real-world benign control, and the entire 16,971-PDF Epstein release: a PDF is a stack of representations that can disagree, and malware is only one axis.

The synthesis over the entire program below and the three corpora behind it — 24,824 real PDFs measured separately, never blended. Detection without reputation, a 0.34% false-positive rate, 18.6% human-vs-machine drift at scale, metadata that survives the strip in orphaned objects, and a numbering sequence only 2.4% complete. The strongest evidence that document forensics and malware scanning are different disciplines.

Read the full study →

Malware is only one axis. The scanner grades threat, deception and structural integrity independently — a file can carry zero malware and still rank high on deception. Collapsing the three into a single “risk score” is exactly how a clean-looking, lying document slips through. Illustrative of the verdict model.

One property. PDF is the proof.

Parser disagreement, reality drift, V/AP divergence and the rest aren't separate topics — and they aren't really about PDF. They're independent lines of evidence for one general property — Semantic Nondeterminism: identical bytes that yield multiple valid semantic interpretations across different consumers, despite nothing in the file having changed. PDF is simply where it can be measured at scale, because PDF exposes the object graph the divergence hides in. The format is the proof; the property is the point.

Semantic NondeterminismPDF is the proof

Parser disagreement Reality drift V/AP divergence OCR-layer divergence Accessibility-tree divergence ToUnicode remapping AI-ingestion failure

Seven independent lines of evidence — every one measured in PDF, the document format that exposes its own object graph. The same assumption lives unmeasured under search, e-discovery, AI ingestion and compliance.

The argument, in five layers

Read top to bottom and the case builds from first principles to field measurement: what a PDF is, the one property it lacks, the four mechanisms that exploit the gap, the prevalence at scale, and a real-world application. Each study stands alone — together they define the discipline.

Layer I

Foundational framing — what a PDF actually is

Before any attack, a reset of first principles. A PDF is not the document; it is a rendering program over an object graph. Separate the file from the document and the central illusion collapses: identical bytes, an unbroken hash, even a valid signature say something about a container — not about what any given system will read as truth.

🪞

The Illusion of Immutability Newest

Finding — A PDF separates what you see from what the machine reads — a glyph drawn "9" can extract "1", and a valid signature certifies it anyway.

How Reality Drift weaponises the gap between page and payload against legal integrity and AI ingestion: font-encoding remap, V/AP divergence under a valid signature, orphaned-object provenance, parser disagreement — diagrammed and measured across three corpora.

Reality driftV/AP divergenceOrphaned objectsRAG / AI ingestion

Read the study →

Layer II

The unifying theory — one root cause

Every failure mode below is a symptom of a single structural fact. PDF was engineered to guarantee visual fidelity — that a page looks the same everywhere. It never promised semantic determinism: that every extractor reads the same meaning. This is the keystone that ties the threads together.

🧩

The PDF Semantic Determinism Problem

Finding — Parser disagreement, V/AP divergence, AI-ingestion failure and reality drift converge on one root cause: PDF guarantees pixels, not a single meaning.

The framework that names the property — stable cross-extractor semantics under real pipelines — explains why a format built for visual fidelity has no canonical machine-readable interpretation, and sets out what an actual fix would require.

FrameworkRoot causeDeterminism

Read the study →

Layer III

Mechanism threads — how one file forks into many

Four distinct structural mechanisms, each turning a single file into different documents for different readers. These are not variations on one bug; they are independent routes to the same outcome — what this PDF is depends on who is reading it.

⚖️

Parser Disagreement: Six Parsers, Eleven Divergences

Finding — 11 crafted PDFs run through six production parsers — every file produced a different reading. Same bytes, different document.

MuPDF, Poppler, Ghostscript, qpdf, pdfminer and pdf.js, each in isolated namespaces, disagree on page count, text, JavaScript visibility and structure for the same file — the basis of parser-discrepancy attacks.

Parser discrepancyKeyword injectionStructural ambiguity

Read the study →

📋

PDF Forms as Executable Security Boundaries

Finding — A digital signature can certify a form while /V (the value) and /AP (the appearance) disagree — what gets signed is not what gets read.

Form fields carry two independent representations. V/AP divergence, NeedAppearances, DocMDP and FieldMDP certification mean a "signed" document can render one value and store another.

V/AP divergenceDocMDPAcroForm

Read the study →

🌀

PDF Reality Drift

Finding — One file, many realities: 43 of 44 IRS tax forms drift between the rendered page and the extracted text layer.

Thirteen structural drift vectors — OCR layers, accessibility trees, incremental revisions, ToUnicode remaps, optional-content groups — make legitimate, professionally produced PDFs among the most semantically unstable of all.

Reality driftOCR / text-layerAccessibility

Read the study →

🤖

PDF Structural Problems in AI Ingestion Pipelines

Finding — AI ingestion can be poisoned by the document itself — the model ingests text the human reader never sees.

When V/AP divergence and parser disagreement reach a RAG knowledge base or an LLM training corpus, single-parser extraction silently picks one reality and enshrines it as ground truth — quietly poisoning what the model learns or retrieves.

AI poisoningV/AP divergenceParser discrepancyRAG / LLM

Read the study →

Layer IV

Empirical scale — does it hold in the wild?

A theory is only as good as its measurement. The mechanisms above are stress-tested against a large, multi-domain corpus to answer the question that decides whether this is a curiosity or a property of the ecosystem: how often does it actually happen? Its corpus-level numbers consolidate and supersede the per-study prevalence figures in the earlier deep-dives — the mechanisms they document still stand; the authoritative counts live here and in the synthesis above.

📊

PDF Forensics at Scale — a 7,800-PDF Study

Finding — Parser disagreement on 502 of 1,572 files (~1 in 3); a 0.34% false-positive rate on a 6,281-PDF real-world control — detection by analysis, not reputation.

1,572 curated malicious/edge-case PDFs plus a 6,281-PDF real-world benign control, across eight domains including 400 live-malware samples. A multi-axis threat-vs-deception-vs-capability verdict, an honest false-positive measurement, and the prevalence numbers that consolidate the mechanism studies above — superseding their corpus-level figures while the mechanisms themselves still hold.

Multi-axis verdictDetection-by-analysisFalse-positive control

Read the study →

Layer V

Field application — a high-stakes case

The machinery, applied to a socially loaded, contested real-world disclosure where “what the document says” genuinely matters. This is the exhibit that the work is not academic: multi-engine, multi-layer forensics yields a different answer than opening the file and squinting.

🗂️

The Epstein Files, Forensically — 16,971 PDFs

Finding — A complete pass over the entire DOJ Epstein release: malware-clean, but 100% metadata-stripped (toolchain still recoverable — OmniPage CSDK 21.1), and 18.6% read differently to humans than to machines.

The first complete automated forensic pass over the whole DOJ disclosure — every one of 16,971 PDFs, all 47 engines. A document-integrity story: uniformly re-processed, metadata stripped from view but recoverable, and for nearly 1 in 5 documents the machine-readable text layer diverges from the visible page.

Reality driftMetadata recoveryParser discrepancyReal-world corpus

Read the study →

Why this is its own subject

The thirty-year PDF research canon — rendering fidelity, compression, digital signatures, malware, OCR, accessibility, PDF/A conformance — shares one unexamined assumption: that a PDF has a single, knowable content, and that one parser reading it yields the truth. Cross-parser semantic divergence is not unknown territory: DARPA's SafeDocs program and the LangSec community (Bratus et al., 2019–) formalized why format ambiguity makes parsers disagree, and built the theory this work stands on. What has been missing is operational measurement at corpus scale — not whether divergence can occur, but whether and how often production engines extract the same meaning from the same bytes across tens of thousands of real-world files. That is what we measured, and this program extends that prior work from the parser to the pipeline. And if semantic determinism cannot be assumed for the world's most common document format, then AI, search, compliance, and evidentiary systems built on top of documents must treat semantic determinism as a property to be verified, not assumed.

Across 24,824 real PDFs the same file routinely produces different documents — different page counts, different text, different JavaScript visibility, a value that disagrees with its own signed appearance. That isn't a crafted edge case; it's a property of the ecosystem, latent for as long as humans read the page and now active because machines read the structure. The threat model shifts from “is this file malware?” to “what realities can this file present, to whom, and who will believe which one?” — and that question is the subject this program exists to map.

The finding is about PDF; the assumption it breaks is not. The assumption that a document means one thing regardless of who reads it is the unstated foundation of digital forensics, e-discovery, search and information retrieval, compliance, and now AI training and retrieval. Semantic determinism is the assumption they all share; PDF is the proof it was never guaranteed. Name it, measure it, and you can start hardening ground truth everywhere a document is ingested.

Frequently asked

What is semantic nondeterminism?

Semantic nondeterminism is the property whereby identical bytes — an unchanged file, same hash — yield multiple valid machine-readable interpretations across different consumers. It is technical and falsifiable: same input, different extracted meaning, nothing in the file changed. This research measures it in PDF, the document format whose object graph exposes the divergence.

Is this only a PDF problem?

The finding is measured in PDF; the assumption it breaks is not. PDF is simply where the property is testable at scale, because the format exposes its own object graph. The assumption that a document means one thing regardless of who reads it is the unstated foundation of search, e-discovery, AI ingestion and compliance. We have proven it fails in PDF, so those systems must treat semantic determinism as a property to be verified, not assumed.

Why does it matter for AI and RAG pipelines?

When a single PDF carries more than one valid reading, single-parser extraction silently picks one and enshrines it as ground truth. The text a model trains or retrieves on can differ from the page a human reviewed — so retrieval correctness, reproducible evaluation and hallucination control all rest on an assumption the document never guaranteed.

How common is it, really?

Common enough to be a property of the ecosystem, not an edge case. Six production parsers disagree on roughly one file in three (502 of 1,572) — counting only divergence in extracted content or structure a security or ingestion pipeline would act on, not spec-legal optional-feature absence or version strings. 43 of 44 IRS tax forms drift between the rendered page and the extracted text layer. 18.6% of the 16,971-PDF DOJ Epstein release reads differently to a machine than to a person. Detection is by analysis, not reputation — a 0.34% false-positive rate on a 6,281-PDF real-world control.

Doesn't a hash or digital signature already prevent this?

No. A hash proves the bytes are unchanged; a signature certifies them. Neither constrains meaning. A form field can be signed while its stored value disagrees with its rendered appearance; a font can draw the character 9 while text extraction returns 1. The signature stays valid and the document still says two different things to two different readers.

These findings drive the scanner. Test the whole multi-axis pipeline on your own files with the PDF Forensics Scanner · how the engine works
Building AI, RAG or document pipelines? Apply this research → AI Document Integrity