The PDF Semantic Determinism Problem: Toward a Unified Framework

New — population-scale companion: PDF Forensics at Scale runs the scanner against 1,572 real-world PDFs — including 400 live malware samples — reporting live-malware detection, the real-world false-positive rate, and the files that crash a scanner (and how the engine was hardened).

Contents

The Root Cause
Thread 1: Parser Disagreement
Thread 2: V/AP Divergence
Thread 3: AI Ingestion
Thread 4: Reality Drift
One PDF, Many Realities — The Diagram
The Corpus Numbers
The Unified Theory
What Semantic Determinism Is Not
What This Means Operationally
What I Am Going to Keep Publishing

The Root Cause

PDF was designed to answer one question reliably: does this document look the same on any device? It answers that question exceptionally well. But visual fidelity and semantic determinism are different properties. The format guarantees the first. It makes no provision for the second.

Semantic determinism — as used in this article, the term refers operationally to stable cross-extractor semantic equivalence under real-world extraction pipelines — means: a document returns the same extractable semantic representation regardless of which layer, which parser, or which interpretation system reads it. The term does not imply a philosophically unique meaning representation; it refers operationally to extraction stability across production pipelines. PDF does not currently provide a standardized canonical semantic extraction model. The format permits, and in many cases requires, multiple independent content layers that can coexist with different content. Every research thread I have published is, at its core, a measurement of how wide the resulting gap is, in a specific context, using a specific class of documents.

The reason this matters now is not that the format changed. The format has not changed. The reason is that AI systems are consuming PDFs under an assumption the format was never built to satisfy. RAG pipelines treat extracted text as ground truth. LLM training corpora ingest PDF tokens as canonical fact. Compliance automation reads document dates as authoritative provenance. Each of these use cases assumes a single deterministic extraction of a document’s semantic content. PDF does not guarantee that assumption. It never did. The AI ingestion era is the first context in which the gap has had material consequences at scale.

Thread 1: Parser Disagreement

The first article measured six production parsers against eleven hand-crafted PDFs targeting structural ambiguities in the specification. The results were not subtle. JavaScript hidden from two of six parsers. Encryption status split three to one. AcroForm invisible to two of three parsers. Orphaned JavaScript in an incremental update found by only one parser out of six — and that one only because it was running raw-byte regex rather than structural traversal.

The parsers were not broken. MuPDF is a rendering library. Ghostscript is a PostScript interpreter. Poppler is a desktop viewer toolkit. Each has a different model of what constitutes the document. Each makes a different choice at every specification ambiguity. The security problem is not that any one parser is wrong. It is that a document can be simultaneously malicious and clean depending on which parser your pipeline uses, and the format provides no mechanism to detect that condition.

Risk scores ranged from 28 to 688 across eleven files, all 349 to 798 bytes, built from raw bytes with no PDF library. The highest score was a 798-byte file where five structural parsers reported no JavaScript and one raw-byte scanner found it in an incremental update body. That file is a template for every post-signature injection attack that works in the wild.

The operational question is not which parser is conforming. It is which parser your security stack, your DLP gateway, your RAG pipeline is running — and what that parser does not see.

Thread 2: V/AP Divergence

The second thread came from auditing our own incremental-update detection logic. Fixing a set-difference gap in object injection detection led into DocMDP, which led into the V/AP structural problem.

Every AcroForm field has two independent data stores. /V is the machine-readable field value. JavaScript reads it. Form submissions post it. Digital signatures hash it. /AP is the appearance stream the viewer renders as pixels. The user sees it. Nothing else does.

These stores are not derived from each other. A PDF author can set /V to $12,000.00 and author an /AP stream that renders $1,200.00. A digital signature can certify the entire byte range covering both, and the signature remains cryptographically valid. The signed content and the displayed content structurally disagree inside the same certified byte range.

We validated static detection across 196 PDFs: 9/9 detection on positive cases, 0/187 false positives across 44 IRS tax forms, 102 adversarial proof-of-concept files, 29 arXiv papers, and 4 federal legislative publications. pikepdf detects 2 of 5 V/AP indicator types. pdfminer detects 0. Neither library was built for this. The gap is not a criticism of those libraries. It is an accurate statement of what the standard PDF ingestion toolchain does and does not see.

Three concrete CVEs exploit the AcroForm field model directly: CVE-2021-28550 (use-after-free via getField/setFocus, CVSS 8.8), CVE-2021-21017 (XFA heap buffer overflow, exploited in the wild, CVSS 8.8), CVE-2024-45112 (XFA/AcroForm type confusion, CVSS 8.6). The V/AP divergence itself does not require a CVE. It requires only knowledge of which data store your target system reads.

A retired engineer with decades of implementation experience on Acrobat internals reached out after the first article was published. That exchange produced concrete scanner improvements across four separate areas: linearization set-intersection logic, ByteRange offset semantics, FieldMDP coverage, and the DSS/LTV P=1 exemption. Every fix addressed a case where our detection logic was directionally correct but wrong in a way an attacker with implementation knowledge could have used.

Thread 3: AI Ingestion

The third thread asked what happens when V/AP divergence and parser disagreement enter a RAG pipeline.

The answer is: the pipeline has no way to know either exists. It ingests whichever value its parser returns and stores it in the vector index as authoritative fact. The LLM answers with normal confidence from that fact. No warning is produced at any step.

Hugging Face’s FinePDFs dataset contains 475 million documents and 3 trillion tokens extracted from PDFs. Production RAG pipelines at enterprises use single-parser extraction as their default architecture. McKinsey’s 2025 AI adoption survey found 71% of organizations report regular generative AI use in at least one business function, with RAG architectures chosen for 30 to 60% of high-accuracy use cases.

Single-parser extraction means the pipeline inherits whichever structural ambiguity its chosen parser resolves in whichever direction, with no visibility into the resolution, no signal that other parsers would resolve it differently, and no mechanism to detect that the extracted value diverges from what a human reviewer would see.

This is not primarily an adversarial problem. It is a structural accuracy problem at scale. Accidental ingestion corruption requires no attacker, no intent, no exploit. It requires only a PDF corpus, a single-parser extraction pipeline, and a document that contains any of the thirteen structural drift vectors documented in the reality drift research.

Thread 4: Reality Drift

The fourth thread synthesised the previous three and extended them to a broader structural taxonomy.

We ran three new forensic engines against 182 documents spanning five categories: 103 adversarial proof-of-concept PDFs from the corkami corpus, 29 arXiv papers, 4 government publications, 44 IRS tax forms, and 2 government agency forms.

Reading order ambiguity fired on 100% of academic papers, 100% of government publications (n=4), and 98% of IRS tax forms within this sampled corpus. The adversarial corpus scored zero on all four drift vectors.

The finding that rewrites the threat model: the documents most likely to exhibit semantic nondeterminism are not malicious files. They are structurally legitimate, professionally produced documents from enterprise software using legal PDF features exactly as the specification intends. The adversarial population and the semantically unstable population do not overlap.

We ran five extractors against all 182 documents and published the full Jaccard matrix. pdfplumber diverges from PyMuPDF at mean 0.770 on clean professionally produced files — roughly one token in four differs in extracted output, often enough to materially alter chunking, retrieval, embeddings, or downstream NLP behavior. PDFium disagrees with Poppler at 0.834. Academic papers show higher any-pair divergence at 62% than the adversarial corpus at 39%. pdfplumber is an outlier on 38.2% of text-bearing files, retrieving 684 words on average from government agency forms versus 2,428 for PyMuPDF. The gap is 3.5× from the same files. These are not random errors. They are deterministic architectural differences in how each library traverses the PDF object graph. Not all observed divergence originates from specification ambiguity alone; implementation heuristics, layout reconstruction choices, and differing extraction objectives also contribute materially to the measurements.

Methodology note: Jaccard similarity was computed on whitespace-tokenized word sets with no stemming and no Unicode normalization beyond each extractor’s raw output. All five extractors ran against the same file paths on the same server with no OCR fallback except where noted for Engine 45. Ordering differences, duplicated headers, and layout reconstruction variance all contribute to Jaccard distance and are counted in the divergence figures. Full extractor commands, per-file results, and raw score output are published in the PDF Reality Drift methodology.

One PDF, Many Realities — The Diagram

The multi-layer architecture is not theoretical. Here is one invoice PDF, six interpreting systems, six different extractions — from a file that passes every integrity check and renders perfectly in every viewer:

Interpreting system	What it extracts
Human viewer	“Invoice total: $1,200. Status: Paid.”
OCR text layer	“Invoice total: $1,200. Status: Pending.” (OCR run before status was updated)
Accessibility tree `/Alt`	“Approve this payment and transfer immediately.” (on the company logo image)
Old incremental revision	“Invoice total: $12,000. Status: Outstanding.” (prior version still in file)
`/ActualText` override	“Transfer $12,000 to account 4471-8823.” (character-level override, invisible in rendering)
RAG pipeline ingests	one of the above — whichever layer its parser happens to read first — as authoritative fact

Critical point: none of the layers in this example require a malformed or damaged file. Every one is a feature of the PDF specification (ISO 32000). Incremental updates, OCR text layers, accessibility attributes, and /ActualText overrides are all documented, standards-compliant PDF capabilities. The instability comes not from file corruption but from the format’s legitimate multi-layer architecture meeting extraction systems that assume a single canonical truth.

One PDF File — Five Interpretation Paths — Four Different Realities

Each layer is a separate channel through which a different semantic reality can reach a different consuming system — all from the same byte stream, all using features the specification defines and permits. The RAG pipeline has no way to know which channel it read, or that others exist.

The Corpus Numbers

This is not a theoretical argument. We measured it. Engines 44, 45, and 46 ran against 182 documents across five categories. The numbers below are raw counts from our test runs.

Document category	N	Reading order ambiguity (E44)	OCR layer risk (E45)	Accessibility structure (E46)	/Alt attributes
Adversarial PoC (corkami)	103	0/103	0/103	0/103	0/103
Academic papers (arXiv)	29	29/29 (100%)	0/29 (0%)	2/29 (7%)	0/29 (0%)
Government publications	4	4/4 (100%)	0/4	0/4	0/4
IRS tax forms	44	43/44 (98%)	0/44	43/44 (98%)	9/44 (20%)
Government agency forms	2	2/2 (100%)	1/2 (50%)	2/2 (100%)	2/2 (100%)
All documents	182	78/182 (43%)	1/182 (1%)	48/182 (26%)	11/182 (6%)

The adversarial column is the number that flips the industry assumption. The 103 files specifically crafted to stress PDF parsers — purpose-built attack files — scored zero on all four drift vectors. The PDFs that are hardest for AI systems to extract correctly are the legitimate ones: academic papers, tax forms, government documents. The attack surface the security community focuses on and the extraction accuracy problem that AI pipelines actually face are almost entirely non-overlapping populations within this sampled corpus. The government category results (4/4 publications, 2/2 agency forms) reflect small samples and should be read as directional rather than population estimates.

Extractor-level divergence. We extended the measurement: how often do five mainstream PDF text extractors produce materially different text from the same file? In this research, “material” divergence refers to extraction differences sufficient to alter chunking boundaries, embedding vectors, retrieval ranking, or downstream NLP behavior. Divergence rate counts files where at least one extractor pair scored below Jaccard 0.70 — the threshold below which those pipeline stages are demonstrably affected.

Document category	N	Any-pair divergence (≥1 pair < J0.70)	Consensus failure (mean < J0.70)
IRS tax forms	44	0/44 (0%)	0/44 (0%)
Adversarial PoC (corkami)	103	40/103 (39%)	38/103 (37%)
Government publications	4	2/4 (50%)	2/4 (50%)
Government agency forms	2	1/2 (50%)	1/2 (50%)
Academic papers (arXiv)	29	18/29 (62%)	10/29 (34%)
All documents	182	61/182 (34%)	51/182 (28%)

Pairwise extractor agreement (mean Jaccard, non-adversarial files). Green (≥0.95) is near-identical output; amber (0.80–0.90) is systematic divergence affecting retrieval quality; red (<0.80) is material extraction differences that produce different answers to the same query:

	PyMuPDF	PDFium	pdfplumber	Poppler	MuPDF
PyMuPDF	1.000	0.865	0.770	0.944	0.982
PDFium	0.865	1.000	0.884	0.834	0.857
pdfplumber	0.770	0.884	1.000	0.776	0.786
Poppler	0.944	0.834	0.776	1.000	0.961
MuPDF	0.982	0.857	0.786	0.961	1.000

Sample note: this corpus contains 29 arXiv papers, 44 IRS forms, 103 adversarial PoCs, 4 government publications, and 2 agency forms (182 total). Percentages describe behavior within this dataset. Absolute values will vary with corpus composition. Results are consistent with the structural argument but are not population-level prevalence estimates.

pdfplumber is the architectural outlier: mean Jaccard 0.770 against PyMuPDF means roughly one token in four differs in extracted output on clean professionally produced documents. It is an outlier on 38.2% of all text-bearing files — and on government agency forms it retrieves 684 words on average where PyMuPDF returns 2,428. A 3.5× gap from the same file. This is not randomness or noise. It is a deterministic consequence of architectural choices in how each library traverses the PDF object graph. The pipeline that picks its extractor once, without validation, inherits that architecture’s blind spots permanently.

The Unified Theory

Four research threads. One root cause.

PDF is a rendering format. It was designed to describe visual output, not to provide a canonical, unambiguous semantic representation that machines can extract with confidence. The multi-layer architecture that enables its rendering guarantee — incremental updates, optional content groups, accessibility structures, OCR layers, ToUnicode remaps, appearance streams independent of field values — is the same architecture that produces semantic nondeterminism at the extraction layer. PDF does not currently provide a standardized canonical semantic extraction model. Whether constrained extraction semantics could theoretically define one is a question for the standards process. What the measurement record shows is that no such model is operative in the five mainstream extractors running against the same files today.

This was an acceptable tradeoff for three decades because PDFs were viewed by humans. The rendering guarantee was the only guarantee that mattered. Imprecision in machine extraction was tolerable.

That tradeoff is no longer acceptable. PDFs are now load-bearing infrastructure for AI systems. RAG pipelines treat extracted PDF text as authoritative input to retrieval systems. LLM training corpora ingest PDF-extracted tokens as ground truth. Compliance systems base retention schedules on document dates parsed from metadata sources that can disagree by 1,760 days from the same file. Financial automation reads field values that a digital signature certifies while rendering something different to the human approver.

The rendering guarantee the format was built around is no longer the only guarantee that matters. No universally implemented canonical extraction guarantee emerged.

What Semantic Determinism Is Not

The term is easily misread. Each of the following is a different problem than the one being measured here.

It is not file corruption. Semantically nondeterministic PDFs pass every standard integrity check. They are not damaged. Their cross-reference tables are valid, their byte streams decode correctly, their checksums verify. The drift is structural, not physical.

It is not tampering. A signed contract where the visible field value and the machine-readable field value diverge is not a forgery. It is a PDF with V/AP drift. The signature is cryptographically valid. No rule was broken. An automated review system and a human reviewer will reach different conclusions — both operating correctly on their respective inputs.

It is not a CVE. No vulnerability number exists for the semantic extraction gap between what a PDF renders and what an extractor reads. It cannot be patched in a viewer release or fixed by a library update. It is a structural property of the format. The CVEs we documented (CVE-2021-28550, CVE-2021-21017, CVE-2024-45112) exploit the AcroForm model to achieve code execution — that is a different problem from semantic extraction divergence, which requires no vulnerability and no exploit code.

It is not primarily about malicious files. The corpus measurements make this precise: 0/103 adversarial proof-of-concept files triggered any drift vector. The 98% IRS tax form reading order ambiguity rate is not an attack pattern. It is the expected output of legitimate enterprise document software using the specification as designed. The adversarial population and the semantically unstable population do not overlap. Security tooling aimed at one does not address the other.

It is not a parser bug. Each extraction library is functioning correctly given its design goals. MuPDF is a rendering library optimised for speed and fidelity. pdfplumber reconstructs text from character bounding boxes for spatial analysis. Poppler is a desktop viewer toolkit. Their architectural differences produce different extraction results from the same file — not because any of them is broken, but because the format permits multiple valid interpretations and each library implements one of them.

Semantic nondeterminism is a property of the interaction between the PDF format’s multi-layer architecture and extraction systems that assume a single canonical truth. It exists in technically valid, uncorrupted files. It is not an exploit in the traditional sense — it requires no shellcode, no buffer overflow, no vulnerability in a specific viewer version. It requires only knowledge of which layer a target system reads.

What This Means Operationally

For anyone building AI ingestion pipelines, RAG systems, compliance automation, or document intelligence on top of PDFs, the practical implications are direct.

Single-parser extraction is an unvalidated assumption for a non-trivial percentage of document corpora that include multi-column layouts, tagged forms, or incremental updates — the corpus types measured here. The percentage depends on the corpus. For academic papers it is 62%. For government publications it is 100% on reading order ambiguity alone.

V/AP divergence in signed documents means the value your pipeline extracts from a certified financial form may not be the value the human signer approved. Standard signature validation passes. No warning is produced.

Parser disagreement on JavaScript visibility means your security scanner may report clean on a file that the rendering engine in your user’s browser would execute. The gap is not theoretical. The demo file is 798 bytes.

OCR text layer poisoning means your training corpus may contain corrections to documents that were never in the visible content. The poisoned layer is fully accessible to all text extraction tools. The visible layer is a raster image.

Accessibility tree injection means your RAG pipeline, if it prefers tagged PDF structure for better chunking quality, is reading from a channel that can carry content completely invisible to a human reviewer of the same document.

None of these require a malformed file. None require a CVE. None require an attacker. They require only PDFs — which your pipeline already has.

What I Am Going to Keep Publishing

The research series is not complete. The four articles to date established the measurement framework. What remains is scale validation against production document corpora, multi-extractor consensus as an ingestion primitive, and the tooling question of how to make semantic determinism checking a standard step in PDF ingestion pipelines rather than an afterthought.

The spec is not broken. The format is not defective. The extraction layer is structurally nondeterministic and always has been, and AI ingestion just made that matter enormously.

Traditional forensic scanning asked one question: is this PDF malicious? That question remains important. But the 47-engine suite now asks a second question in parallel: does this PDF present a stable semantic reality across all interpretation layers? A document can fail the second question without failing the first. It can contain no JavaScript, no CVE patterns, no embedded executables, and still deliver completely different content to a human reviewer and an AI extraction pipeline from the same byte stream.

That distinction is now operationally important. PDFs are load-bearing infrastructure for AI systems that take actions, produce compliance records, and serve as ground truth for LLM training. When the extraction layer is nondeterministic and pipelines have no mechanism to detect it, errors propagate downstream without any signal that they occurred.

If we cannot guarantee semantic determinism, downstream correctness guarantees become difficult to justify.

That is the statement the research series is building toward. The four articles to date established the measurement. What remains is the tooling question: how to make semantic determinism checking a standard step in PDF ingestion pipelines rather than an afterthought reserved for forensic investigations.

Full research series Thread 1
PDF Parser Disagreement: Six Parsers, Eleven Divergences
Eleven hand-crafted files. Every file produced a confirmed cross-parser disagreement. Risk scores ranged from 28 to 688. Thread 2
PDF Forms as Executable Security Boundaries: V/AP Divergence, DocMDP, and What Gets Certified
The two data stores every AcroForm field carries. A digital signature certifies both while they disagree. Thread 3
PDF Structural Problems in AI Ingestion Pipelines
V/AP divergence and parser disagreement follow PDFs into RAG pipelines. Standard tooling detects neither. Thread 4
PDF Reality Drift: When a Single File Presents Different Semantic Realities to Different Systems
13 structural drift vectors. 3 new forensic engines. The adversarial corpus and the semantically unstable corpus do not overlap.

Scanner — free, zero-retention, no account

The 47-engine scanner addresses all four research threads: parser disagreement detection, V/AP divergence (9/9 on positive cases, 0/187 false positives), reality drift vectors (Engines 44–46), and the full structural security envelope. Runs in your browser. Nothing is retained.

Scan a PDF →