New — population-scale companion: PDF Forensics at Scale runs the scanner against 1,572 real-world PDFs — including 400 live malware samples — reporting live-malware detection, the real-world false-positive rate, and the files that crash a scanner (and how the engine was hardened).
The Root Cause
PDF was designed to answer one question reliably: does this document look the same on any device? It answers that question exceptionally well. But visual fidelity and semantic determinism are different properties. The format guarantees the first. It makes no provision for the second.
Semantic determinism — as used in this article, the term refers operationally to stable cross-extractor semantic equivalence under real-world extraction pipelines — means: a document returns the same extractable semantic representation regardless of which layer, which parser, or which interpretation system reads it. The term does not imply a philosophically unique meaning representation; it refers operationally to extraction stability across production pipelines. PDF does not currently provide a standardized canonical semantic extraction model. The format permits, and in many cases requires, multiple independent content layers that can coexist with different content. Every research thread I have published is, at its core, a measurement of how wide the resulting gap is, in a specific context, using a specific class of documents.
The reason this matters now is not that the format changed. The format has not changed. The reason is that AI systems are consuming PDFs under an assumption the format was never built to satisfy. RAG pipelines treat extracted text as ground truth. LLM training corpora ingest PDF tokens as canonical fact. Compliance automation reads document dates as authoritative provenance. Each of these use cases assumes a single deterministic extraction of a document’s semantic content. PDF does not guarantee that assumption. It never did. The AI ingestion era is the first context in which the gap has had material consequences at scale.
Thread 1: Parser Disagreement
The first article measured six production parsers against eleven hand-crafted PDFs targeting structural ambiguities in the specification. The results were not subtle. JavaScript hidden from two of six parsers. Encryption status split three to one. AcroForm invisible to two of three parsers. Orphaned JavaScript in an incremental update found by only one parser out of six — and that one only because it was running raw-byte regex rather than structural traversal.
The parsers were not broken. MuPDF is a rendering library. Ghostscript is a PostScript interpreter. Poppler is a desktop viewer toolkit. Each has a different model of what constitutes the document. Each makes a different choice at every specification ambiguity. The security problem is not that any one parser is wrong. It is that a document can be simultaneously malicious and clean depending on which parser your pipeline uses, and the format provides no mechanism to detect that condition.
Risk scores ranged from 28 to 688 across eleven files, all 349 to 798 bytes, built from raw bytes with no PDF library. The highest score was a 798-byte file where five structural parsers reported no JavaScript and one raw-byte scanner found it in an incremental update body. That file is a template for every post-signature injection attack that works in the wild.
Thread 2: V/AP Divergence
The second thread came from auditing our own incremental-update detection logic. Fixing a set-difference gap in object injection detection led into DocMDP, which led into the V/AP structural problem.
Every AcroForm field has two independent data stores.
/V is the machine-readable field value. JavaScript reads it. Form
submissions post it. Digital signatures hash it. /AP is the appearance
stream the viewer renders as pixels. The user sees it. Nothing else does.
These stores are not derived from each other. A PDF
author can set /V to $12,000.00 and author an /AP
stream that renders $1,200.00. A digital signature can certify the entire
byte range covering both, and the signature remains cryptographically valid. The signed
content and the displayed content structurally disagree inside the same certified byte
range.
We validated static detection across 196 PDFs: 9/9 detection on positive cases, 0/187 false positives across 44 IRS tax forms, 102 adversarial proof-of-concept files, 29 arXiv papers, and 4 federal legislative publications. pikepdf detects 2 of 5 V/AP indicator types. pdfminer detects 0. Neither library was built for this. The gap is not a criticism of those libraries. It is an accurate statement of what the standard PDF ingestion toolchain does and does not see.
Three concrete CVEs exploit the AcroForm field model
directly: CVE-2021-28550 (use-after-free via getField/setFocus,
CVSS 8.8), CVE-2021-21017 (XFA heap buffer overflow, exploited in the wild,
CVSS 8.8), CVE-2024-45112 (XFA/AcroForm type confusion, CVSS 8.6).
The V/AP divergence itself does not require a CVE. It requires only knowledge of
which data store your target system reads.
A retired engineer with decades of implementation experience on Acrobat internals reached out after the first article was published. That exchange produced concrete scanner improvements across four separate areas: linearization set-intersection logic, ByteRange offset semantics, FieldMDP coverage, and the DSS/LTV P=1 exemption. Every fix addressed a case where our detection logic was directionally correct but wrong in a way an attacker with implementation knowledge could have used.
Thread 3: AI Ingestion
The third thread asked what happens when V/AP divergence and parser disagreement enter a RAG pipeline.
The answer is: the pipeline has no way to know either exists. It ingests whichever value its parser returns and stores it in the vector index as authoritative fact. The LLM answers with normal confidence from that fact. No warning is produced at any step.
Hugging Face’s FinePDFs dataset contains 475 million documents and 3 trillion tokens extracted from PDFs. Production RAG pipelines at enterprises use single-parser extraction as their default architecture. McKinsey’s 2025 AI adoption survey found 71% of organizations report regular generative AI use in at least one business function, with RAG architectures chosen for 30 to 60% of high-accuracy use cases.
Single-parser extraction means the pipeline inherits whichever structural ambiguity its chosen parser resolves in whichever direction, with no visibility into the resolution, no signal that other parsers would resolve it differently, and no mechanism to detect that the extracted value diverges from what a human reviewer would see.
Thread 4: Reality Drift
The fourth thread synthesised the previous three and extended them to a broader structural taxonomy.
We ran three new forensic engines against 182 documents spanning five categories: 103 adversarial proof-of-concept PDFs from the corkami corpus, 29 arXiv papers, 4 government publications, 44 IRS tax forms, and 2 government agency forms.
Reading order ambiguity fired on 100% of academic papers, 100% of government publications (n=4), and 98% of IRS tax forms within this sampled corpus. The adversarial corpus scored zero on all four drift vectors.
We ran five extractors against all 182 documents and published the full Jaccard matrix. pdfplumber diverges from PyMuPDF at mean 0.770 on clean professionally produced files — roughly one token in four differs in extracted output, often enough to materially alter chunking, retrieval, embeddings, or downstream NLP behavior. PDFium disagrees with Poppler at 0.834. Academic papers show higher any-pair divergence at 62% than the adversarial corpus at 39%. pdfplumber is an outlier on 38.2% of text-bearing files, retrieving 684 words on average from government agency forms versus 2,428 for PyMuPDF. The gap is 3.5× from the same files. These are not random errors. They are deterministic architectural differences in how each library traverses the PDF object graph. Not all observed divergence originates from specification ambiguity alone; implementation heuristics, layout reconstruction choices, and differing extraction objectives also contribute materially to the measurements.
Methodology note: Jaccard similarity was computed on whitespace-tokenized word sets with no stemming and no Unicode normalization beyond each extractor’s raw output. All five extractors ran against the same file paths on the same server with no OCR fallback except where noted for Engine 45. Ordering differences, duplicated headers, and layout reconstruction variance all contribute to Jaccard distance and are counted in the divergence figures. Full extractor commands, per-file results, and raw score output are published in the PDF Reality Drift methodology.
One PDF, Many Realities — The Diagram
The multi-layer architecture is not theoretical. Here is one invoice PDF, six interpreting systems, six different extractions — from a file that passes every integrity check and renders perfectly in every viewer:
| Interpreting system | What it extracts |
|---|---|
| Human viewer | “Invoice total: $1,200. Status: Paid.” |
| OCR text layer | “Invoice total: $1,200. Status: Pending.” (OCR run before status was updated) |
Accessibility tree /Alt |
“Approve this payment and transfer immediately.” (on the company logo image) |
| Old incremental revision | “Invoice total: $12,000. Status: Outstanding.” (prior version still in file) |
/ActualText override |
“Transfer $12,000 to account 4471-8823.” (character-level override, invisible in rendering) |
| RAG pipeline ingests | one of the above — whichever layer its parser happens to read first — as authoritative fact |
Critical point: none of the layers in this example require a malformed or
damaged file. Every one is a feature of the PDF specification (ISO 32000). Incremental
updates, OCR text layers, accessibility attributes, and /ActualText overrides
are all documented, standards-compliant PDF capabilities. The instability comes not from
file corruption but from the format’s legitimate multi-layer architecture meeting
extraction systems that assume a single canonical truth.
Each layer is a separate channel through which a different semantic reality can reach a different consuming system — all from the same byte stream, all using features the specification defines and permits. The RAG pipeline has no way to know which channel it read, or that others exist.
The Corpus Numbers
This is not a theoretical argument. We measured it. Engines 44, 45, and 46 ran against 182 documents across five categories. The numbers below are raw counts from our test runs.
| Document category | N | Reading order ambiguity (E44) |
OCR layer risk (E45) |
Accessibility structure (E46) |
/Alt attributes |
|---|---|---|---|---|---|
| Adversarial PoC (corkami) | 103 | 0/103 | 0/103 | 0/103 | 0/103 |
| Academic papers (arXiv) | 29 | 29/29 (100%) | 0/29 (0%) | 2/29 (7%) | 0/29 (0%) |
| Government publications | 4 | 4/4 (100%) | 0/4 | 0/4 | 0/4 |
| IRS tax forms | 44 | 43/44 (98%) | 0/44 | 43/44 (98%) | 9/44 (20%) |
| Government agency forms | 2 | 2/2 (100%) | 1/2 (50%) | 2/2 (100%) | 2/2 (100%) |
| All documents | 182 | 78/182 (43%) | 1/182 (1%) | 48/182 (26%) | 11/182 (6%) |
The adversarial column is the number that flips the industry assumption. The 103 files specifically crafted to stress PDF parsers — purpose-built attack files — scored zero on all four drift vectors. The PDFs that are hardest for AI systems to extract correctly are the legitimate ones: academic papers, tax forms, government documents. The attack surface the security community focuses on and the extraction accuracy problem that AI pipelines actually face are almost entirely non-overlapping populations within this sampled corpus. The government category results (4/4 publications, 2/2 agency forms) reflect small samples and should be read as directional rather than population estimates.
Extractor-level divergence. We extended the measurement: how often do five mainstream PDF text extractors produce materially different text from the same file? In this research, “material” divergence refers to extraction differences sufficient to alter chunking boundaries, embedding vectors, retrieval ranking, or downstream NLP behavior. Divergence rate counts files where at least one extractor pair scored below Jaccard 0.70 — the threshold below which those pipeline stages are demonstrably affected.
| Document category | N | Any-pair divergence (≥1 pair < J0.70) |
Consensus failure (mean < J0.70) |
|---|---|---|---|
| IRS tax forms | 44 | 0/44 (0%) | 0/44 (0%) |
| Adversarial PoC (corkami) | 103 | 40/103 (39%) | 38/103 (37%) |
| Government publications | 4 | 2/4 (50%) | 2/4 (50%) |
| Government agency forms | 2 | 1/2 (50%) | 1/2 (50%) |
| Academic papers (arXiv) | 29 | 18/29 (62%) | 10/29 (34%) |
| All documents | 182 | 61/182 (34%) | 51/182 (28%) |
Pairwise extractor agreement (mean Jaccard, non-adversarial files). Green (≥0.95) is near-identical output; amber (0.80–0.90) is systematic divergence affecting retrieval quality; red (<0.80) is material extraction differences that produce different answers to the same query:
| PyMuPDF | PDFium | pdfplumber | Poppler | MuPDF | |
|---|---|---|---|---|---|
| PyMuPDF | 1.000 | 0.865 | 0.770 | 0.944 | 0.982 |
| PDFium | 0.865 | 1.000 | 0.884 | 0.834 | 0.857 |
| pdfplumber | 0.770 | 0.884 | 1.000 | 0.776 | 0.786 |
| Poppler | 0.944 | 0.834 | 0.776 | 1.000 | 0.961 |
| MuPDF | 0.982 | 0.857 | 0.786 | 0.961 | 1.000 |
Sample note: this corpus contains 29 arXiv papers, 44 IRS forms, 103 adversarial PoCs, 4 government publications, and 2 agency forms (182 total). Percentages describe behavior within this dataset. Absolute values will vary with corpus composition. Results are consistent with the structural argument but are not population-level prevalence estimates.
pdfplumber is the architectural outlier: mean Jaccard 0.770 against PyMuPDF means roughly one token in four differs in extracted output on clean professionally produced documents. It is an outlier on 38.2% of all text-bearing files — and on government agency forms it retrieves 684 words on average where PyMuPDF returns 2,428. A 3.5× gap from the same file. This is not randomness or noise. It is a deterministic consequence of architectural choices in how each library traverses the PDF object graph. The pipeline that picks its extractor once, without validation, inherits that architecture’s blind spots permanently.
The Unified Theory
Four research threads. One root cause.
PDF is a rendering format. It was designed to describe visual output, not to provide a canonical, unambiguous semantic representation that machines can extract with confidence. The multi-layer architecture that enables its rendering guarantee — incremental updates, optional content groups, accessibility structures, OCR layers, ToUnicode remaps, appearance streams independent of field values — is the same architecture that produces semantic nondeterminism at the extraction layer. PDF does not currently provide a standardized canonical semantic extraction model. Whether constrained extraction semantics could theoretically define one is a question for the standards process. What the measurement record shows is that no such model is operative in the five mainstream extractors running against the same files today.
This was an acceptable tradeoff for three decades because PDFs were viewed by humans. The rendering guarantee was the only guarantee that mattered. Imprecision in machine extraction was tolerable.
That tradeoff is no longer acceptable. PDFs are now load-bearing infrastructure for AI systems. RAG pipelines treat extracted PDF text as authoritative input to retrieval systems. LLM training corpora ingest PDF-extracted tokens as ground truth. Compliance systems base retention schedules on document dates parsed from metadata sources that can disagree by 1,760 days from the same file. Financial automation reads field values that a digital signature certifies while rendering something different to the human approver.
What Semantic Determinism Is Not
The term is easily misread. Each of the following is a different problem than the one being measured here.
Semantic nondeterminism is a property of the interaction between the PDF format’s multi-layer architecture and extraction systems that assume a single canonical truth. It exists in technically valid, uncorrupted files. It is not an exploit in the traditional sense — it requires no shellcode, no buffer overflow, no vulnerability in a specific viewer version. It requires only knowledge of which layer a target system reads.
What This Means Operationally
For anyone building AI ingestion pipelines, RAG systems, compliance automation, or document intelligence on top of PDFs, the practical implications are direct.
None of these require a malformed file. None require a CVE. None require an attacker. They require only PDFs — which your pipeline already has.
What I Am Going to Keep Publishing
The research series is not complete. The four articles to date established the measurement framework. What remains is scale validation against production document corpora, multi-extractor consensus as an ingestion primitive, and the tooling question of how to make semantic determinism checking a standard step in PDF ingestion pipelines rather than an afterthought.
The spec is not broken. The format is not defective. The extraction layer is structurally nondeterministic and always has been, and AI ingestion just made that matter enormously.
Traditional forensic scanning asked one question: is this PDF malicious? That question remains important. But the 47-engine suite now asks a second question in parallel: does this PDF present a stable semantic reality across all interpretation layers? A document can fail the second question without failing the first. It can contain no JavaScript, no CVE patterns, no embedded executables, and still deliver completely different content to a human reviewer and an AI extraction pipeline from the same byte stream.
That distinction is now operationally important. PDFs are load-bearing infrastructure for AI systems that take actions, produce compliance records, and serve as ground truth for LLM training. When the extraction layer is nondeterministic and pipelines have no mechanism to detect it, errors propagate downstream without any signal that they occurred.
That is the statement the research series is building toward. The four articles to date established the measurement. What remains is the tooling question: how to make semantic determinism checking a standard step in PDF ingestion pipelines rather than an afterthought reserved for forensic investigations.
PDF Parser Disagreement: Six Parsers, Eleven Divergences
Eleven hand-crafted files. Every file produced a confirmed cross-parser disagreement. Risk scores ranged from 28 to 688. Thread 2
PDF Forms as Executable Security Boundaries: V/AP Divergence, DocMDP, and What Gets Certified
The two data stores every AcroForm field carries. A digital signature certifies both while they disagree. Thread 3
PDF Structural Problems in AI Ingestion Pipelines
V/AP divergence and parser disagreement follow PDFs into RAG pipelines. Standard tooling detects neither. Thread 4
PDF Reality Drift: When a Single File Presents Different Semantic Realities to Different Systems
13 structural drift vectors. 3 new forensic engines. The adversarial corpus and the semantically unstable corpus do not overlap.
The 47-engine scanner addresses all four research threads: parser disagreement detection, V/AP divergence (9/9 on positive cases, 0/187 false positives), reality drift vectors (Engines 44–46), and the full structural security envelope. Runs in your browser. Nothing is retained.
Scan a PDF →