What does “the PDF is not the document” mean?

A PDF is a stack of representations — a rendered page, a hidden text layer, an object graph, a metadata record, and the byte range a signature covers — and they can disagree. The page a human sees is not necessarily what a search index, e-discovery tool, or AI system extracts. Whether the file is malware is only one of the questions you can ask it.

How many PDFs did this study cover?

24,824 real PDFs across three separate corpora: a 1,572-document curated detection set including 400 live malware samples, a 6,281-document GovDocs1 real-world benign control, and the complete 16,971-document DOJ Epstein release. Each corpus measures a different thing, so the figures are reported separately and never blended.

Is document forensics the same as malware scanning?

No. On the entire 16,971-document real-world release the malware axis was empty — zero exploits — yet the scanner surfaced uniform re-processing, a fully stripped but recoverable toolchain, and a machine-readable text layer that diverges from the visible page on 18.6% of documents. The richest findings came from the corpus with no malware at all.

Can three independent engines really agree on the same documents?

Yes, exactly. Three engines sharing no mechanism — object-graph structural signature, a producer string recovered from orphaned objects, and six-parser version disagreement — each selected the identical 10,499 documents, with 100% agreement (intersection equals union). Independent measurements that converge on the same files prove the production pipeline is real by triangulation.

Why does this matter for AI and RAG pipelines?

An automated reader consumes the text layer and object graph, not the rendered page. Across these corpora that machine-readable layer frequently differs from what the page shows, the extracted text depends on which parser you use, and a release can be a small slice of its own numbering. Any system that indexes or feeds PDFs to a model inherits all of it.

The PDF Is Not the Document — Forensics Across 24,000 Real PDFs

Executive Summary

Over the past few months we ran the same multi-axis PDF forensic engine against three corpora built for three different purposes. Read together, they make one argument that no single corpus could make alone: document forensics is not malware scanning, and a PDF's most consequential properties are usually the ones a malware verdict never measures.

It catches real malice without reputation. On the curated detection set, the engine identified 400 live malware samples by structure and behaviour, not by hash lookup or prior sightings.
It stays quiet on clean files. On the 6,281-document real-world benign control, the false-positive rate was 0.34% (20 of 5,899 genuinely-benign documents) — detection by analysis, measured honestly against a diverse population it had never seen.
It still finds plenty when there is no malware to find. Across the entire 16,971-document DOJ Epstein release the malware axis was empty — zero exploits — yet the engine surfaced uniform re-processing, a totally stripped (but recoverable) toolchain, and a machine-readable layer that diverges from the visible page on 18.6% of documents.
The page a human sees is not the only document in the file. This is the through-line. It shows up as parser disagreement, reality drift, value-versus-appearance divergence, and hidden text layers — in crafted files, in ordinary files, and in a real government production at scale.
And it publishes new primary findings, not just a synthesis. Provenance recovered from orphaned objects on 65.8% of a real production and byte-verified with two independent tools; a manufacturing fingerprint so uniform it resolves 16,971 files into exactly two physical page layouts; and Bates-sequence continuity, a completeness measure showing the release occupies just 2.4% of its own numbered range.

The number to take away is not a blended accuracy figure. It is the coverage: 24,824 real PDFs across an adversarial set, a benign control, and a single real-world production, with the same engine reaching the same conclusion in all three.

Executive summary
Three corpora, three jobs
The thesis: a PDF is a stack of representations
Finding 1 — detection without reputation
Finding 2 — the human-vs-machine gap
Finding 3 — provenance survives the strip
Finding 4 — the manufacturing fingerprint
Finding 5 — three engines, one population
Finding 6 — the numbering tells a story
Finding 7 — signed is not the same as read
What is new since the earlier studies
What this means for AI & RAG
Scope & what we did not do
The seven studies
Methodology

Three Corpora, Three Jobs

A single corpus can only answer a single kind of question. Detection accuracy needs known malice. A false-positive rate needs known-benign diversity. Real-world behaviour needs a real population nobody curated. We used three, and we keep their numbers apart on purpose.

Corpus	Size	What it is	What it measures
Curated detection set	1,572	Eight domains: 400 live malware samples (MalwareBazaar), 950 real-world Mozilla pdf.js files, government AcroForm/XFA forms, academic and legislative publications, adversarial proof-of-concept files, and hand-crafted fixtures.	Detection — does it catch malice, and on what evidence?
Real-world benign control	6,281	The GovDocs1 corpus (Digital Corpora) — web-crawled government and academic PDFs, deliberately ordinary and diverse.	False positives — does it stay quiet on clean files it has never seen?
Real-world production	16,971	The complete DOJ Epstein release — every released PDF, all 47 engines, zero files skipped.	Behaviour at scale — what does it find when nothing is malicious?
Total: 24,824 real PDFs. These are three separate measurements. We never merge them into a single rate, because a homogeneous clean production is not a detection benchmark, and a release with real content-integrity findings is not a false-positive control.

That last point is worth stating plainly, because it is the most common mistake. Adding the 16,971 Epstein documents to the benign control would look like a bigger, more impressive study and would actually make it worse: the Epstein corpus is malware-clean, so it adds no detection signal, and the engine legitimately raised content-integrity findings on 18.6% of it — findings that are real, not false positives. Blending the two would either hide a true signal or fake a false-positive rate. So we don't.

The Thesis: A PDF Is a Stack of Representations

Open a PDF and you see one page. The file contains several documents at once, and a different reader consumes each one:

Representation	Who reads it	How it can disagree
The rendered page	A human, a viewer	The visible image
The text layer	Search, e-discovery, OCR, AI/RAG	Can say something different from the page (reality drift)
The object graph	Parsers, indexers	Different parsers reconstruct a different document from the same bytes
The metadata record	Provenance and classification tools	Can be stripped from view while surviving in orphaned objects
The signed byte range	Signature validators	Can certify a value that differs from the appearance shown (V/AP divergence)

A verdict that reduces all of this to one number — is it malware? — cannot express it. The PDF Semantic Determinism study sets out the root cause: the format guarantees pixels, not a single canonical meaning. Everything below is that abstract problem showing up in measured data, across three corpora.

Finding 1 — Detection Without Reputation (curated set + benign control)

The engine caught 400 live malware samples by structure and behaviour rather than reputation, and stayed quiet on clean files at a 0.34% false-positive rate over a 6,281-document real-world control.

The baseline question still matters: does the engine catch real malware, and does it stay quiet otherwise? Measured on the two corpora built for exactly that:

400 live malware samples from MalwareBazaar were identified by structure and behaviour — dangerous content-stream operators, exploit filter parameters, behavioural detonation, object-graph anomalies — not by hash reputation or a record of having seen the file before. A novel sample with no reputation is still caught on what it is and what it does.
0.34% false positives (20 of 5,899 genuinely-benign documents) on the 6,281-document GovDocs1 control. The headline 3.3% flag rate on that corpus collapses to 0.34% once the genuinely-malicious files mixed into the crawl are removed — an honest number, measured on a diverse population the engine had never seen.

The point of this finding is not the number on its own; it is what the number proves about the method. Detection-by-analysis is what lets the same engine say something useful about the Epstein corpus, where reputation is worthless because the files are brand new and clean. Full detail is in PDF Forensics at Scale.

Finding 2 — The Human-vs-Machine Gap (all three corpora)

The text a machine extracts diverges from the page a human sees in every corpus — 18.6% of a real release, 43.5% of ordinary benign documents, nearly every government form — yet almost never in hand-crafted adversarial files.

This is the through-line, and it appears in every corpus we measured. The most telling part is the last column: the gap is pervasive in real documents and effectively absent in adversarial ones — the opposite of an attack signature.

Corpus	The human-vs-machine gap, measured	In adversarial files
Curated detection set 1,572	~1 in 3 files diverge across parsers (502); reading-order ambiguity on 45 of 46 government forms and all 34 academic / legislative publications	0 of 103 proof-of-concept files triggered any drift vector
Real-world benign control 6,281 (GovDocs1)	43.5% parser divergence (2,641); 69.6% reading-order ambiguity (4,223); 80.0% carry at least one extraction-divergence vector (4,850)	n/a — benign control
Real-world production 16,971 (Epstein)	18.6% OCR text vs image drift (3,159); 61.9% version-decode disagreement (10,499); 44.7% reading-order ambiguity (7,587)	n/a — malware-clean release
The metrics are not identical across corpora and we do not blend them: the production's 61.9% is version-declaration disagreement, while the control's 43.5% counts page-count, JavaScript, encryption, and form disagreement. We show each corpus's own measurement. The constant across all three is direction — the gap rises with how real and uncurated the documents are, and vanishes on hand-crafted attack files.

The Epstein figure is the strongest version of the argument, because it is the largest and the least curated. On 18.6% of a real government release, the text a machine ingests does not match the page a person reads — measured conservatively, flagging a page only when more than roughly 70% of words differ between the embedded text and a fresh OCR of the rendered image. The 61.9% parser-disagreement figure is narrower than it sounds and we are precise about it: the parsers agree on page count and object count; they disagree on the file's declared version and the decode path that follows from it (a 1.3 header over 1.5-era object streams). That is not six parsers extracting six different texts — it is six parsers disagreeing on how the file should be read, which is the structural root of the drift.

Two readers, one file, two different documents. Detail in PDF Reality Drift and Parser Disagreement.

Finding 3 — Provenance Survives the Strip (Epstein, 16,971)

Active metadata was 100% empty, but a deep object walk recovered the full production toolchain from orphaned objects on 65.8% of the release — byte-verified with two independent tools, proving "metadata removed" only meant "metadata unlinked."

Every one of the 16,971 documents had its active metadata removed: author, title, producer, all empty, no XMP. On the surface, the authoring fingerprints look erased. They were not erased; they were unlinked.

PDFs are hoarders. When a pipeline strips the visible metadata, the original information is often left behind in an orphaned object — a dictionary inside a compressed object stream that the final trailer no longer references. Walking every cross-reference object recovers it. On 11,165 documents (65.8%) a complete authoring fingerprint survived in those orphans even though the active record was blank, and it does not describe one pipeline. It resolves the corpus into four distinct generators, each of which lines up exactly with a structural population we isolated from the bytes alone:

Recovered generator	Files	Share	What it is
`OmniPage CSDK 21.1` → `Processing-CLI`	10,499	61.9%	Commercial OCR capture feeding a batch-assembly step — the dominant scanning pipeline
`pypdf`	227	1.3%	A final Python manipulation pass over a minority of files
`ReportLab PDF Library`	425	2.5%	Born-digital generation — a distinct sub-population, not scanned
`Microsoft® Office Word 2007`	14	0.1%	Genuine born-digital originals, all carrying a 2013 creation date
No recoverable fingerprint	5,806	34.2%	Orphaned record empty as well as the active one

We verified this the careful way, and the verification is itself a finding. We pulled the original files and read the active, trailer-referenced metadata directly from the bytes with two independent tools — pikepdf and pdfinfo — across the populations a surface reader would most expect to retain metadata. Both agree: the active record is empty, while the toolchain strings live only in orphaned dictionaries that a deep object walk recovers.

Population sampled	Files read from source bytes	Active `/Info` populated	XMP present
pypdf cohort	212	0	0
ReportLab cohort	16	0	0
OmniPage scanned majority	235	0	0
Confirmed independently by `pikepdf` (final-trailer resolution) and `pdfinfo`. The recovered toolchain strings are present only in orphaned object-stream dictionaries.

The visible metadata strip is a lock on the front door; the back door was left open. This is the practical lesson for anyone who believes "remove metadata" actually removed it: a surface tool reports empty, the file looks clean, and the full provenance is still inside it. A forensic reader that walks the object graph rather than trusting the trailer recovers what the strip was meant to hide — on nearly two-thirds of a real production.

Finding 4 — The Manufacturing Fingerprint (Epstein, 16,971)

A single automated pipeline stamps all 16,971 files with the same mechanical signature — identical hex-encoded Helvetica overlay, the same retained bloat, and exactly two physical page layouts across the entire release.

Provenance hides in the metadata. Process hides in the geometry. When a single automated pipeline produces seventeen thousand documents, it stamps every one of them with the same mechanical signature, and that signature is measurable down to the content-stream index. We characterised the page-identifier overlay and the structural bloat the stamping step leaves behind, across all 16,971 files:

Manufacturing trait	Prevalence	What it means
Bates overlay in a dedicated content stream	100% (16,971)	The identifier is drawn by its own injected stream, not part of the page
Identifier hex-encoded	100%	The Bates text is written as a hex string, the same way every time
Drawn in Helvetica	100%	One non-embedded font for every stamp in the corpus
Stamp size exactly 12.0 pt	80.9% (13,730)	The remainder are the same 12 pt stamp scaled by the page transform
Empty leftover content stream	100%	The pipeline leaves an unused stream behind in every file
Legacy `/ProcSet` array retained	100%	A deprecated resource declaration carried in every file
Retained page `/Thumb` thumbnail	97.4% (16,532)	An abandoned thumbnail nobody removed

The overlay does not just look uniform; it resolves into exactly two physical page layouts and nothing else. A five-stream page with the Bates stamp at content-stream index 3 accounts for 83.1% (14,110) of the corpus; a four-stream page with the stamp at index 2 accounts for the other 16.9% (2,861). A release assembled by hand, or by more than one tool, does not converge on two byte-level page shapes across seventeen thousand files. This is the fingerprint of a single stamping pass, and it is the kind of signal a malware verdict never looks for and a structural engine reads immediately.

The forensic value generalises well beyond this corpus: any production line that touches documents in bulk leaves a measurable manufacturing signature, and the absence or mixing of that signature is itself evidence. We added dedicated Bates-stamp and file-bloat detection to the scanner during this work, so it now records these traits on every file it sees.

Finding 5 — Three Engines, One Population (Epstein, 16,971 — new analysis)

The release reduces to 181 object-graph templates, and three independent engines select the identical 10,499 documents with 100% agreement — proof by triangulation that no single-purpose tool could produce.

This is a finding only a multi-engine scanner can produce, and it is new here. Run a single tool over the corpus and you get one opinion. Run dozens of independent engines and you can ask a question no single tool can answer: do unrelated measurements agree? On this corpus the answer is exact, and it is striking.

Start with the object graph. Every PDF has an internal skeleton — the order and kind of its core objects (catalog, fonts, encodings, pages, object streams, cross-reference streams). Reduced to a signature, the entire 16,971-document release collapses to just 181 distinct object-graph templates. One template alone accounts for 61.9% (10,499) of the corpus; the top ten cover 85%. The internal structure is as stamped as the Bates number on the page.

Now the cross-validation. That dominant 10,499-document template is not the only engine that isolates this population. Three completely independent engines, sharing no mechanism, each select a set of documents — and they select the identical files:

Independent engine	What it measures	Documents selected
Campaign attribution	Object-graph structural signature	10,499
Metadata recovery	Recovered `OmniPage CSDK 21.1` producer string in orphaned objects	10,499
Differential parsing	Six external parsers disagree on the declared version	10,499
Intersection of all three sets: 10,499. Union: 10,499. Agreement (Jaccard): 100.0%. Not the same count — the same files.

The shape of the object graph, a string buried in an orphaned dictionary, and the way six outside parsers stumble on the file's version have nothing to do with one another. That they partition out the exact same 10,499 documents is not a coincidence; it is proof, by triangulation, that Pipeline A is a single coherent production and not an artefact of any one measurement. This is what a multi-engine forensic approach buys that a single-purpose scanner cannot: engines that check each other.

One more layer, and it is the counter-intuitive one. The production is uniform, but the documents are not. On the subset for which the scanner recorded a fuzzy content hash (8,081 files), near-duplication is almost entirely absent: 99.1% have no near-twin in the corpus, and the largest near-identical family is ten documents. So the picture is a uniform factory turning out unique products — one pipeline, a handful of structural templates, and seventeen thousand genuinely distinct documents passing through it. That combination is itself a signature: it is what the bulk digitisation of a real, heterogeneous archive looks like, and it is not what templated or synthetic content would look like.

Finding 6 — The Numbering Tells a Story (Epstein, 16,971)

The page numbering is internally perfect (16,914 of 16,970 boundaries exact, zero overlaps) yet covers only 2.4% of its own range, in sixteen islands — a completeness measure no prior study had a corpus large enough to perform.

The same manufacturing uniformity that stamps every page (Finding 4) makes the page identifiers a sequence, and a sequence can be checked for completeness. This turned out to be the most quietly remarkable result in the entire corpus, and it is a forensic technique none of the earlier studies had: Bates-sequence continuity analysis.

The released documents are internally page-perfect: for 16,914 of 16,970 consecutive document boundaries, the next document's first Bates number equals the previous document's first number plus its exact page count, with zero overlaps anywhere in the corpus. The numbering is clean, strictly per-page, and never reused. Yet the 16,971 documents (67,143 pages) sit inside a numbered range running to 2,853,705. The release accounts for about 2.4% of its own numbered range, and the present documents collapse into sixteen contiguous islands — with 86.8% of all documents (14,723) packed into the first 40,000 numbers, followed by three contiguous deserts:

Absent contiguous range (between two released documents)	Numbered pages absent
`EFTA00039881` → `EFTA01262782`	1,222,900
`EFTA01264396` → `EFTA02209722`	945,325
`EFTA02212972` → `EFTA02730265`	517,290
+ 53 further gap boundaries	2,787,447 numbered pages absent in total

What our data establishes is structural: the released PDFs occupy 2.4% of a page-contiguous, reuse-free numbering range, in sixteen islands, with three multi-hundred-thousand-page gaps. What it cannot establish is the cause of the absences — withheld, redacted in full, released only as non-PDF exhibits, or part of separate productions. We report the shape and scale of what is absent, not the reason for it.

Full treatment in The Epstein Files, Forensically.

Finding 7 — Signed Is Not the Same as Read (curated forms + fixtures)

A digital signature can certify a form field whose stored value and displayed appearance disagree, so the document that was signed is not the document that is read — caught with five rendering-free checks.

The most counter-intuitive form of the human-vs-machine gap lives in interactive forms and signatures. A form field carries two independent representations: its value (/V, the data) and its appearance stream (/AP, what is drawn). They can disagree. A digital signature covers a byte range, and that range can include a value while the viewer renders a regenerated appearance showing something else — so the document that was signed is not the document that is displayed.

We catch this without rendering, through five checks: /NeedAppearances on a signed file, checkbox /V-versus-/AS mismatch, appearance-stream text extraction with font-encoding remap (so a custom encoding that draws "9" where the value is "1" is caught), image-based appearance streams, and blank appearance streams. A "signed" form can store one number and show another, entirely within the signed bytes. Detail in PDF Forms as Executable Security Boundaries.

What Is New Since the Earlier Studies

A synthesis should earn its place. Three things in this body of work did not exist in the seven studies on their own and came directly out of running the engine against a real production at full scale:

Byte-verified metadata recovery. The orphaned-object recovery in Finding 3 was confirmed against the original file bytes with two independent tools (pikepdf and pdfinfo), establishing that "metadata removed" can mean "metadata unlinked, not deleted" — and that a deep object walk recovers it on 65.8% of a real production. This is a reproducible forensic result, not an inference.
The manufacturing-fingerprint measurement. Quantifying a production pipeline's mechanical signature down to the content-stream index (Finding 4), and showing it resolves to exactly two page layouts across 16,971 files, is a generalisable technique for distinguishing automated bulk production from hand assembly.
Multi-engine cross-validation by triangulation. Showing that three independent engines partition out the identical 10,499 documents with 100% agreement (Finding 5) is a method, not just a result: unrelated measurements that converge are a proof structure single-purpose tools cannot offer, and it reduced 16,971 files to 181 object-graph templates along the way.
Bates-sequence continuity analysis. Treating a numbered production as a sequence and measuring its completeness (Finding 6) is a new analytical lens, only possible on a corpus large and uniform enough to have a sequence at all.
A new engine capability. The scanner gained dedicated Bates-stamp and file-bloat forensics, shipped during this work — it now records how a page identifier is applied (font, size, encoding, dedicated content stream) and the structural bloat a stamping pipeline leaves behind, on every file it sees.

The deeper result is the cross-corpus one: the engine's value did not depend on there being malware to find. On the one corpus where the malware axis was completely empty, it produced the richest structural findings of all three. That is the strongest evidence we have that document forensics and malware scanning are different disciplines.

What This Means for AI & RAG

Every finding above lands on the same operational point for anyone feeding PDFs to an automated reader. The page is for humans; the machine consumes a different representation, and across these corpora that representation is frequently not what the page shows.

You ingest the text layer, not the page. On 18.6% of a real release that layer diverged from the image — a model can answer fluently from text the source page does not support, with no visible cue.
Your corpus depends on your parser. On 61.9% of that release, independent libraries disagreed on how to decode the file. Pin one extractor and version it, or two teams build measurably different indexes from the identical documents.
The shipped text is provisional. Where pages are fixed low-resolution rasters, a modern OCR pass produces a different, usually better text layer than the one bundled in the file. Treat embedded OCR as a draft.
Absence is not evidence of absence. When a release is a small, clustered slice of its own numbering, a retrieval system that returns "nothing about X" is reporting the contents of the slice, not of the underlying production.

None of this concerns document content or its truth. It concerns whether an automated reader can faithfully recover what the page shows — and for a large share of real PDFs, by construction, it cannot without deliberate countermeasures. See PDF Structural Problems in AI Ingestion Pipelines.

Scope & What We Did Not Do

This is a structural synthesis. Its boundaries are deliberate:

We did not blend the corpora into a single metric. Detection figures come only from the curated set; the false-positive rate comes only from the benign control; behavioural and structural figures at scale come only from the Epstein release. Each number is labelled with its source.
We did not assess the truth, accuracy, or meaning of any document's contents. Every finding concerns how files were produced and encoded, not what they say.
We did not attribute motive. Metadata stripping, the numbering gaps, and reality drift are reported as observable facts; whether any was deliberate is outside what PDF structure can establish.
We did not re-open concluded research to inflate it. The earlier studies keep their own scoped figures; this article cites them, it does not overwrite them.

The Seven Studies

This synthesis stands on seven prior pieces of original research. Each is the full treatment of a mechanism summarised above.

Study	One-line finding
PDF Forensics at Scale	A 0.34% false-positive rate on a 6,281-PDF real-world control, with live-malware detection by analysis rather than reputation.
The Epstein Files, Forensically	All 16,971 DOJ release PDFs: malware-clean, 100% metadata-stripped but recoverable, 18.6% reality drift, 2.4% of the Bates range present.
Parser Disagreement	Eleven crafted PDFs through six parsers — every file produced a different reading from the same bytes.
PDF Reality Drift	One file, many realities: 43 of 44 IRS tax forms drift between the rendered page and the extracted text.
The PDF Semantic Determinism Problem	The root cause: a format built for visual fidelity guarantees pixels, not a single canonical meaning.
PDF Structural Problems in AI Ingestion	AI ingestion can be poisoned by the document itself — the model reads text the human never sees.
PDF Forms as Executable Security Boundaries	A signature can certify a form while value and appearance disagree — what gets signed is not what gets read.

Methodology

Every figure derives from machine-readable per-document output — one structured result per PDF — produced by the same 47-engine PQ PDF forensic scanner across all three corpora. The verdict model is multi-axis: threat (malware and exploit), deception (content-integrity), and structure (neutral capability) are scored on separate axes, so a complex-but-legitimate file is never mistaken for an attack and a deceptive one is not waved through because it carries no exploit.

Corpus provenance: the curated detection set and the GovDocs1 benign control are described in full in PDF Forensics at Scale; the 16,971-document release and its per-engine methodology in The Epstein Files, Forensically. The OCR-divergence measure flags a page only when normalized word-set overlap between the embedded text and a fresh OCR of the rendered image falls below 0.30, with gates that exclude blank and figure-only pages. Metadata-recovery results were confirmed against original file bytes with two independent readers. No figure in this article averages across corpora.

→ Run the same forensic scanner on your own PDF — 47 engines, free, zero retention