PQ PDF Logo
PQ PDF Tools Secure document utilities for everyday workflows.
Home About Enterprise Outlook Add-in Research Contact Feedback Legal Privacy Security Development Analytics

Security Research — Published 8 June 2026

The PDF Is Not the Document

Three corpora, 24,824 real PDFs, seven prior studies, one finding. A curated detection set with 400 live malware samples, a 6,281-document real-world benign control, and the complete 16,971-document DOJ Epstein release — measured separately, never blended.

A PDF is not a document. It is a stack of representations — a rendered page, a hidden text layer, an object graph, a metadata record, a region of bytes a signature vouches for — and those representations can disagree with one another. Whether the file is malware is only one of the questions you can ask it. Across three very different corpora, the more interesting questions turned out to be the other ones.

This article consolidates seven prior PQ PDF studies and adds the result that ties them together: the same multi-axis forensic engine, run against adversarial files, against ordinary benign documents, and against a single massive real-world release where nothing was malicious at all, keeps finding the same structural truth. Every figure below is sourced to the specific corpus it came from. We do not average across them, because they were built to measure different things.

Who this matters for DFIR & malware analysts e-Discovery & legal review RAG & document-AI builders Search & indexing engineers PDF tooling authors Records & archival teams

Executive Summary

Over the past year we ran the same multi-axis PDF forensic engine against three corpora built for three different purposes. Read together, they make one argument that no single corpus could make alone: document forensics is not malware scanning, and a PDF's most consequential properties are usually the ones a malware verdict never measures.

  • It catches real malice without reputation. On the curated detection set, the engine identified 400 live malware samples by structure and behaviour, not by hash lookup or prior sightings.
  • It stays quiet on clean files. On the 6,281-document real-world benign control, the false-positive rate was 0.34% (20 of 5,899 genuinely-benign documents) — detection by analysis, measured honestly against a diverse population it had never seen.
  • It still finds plenty when there is no malware to find. Across the entire 16,971-document DOJ Epstein release the malware axis was empty — zero exploits — yet the engine surfaced uniform re-processing, a totally stripped (but recoverable) toolchain, and a machine-readable layer that diverges from the visible page on 18.6% of documents.
  • The page a human sees is not the only document in the file. This is the through-line. It shows up as parser disagreement, reality drift, value-versus-appearance divergence, and hidden text layers — in crafted files, in ordinary files, and in a real government production at scale.
  • And it publishes new primary findings, not just a synthesis. Provenance recovered from orphaned objects on 65.8% of a real production and byte-verified with two independent tools; a manufacturing fingerprint so uniform it resolves 16,971 files into exactly two physical page layouts; and Bates-sequence continuity, a completeness measure showing the release occupies just 2.4% of its own numbered range.

The number to take away is not a blended accuracy figure. It is the coverage: 24,824 real PDFs across an adversarial set, a benign control, and a single real-world production, with the same engine reaching the same conclusion in all three.

Contents

  • Executive summary
  • Three corpora, three jobs
  • The thesis: a PDF is a stack of representations
  • Finding 1 — detection without reputation
  • Finding 2 — the human-vs-machine gap
  • Finding 3 — provenance survives the strip
  • Finding 4 — the manufacturing fingerprint
  • Finding 5 — three engines, one population
  • Finding 6 — the numbering tells a story
  • Finding 7 — signed is not the same as read
  • What is new since the earlier studies
  • What this means for AI & RAG
  • Scope & what we did not do
  • The seven studies
  • Methodology

Three Corpora, Three Jobs

A single corpus can only answer a single kind of question. Detection accuracy needs known malice. A false-positive rate needs known-benign diversity. Real-world behaviour needs a real population nobody curated. We used three, and we keep their numbers apart on purpose.

CorpusSizeWhat it isWhat it measures
Curated detection set 1,572 Eight domains: 400 live malware samples (MalwareBazaar), 950 real-world Mozilla pdf.js files, government AcroForm/XFA forms, academic and legislative publications, adversarial proof-of-concept files, and hand-crafted fixtures. Detection — does it catch malice, and on what evidence?
Real-world benign control 6,281 The GovDocs1 corpus (Digital Corpora) — web-crawled government and academic PDFs, deliberately ordinary and diverse. False positives — does it stay quiet on clean files it has never seen?
Real-world production 16,971 The complete DOJ Epstein release — every released PDF, all 47 engines, zero files skipped. Behaviour at scale — what does it find when nothing is malicious?
Total: 24,824 real PDFs. These are three separate measurements. We never merge them into a single rate, because a homogeneous clean production is not a detection benchmark, and a release with real content-integrity findings is not a false-positive control.

That last point is worth stating plainly, because it is the most common mistake. Adding the 16,971 Epstein documents to the benign control would look like a bigger, more impressive study and would actually make it worse: the Epstein corpus is malware-clean, so it adds no detection signal, and the engine legitimately raised content-integrity findings on 18.6% of it — findings that are real, not false positives. Blending the two would either hide a true signal or fake a false-positive rate. So we don't.

The Thesis: A PDF Is a Stack of Representations

Open a PDF and you see one page. The file contains several documents at once, and a different reader consumes each one:

RepresentationWho reads itHow it can disagree
The rendered pageA human, a viewerThe visible image
The text layerSearch, e-discovery, OCR, AI/RAGCan say something different from the page (reality drift)
The object graphParsers, indexersDifferent parsers reconstruct a different document from the same bytes
The metadata recordProvenance and classification toolsCan be stripped from view while surviving in orphaned objects
The signed byte rangeSignature validatorsCan certify a value that differs from the appearance shown (V/AP divergence)

A verdict that reduces all of this to one number — is it malware? — cannot express it. The PDF Semantic Determinism study sets out the root cause: the format guarantees pixels, not a single canonical meaning. Everything below is that abstract problem showing up in measured data, across three corpora.

Why this matters in plain terms

When you open a PDF, you trust that the page in front of you is the document. It usually is not the whole story. The same file hands a different, invisible version of itself to every search engine, e-discovery platform, and AI system that reads it — and those versions can quietly disagree with the page you see. This article measures how often that happens in real documents, and why "is it a virus?" is the least interesting question you can ask a PDF.

Finding 1 — Detection Without Reputation (curated set + benign control)

The engine caught 400 live malware samples by structure and behaviour rather than reputation, and stayed quiet on clean files at a 0.34% false-positive rate over a 6,281-document real-world control.

The baseline question still matters: does the engine catch real malware, and does it stay quiet otherwise? Measured on the two corpora built for exactly that:

  • 400 live malware samples from MalwareBazaar were identified by structure and behaviour — dangerous content-stream operators, exploit filter parameters, behavioural detonation, object-graph anomalies — not by hash reputation or a record of having seen the file before. A novel sample with no reputation is still caught on what it is and what it does.
  • 0.34% false positives (20 of 5,899 genuinely-benign documents) on the 6,281-document GovDocs1 control. The headline 3.3% flag rate on that corpus collapses to 0.34% once the genuinely-malicious files mixed into the crawl are removed — an honest number, measured on a diverse population the engine had never seen.

The point of this finding is not the number on its own; it is what the number proves about the method. Detection-by-analysis is what lets the same engine say something useful about the Epstein corpus, where reputation is worthless because the files are brand new and clean. Full detail is in PDF Forensics at Scale.

Finding 2 — The Human-vs-Machine Gap (all three corpora)

The text a machine extracts diverges from the page a human sees in every corpus — 18.6% of a real release, 43.5% of ordinary benign documents, nearly every government form — yet almost never in hand-crafted adversarial files.

This is the through-line, and it appears in every corpus we measured. The most telling part is the last column: the gap is pervasive in real documents and effectively absent in adversarial ones — the opposite of an attack signature.

CorpusThe human-vs-machine gap, measuredIn adversarial files
Curated detection set
1,572
~1 in 3 files diverge across parsers (502); reading-order ambiguity on 45 of 46 government forms and all 34 academic / legislative publications 0 of 103 proof-of-concept files triggered any drift vector
Real-world benign control
6,281 (GovDocs1)
43.5% parser divergence (2,641); 69.6% reading-order ambiguity (4,223); 80.0% carry at least one extraction-divergence vector (4,850) n/a — benign control
Real-world production
16,971 (Epstein)
18.6% OCR text vs image drift (3,159); 61.9% version-decode disagreement (10,499); 44.7% reading-order ambiguity (7,587) n/a — malware-clean release
The metrics are not identical across corpora and we do not blend them: the production's 61.9% is version-declaration disagreement, while the control's 43.5% counts page-count, JavaScript, encryption, and form disagreement. We show each corpus's own measurement. The constant across all three is direction — the gap rises with how real and uncurated the documents are, and vanishes on hand-crafted attack files.

The Epstein figure is the strongest version of the argument, because it is the largest and the least curated. On 18.6% of a real government release, the text a machine ingests does not match the page a person reads — measured conservatively, flagging a page only when more than roughly 70% of words differ between the embedded text and a fresh OCR of the rendered image. The 61.9% parser-disagreement figure is narrower than it sounds and we are precise about it: the parsers agree on page count and object count; they disagree on the file's declared version and the decode path that follows from it (a 1.3 header over 1.5-era object streams). That is not six parsers extracting six different texts — it is six parsers disagreeing on how the file should be read, which is the structural root of the drift.

Two readers, one file, two different documents. Detail in PDF Reality Drift and Parser Disagreement.

Finding 3 — Provenance Survives the Strip (Epstein, 16,971)

Active metadata was 100% empty, but a deep object walk recovered the full production toolchain from orphaned objects on 65.8% of the release — byte-verified with two independent tools, proving "metadata removed" only meant "metadata unlinked."

Every one of the 16,971 documents had its active metadata removed: author, title, producer, all empty, no XMP. On the surface, the authoring fingerprints look erased. They were not erased; they were unlinked.

PDFs are hoarders. When a pipeline strips the visible metadata, the original information is often left behind in an orphaned object — a dictionary inside a compressed object stream that the final trailer no longer references. Walking every cross-reference object recovers it. On 11,165 documents (65.8%) a complete authoring fingerprint survived in those orphans even though the active record was blank, and it does not describe one pipeline. It resolves the corpus into four distinct generators, each of which lines up exactly with a structural population we isolated from the bytes alone:

Recovered generatorFilesShareWhat it is
OmniPage CSDK 21.1 → Processing-CLI10,49961.9%Commercial OCR capture feeding a batch-assembly step — the dominant scanning pipeline
pypdf2271.3%A final Python manipulation pass over a minority of files
ReportLab PDF Library4252.5%Born-digital generation — a distinct sub-population, not scanned
Microsoft® Office Word 2007140.1%Genuine born-digital originals, all carrying a 2013 creation date
No recoverable fingerprint5,80634.2%Orphaned record empty as well as the active one

We verified this the careful way, and the verification is itself a finding. We pulled the original files and read the active, trailer-referenced metadata directly from the bytes with two independent tools — pikepdf and pdfinfo — across the populations a surface reader would most expect to retain metadata. Both agree: the active record is empty, while the toolchain strings live only in orphaned dictionaries that a deep object walk recovers.

Population sampledFiles read from source bytesActive /Info populatedXMP present
pypdf cohort21200
ReportLab cohort1600
OmniPage scanned majority23500
Confirmed independently by pikepdf (final-trailer resolution) and pdfinfo. The recovered toolchain strings are present only in orphaned object-stream dictionaries.

The visible metadata strip is a lock on the front door; the back door was left open. This is the practical lesson for anyone who believes "remove metadata" actually removed it: a surface tool reports empty, the file looks clean, and the full provenance is still inside it. A forensic reader that walks the object graph rather than trusting the trailer recovers what the strip was meant to hide — on nearly two-thirds of a real production.

Finding 4 — The Manufacturing Fingerprint (Epstein, 16,971)

A single automated pipeline stamps all 16,971 files with the same mechanical signature — identical hex-encoded Helvetica overlay, the same retained bloat, and exactly two physical page layouts across the entire release.

Provenance hides in the metadata. Process hides in the geometry. When a single automated pipeline produces seventeen thousand documents, it stamps every one of them with the same mechanical signature, and that signature is measurable down to the content-stream index. We characterised the page-identifier overlay and the structural bloat the stamping step leaves behind, across all 16,971 files:

Manufacturing traitPrevalenceWhat it means
Bates overlay in a dedicated content stream100% (16,971)The identifier is drawn by its own injected stream, not part of the page
Identifier hex-encoded100%The Bates text is written as a hex string, the same way every time
Drawn in Helvetica100%One non-embedded font for every stamp in the corpus
Stamp size exactly 12.0 pt80.9% (13,730)The remainder are the same 12 pt stamp scaled by the page transform
Empty leftover content stream100%The pipeline leaves an unused stream behind in every file
Legacy /ProcSet array retained100%A deprecated resource declaration carried in every file
Retained page /Thumb thumbnail97.4% (16,532)An abandoned thumbnail nobody removed

The overlay does not just look uniform; it resolves into exactly two physical page layouts and nothing else. A five-stream page with the Bates stamp at content-stream index 3 accounts for 83.1% (14,110) of the corpus; a four-stream page with the stamp at index 2 accounts for the other 16.9% (2,861). A release assembled by hand, or by more than one tool, does not converge on two byte-level page shapes across seventeen thousand files. This is the fingerprint of a single stamping pass, and it is the kind of signal a malware verdict never looks for and a structural engine reads immediately.

The forensic value generalises well beyond this corpus: any production line that touches documents in bulk leaves a measurable manufacturing signature, and the absence or mixing of that signature is itself evidence. We added dedicated Bates-stamp and file-bloat detection to the scanner during this work, so it now records these traits on every file it sees.

Finding 5 — Three Engines, One Population (Epstein, 16,971 — new analysis)

The release reduces to 181 object-graph templates, and three independent engines select the identical 10,499 documents with 100% agreement — proof by triangulation that no single-purpose tool could produce.

This is a finding only a multi-engine scanner can produce, and it is new here. Run a single tool over the corpus and you get one opinion. Run dozens of independent engines and you can ask a question no single tool can answer: do unrelated measurements agree? On this corpus the answer is exact, and it is striking.

Start with the object graph. Every PDF has an internal skeleton — the order and kind of its core objects (catalog, fonts, encodings, pages, object streams, cross-reference streams). Reduced to a signature, the entire 16,971-document release collapses to just 181 distinct object-graph templates. One template alone accounts for 61.9% (10,499) of the corpus; the top ten cover 85%. The internal structure is as stamped as the Bates number on the page.

Now the cross-validation. That dominant 10,499-document template is not the only engine that isolates this population. Three completely independent engines, sharing no mechanism, each select a set of documents — and they select the identical files:

Independent engineWhat it measuresDocuments selected
Campaign attributionObject-graph structural signature10,499
Metadata recoveryRecovered OmniPage CSDK 21.1 producer string in orphaned objects10,499
Differential parsingSix external parsers disagree on the declared version10,499
Intersection of all three sets: 10,499. Union: 10,499. Agreement (Jaccard): 100.0%. Not the same count — the same files.

The shape of the object graph, a string buried in an orphaned dictionary, and the way six outside parsers stumble on the file's version have nothing to do with one another. That they partition out the exact same 10,499 documents is not a coincidence; it is proof, by triangulation, that Pipeline A is a single coherent production and not an artefact of any one measurement. This is what a multi-engine forensic approach buys that a single-purpose scanner cannot: engines that check each other.

One more layer, and it is the counter-intuitive one. The production is uniform, but the documents are not. On the subset for which the scanner recorded a fuzzy content hash (8,081 files), near-duplication is almost entirely absent: 99.1% have no near-twin in the corpus, and the largest near-identical family is ten documents. So the picture is a uniform factory turning out unique products — one pipeline, a handful of structural templates, and seventeen thousand genuinely distinct documents passing through it. That combination is itself a signature: it is what the bulk digitisation of a real, heterogeneous archive looks like, and it is not what templated or synthetic content would look like.

Finding 6 — The Numbering Tells a Story (Epstein, 16,971)

The page numbering is internally perfect (16,914 of 16,970 boundaries exact, zero overlaps) yet covers only 2.4% of its own range, in sixteen islands — a completeness measure no prior study had a corpus large enough to perform.

The same manufacturing uniformity that stamps every page (Finding 4) makes the page identifiers a sequence, and a sequence can be checked for completeness. This turned out to be the most quietly remarkable result in the entire corpus, and it is a forensic technique none of the earlier studies had: Bates-sequence continuity analysis.

The released documents are internally page-perfect: for 16,914 of 16,970 consecutive document boundaries, the next document's first Bates number equals the previous document's first number plus its exact page count, with zero overlaps anywhere in the corpus. The numbering is clean, strictly per-page, and never reused. Yet the 16,971 documents (67,143 pages) sit inside a numbered range running to 2,853,705. The release accounts for about 2.4% of its own numbered range, and the present documents collapse into sixteen contiguous islands — with 86.8% of all documents (14,723) packed into the first 40,000 numbers, followed by three contiguous deserts:

Absent contiguous range (between two released documents)Numbered pages absent
EFTA00039881 → EFTA012627821,222,900
EFTA01264396 → EFTA02209722945,325
EFTA02212972 → EFTA02730265517,290
+ 53 further gap boundaries2,787,447 numbered pages absent in total

What our data establishes is structural: the released PDFs occupy 2.4% of a page-contiguous, reuse-free numbering range, in sixteen islands, with three multi-hundred-thousand-page gaps. What it cannot establish is the cause of the absences — withheld, redacted in full, released only as non-PDF exhibits, or part of separate productions. We report the shape and scale of what is absent, not the reason for it.

Full treatment in The Epstein Files, Forensically.

Finding 7 — Signed Is Not the Same as Read (curated forms + fixtures)

A digital signature can certify a form field whose stored value and displayed appearance disagree, so the document that was signed is not the document that is read — caught with five rendering-free checks.

The most counter-intuitive form of the human-vs-machine gap lives in interactive forms and signatures. A form field carries two independent representations: its value (/V, the data) and its appearance stream (/AP, what is drawn). They can disagree. A digital signature covers a byte range, and that range can include a value while the viewer renders a regenerated appearance showing something else — so the document that was signed is not the document that is displayed.

We catch this without rendering, through five checks: /NeedAppearances on a signed file, checkbox /V-versus-/AS mismatch, appearance-stream text extraction with font-encoding remap (so a custom encoding that draws "9" where the value is "1" is caught), image-based appearance streams, and blank appearance streams. A "signed" form can store one number and show another, entirely within the signed bytes. Detail in PDF Forms as Executable Security Boundaries.

What Is New Since the Earlier Studies

A synthesis should earn its place. Three things in this body of work did not exist in the seven studies on their own and came directly out of running the engine against a real production at full scale:

  • Byte-verified metadata recovery. The orphaned-object recovery in Finding 3 was confirmed against the original file bytes with two independent tools (pikepdf and pdfinfo), establishing that "metadata removed" can mean "metadata unlinked, not deleted" — and that a deep object walk recovers it on 65.8% of a real production. This is a reproducible forensic result, not an inference.
  • The manufacturing-fingerprint measurement. Quantifying a production pipeline's mechanical signature down to the content-stream index (Finding 4), and showing it resolves to exactly two page layouts across 16,971 files, is a generalisable technique for distinguishing automated bulk production from hand assembly.
  • Multi-engine cross-validation by triangulation. Showing that three independent engines partition out the identical 10,499 documents with 100% agreement (Finding 5) is a method, not just a result: unrelated measurements that converge are a proof structure single-purpose tools cannot offer, and it reduced 16,971 files to 181 object-graph templates along the way.
  • Bates-sequence continuity analysis. Treating a numbered production as a sequence and measuring its completeness (Finding 6) is a new analytical lens, only possible on a corpus large and uniform enough to have a sequence at all.
  • A new engine capability. The scanner gained dedicated Bates-stamp and file-bloat forensics, shipped during this work — it now records how a page identifier is applied (font, size, encoding, dedicated content stream) and the structural bloat a stamping pipeline leaves behind, on every file it sees.

The deeper result is the cross-corpus one: the engine's value did not depend on there being malware to find. On the one corpus where the malware axis was completely empty, it produced the richest structural findings of all three. That is the strongest evidence we have that document forensics and malware scanning are different disciplines.

What This Means for AI & RAG

Every finding above lands on the same operational point for anyone feeding PDFs to an automated reader. The page is for humans; the machine consumes a different representation, and across these corpora that representation is frequently not what the page shows.

  • You ingest the text layer, not the page. On 18.6% of a real release that layer diverged from the image — a model can answer fluently from text the source page does not support, with no visible cue.
  • Your corpus depends on your parser. On 61.9% of that release, independent libraries disagreed on how to decode the file. Pin one extractor and version it, or two teams build measurably different indexes from the identical documents.
  • The shipped text is provisional. Where pages are fixed low-resolution rasters, a modern OCR pass produces a different, usually better text layer than the one bundled in the file. Treat embedded OCR as a draft.
  • Absence is not evidence of absence. When a release is a small, clustered slice of its own numbering, a retrieval system that returns "nothing about X" is reporting the contents of the slice, not of the underlying production.

None of this concerns document content or its truth. It concerns whether an automated reader can faithfully recover what the page shows — and for a large share of real PDFs, by construction, it cannot without deliberate countermeasures. See PDF Structural Problems in AI Ingestion Pipelines.

Scope & What We Did Not Do

This is a structural synthesis. Its boundaries are deliberate:

  • We did not blend the corpora into a single metric. Detection figures come only from the curated set; the false-positive rate comes only from the benign control; behavioural and structural figures at scale come only from the Epstein release. Each number is labelled with its source.
  • We did not assess the truth, accuracy, or meaning of any document's contents. Every finding concerns how files were produced and encoded, not what they say.
  • We did not attribute motive. Metadata stripping, the numbering gaps, and reality drift are reported as observable facts; whether any was deliberate is outside what PDF structure can establish.
  • We did not re-open concluded research to inflate it. The earlier studies keep their own scoped figures; this article cites them, it does not overwrite them.

The Seven Studies

This synthesis stands on seven prior pieces of original research. Each is the full treatment of a mechanism summarised above.

StudyOne-line finding
PDF Forensics at ScaleA 0.34% false-positive rate on a 6,281-PDF real-world control, with live-malware detection by analysis rather than reputation.
The Epstein Files, ForensicallyAll 16,971 DOJ release PDFs: malware-clean, 100% metadata-stripped but recoverable, 18.6% reality drift, 2.4% of the Bates range present.
Parser DisagreementEleven crafted PDFs through six parsers — every file produced a different reading from the same bytes.
PDF Reality DriftOne file, many realities: 43 of 44 IRS tax forms drift between the rendered page and the extracted text.
The PDF Semantic Determinism ProblemThe root cause: a format built for visual fidelity guarantees pixels, not a single canonical meaning.
PDF Structural Problems in AI IngestionAI ingestion can be poisoned by the document itself — the model reads text the human never sees.
PDF Forms as Executable Security BoundariesA signature can certify a form while value and appearance disagree — what gets signed is not what gets read.

Methodology

Every figure derives from machine-readable per-document output — one structured result per PDF — produced by the same 47-engine PQ PDF forensic scanner across all three corpora. The verdict model is multi-axis: threat (malware and exploit), deception (content-integrity), and structure (neutral capability) are scored on separate axes, so a complex-but-legitimate file is never mistaken for an attack and a deceptive one is not waved through because it carries no exploit.

Corpus provenance: the curated detection set and the GovDocs1 benign control are described in full in PDF Forensics at Scale; the 16,971-document release and its per-engine methodology in The Epstein Files, Forensically. The OCR-divergence measure flags a page only when normalized word-set overlap between the embedded text and a fresh OCR of the rendered image falls below 0.30, with gates that exclude blank and figure-only pages. Metadata-recovery results were confirmed against original file bytes with two independent readers. No figure in this article averages across corpora.

→ Run the same forensic scanner on your own PDF — 47 engines, free, zero retention


PQ PDF PQ PDF Tools

© 2026 PQ PDF — All rights reserved.

← All PDF Tools • About • Research • Legal • Contact

Secure document utilities — free, private, zero-retention. pqpdf.com