How many PDFs and malware samples does the PDF Forensics at Scale study cover?

Two corpora totalling nearly 8,000 real PDFs. A 1,572-document curated detection set across eight domains — 400 live malware samples from MalwareBazaar, 950 Mozilla pdf.js test files, real IRS and US-agency AcroForm/XFA forms, arXiv papers and GovInfo federal legislation, corkami adversarial proof-of-concept files, PDF 2.0 examples, and hand-crafted value/appearance and hidden-JavaScript fixtures — and a separate 6,281-document GovDocs1 real-world benign control (Digital Corpora's web-crawled government and academic PDFs) used to measure and harden the false-positive rate. Every file was scored by the same 47-module forensic engine the public scanner runs.

What was the malware detection rate?

Of the 400 live malware samples, 399 completed full analysis and every one was classified high-risk or dangerous on the threat axis (median threat score 203); none scored clean or low. The remaining sample was a resource-exhaustion case (a decompression bomb) that was separately flagged as resisting analysis rather than scored. The figure characterises behaviour on this 400-sample snapshot; it is not a claim of universal detection across all PDF malware families.

What was the false-positive rate on legitimate documents?

Measured against an independent 6,281-document GovDocs1 benign control (web-crawled government and academic PDFs), the scanner grades 96.7% of the corpus clean or low and flags 3.3% suspicious-or-higher. That 3.3% is not error: most of it is real malware the web-crawled corpus genuinely contains (2.7% of the corpus) plus a small number of genuine content-integrity findings. After removing those, the false-positive rate on genuinely-benign documents is 0.34% (20 of 5,899).

Is the data and code reproducible?

Yes. The per-file results for all 1,572 curated files (multi-axis scores and per-domain detection flags), the benign-corpus manifest, the SHA-256 hashes of all 400 malware samples, and the scan harness, resource governor and analysis scripts are published as a downloadable bundle. The 6,281-document GovDocs1 benign control is the public Digital Corpora corpus (retrievable by zipfile range), with the per-file verdicts and the false-positive breakdown published in the bundle. The live malware binaries are not redistributed but each is retrievable from MalwareBazaar by hash.

PDF Forensics at Scale — a 7,800-PDF Study

Executive Summary

Reducing a PDF to a single malware verdict and a single score leaves out most of what matters about it. A PDF is a rendering format, not a statement of fact: the value a human sees, the value a parser extracts, and the value a signature covers can all differ inside the same file. This study measures a scanner built around that reality, scoring every document on three independent axes — threat (can it execute an exploit), integrity (does it deceive: value ≠ appearance, uncovered signature, reality drift), and capability (does it merely contain script, forms, or embedded files) — across two corpora totalling nearly 8,000 real PDFs: a 1,572-document curated set for measuring detection, and an independent 6,281-document real-world benign control for measuring (and hardening) the false-positive rate.

Corpus — two sets, ~7,850 real PDFs. A 1,572-document curated detection set across eight domains (400 live malware samples from MalwareBazaar, 950 Mozilla pdf.js real-world files, real government AcroForm/XFA forms, academic and legislative publications, adversarial proof-of-concept files, and hand-crafted fixtures), plus an independent 6,281-document GovDocs1 real-world benign control (Digital Corpora's web-crawled government and academic PDFs) used as the false-positive control. The detection figures below are from the curated set; the false-positive figures are from the 6,281-PDF control.
Malware. Of the 400 live samples in the curated set, 399 completed analysis and every one was classified high-risk or dangerous (median threat score 203); the remaining sample was a decompression bomb, flagged as resisting analysis rather than scored. No sample was graded clean or low.
False positives — measured against 6,281 real-world PDFs. The curated set is too small and too clean to measure a false-positive rate, so the rate is measured against an independent 6,281-document GovDocs1 benign control (Digital Corpora's web-crawled government and academic PDFs — the broadest producer diversity available). The scanner grades 96.7% of the corpus clean or low and flags 3.3% suspicious-or-higher. That 3.3% is not error: most of it is real malware the web-crawled corpus genuinely carries (2.7% of the corpus) plus a handful of real content-integrity findings; after removing them, the false-positive rate on genuinely-benign documents is 0.34% (20 of 5,899). See the real-world control.
Parser disagreement — arguably the most consequential result. The six parsers diverged on 502 of the 1,572 curated files (roughly one in three) on page count, JavaScript visibility, encryption status, or AcroForm presence. This is not a security finding alone: it means a third of real-world PDFs can yield a different answer depending on which library reads them — with implications for forensics, compliance, e-discovery, digital-signature validation, document indexing and AI ingestion, even when no malware is involved.
AI ingestion. These conflicting realities do not stop at the human reviewer: value/appearance divergence, reality drift and parser disagreement follow a PDF into RAG pipelines and training corpora, where a single-parser extraction layer indexes whichever reality it happened to read as authoritative fact.

A note on reading these numbers: the false-positive rate is now measured against a large, producer-diverse benign control (6,281 GovDocs1 documents), which directly addresses the small-negative-class weakness of the original curated set. The remaining 3.3% is not error — GovDocs1 contains real malware, and ~2.7 points of it are genuine detections. The malware detection figures remain a snapshot of already-known samples. These results show the scanner is not signature-dependent, does not alarm on the structural quirks of ordinary real-world PDFs, and still fires on real exploit content — not that detection is perfect in production. The Scope & Limitations section states exactly how far each result generalises.

Contents

Executive summary
The curated corpus: 1,572 PDFs across eight domains
How the scanner reaches a verdict
Results across the corpus
False positives: a 6,281-PDF real-world control
Malware & exploit detection
Forms, signatures & value/appearance divergence
Reality drift: rendered page vs extracted text
Parser disagreement
What this means for AI ingestion
Capability is not threat: the multi-axis verdict
A verdict on every file
Behaviour over signatures
Scope & limitations
Methodology & reproducibility
Data & code availability
References

The Curated Corpus: 1,572 PDFs Across Eight Domains

A scanner is only as credible as the corpus it is measured against. Rather than a set of files built to demonstrate a feature, this study uses a large, diverse population that is mostly not ours — real documents, real malware, and adversarial files written by other people for other purposes. This section describes the curated detection corpus (1,572 files), the set used to measure detection. A second, far larger corpus — the 6,281-document GovDocs1 real-world benign control used to measure and harden the false-positive rate — is covered separately in the real-world control.

Domain	Files	Source
Real-world structural	950	Mozilla pdf.js corpus — fonts, CID/Type3, encodings, broken headers, annotations, JavaScript, and deliberately-malformed exploit test files
Live malware	400	MalwareBazaar (abuse.ch) samples tagged PDF — real malicious documents, analysed statically
Adversarial PoC	103	corkami / structural edge cases — truncated xrefs, version mismatches, orphan objects, encoding tricks
Government forms	46	Real IRS / US agency AcroForm and XFA forms (W-9, 1040, 941, VA-10091, …)
Clean publications	34	arXiv papers (2023–2025) and GovInfo federal legislation
Value/appearance divergence	28	Hand-crafted V/AP and evasion fixtures
PDF 2.0 features	6	PDF Association PDF 2.0 example files
Hidden JavaScript	5	Names-tree, orphan/sleeper, ObjStm, OpenAction, annotation-/AA fixtures
Total	1,572	Eight domains, benign through actively malicious

Two blocks anchor the measurement. The live malware set is the detection control — 400 known-bad documents, so any one scoring clean is a miss. The clean publications and real government forms are an initial false-positive control — legitimate documents, several of them complex (XFA forms, digital signatures, embedded JavaScript), so any one graded as malware is an over-call. This curated benign set is small (86 documents), which is exactly why the real false-positive measurement was moved to the 6,281-document GovDocs1 control in the the real-world control section — a small, clean control can confirm the scanner does not alarm on complex legitimate documents, but it cannot surface the false-positive classes that only a large, messy, real-world population reveals. The pdf.js block sits between the two controls: Mozilla maintains it precisely because it contains pathological and exploit test files, so a high score there is frequently correct.

How the Scanner Reaches a Verdict

Every upload runs through a single 47-module pipeline — structural validation, differential multi-parser analysis, a behavioural sandbox, YARA, entropy and steganography analysis, signature and DocMDP forensics, JavaScript and XFA analysis, and the content-integrity engines (value/appearance divergence, reading-order, OCR-layer, ToUnicode, accessibility-tree). To be precise about the number: these are 47 analysis passes over a shared parse of the file — not 47 independent detection products, and not 47 separate renderers (the six renderers are one of the passes). The count describes coverage, not redundancy.

Their findings feed a multi-axis verdict. Rather than collapsing everything into one number, the scanner scores three independent axes:

Threat — exploitation and execution: shellcode patterns, CVE signatures, auto-executing JavaScript that performs dangerous operations, launch actions, attack chains.
Integrity / deception — the document asserting a different reality to a human than to a machine: value/appearance divergence, signature coverage gaps, reality drift between the rendered page and the extracted text.
Capability (structural) — neutral facts about what the document can do: it contains a form, JavaScript, an embedded file, XFA. These are reported but do not, by themselves, drive a malware verdict.

The Multi-Axis Verdict — Three Independent Scores

The headline band follows the threat axis. Integrity and capability are reported independently, so a complex-but-legitimate document is graded for review on its own terms instead of being alarmed on as malware.

This separation is the difference between a scanner that cries wolf and one that is useful. A legitimate tax form contains JavaScript and XFA — real capability — but no threat; the multi-axis verdict keeps it off the malware band while still surfacing what it contains. The sections below report how each axis performed across the corpus.

Each indicator carries a severity that contributes weighted points to its axis (critical = 50, high = 25, medium = 10, low = 3, capped at three occurrences per indicator). The threat axis determines the headline band:

Threat score	Band	Meaning
0	Clean	No threat-axis indicators
1–29	Low	Capability or weak signals only
30–149	Suspicious	Worth review
150–349	High-risk	Strong threat indicators
350+	Dangerous	Strong evidence of an exploit chain or active malicious behaviour

These are weighted heuristic aggregates, not CVSS-equivalent severities.

Execution-vector gating: why complexity is not threat

The weights and bands are only half the verdict. The correlation engine applies a hard gate: a document that cannot execute code — no JavaScript, no /Launch, no embedded executable, no /RichMedia, and no behavioural-sandbox signal — is forced to the low band regardless of its raw score. Without an execution vector, a high feature count reflects document complexity, not threat capability. Pattern-only findings (YARA and stream-inspector hits) are additionally downgraded from critical/high to medium when no execution vector is present. This single rule is the main reason a feature-rich legitimate document — a signed XFA tax form with hundreds of scripts — does not appear malicious, and it is why the dangerous band carries a second gate: it requires a confirmed execution vector or attack chain, not merely a high aggregate. Conversely, the correlation engine raises severity when independent engines agree (a multi-engine consensus bonus) and when a divergence coincides with an execution vector — so a parser split that hides a script is escalated while a harmless structural quirk is not.

Results Across the Curated Corpus

The headline result is the threat-band distribution by domain across the 1,572-document curated set (the real-world false-positive results are in the real-world control). The two control blocks behave exactly as a working detector should: every live-malware sample lands in high-risk or dangerous, and no clean publication, government form or PDF 2.0 file reaches the malware band (suspicious or above on the threat axis). The pdf.js block spreads across every band because it is itself a mix of benign and deliberately-malicious test files.

Domain	n	Clean	Low	Suspicious	High-risk	Dangerous
Live malware	399	0	0	0	347	52
pdf.js real-world	949	550	270	109	17	3
Adversarial PoC	103	2	30	37	23	11
Government forms	46	0	45	1	0	0
Clean publications	34	14	19	1	0	0
V/AP fixtures	28	24	1	0	3	0
PDF 2.0 examples	6	6	0	0	0	0
Hidden-JavaScript	5	0	0	0	3	2

Two notes on the apparent “high” counts in the real-world and adversarial blocks. The 20 pdf.js files at high-risk or above are predominantly genuine exploit and fuzzer test files Mozilla ships on purpose — correct detections, not false positives. The adversarial block scores high because the corkami set is, by construction, malformed and structurally hostile; a forensic scanner is meant to react to it. The single suspicious clean-publication and the single suspicious government form are integrity-axis observations (a structural anomaly worth a glance), not malware calls — no file in either control block reached the threat-axis malware band.

Per-domain capability & detection signals

The same corpus, viewed by which signals fired in how many files of each domain. This is the at-scale evidence behind the per-mechanism sections that follow.

Domain	n	V/AP	Parser disagree	Reading-order	Accessibility tree	Signed	JavaScript	XFA
Live malware	399	0	131	55	69	0	30	3
pdf.js real-world	949	21	217	44	87	8	16	29
Adversarial PoC	103	0	56	0	0	0	22	1
Government forms	46	0	46	45	46	46	0	45
Clean publications	34	0	25	34	3	4	0	0
V/AP fixtures	28	9	20	1	1	2	1	0
Hidden-JavaScript	5	0	5	0	0	0	5	0

Read across the government-forms row: all 46 are signed, 45 use XFA, all 46 trigger a parser disagreement, 45 show reading-order ambiguity, and all 46 carry an accessibility tree — these are genuinely complex documents — yet 0 fire a value/appearance divergence and 0 reach the malware band. That row is the whole thesis of the multi-axis verdict in one line: maximum capability, correctly graded as low threat.

False Positives: a 6,281-PDF Real-World Control

The curated corpus above is useful for measuring detection, but it is the wrong instrument for measuring false positives: 86 hand-picked benign documents cannot represent the structural chaos of PDFs as they exist in the wild. The real false-positive rate is measured against an independent 6,281-document GovDocs1 benign control (Digital Corpora's web-crawled corpus of U.S. government and academic PDFs, spanning two decades of producers, tools and malformations). 6,065 complete analysis under the adaptive governor; the rest stall or produce no output and are excluded.

Run against this control, the scanner grades 96.7% of the corpus clean or low and flags 3.3% suspicious-or-higher. The verdict-band distribution:

Verdict band on the 6,065-PDF benign control	Documents	Share
Clean	2,662	43.9%
Low (capability noted, no threat)	3,203	52.8%
Suspicious	94	1.5%
High-risk	87	1.4%
Dangerous	19	0.3%
Clean or low	5,865	96.7%
Suspicious-or-higher	200	3.3%

That 3.3% is not error. The corpus is web-crawled and contains real malicious PDFs: 166 (2.7% of the corpus) are threat-driven — auto-executing /OpenAction → /Launch droppers, shadow-document signature bypasses and confirmed exploit patterns, i.e. true positives the corpus genuinely carries. Another 17 are genuine content-integrity findings (parser disagreement, weak signature crypto, real shadow/orphan objects, real value/appearance divergence). That leaves 20 residual false positives — a long tail of low-volume edges. After removing the real malware the corpus carries, the false-positive rate on genuinely-benign documents is 20 / 5,899 = 0.34% (0.76% if every non-malware flag is counted as an error).

How the scanner separates benign real-world structure from attacks

The multi-axis verdict turns on the difference between a structural feature and a structural feature carrying a payload. These are the real-world patterns that most distinguish ordinary documents from attacks — benign on their own, a finding only when an execution vector or genuine divergence accompanies them:

Real-world structure (benign on its own)	When the scanner treats it as a finding
Linearized/incremental trailer ordering. A linearized PDF's `/Size` is non-monotonic by file offset; the engine read the drop as objects being hidden — firing on 67% of real PDFs.	Compare revisions along the true `/Prev` chain, not file order. A genuine `/Size 20→6` object-hiding shrink still fires.
Linearized first-page object override. A normal editorial re-save re-defines the first-page object; scored as deception.	Gate on a JS/Launch payload in the update. A first-page override carrying a real execution vector still grades high-risk.
`setpagedevice` operator. Standard PostScript page-setup op in every Distiller/Ghostscript PDF, flagged as an RCE pass-through.	Informational — it is inert under PDF rendering. Genuine PostScript-exec ops (`) run`, `) exec`) still fire high.
qpdf “damaged”. Recoverable real-world malformation (truncated/off-by-one xref) graded as intentional corruption.	Scored medium; the Correlation Engine escalates it when combined with JavaScript/executable content.
Bare IP-address URL. Legacy and government PDFs link to raw IPs (file/stats servers); flagged as C2 beacons.	A plain IP is a weak signal (medium). A non-standard port or randomised/DGA host still grades high.
Action fan-in. Many form fields sharing one action (format/calculate, navigation) read as “trigger maximization.”	Severity follows the convergent action type: shared `/Launch` still high; shared navigation is normal structure.
Image colorspace mismatch. A DeviceGray colorspace wrapping a 3-channel JPEG (grayscale stored as RGB) read as “extra channels hiding data.”	Measure actual channel divergence: identical channels (R≈G≈B) are benign; genuinely divergent color data still flags as a carrier.
Tiny embedded image. A 1×1 embedded spacer pixel flagged as a remote tracking beacon — but an embedded pixel fetches nothing.	Informational; a real beacon needs a remote resource. A tiny image in a confirmed phishing context still escalates.
ToUnicode glyph remap. The flagship semantic-determinism check fired critical on whole-alphabet case folds (`A→a`) and broken digit→control maps.	Score only a true meaning substitution: both sides alphanumeric and differing beyond case. `A→`Cyrillic `А` / `1→9` / `$1,200→$12,000` still fire.
Null-byte injection. Non-BOM UTF-16 metadata (titles, bookmarks, field values — ~50% null by construction) counted as token-evasion nulls.	Strip alternating-null UTF-16 text runs first. An isolated `/Launch\x00` token-split injection still counts.

The same discipline applies to the content-integrity and structural axes — each of these real-world patterns is benign on its own and a finding only with corroboration:

Null-byte “injection”: stop counting consecutive null runs (zero-fill padding) and nulls inside string literals; only isolated token-splitting nulls (/Launch\0) remain a signal.
Large object injection in the final revision: require a signature (/ByteRange) to be present — an unsigned multi-revision edit is publishing, not injection — and drop /EmbeddedFile from the “execution vector” set.
Info vs XMP ModDate disagreement and oversized JPEG EXIF/COM segments: downgraded — editors routinely touch one date field, and large EXIF is normal photo metadata (ICC profiles, thumbnails).
qpdf structural “errors”: exclude benign linearization-hint inconsistencies that qpdf reads through fine.
ToUnicode glyph remap (two refinements): require the remap to stay within a character category (a letter→digit map is a broken subset font, not a meaning substitution), and treat 6+ distinct remaps as a systematically scrambled font rather than a targeted attack.
OCR / text-layer mismatch: require the rendered image to OCR to substantial text before flagging — an empty-OCR figure/photo/map page with a caption is not a poisoned scan.

Every one of these distinctions is validated both ways: the benign real-world pattern grades clean/low, and a constructed positive control carrying the genuine payload still fires. The residual on genuinely-benign documents is 0.34% (20 of 5,899); the rest of the flagged population is real malware the corpus carries (163) and genuine content-integrity findings (14).

Malware & Exploit Detection

The 400 live MalwareBazaar samples are the detection control. Of the 400, 399 completed full analysis and every one was classified high-risk or dangerous (median threat score 203); none scored clean or low. The remaining sample was a decompression bomb that exhausted the analysis budget — it was separately flagged as resisting analysis rather than scored (see A verdict on every file).

Metric	Result
Samples that completed analysis	399 / 400 (1 resource-bomb flagged separately)
Classified high-risk or dangerous	399 / 399 analyzed
Threat score range (min / median / mean / max)	160 / 203 / 256 / 999
Band split	347 high-risk, 52 dangerous
Scored clean or low (missed)	0
Detection signals	YARA (CVE + heap-spray patterns), behavioural sandbox (network/exec/mmap), object & action analysis, XRef integrity, differential parsing

Detection statistics

Treating the 399 analysed malware samples as positives and the 86 genuinely-benign real documents (clean publications, real government/XFA forms, PDF 2.0) as negatives gives a labelled two-class problem. (This ROC uses the small curated negative class; the production false-positive rate is measured separately on the 6,281-PDF real-world control in the the real-world control section.) The pdf.js block is excluded from this calculation because it deliberately mixes benign and malicious files and cannot be cleanly labelled. Malware scored 160–999; the benign control scored 0–38 — the two classes do not overlap.

Metric	@ suspicious (≥30)	@ high-risk (≥150)
True-positive rate (recall)	1.000 (95% CI 0.990–1.000)	1.000 (0.990–1.000)
False-positive rate	0.023 (0.006–0.081)	0.000 (0.000–0.043)
Precision	0.995	1.000
F1	0.997	1.000
ROC AUC	1.000 — complete separation (malware min 160 vs benign max 38)

95% confidence intervals are Wilson score intervals on n = 399 positives and n = 86 negatives. An AUC of 1.000 reflects complete class separation on this corpus; with larger and more varied negatives the boundary cases would grow, and these intervals are the honest bound on how far the result generalises. Every malware score is published per SHA-256 in malware-detection-breakdown.csv, so the table above can be recomputed from the raw data.

Per-engine contribution — and an honest caveat

Recording which engines raised a high or critical indicator on each of the 398 fully analysed samples shows how the verdict is reached — and exposes the corpus’s single biggest limitation in one number:

Engine / signal	Fired on	%
Threat-intelligence hash match	398	100%
Object analysis	57	14%
YARA (CVE + heap-spray)	38	10%
qpdf structural integrity	38	10%
Campaign attribution	38	10%
Pattern scanner	34	9%
ClamAV signatures	33	8%
Differential parsing	30	8%
Embedded-file / polyglot	19 / 18	5%
Structural, entropy, stego, JS, OCR, … (long tail)	≤13 each	≤3%

The 100% threat-intelligence hit rate is the most important caveat in this paper. It means every sample was already catalogued — which is exactly what a MalwareBazaar corpus is. On this corpus the verdict is therefore over-determined: a known-bad hash alone would flag every file, so the study cannot cleanly separate detection-by-prior-knowledge from detection-by-analysis. The analytical engines (YARA, ClamAV, structural integrity, differential parsing, object and embedded-file analysis) each fire on a subset, demonstrating they contribute independent signal, but the clean test of signature-independent detection — samples evaluated with the hash/reputation databases removed — is one this corpus could not answer on its own, so we ran it separately (below). The honest reading of the malware result is: the scanner reliably flags known-bad PDFs through multiple redundant paths.

Detection by analysis vs. by reputation

To separate the two, we scanned 402 MalwareBazaar PDFs outside our corpus with the hash/reputation engines disabled (URLhaus, MalwareBazaar and ThreatFox hash lookups, ClamAV, and TLSH similarity all forced to miss) and measured whether the pure analysis engines — YARA, sandbox, JavaScript-AST, exploit/structure, correlation — still raised a high/critical finding. The result is honest and clarifying:

Malicious PDF type	Share	Caught by analysis alone
Carries a technical execution vector (JS / Launch / embedded / XFA)	10%	93% high-confidence (100% flagged)
Social-engineering (phishing link / dropper, no exploit)	90%	16% high-confidence (63% any signal)
All 402 novel samples	100%	24% high-confidence, 67% any signal

Two things follow, and both are worth stating plainly. First, the exploit and structural engines are genuinely strong: when a malicious PDF carries a technical exploit, analysis catches it 93% of the time with no hash help at all — signature-independent detection, demonstrated. Second, the reason the overall analysis number is only 24% is that ~90% of real-world malicious PDFs are social-engineering — a clean-looking document with a phishing link and no exploit. There is structurally little for any static analyzer to find in those; in production they are caught by the hash and URL-reputation feeds. The takeaway is not “the scanner misses malware” — it is that known malware is caught near-100% by reputation, novel technical exploits are caught ~93% by analysis, and novel social-engineering is a limit of static analysis generally, not of this engine specifically. The per-sample analysis outcomes are published by SHA-256 in novel-malware-analysis-detection.csv.

Architecturally, the redundancy still matters: a sample that hides JavaScript from one parser is surfaced by another, by the raw-byte pass, or by sandbox syscall capture — the hidden-JavaScript fixtures (payloads in a compressed Names tree, an orphan/sleeper object, or an object stream) are caught even when most structural parsers miss them. That is a property of the design, not a claim this corpus proves about novel malware.

The samples are kept private and not redistributed; per-sample scores are published by SHA-256 in the data bundle. Origin study: PDF Malware Scanner.

Forms, Signatures & Value/Appearance Divergence

A PDF form field stores its value (/V) and its rendered appearance (/AP) in two independent places, with no obligation to agree. When they diverge, a human sees one thing and a parser — or an LLM — reads another. The scanner detects this with five checks operating on the raw object model: /NeedAppearances, checkbox /V-vs-/AS, decoded /AP-stream text vs /V (with hex and UTF-16 decoding and /Opt resolution), blank-appearance detection, and missing-appearance detection.

The dedicated validation study measured this directly: 9 of 9 hand-crafted V/AP positives detected — including evasion variants using hex-encoded values, Unicode confusables, and font-encoding remaps — with 0 of 187 false positives (0.00%) across the benign validation set of 44 IRS forms, agency XFA forms, federal publications, academic papers and adversarial files. Signed documents are graded on signature coverage: the scanner models DocMDP permission levels, so a P=2 certified form that a recipient legitimately fills and re-saves is recognised as permitted form-filling, not flagged as a shadow-document attack — while a genuine post-signature execution vector still escalates. At the full 1,572-file scale here the result holds and strengthens: the deception axis fired on 0 of 86 genuinely-benign real documents (clean publications, 46 real government forms including complex multi-signature XFA forms, and PDF 2.0 examples) and on 0 of 399 live malware samples — malware is caught on the threat axis by other signals, not mistaken for a value/appearance problem. No legitimate document in the curated corpus was graded into the malware band.

Origin study: PDF Form Security — V/AP, DocMDP, FieldMDP.

Reality Drift: When the Rendered Page and the Extracted Text Disagree

A PDF can show one thing to a human eye and hand a different string to any program that extracts its text. The scanner detects the full family of these reality-drift mechanisms: ToUnicode CMap remapping (the glyph drawn is not the character extracted), OCR text-layer mismatch (a scanned image with a selectable-text overlay that disagrees with it), accessibility injection (/Alt and /ActualText strings that a screen reader or an AI ingests but no reader sees), reading-order ambiguity in multi-column layouts, homoglyph and right-to-left-override spoofing, and metadata desynchronisation between the document info dictionary and XMP.

The full taxonomy is thirteen structural drift vectors, each a place where the rendered reality and the machine reality can be made to disagree:

Incremental update chains (“ghost revisions”)
Object-stream compression cloaking
Optional content groups — hidden layers (OCG)
Rendering-time logic (JavaScript, actions, triggers)
Embedded files and nested containers
Alternate representations — dual-reality PDFs
Font-level semantic attacks (ToUnicode remapping)
Spatial ambiguity and reading-order collapse
Metadata desynchronisation
Malformed-but-tolerated structures (parser differential)
Accessibility trees as hidden semantic channels
Embedded OCR lies (hidden text-layer poisoning)
PDF as a polyglot container

These findings land on the integrity axis, not the malware axis — a document whose extracted text contradicts its rendered appearance is a content-integrity problem even when it carries no exploit. The dedicated prevalence study (182 documents) found reading-order ambiguity in 43 of 44 IRS tax forms (98%) and all of the academic papers and government publications sampled, while 0 of 103 adversarial proof-of-concept files triggered any drift vector — confirming these vectors describe everyday structural ambiguity in real documents, not an attack signature. At the full 1,572-file scale here the pattern holds and sharpens: reading-order ambiguity fired on 45 of 46 government forms (98%) and on all 34 academic and legislative publications, accessibility structure was present in all 46 government forms, and once again 0 of 103 adversarial proof-of-concept files triggered any drift vector. Reality drift is pervasive in real documents and absent in adversarial ones — the opposite of an attack signature, and exactly why it belongs on the integrity axis. The scanner reports each drift vector per page so an analyst can see exactly where the rendered reality and the machine reality diverge.

Origin studies: PDF Reality Drift and PDF Semantic Determinism.

Parser Disagreement

Six production parsers do not return the same account of the same file. They disagree on page count, JavaScript presence, encryption status, and whether an AcroForm exists — because the PDF specification leaves those behaviours underspecified. The scanner runs all six on every upload and reports the disagreements directly: a JavaScript-visibility discrepancy (one parser sees the script, five do not) is itself a strong signal, and is one of the ways hidden-JavaScript payloads are caught.

This is not a rare condition. Across the 1,572-file corpus the six parsers diverged on 502 files — roughly one in three — on page count, JavaScript visibility, encryption status, or AcroForm presence. Differential parsing is reported on the informational axis when the divergence is benign (a version or object-count disagreement) and escalated when it coincides with an execution vector — so a parser split that hides a script is treated as the threat it is, while a harmless structural quirk is not.

On the larger real-world control the rate is higher still. Across the 6,281-document GovDocs1 corpus (6,065 analysed), the six parsers disagreed on page count, JavaScript visibility, encryption status, or AcroForm presence in 43.5% of files (2,641); 69.6% (4,223) carried multi-column reading-order ambiguity; and 80.0% (4,850) carried at least one extraction-divergence vector. Four in five ordinary, benign documents contain a mechanism that can change what a single-parser extraction pipeline sees — the failure mode that matters most for RAG and training-data ingestion, where whichever reading one parser produces becomes the document the model indexes, retrieves, and learns from.

The security reading — hidden-JavaScript detection — is the narrow one. The broader finding is that roughly a third of real-world PDFs can yield a materially different answer depending on which library reads them, with no malware involved at all. That has consequences well beyond malware scanning:

Forensics & e-discovery — two tools can extract different page counts or text from the same exhibit, so “what the document says” depends on the tool of record.
Digital-signature validation — if parsers disagree on what content or how many revisions a file contains, they can disagree on what a signature actually covers.
Compliance & archiving — a PDF/A or retention pipeline that validates with one parser and renders with another can certify content a reader never sees.
Document indexing & AI ingestion — a single-parser extraction layer indexes whichever account it happened to read as authoritative.

A 32% disagreement rate is the kind of structural-integrity problem that exists independent of any attacker.

The government-forms result, in detail

The sharpest case is the real government forms: all 46 of 46 produced a parser disagreement, and — tellingly — the same two, every time. Each form stores its AcroForm and JavaScript inside compressed object streams (/ObjStm), which the six parsers resolve differently:

Divergence (all 46 forms)	What the parsers report	Consequence
AcroForm visibility	`MuPDF=none, Poppler=AcroForm, pdfminer=AcroForm`	A MuPDF-based tool reports no interactive form on a real fillable government form
JavaScript visibility	`MuPDF=none, Poppler=JS, Ghostscript=none, pdfminer=JS, pdfjs=none`	Of the parsers that read JavaScript, three see none and two do — the form’s field logic is invisible to some tools (qpdf, the sixth, checks structure rather than JS text)

This is not random noise — it is a single, reproducible structural pattern (compressed-object-stream resolution) that makes a third of real-world PDFs, and every government form in this corpus, read differently depending on the library. The per-form divergences for all 46 are published in parser-disagreement-gov-forms.json (and .csv), so the result can be inspected form by form and reproduced against the named parsers.

Origin study: Parser Disagreement — Six Parsers, Eleven Divergences.

What This Means for AI Ingestion

Value/appearance divergence, reality drift, and parser disagreement do not stop at the human reviewer — they follow the document into RAG pipelines and LLM training corpora. A single-parser extraction layer implicitly assumes there is one canonical truth inside every PDF; for documents with structural ambiguity, there is not, and the wrong value enters the knowledge base as authoritative fact with no rejection signal. The same checks that make this scanner a forensic tool — multi-parser extraction, V/AP divergence detection, reality-drift analysis — are exactly the pre-ingestion checks an AI pipeline needs to avoid silently indexing content no human ever saw.

Origin study: PDF Structural Problems in AI Ingestion Pipelines.

Capability Is Not Threat: The Multi-Axis Verdict in Practice

The hardest job for a forensic scanner is not catching obvious malware — it is not alarming on legitimate documents that happen to be complex. The multi-axis verdict is what makes that distinction, and the corpus shows it working in both directions:

Document	What it contains	Verdict
Live malware sample	Exploit pattern, auto-exec payload	Dangerous — threat axis
Real IRS / XFA tax form	JavaScript, XFA, embedded files	Low — capability, no threat
Signed government XFA form (filled)	Hundreds of form scripts, multiple signatures covering part of the file	Suspicious — integrity review, driver = integrity, not malware
arXiv paper / federal legislation	Standard publication structure	Clean
Form with value/appearance mismatch	`/V` ≠ rendered `/AP`	Flagged — deception axis

A document’s scripting, forms, and embedded files are reported as capability, never mistaken for malware on their own. CVE attribution requires the actual exploit evidence, not merely the structure a vulnerable feature uses — so a dynamic XFA form is not labelled a heap-overflow exploit because it contains the XFA scripting API. The result across the corpus: malware is caught, complex legitimate documents are graded accurately, and not one of the 86 curated benign documents (clean publications and real government forms) was graded into the malware band — a result that held up, and was then stress-tested far harder, against the 6,281-document real-world control in the the real-world control section.

A Verdict on Every File

A forensic scanner must never be silenced by a hostile upload. Decompression bombs, pathological object graphs, and files engineered to hang a parser are treated as a finding, not an error: the scanner bounds its own analysis and, when a document cannot be fully analysed within budget, returns a graceful “analysis could not complete — file resisted full analysis” verdict graded at least suspicious, with the partial findings from every engine that completed. Across all 1,572 files, every upload received a verdict; the only documents that could not be fully analysed were genuine resource bombs, which are exactly the documents a scanner should flag rather than choke on.

Behaviour Over Signatures: an Architectural Distinction

Signature-based antivirus answers one question — “have we seen this before?” — and answers it well: against catalogued threats, multi-engine AV is mature and effective. This scanner answers a different question — “what does this PDF actually do, and do its realities agree?” — using behavioural execution, structural differential analysis, and content-integrity checks that are designed to work without a prior signature. The two are complementary, not interchangeable: one recognises known-bad bytes, the other reasons about structure and behaviour. A caveat we make explicit elsewhere applies here too — on this corpus every malware sample also matched threat intelligence, so the corpus shows the analytical engines contributing alongside hash lookup, not replacing it; isolating behaviour-only detection would need novel samples this corpus does not contain.

This is an architectural comparison of what each approach analyses — not a controlled benchmark. We did not run VirusTotal, MetaDefender or any other product against this corpus, so the table below is illustrative of design, not a measured head-to-head. The one experimental result we do report is on this scanner alone: complete separation between the live-malware and benign-control sets (see the detection statistics above).

What each approach is designed to analyse (architectural, not measured):
Dimension	Signature-based AV (general)	This scanner (by design)
Core question	Has this been seen before?	What does this file do?
PDF structural analysis	Scans bytes, not PDF structure	Static engines on xref / objects / streams
Behavioural sandbox	General, not PDF-specific	Six PDF renderers, isolated namespaces, syscall capture
Content integrity (V/AP, drift)	Not in scope	Value/appearance, reading-order, OCR, accessibility, ToUnicode

Privacy. Every file in this study, and every file submitted to the public tool, is analysed on the server and deleted immediately afterwards — no account, no file retention, no third-party upload. A forensic scan should not itself become a data-exposure event.

Scope & Limitations

This is a measurement on a specific corpus, not a universal benchmark. Five limitations bound how far the results generalise, and we state them plainly:

The ROC negative class is small — but the false-positive rate is not. The AUC of 1.000 is computed against only 86 curated benign-control documents (46 government forms, 34 publications, 6 PDF 2.0 files); perfect separation on 86 negatives is a clean result on that set, not evidence of perfect separation in production. The false-positive rate, however, is now measured against a far larger and more varied population: the 6,281-document GovDocs1 benign control (the real-world control), where the scanner flags 3.3% suspicious-or-higher and, after removing the real malware the corpus carries, has a 0.34% false-positive rate on genuinely-benign documents — not zero, but a production-facing number a small curated control could never have produced. That is why a large benign control matters.
Most files do not participate in the classification metrics. The TPR/FPR/precision/AUC figures use 399 malware positives and 86 benign negatives — 485 of the 1,572 files. The 950 pdf.js files and 103 corkami adversarial files are deliberately excluded from those statistics because they mix benign and malicious content and cannot be cleanly labelled. They inform the prevalence and behavioural sections, but readers should not assume all 1,572 files contributed equally to the detection numbers.
There is no comparison baseline. We did not run pdfid, peepdf, pdf-parser, or any commercial scanner against this corpus. The study supports “this scanner performs well on this corpus,” not “it performs better than tool X” — that claim would require a controlled side-by-side we have not done.
MalwareBazaar bias — and we can quantify it. The 400 malware samples are drawn from a public repository of already-known, already-clustered files, and the per-engine data proves it: the threat-intelligence hash lookup matched 100% of samples. Every file was already catalogued, so a hash match alone would flag the whole set and the corpus cannot separate detection-by-prior-knowledge from detection-by-analysis. The analytical engines each fire on a subset (so they do contribute), but a clean test of behaviour-only detection — novel PDFs, private or targeted campaigns, or zero-days with no threat-intel hit — is one this corpus does not contain. The result shows reliable flagging of known-bad through redundant paths, not exhaustive zero-day detection.
Snapshot in time. The malware set is one feed pulled on one day; a different feed or date would give different files.

Live malware binaries are kept private and are not redistributed (their SHA-256 hashes are published so each can be retrieved from MalwareBazaar); the benign corpus manifest, the per-file results, and the analysis code are published in the Data & Code Availability section below, so every figure here can be recomputed and challenged.

Methodology & Reproducibility

The “47” is a coverage figure, not a measure of independence: it is 47 analysis passes over one shared parse of the file, of very different weights — some are major subsystems (the behavioural sandbox, the six-parser differential, the XFA parser), others are lightweight checks (a metadata field, an entropy reading). They group into six functional categories:

Category	Passes	Examples
Structural & parsing	10	structure validator, six-parser differential, xref-integrity graph, qpdf check, content/object-stream & trailer-chain forensics
Malware, exploit & behavioural	15	YARA, CVE matcher, ClamAV, behavioural sandbox, JS AST de-obfuscation & emulation, polyglot/embedded-binary, entropy topology, steganography, ML
Signatures, forms & document integrity	8	signature forensics, DocMDP/FieldMDP, AcroForm field forensics, XFA/FormCalc, revision history, annotation & named-tree analysis
Content integrity / reality drift	7	value/appearance, reading-order, OCR-layer, accessibility-tree, ToUnicode, codec validation, compliance-fraud
Metadata & threat intelligence	6	metadata/ExifTool reconciliation, URL extraction, threat-intel hash lookup, phishing, campaign attribution
Correlation & scoring	1	severity fusion, execution-vector gating, multi-axis verdict
Total	47	analysis passes, not 47 independent products

The full corpus was scanned under a resource-governed scheduler so a single hostile file could neither exhaust the host nor skew the run: each scan ran in its own control group bounded to 80% of available memory with a fixed system reserve, concurrency scaled to free memory, and a progress-watchdog terminated any scan stuck on one engine. Two engine safeguards mean every file produces a verdict rather than a crash: a resource-bomb guard that detects decompression and pixel bombs by bounded decode (so content-heavy engines never materialise a multi-hundred-MB buffer), and a fallback that converts any residual hang or out-of-memory kill into a graceful “file defeated analysis” verdict. Across all 1,572 files the only documents that could not be fully analysed were genuine resource bombs, which are flagged as such.

The corpus is reproducible from public sources: the Mozilla pdf.js test set, the corkami adversarial PDFs, the PDF Association PDF 2.0 examples, GovInfo federal publications, arXiv papers, and real IRS/agency forms; live malware was pulled from MalwareBazaar by SHA-256. The per-file results, the corpus manifest, the malware hash list, and the scripts are published in the data bundle below.

Data & Code Availability

The full reproducibility bundle is published, not held on request:

Artifact	Contents
per-file-results.jsonl	One record per file for all 1,572 PDFs: multi-axis scores (threat / deception / structural), verdict band, and per-domain detection flags (V/AP, parser-disagreement, reading-order, accessibility, signed, JavaScript, XFA)
malware-sha256.txt	SHA-256 of all 400 live malware samples — binaries are not redistributed, but each is retrievable from MalwareBazaar by hash
malware-detection-breakdown.csv	Per-sample result keyed by SHA-256: threat score, verdict band and driver for every one of the 400 malware samples — join to the hash list, retrieve from MalwareBazaar, and re-scan to verify each score independently
benign-corpus-manifest.txt	Exact file list of the benign and adversarial corpus, grouped by source (pdf.js, corkami, PDF 2.0, GovInfo, arXiv, IRS/agency forms, V/AP and hidden-JS fixtures)
govdocs1-benign-control-results.jsonl	One record per file for all 6,281 GovDocs1 real-world benign-control PDFs (6,065 analysed): public Digital Corpora path (`govdocs1/NNN/NNNNNN.pdf`), corrected verdict band & driver, multi-axis scores, high/critical engines, and which hardening pass produced the verdict. Re-fetch any path from Digital Corpora and re-scan to verify
govdocs1-verdict-summary.csv	The scanner's verdict-band distribution over the 6,065-PDF benign control (96.7% clean/low, 3.3% suspicious+) and the breakdown of the flagged set into real malware, genuine integrity findings and the 0.34% false-positive residual
novel-malware-analysis-detection.csv	402 MalwareBazaar PDFs outside our corpus, scored with hash/reputation engines disabled: per sample, execution-vector presence and whether pure analysis raised a high/critical (or any) finding — the detection-by-analysis measurement (93% on exec-bearing exploits)
novel-malware-sha256.txt	SHA-256 of the 402 novel samples — binaries not redistributed; retrievable from MalwareBazaar by hash
scan_harness.py	Resumable, resource-bounded batch scan harness
scan_governor.py	Adaptive resource governor (80%-available memory, fixed OS reserve, memory-scaled concurrency, progress-stall watchdog)
analyze.py	Per-domain aggregation over the results JSONL — reproduces the numbers in this article
README.md	Bundle index and reproduction steps

The scoring weights and verdict bands needed to interpret the result records are documented in the verdict and methodology sections above. The scanner itself is the same engine the free public tool runs, so any file in the manifest can be re-scanned directly.

References

ISO 32000-1:2008 — Portable Document Format, Part 1. AcroForm field values (§12.7), appearance streams, filters (RunLengthDecode, FlateDecode, DCTDecode, JPXDecode). (specification)
ISO 32000-2:2020 — PDF 2.0. DocMDP / MDP transform permission levels (§12.8.2.2), /NeedAppearances, ToUnicode CMaps. (specification)
abuse.ch — MalwareBazaar malicious-sample repository (PDF-tagged samples, retrieved by SHA-256). bazaar.abuse.ch (threat-intel feed)
Mozilla pdf.js test corpus — real-world and regression PDFs (fonts, CID/Type3, encodings, malformed and exploit test files). github.com/mozilla/pdf.js (public corpus)
corkami / Ange Albertini — PDF format-edge and polyglot proof-of-concept files. (public corpus)
Mladenov, V., Mainka, C., Rohlmann, S., Schwenk, J. — “Shadow Attacks: Hiding and Replacing Content in Signed PDFs,” NDSS 2021; and the PDF Insecurity series. (peer-reviewed)
CVE-2021-21017, CVE-2024-45112 (Acrobat XFA/AcroForm); CVE-2010-1240 (/Launch + embedded file) — pattern references used by the YARA rule sets. (NVD / Adobe advisories)
Parser & analysis tooling: MuPDF, Poppler, Ghostscript, qpdf, pdfminer.six, pdf.js, Tesseract OCR, YARA. (open-source tools)

Origin Research

This page carries the current numbers. The pages below first characterised each mechanism — useful for the construction detail and methodology behind a specific vector — but they were measured on smaller corpora that predate this study, so where a figure differs, the value on this page is the current one:

PDF Malware Scanner — the engines and what each detects
PDF Form Security — value/appearance divergence, DocMDP, FieldMDP construction
PDF Reality Drift — rendered glyph vs extracted text
PDF Semantic Determinism — one file, different realities to human, parser, and LLM
Parser Disagreement — the eleven hand-built divergence cases
PDF AI Ingestion Pipelines — how these problems enter RAG and training corpora

→ Run the PDF Forensic Scanner — Multi-Parser, 47-Module Analysis, Free