Executive Summary
Reducing a PDF to a single malware verdict and a single score leaves out most of what matters about it. A PDF is a rendering format, not a statement of fact: the value a human sees, the value a parser extracts, and the value a signature covers can all differ inside the same file. This study measures a scanner built around that reality, scoring every document on three independent axes — threat (can it execute an exploit), integrity (does it deceive: value ≠ appearance, uncovered signature, reality drift), and capability (does it merely contain script, forms, or embedded files) — across a large, deliberately diverse corpus.
- Corpus. 1,572 PDFs across eight domains, including 400 live malware samples from MalwareBazaar, 950 Mozilla pdf.js real-world files, real government AcroForm/XFA forms, academic and legislative publications, adversarial proof-of-concept files, and hand-crafted fixtures.
- Malware. Of the 400 live samples, 399 completed analysis and every one was classified high-risk or dangerous (median threat score 203); the remaining sample was a decompression bomb, flagged as resisting analysis rather than scored. No sample was graded clean or low.
- False positives. None of the 86 genuinely-benign real documents (clean publications, 46 real government forms including complex multi-signature XFA forms, and PDF 2.0 examples) was graded into the malware band — even though those forms carry signatures, JavaScript and XFA. Capability was reported as capability, not mistaken for threat.
- Parser disagreement — arguably the most consequential result. The six parsers diverged on 502 of 1,572 files (roughly one in three) on page count, JavaScript visibility, encryption status, or AcroForm presence. This is not a security finding alone: it means a third of real-world PDFs can yield a different answer depending on which library reads them — with implications for forensics, compliance, e-discovery, digital-signature validation, document indexing and AI ingestion, even when no malware is involved.
- AI ingestion. These conflicting realities do not stop at the human reviewer: value/appearance divergence, reality drift and parser disagreement follow a PDF into RAG pipelines and training corpora, where a single-parser extraction layer indexes whichever reality it happened to read as authoritative fact.
A note on reading these numbers: the detection and false-positive figures are clean on this corpus, but the benign control is small (86 documents) and the malware is a snapshot of already-known samples. They show the scanner is not signature-dependent and does not alarm on complex legitimate documents — not that detection is perfect in production. The Scope & Limitations section states exactly how far each result generalises.
- Executive summary
- The corpus: 1,572 PDFs across eight domains
- How the scanner reaches a verdict
- Results across the corpus
- Malware & exploit detection
- Forms, signatures & value/appearance divergence
- Reality drift: rendered page vs extracted text
- Parser disagreement
- What this means for AI ingestion
- Capability is not threat: the multi-axis verdict
- A verdict on every file
- Behaviour over signatures
- Scope & limitations
- Methodology & reproducibility
- Data & code availability
- References
The Corpus: 1,572 PDFs Across Eight Domains
A scanner is only as credible as the corpus it is measured against. Rather than a set of files built to demonstrate a feature, this study uses a large, diverse population that is mostly not ours — real documents, real malware, and adversarial files written by other people for other purposes.
| Domain | Files | Source |
|---|---|---|
| Real-world structural | 950 | Mozilla pdf.js corpus — fonts, CID/Type3, encodings, broken headers, annotations, JavaScript, and deliberately-malformed exploit test files |
| Live malware | 400 | MalwareBazaar (abuse.ch) samples tagged PDF — real malicious documents, analysed statically |
| Adversarial PoC | 103 | corkami / structural edge cases — truncated xrefs, version mismatches, orphan objects, encoding tricks |
| Government forms | 46 | Real IRS / US agency AcroForm and XFA forms (W-9, 1040, 941, VA-10091, …) |
| Clean publications | 34 | arXiv papers (2023–2025) and GovInfo federal legislation |
| Value/appearance divergence | 28 | Hand-crafted V/AP and evasion fixtures |
| PDF 2.0 features | 6 | PDF Association PDF 2.0 example files |
| Hidden JavaScript | 5 | Names-tree, orphan/sleeper, ObjStm, OpenAction, annotation-/AA fixtures |
| Total | 1,572 | Eight domains, benign through actively malicious |
Two blocks anchor the measurement. The live malware set is the detection control — 400 known-bad documents, so any one scoring clean is a miss. The clean publications and real government forms are the false-positive control — legitimate documents, several of them complex (XFA forms, digital signatures, embedded JavaScript), so any one graded as malware is an over-call. The pdf.js block sits between them: Mozilla maintains it precisely because it contains pathological and exploit test files, so a high score there is frequently correct.
How the Scanner Reaches a Verdict
Every upload runs through a single 47-module pipeline — structural validation, differential multi-parser analysis, a behavioural sandbox, YARA, entropy and steganography analysis, signature and DocMDP forensics, JavaScript and XFA analysis, and the content-integrity engines (value/appearance divergence, reading-order, OCR-layer, ToUnicode, accessibility-tree). To be precise about the number: these are 47 analysis passes over a shared parse of the file — not 47 independent detection products, and not 47 separate renderers (the six renderers are one of the passes). The count describes coverage, not redundancy.
Their findings feed a multi-axis verdict. Rather than collapsing everything into one number, the scanner scores three independent axes:
- Threat — exploitation and execution: shellcode patterns, CVE signatures, auto-executing JavaScript that performs dangerous operations, launch actions, attack chains.
- Integrity / deception — the document asserting a different reality to a human than to a machine: value/appearance divergence, signature coverage gaps, reality drift between the rendered page and the extracted text.
- Capability (structural) — neutral facts about what the document can do: it contains a form, JavaScript, an embedded file, XFA. These are reported but do not, by themselves, drive a malware verdict.
This separation is the difference between a scanner that cries wolf and one that is useful. A legitimate tax form contains JavaScript and XFA — real capability — but no threat; the multi-axis verdict keeps it off the malware band while still surfacing what it contains. The sections below report how each axis performed across the corpus.
Each indicator carries a severity that contributes weighted points to its axis (critical = 50, high = 25, medium = 10, low = 3, capped at three occurrences per indicator). The threat axis determines the headline band:
| Threat score | Band | Meaning |
|---|---|---|
| 0 | Clean | No threat-axis indicators |
| 1–29 | Low | Capability or weak signals only |
| 30–149 | Suspicious | Worth review |
| 150–349 | High-risk | Strong threat indicators |
| 350+ | Dangerous | Strong evidence of an exploit chain or active malicious behaviour |
These are weighted heuristic aggregates, not CVSS-equivalent severities.
Execution-vector gating: why complexity is not threat
The weights and bands are only half the verdict. The correlation engine applies a hard
gate: a document that cannot execute code — no JavaScript, no
/Launch, no embedded executable, no /RichMedia, and no
behavioural-sandbox signal — is forced to the low band regardless of its raw
score. Without an execution vector, a high feature count reflects document
complexity, not threat capability. Pattern-only findings (YARA and stream-inspector hits)
are additionally downgraded from critical/high to medium when no execution vector is
present. This single rule is the main reason a feature-rich legitimate document — a
signed XFA tax form with hundreds of scripts — does not appear malicious, and it is
why the dangerous band carries a second gate: it requires a confirmed execution vector or
attack chain, not merely a high aggregate. Conversely, the correlation engine raises
severity when independent engines agree (a multi-engine consensus bonus) and when a
divergence coincides with an execution vector — so a parser split that hides a script
is escalated while a harmless structural quirk is not.
Results Across the Corpus
The headline result is the threat-band distribution by domain. The two control blocks behave exactly as a working detector should: every live-malware sample lands in high-risk or dangerous, and no clean publication, government form or PDF 2.0 file reaches the malware band (suspicious or above on the threat axis). The pdf.js block spreads across every band because it is itself a mix of benign and deliberately-malicious test files.
| Domain | n | Clean | Low | Suspicious | High-risk | Dangerous |
|---|---|---|---|---|---|---|
| Live malware | 399 | 0 | 0 | 0 | 347 | 52 |
| pdf.js real-world | 949 | 550 | 270 | 109 | 17 | 3 |
| Adversarial PoC | 103 | 2 | 30 | 37 | 23 | 11 |
| Government forms | 46 | 0 | 45 | 1 | 0 | 0 |
| Clean publications | 34 | 14 | 19 | 1 | 0 | 0 |
| V/AP fixtures | 28 | 24 | 1 | 0 | 3 | 0 |
| PDF 2.0 examples | 6 | 6 | 0 | 0 | 0 | 0 |
| Hidden-JavaScript | 5 | 0 | 0 | 0 | 3 | 2 |
Two notes on the apparent “high” counts in the real-world and adversarial blocks. The 20 pdf.js files at high-risk or above are predominantly genuine exploit and fuzzer test files Mozilla ships on purpose — correct detections, not false positives. The adversarial block scores high because the corkami set is, by construction, malformed and structurally hostile; a forensic scanner is meant to react to it. The single suspicious clean-publication and the single suspicious government form are integrity-axis observations (a structural anomaly worth a glance), not malware calls — no file in either control block reached the threat-axis malware band.
Per-domain capability & detection signals
The same corpus, viewed by which signals fired in how many files of each domain. This is the at-scale evidence behind the per-mechanism sections that follow.
| Domain | n | V/AP | Parser disagree | Reading-order | Accessibility tree | Signed | JavaScript | XFA |
|---|---|---|---|---|---|---|---|---|
| Live malware | 399 | 0 | 131 | 55 | 69 | 0 | 30 | 3 |
| pdf.js real-world | 949 | 21 | 217 | 44 | 87 | 8 | 16 | 29 |
| Adversarial PoC | 103 | 0 | 56 | 0 | 0 | 0 | 22 | 1 |
| Government forms | 46 | 0 | 46 | 45 | 46 | 46 | 0 | 45 |
| Clean publications | 34 | 0 | 25 | 34 | 3 | 4 | 0 | 0 |
| V/AP fixtures | 28 | 9 | 20 | 1 | 1 | 2 | 1 | 0 |
| Hidden-JavaScript | 5 | 0 | 5 | 0 | 0 | 0 | 5 | 0 |
Read across the government-forms row: all 46 are signed, 45 use XFA, all 46 trigger a parser disagreement, 45 show reading-order ambiguity, and all 46 carry an accessibility tree — these are genuinely complex documents — yet 0 fire a value/appearance divergence and 0 reach the malware band. That row is the whole thesis of the multi-axis verdict in one line: maximum capability, correctly graded as low threat.
Malware & Exploit Detection
The 400 live MalwareBazaar samples are the detection control. Of the 400, 399 completed full analysis and every one was classified high-risk or dangerous (median threat score 203); none scored clean or low. The remaining sample was a decompression bomb that exhausted the analysis budget — it was separately flagged as resisting analysis rather than scored (see A verdict on every file).
| Metric | Result |
|---|---|
| Samples that completed analysis | 399 / 400 (1 resource-bomb flagged separately) |
| Classified high-risk or dangerous | 399 / 399 analyzed |
| Threat score range (min / median / mean / max) | 160 / 203 / 256 / 999 |
| Band split | 347 high-risk, 52 dangerous |
| Scored clean or low (missed) | 0 |
| Detection signals | YARA (CVE + heap-spray patterns), behavioural sandbox (network/exec/mmap), object & action analysis, XRef integrity, differential parsing |
Detection statistics
Treating the 399 analysed malware samples as positives and the 86 genuinely-benign real documents (clean publications, real government/XFA forms, PDF 2.0) as negatives gives a labelled two-class problem. The pdf.js block is excluded from this calculation because it deliberately mixes benign and malicious files and cannot be cleanly labelled. Malware scored 160–999; the benign control scored 0–38 — the two classes do not overlap.
| Metric | @ suspicious (≥30) | @ high-risk (≥150) |
|---|---|---|
| True-positive rate (recall) | 1.000 (95% CI 0.990–1.000) | 1.000 (0.990–1.000) |
| False-positive rate | 0.023 (0.006–0.081) | 0.000 (0.000–0.043) |
| Precision | 0.995 | 1.000 |
| F1 | 0.997 | 1.000 |
| ROC AUC | 1.000 — complete separation (malware min 160 vs benign max 38) | |
95% confidence intervals are Wilson score intervals on n = 399 positives and n = 86 negatives. An AUC of 1.000 reflects complete class separation on this corpus; with larger and more varied negatives the boundary cases would grow, and these intervals are the honest bound on how far the result generalises. Every malware score is published per SHA-256 in malware-detection-breakdown.csv, so the table above can be recomputed from the raw data.
Per-engine contribution — and an honest caveat
Recording which engines raised a high or critical indicator on each of the 398 fully analysed samples shows how the verdict is reached — and exposes the corpus’s single biggest limitation in one number:
| Engine / signal | Fired on | % |
|---|---|---|
| Threat-intelligence hash match | 398 | 100% |
| Object analysis | 57 | 14% |
| YARA (CVE + heap-spray) | 38 | 10% |
| qpdf structural integrity | 38 | 10% |
| Campaign attribution | 38 | 10% |
| Pattern scanner | 34 | 9% |
| ClamAV signatures | 33 | 8% |
| Differential parsing | 30 | 8% |
| Embedded-file / polyglot | 19 / 18 | 5% |
| Structural, entropy, stego, JS, OCR, … (long tail) | ≤13 each | ≤3% |
The 100% threat-intelligence hit rate is the most important caveat in this paper. It means every sample was already catalogued — which is exactly what a MalwareBazaar corpus is. On this corpus the verdict is therefore over-determined: a known-bad hash alone would flag every file, so the study cannot cleanly separate detection-by-prior-knowledge from detection-by-analysis. The analytical engines (YARA, ClamAV, structural integrity, differential parsing, object and embedded-file analysis) each fire on a subset, demonstrating they contribute independent signal, but the clean test of signature-independent detection — novel samples with no threat-intel hit — is one this corpus does not contain. The honest reading of the malware result is: the scanner reliably flags known-bad PDFs through multiple redundant paths; whether it would catch a true zero-day on behaviour alone is a question this corpus cannot answer.
Architecturally, the redundancy still matters: a sample that hides JavaScript from one parser is surfaced by another, by the raw-byte pass, or by sandbox syscall capture — the hidden-JavaScript fixtures (payloads in a compressed Names tree, an orphan/sleeper object, or an object stream) are caught even when most structural parsers miss them. That is a property of the design, not a claim this corpus proves about novel malware.
The samples are kept private and not redistributed; per-sample scores are published by SHA-256 in the data bundle. Origin study: PDF Malware Scanner.
Forms, Signatures & Value/Appearance Divergence
A PDF form field stores its value (/V) and its rendered appearance
(/AP) in two independent places, with no obligation to agree. When they
diverge, a human sees one thing and a parser — or an LLM — reads another. The
scanner detects this with five checks operating on the raw object model:
/NeedAppearances, checkbox /V-vs-/AS, decoded
/AP-stream text vs /V (with hex and UTF-16 decoding and
/Opt resolution), blank-appearance detection, and missing-appearance
detection.
The dedicated validation study measured this directly: 9 of 9 hand-crafted V/AP positives detected — including evasion variants using hex-encoded values, Unicode confusables, and font-encoding remaps — with 0 of 187 false positives (0.00%) across the benign validation set of 44 IRS forms, agency XFA forms, federal publications, academic papers and adversarial files. Signed documents are graded on signature coverage: the scanner models DocMDP permission levels, so a P=2 certified form that a recipient legitimately fills and re-saves is recognised as permitted form-filling, not flagged as a shadow-document attack — while a genuine post-signature execution vector still escalates. At the full 1,572-file scale here the result holds and strengthens: the deception axis fired on 0 of 86 genuinely-benign real documents (clean publications, 46 real government forms including complex multi-signature XFA forms, and PDF 2.0 examples) and on 0 of 399 live malware samples — malware is caught on the threat axis by other signals, not mistaken for a value/appearance problem. No legitimate document in the corpus was graded into the malware band.
Origin study: PDF Form Security — V/AP, DocMDP, FieldMDP.
Reality Drift: When the Rendered Page and the Extracted Text Disagree
A PDF can show one thing to a human eye and hand a different string to any program that
extracts its text. The scanner detects the full family of these reality-drift
mechanisms: ToUnicode CMap remapping (the glyph drawn is not the character
extracted), OCR text-layer mismatch (a scanned image with a selectable-text overlay that
disagrees with it), accessibility injection (/Alt and
/ActualText strings that a screen reader or an AI ingests but no reader sees),
reading-order ambiguity in multi-column layouts, homoglyph and right-to-left-override
spoofing, and metadata desynchronisation between the document info dictionary and XMP.
The full taxonomy is thirteen structural drift vectors, each a place where the rendered reality and the machine reality can be made to disagree:
- Incremental update chains (“ghost revisions”)
- Object-stream compression cloaking
- Optional content groups — hidden layers (OCG)
- Rendering-time logic (JavaScript, actions, triggers)
- Embedded files and nested containers
- Alternate representations — dual-reality PDFs
- Font-level semantic attacks (ToUnicode remapping)
- Spatial ambiguity and reading-order collapse
- Metadata desynchronisation
- Malformed-but-tolerated structures (parser differential)
- Accessibility trees as hidden semantic channels
- Embedded OCR lies (hidden text-layer poisoning)
- PDF as a polyglot container
These findings land on the integrity axis, not the malware axis — a document whose extracted text contradicts its rendered appearance is a content-integrity problem even when it carries no exploit. The dedicated prevalence study (182 documents) found reading-order ambiguity in 43 of 44 IRS tax forms (98%) and all of the academic papers and government publications sampled, while 0 of 103 adversarial proof-of-concept files triggered any drift vector — confirming these vectors describe everyday structural ambiguity in real documents, not an attack signature. At the full 1,572-file scale here the pattern holds and sharpens: reading-order ambiguity fired on 45 of 46 government forms (98%) and on all 34 academic and legislative publications, accessibility structure was present in all 46 government forms, and once again 0 of 103 adversarial proof-of-concept files triggered any drift vector. Reality drift is pervasive in real documents and absent in adversarial ones — the opposite of an attack signature, and exactly why it belongs on the integrity axis. The scanner reports each drift vector per page so an analyst can see exactly where the rendered reality and the machine reality diverge.
Origin studies: PDF Reality Drift and PDF Semantic Determinism.
Parser Disagreement
Six production parsers do not return the same account of the same file. They disagree on page count, JavaScript presence, encryption status, and whether an AcroForm exists — because the PDF specification leaves those behaviours underspecified. The scanner runs all six on every upload and reports the disagreements directly: a JavaScript-visibility discrepancy (one parser sees the script, five do not) is itself a strong signal, and is one of the ways hidden-JavaScript payloads are caught.
This is not a rare condition. Across the 1,572-file corpus the six parsers diverged on 502 files — roughly one in three — on page count, JavaScript visibility, encryption status, or AcroForm presence. Differential parsing is reported on the informational axis when the divergence is benign (a version or object-count disagreement) and escalated when it coincides with an execution vector — so a parser split that hides a script is treated as the threat it is, while a harmless structural quirk is not.
The security reading — hidden-JavaScript detection — is the narrow one. The broader finding is that roughly a third of real-world PDFs can yield a materially different answer depending on which library reads them, with no malware involved at all. That has consequences well beyond malware scanning:
- Forensics & e-discovery — two tools can extract different page counts or text from the same exhibit, so “what the document says” depends on the tool of record.
- Digital-signature validation — if parsers disagree on what content or how many revisions a file contains, they can disagree on what a signature actually covers.
- Compliance & archiving — a PDF/A or retention pipeline that validates with one parser and renders with another can certify content a reader never sees.
- Document indexing & AI ingestion — a single-parser extraction layer indexes whichever account it happened to read as authoritative.
A 32% disagreement rate is the kind of structural-integrity problem that exists independent of any attacker.
The government-forms result, in detail
The sharpest case is the real government forms: all 46 of 46 produced a
parser disagreement, and — tellingly — the same two, every time. Each
form stores its AcroForm and JavaScript inside compressed object streams (/ObjStm),
which the six parsers resolve differently:
| Divergence (all 46 forms) | What the parsers report | Consequence |
|---|---|---|
| AcroForm visibility | MuPDF=none, Poppler=AcroForm, pdfminer=AcroForm | A MuPDF-based tool reports no interactive form on a real fillable government form |
| JavaScript visibility | MuPDF=none, Poppler=JS, Ghostscript=none, pdfminer=JS, pdfjs=none | Three of five parsers see no JavaScript; two do — the form’s field logic is invisible to some tools |
This is not random noise — it is a single, reproducible structural pattern (compressed-object-stream resolution) that makes a third of real-world PDFs, and every government form in this corpus, read differently depending on the library. The per-form divergences for all 46 are published in parser-disagreement-gov-forms.json (and .csv), so the result can be inspected form by form and reproduced against the named parsers.
Origin study: Parser Disagreement — Six Parsers, Eleven Divergences.
What This Means for AI Ingestion
Value/appearance divergence, reality drift, and parser disagreement do not stop at the human reviewer — they follow the document into RAG pipelines and LLM training corpora. A single-parser extraction layer implicitly assumes there is one canonical truth inside every PDF; for documents with structural ambiguity, there is not, and the wrong value enters the knowledge base as authoritative fact with no rejection signal. The same checks that make this scanner a forensic tool — multi-parser extraction, V/AP divergence detection, reality-drift analysis — are exactly the pre-ingestion checks an AI pipeline needs to avoid silently indexing content no human ever saw.
Origin study: PDF Structural Problems in AI Ingestion Pipelines.
Capability Is Not Threat: The Multi-Axis Verdict in Practice
The hardest job for a forensic scanner is not catching obvious malware — it is not alarming on legitimate documents that happen to be complex. The multi-axis verdict is what makes that distinction, and the corpus shows it working in both directions:
| Document | What it contains | Verdict |
|---|---|---|
| Live malware sample | Exploit pattern, auto-exec payload | Dangerous — threat axis |
| Real IRS / XFA tax form | JavaScript, XFA, embedded files | Low — capability, no threat |
| Signed government XFA form (filled) | Hundreds of form scripts, multiple signatures covering part of the file | Suspicious — integrity review, driver = integrity, not malware |
| arXiv paper / federal legislation | Standard publication structure | Clean |
| Form with value/appearance mismatch | /V ≠ rendered /AP | Flagged — deception axis |
A document’s scripting, forms, and embedded files are reported as capability, never mistaken for malware on their own. CVE attribution requires the actual exploit evidence, not merely the structure a vulnerable feature uses — so a dynamic XFA form is not labelled a heap-overflow exploit because it contains the XFA scripting API. The result across the corpus: malware is caught, complex legitimate documents are graded accurately, and not one of the 86 genuinely-benign real documents (clean publications and real government forms) was graded into the malware band.
A Verdict on Every File
A forensic scanner must never be silenced by a hostile upload. Decompression bombs, pathological object graphs, and files engineered to hang a parser are treated as a finding, not an error: the scanner bounds its own analysis and, when a document cannot be fully analysed within budget, returns a graceful “analysis could not complete — file resisted full analysis” verdict graded at least suspicious, with the partial findings from every engine that completed. Across all 1,572 files, every upload received a verdict; the only documents that could not be fully analysed were genuine resource bombs, which are exactly the documents a scanner should flag rather than choke on.
Behaviour Over Signatures: an Architectural Distinction
Signature-based antivirus answers one question — “have we seen this before?” — and answers it well: against catalogued threats, multi-engine AV is mature and effective. This scanner answers a different question — “what does this PDF actually do, and do its realities agree?” — using behavioural execution, structural differential analysis, and content-integrity checks that are designed to work without a prior signature. The two are complementary, not interchangeable: one recognises known-bad bytes, the other reasons about structure and behaviour. A caveat we make explicit elsewhere applies here too — on this corpus every malware sample also matched threat intelligence, so the corpus shows the analytical engines contributing alongside hash lookup, not replacing it; isolating behaviour-only detection would need novel samples this corpus does not contain.
This is an architectural comparison of what each approach analyses — not a controlled benchmark. We did not run VirusTotal, MetaDefender or any other product against this corpus, so the table below is illustrative of design, not a measured head-to-head. The one experimental result we do report is on this scanner alone: complete separation between the live-malware and benign-control sets (see the detection statistics above).
| Dimension | Signature-based AV (general) | This scanner (by design) |
|---|---|---|
| Core question | Has this been seen before? | What does this file do? |
| PDF structural analysis | Scans bytes, not PDF structure | Static engines on xref / objects / streams |
| Behavioural sandbox | General, not PDF-specific | Six PDF renderers, isolated namespaces, syscall capture |
| Content integrity (V/AP, drift) | Not in scope | Value/appearance, reading-order, OCR, accessibility, ToUnicode |
Privacy. Every file in this study, and every file submitted to the public tool, is analysed on the server and deleted immediately afterwards — no account, no file retention, no third-party upload. A forensic scan should not itself become a data-exposure event.
Scope & Limitations
This is a measurement on a specific corpus, not a universal benchmark. Five limitations bound how far the results generalise, and we state them plainly:
- The negative class is small. The AUC of 1.000 and the 0% false-positive rate are computed against only 86 benign-control documents (46 government forms, 34 publications, 6 PDF 2.0 files). Perfect separation on 86 negatives is a clean result on this corpus — it is not evidence of perfect separation in production, where a larger and more varied benign population would produce boundary cases. The Wilson interval (FPR upper bound 0.043) is the honest ceiling implied by this sample size.
- Most files do not participate in the classification metrics. The TPR/FPR/precision/AUC figures use 399 malware positives and 86 benign negatives — 485 of the 1,572 files. The 950 pdf.js files and 103 corkami adversarial files are deliberately excluded from those statistics because they mix benign and malicious content and cannot be cleanly labelled. They inform the prevalence and behavioural sections, but readers should not assume all 1,572 files contributed equally to the detection numbers.
- There is no comparison baseline. We did not run pdfid, peepdf, pdf-parser, or any commercial scanner against this corpus. The study supports “this scanner performs well on this corpus,” not “it performs better than tool X” — that claim would require a controlled side-by-side we have not done.
- MalwareBazaar bias — and we can quantify it. The 400 malware samples are drawn from a public repository of already-known, already-clustered files, and the per-engine data proves it: the threat-intelligence hash lookup matched 100% of samples. Every file was already catalogued, so a hash match alone would flag the whole set and the corpus cannot separate detection-by-prior-knowledge from detection-by-analysis. The analytical engines each fire on a subset (so they do contribute), but a clean test of behaviour-only detection — novel PDFs, private or targeted campaigns, or zero-days with no threat-intel hit — is one this corpus does not contain. The result shows reliable flagging of known-bad through redundant paths, not exhaustive zero-day detection.
- Snapshot in time. The malware set is one feed pulled on one day; a different feed or date would give different files.
Live malware binaries are kept private and are not redistributed (their SHA-256 hashes are published so each can be retrieved from MalwareBazaar); the benign corpus manifest, the per-file results, and the analysis code are published in the Data & Code Availability section below, so every figure here can be recomputed and challenged.
Methodology & Reproducibility
The “47” is a coverage figure, not a measure of independence: it is 47 analysis passes over one shared parse of the file, of very different weights — some are major subsystems (the behavioural sandbox, the six-parser differential, the XFA parser), others are lightweight checks (a metadata field, an entropy reading). They group into six functional categories:
| Category | Passes | Examples |
|---|---|---|
| Structural & parsing | 10 | structure validator, six-parser differential, xref-integrity graph, qpdf check, content/object-stream & trailer-chain forensics |
| Malware, exploit & behavioural | 15 | YARA, CVE matcher, ClamAV, behavioural sandbox, JS AST de-obfuscation & emulation, polyglot/embedded-binary, entropy topology, steganography, ML |
| Signatures, forms & document integrity | 8 | signature forensics, DocMDP/FieldMDP, AcroForm field forensics, XFA/FormCalc, revision history, annotation & named-tree analysis |
| Content integrity / reality drift | 7 | value/appearance, reading-order, OCR-layer, accessibility-tree, ToUnicode, codec validation, compliance-fraud |
| Metadata & threat intelligence | 6 | metadata/ExifTool reconciliation, URL extraction, threat-intel hash lookup, phishing, campaign attribution |
| Correlation & scoring | 1 | severity fusion, execution-vector gating, multi-axis verdict |
| Total | 47 | analysis passes, not 47 independent products |
The full corpus was scanned under a resource-governed scheduler so a single hostile file could neither exhaust the host nor skew the run: each scan ran in its own control group bounded to 80% of available memory with a fixed system reserve, concurrency scaled to free memory, and a progress-watchdog terminated any scan stuck on one engine. Two engine safeguards mean every file produces a verdict rather than a crash: a resource-bomb guard that detects decompression and pixel bombs by bounded decode (so content-heavy engines never materialise a multi-hundred-MB buffer), and a fallback that converts any residual hang or out-of-memory kill into a graceful “file defeated analysis” verdict. Across all 1,572 files the only documents that could not be fully analysed were genuine resource bombs, which are flagged as such.
The corpus is reproducible from public sources: the Mozilla pdf.js test set, the corkami adversarial PDFs, the PDF Association PDF 2.0 examples, GovInfo federal publications, arXiv papers, and real IRS/agency forms; live malware was pulled from MalwareBazaar by SHA-256. The per-file results, the corpus manifest, the malware hash list, and the scripts are published in the data bundle below.
Data & Code Availability
The full reproducibility bundle is published, not held on request:
| Artifact | Contents |
|---|---|
| per-file-results.jsonl | One record per file for all 1,572 PDFs: multi-axis scores (threat / deception / structural), verdict band, and per-domain detection flags (V/AP, parser-disagreement, reading-order, accessibility, signed, JavaScript, XFA) |
| malware-sha256.txt | SHA-256 of all 400 live malware samples — binaries are not redistributed, but each is retrievable from MalwareBazaar by hash |
| malware-detection-breakdown.csv | Per-sample result keyed by SHA-256: threat score, verdict band and driver for every one of the 400 malware samples — join to the hash list, retrieve from MalwareBazaar, and re-scan to verify each score independently |
| benign-corpus-manifest.txt | Exact file list of the benign and adversarial corpus, grouped by source (pdf.js, corkami, PDF 2.0, GovInfo, arXiv, IRS/agency forms, V/AP and hidden-JS fixtures) |
| scan_harness.py | Resumable, resource-bounded batch scan harness |
| scan_governor.py | Adaptive resource governor (80%-available memory, fixed OS reserve, memory-scaled concurrency, progress-stall watchdog) |
| analyze.py | Per-domain aggregation over the results JSONL — reproduces the numbers in this article |
| README.md | Bundle index and reproduction steps |
The scoring weights and verdict bands needed to interpret the result records are documented in the verdict and methodology sections above. The scanner itself is the same engine the free public tool runs, so any file in the manifest can be re-scanned directly.
References
- ISO 32000-1:2008 — Portable Document Format, Part 1. AcroForm field values (§12.7), appearance streams, filters (RunLengthDecode, FlateDecode, DCTDecode, JPXDecode). (specification)
- ISO 32000-2:2020 — PDF 2.0. DocMDP / MDP transform permission levels (§12.8.2.2),
/NeedAppearances, ToUnicode CMaps. (specification) - abuse.ch — MalwareBazaar malicious-sample repository (PDF-tagged samples, retrieved by SHA-256). bazaar.abuse.ch (threat-intel feed)
- Mozilla pdf.js test corpus — real-world and regression PDFs (fonts, CID/Type3, encodings, malformed and exploit test files). github.com/mozilla/pdf.js (public corpus)
- corkami / Ange Albertini — PDF format-edge and polyglot proof-of-concept files. (public corpus)
- Mladenov, V., Mainka, C., Rohlmann, S., Schwenk, J. — “Shadow Attacks: Hiding and Replacing Content in Signed PDFs,” NDSS 2021; and the PDF Insecurity series. (peer-reviewed)
- CVE-2021-21017, CVE-2024-45112 (Acrobat XFA/AcroForm); CVE-2010-1240 (
/Launch+ embedded file) — pattern references used by the YARA rule sets. (NVD / Adobe advisories) - Parser & analysis tooling: MuPDF, Poppler, Ghostscript, qpdf, pdfminer.six, pdf.js, Tesseract OCR, YARA. (open-source tools)
Origin Research
This page carries the current numbers. The pages below first characterised each mechanism — useful for the construction detail and methodology behind a specific vector — but they were measured on smaller corpora that predate this study, so where a figure differs, the value on this page is the current one:
- PDF Malware Scanner — the engines and what each detects
- PDF Form Security — value/appearance divergence, DocMDP, FieldMDP construction
- PDF Reality Drift — rendered glyph vs extracted text
- PDF Semantic Determinism — one file, different realities to human, parser, and LLM
- Parser Disagreement — the eleven hand-built divergence cases
- PDF AI Ingestion Pipelines — how these problems enter RAG and training corpora
→ Run the PDF Forensic Scanner — Multi-Parser, 47-Module Analysis, Free