PQ PDF Logo
PQ PDF Tools Secure document utilities for everyday workflows.
Home About Enterprise Contact Feedback Legal Privacy Security Status Development Analytics

Security Research — Published 1 June 2026

PDF Forensics at Scale

A single PDF can present several conflicting realities at once. Whether it is malware is only one of them — this study scores four, measured across 1,572 real documents.

The same PDF can show one thing to a human eye and hand a different reality to the program that extracts its text, the parser that indexes it, or the signature that vouches for it. A verdict that reduces a document to a single number — is it malware? — cannot express that. This scanner scores four independent questions of every document — is it malicious, is it deceptive, is it structurally ambiguous, and what can it execute? — on separate axes, so deception is caught and a complex-but-legitimate file is not mistaken for an attack.

Threat, integrity, and capability are separate dimensions. Collapsing them into one risk score is what makes a single-number verdict both miss the deceptive document and cry wolf on the merely complex one — independent of how sophisticated the underlying engines are. This study runs that multi-axis verdict against one large, diverse population of real documents — including 400 live malware samples — and reports what it finds. It consolidates and updates our earlier per-mechanism research with a far larger corpus; the figures here are the current ones.

Who this matters for DFIR & malware analysts Detection engineers Email & upload gateway operators PDF tooling authors RAG & document-AI builders LLM training-data engineers Security researchers

Executive Summary

Reducing a PDF to a single malware verdict and a single score leaves out most of what matters about it. A PDF is a rendering format, not a statement of fact: the value a human sees, the value a parser extracts, and the value a signature covers can all differ inside the same file. This study measures a scanner built around that reality, scoring every document on three independent axes — threat (can it execute an exploit), integrity (does it deceive: value ≠ appearance, uncovered signature, reality drift), and capability (does it merely contain script, forms, or embedded files) — across a large, deliberately diverse corpus.

  • Corpus. 1,572 PDFs across eight domains, including 400 live malware samples from MalwareBazaar, 950 Mozilla pdf.js real-world files, real government AcroForm/XFA forms, academic and legislative publications, adversarial proof-of-concept files, and hand-crafted fixtures.
  • Malware. Of the 400 live samples, 399 completed analysis and every one was classified high-risk or dangerous (median threat score 203); the remaining sample was a decompression bomb, flagged as resisting analysis rather than scored. No sample was graded clean or low.
  • False positives. None of the 86 genuinely-benign real documents (clean publications, 46 real government forms including complex multi-signature XFA forms, and PDF 2.0 examples) was graded into the malware band — even though those forms carry signatures, JavaScript and XFA. Capability was reported as capability, not mistaken for threat.
  • Parser disagreement — arguably the most consequential result. The six parsers diverged on 502 of 1,572 files (roughly one in three) on page count, JavaScript visibility, encryption status, or AcroForm presence. This is not a security finding alone: it means a third of real-world PDFs can yield a different answer depending on which library reads them — with implications for forensics, compliance, e-discovery, digital-signature validation, document indexing and AI ingestion, even when no malware is involved.
  • AI ingestion. These conflicting realities do not stop at the human reviewer: value/appearance divergence, reality drift and parser disagreement follow a PDF into RAG pipelines and training corpora, where a single-parser extraction layer indexes whichever reality it happened to read as authoritative fact.

A note on reading these numbers: the detection and false-positive figures are clean on this corpus, but the benign control is small (86 documents) and the malware is a snapshot of already-known samples. They show the scanner is not signature-dependent and does not alarm on complex legitimate documents — not that detection is perfect in production. The Scope & Limitations section states exactly how far each result generalises.

Contents
  1. Executive summary
  2. The corpus: 1,572 PDFs across eight domains
  3. How the scanner reaches a verdict
  4. Results across the corpus
  5. Malware & exploit detection
  6. Forms, signatures & value/appearance divergence
  7. Reality drift: rendered page vs extracted text
  8. Parser disagreement
  9. What this means for AI ingestion
  10. Capability is not threat: the multi-axis verdict
  11. A verdict on every file
  12. Behaviour over signatures
  13. Scope & limitations
  14. Methodology & reproducibility
  15. Data & code availability
  16. References

The Corpus: 1,572 PDFs Across Eight Domains

A scanner is only as credible as the corpus it is measured against. Rather than a set of files built to demonstrate a feature, this study uses a large, diverse population that is mostly not ours — real documents, real malware, and adversarial files written by other people for other purposes.

DomainFilesSource
Real-world structural950Mozilla pdf.js corpus — fonts, CID/Type3, encodings, broken headers, annotations, JavaScript, and deliberately-malformed exploit test files
Live malware400MalwareBazaar (abuse.ch) samples tagged PDF — real malicious documents, analysed statically
Adversarial PoC103corkami / structural edge cases — truncated xrefs, version mismatches, orphan objects, encoding tricks
Government forms46Real IRS / US agency AcroForm and XFA forms (W-9, 1040, 941, VA-10091, …)
Clean publications34arXiv papers (2023–2025) and GovInfo federal legislation
Value/appearance divergence28Hand-crafted V/AP and evasion fixtures
PDF 2.0 features6PDF Association PDF 2.0 example files
Hidden JavaScript5Names-tree, orphan/sleeper, ObjStm, OpenAction, annotation-/AA fixtures
Total1,572Eight domains, benign through actively malicious

Two blocks anchor the measurement. The live malware set is the detection control — 400 known-bad documents, so any one scoring clean is a miss. The clean publications and real government forms are the false-positive control — legitimate documents, several of them complex (XFA forms, digital signatures, embedded JavaScript), so any one graded as malware is an over-call. The pdf.js block sits between them: Mozilla maintains it precisely because it contains pathological and exploit test files, so a high score there is frequently correct.

How the Scanner Reaches a Verdict

Every upload runs through a single 47-module pipeline — structural validation, differential multi-parser analysis, a behavioural sandbox, YARA, entropy and steganography analysis, signature and DocMDP forensics, JavaScript and XFA analysis, and the content-integrity engines (value/appearance divergence, reading-order, OCR-layer, ToUnicode, accessibility-tree). To be precise about the number: these are 47 analysis passes over a shared parse of the file — not 47 independent detection products, and not 47 separate renderers (the six renderers are one of the passes). The count describes coverage, not redundancy.

Their findings feed a multi-axis verdict. Rather than collapsing everything into one number, the scanner scores three independent axes:

  • Threat — exploitation and execution: shellcode patterns, CVE signatures, auto-executing JavaScript that performs dangerous operations, launch actions, attack chains.
  • Integrity / deception — the document asserting a different reality to a human than to a machine: value/appearance divergence, signature coverage gaps, reality drift between the rendered page and the extracted text.
  • Capability (structural) — neutral facts about what the document can do: it contains a form, JavaScript, an embedded file, XFA. These are reported but do not, by themselves, drive a malware verdict.
The Multi-Axis Verdict — Three Independent Scores
PDF DOCUMENT THREAT exploitation / execution CVE patterns · shellcode auto-exec payloads launch / attack chains live malware → all flagged INTEGRITY deception / tampering value ≠ appearance signature coverage gap reality drift signed form → review CAPABILITY neutral facts forms · JS XFA · embedded not a malware verdict
The headline band follows the threat axis. Integrity and capability are reported independently, so a complex-but-legitimate document is graded for review on its own terms instead of being alarmed on as malware.

This separation is the difference between a scanner that cries wolf and one that is useful. A legitimate tax form contains JavaScript and XFA — real capability — but no threat; the multi-axis verdict keeps it off the malware band while still surfacing what it contains. The sections below report how each axis performed across the corpus.

Each indicator carries a severity that contributes weighted points to its axis (critical = 50, high = 25, medium = 10, low = 3, capped at three occurrences per indicator). The threat axis determines the headline band:

Threat scoreBandMeaning
0CleanNo threat-axis indicators
1–29LowCapability or weak signals only
30–149SuspiciousWorth review
150–349High-riskStrong threat indicators
350+DangerousStrong evidence of an exploit chain or active malicious behaviour

These are weighted heuristic aggregates, not CVSS-equivalent severities.

Execution-vector gating: why complexity is not threat

The weights and bands are only half the verdict. The correlation engine applies a hard gate: a document that cannot execute code — no JavaScript, no /Launch, no embedded executable, no /RichMedia, and no behavioural-sandbox signal — is forced to the low band regardless of its raw score. Without an execution vector, a high feature count reflects document complexity, not threat capability. Pattern-only findings (YARA and stream-inspector hits) are additionally downgraded from critical/high to medium when no execution vector is present. This single rule is the main reason a feature-rich legitimate document — a signed XFA tax form with hundreds of scripts — does not appear malicious, and it is why the dangerous band carries a second gate: it requires a confirmed execution vector or attack chain, not merely a high aggregate. Conversely, the correlation engine raises severity when independent engines agree (a multi-engine consensus bonus) and when a divergence coincides with an execution vector — so a parser split that hides a script is escalated while a harmless structural quirk is not.

Results Across the Corpus

The headline result is the threat-band distribution by domain. The two control blocks behave exactly as a working detector should: every live-malware sample lands in high-risk or dangerous, and no clean publication, government form or PDF 2.0 file reaches the malware band (suspicious or above on the threat axis). The pdf.js block spreads across every band because it is itself a mix of benign and deliberately-malicious test files.

DomainnCleanLowSuspiciousHigh-riskDangerous
Live malware39900034752
pdf.js real-world949550270109173
Adversarial PoC103230372311
Government forms46045100
Clean publications341419100
V/AP fixtures28241030
PDF 2.0 examples660000
Hidden-JavaScript500032

Two notes on the apparent “high” counts in the real-world and adversarial blocks. The 20 pdf.js files at high-risk or above are predominantly genuine exploit and fuzzer test files Mozilla ships on purpose — correct detections, not false positives. The adversarial block scores high because the corkami set is, by construction, malformed and structurally hostile; a forensic scanner is meant to react to it. The single suspicious clean-publication and the single suspicious government form are integrity-axis observations (a structural anomaly worth a glance), not malware calls — no file in either control block reached the threat-axis malware band.

Per-domain capability & detection signals

The same corpus, viewed by which signals fired in how many files of each domain. This is the at-scale evidence behind the per-mechanism sections that follow.

DomainnV/APParser disagreeReading-orderAccessibility treeSignedJavaScriptXFA
Live malware399013155690303
pdf.js real-world94921217448781629
Adversarial PoC103056000221
Government forms46046454646045
Clean publications34025343400
V/AP fixtures2892011210
Hidden-JavaScript50500050

Read across the government-forms row: all 46 are signed, 45 use XFA, all 46 trigger a parser disagreement, 45 show reading-order ambiguity, and all 46 carry an accessibility tree — these are genuinely complex documents — yet 0 fire a value/appearance divergence and 0 reach the malware band. That row is the whole thesis of the multi-axis verdict in one line: maximum capability, correctly graded as low threat.

Malware & Exploit Detection

The 400 live MalwareBazaar samples are the detection control. Of the 400, 399 completed full analysis and every one was classified high-risk or dangerous (median threat score 203); none scored clean or low. The remaining sample was a decompression bomb that exhausted the analysis budget — it was separately flagged as resisting analysis rather than scored (see A verdict on every file).

MetricResult
Samples that completed analysis399 / 400 (1 resource-bomb flagged separately)
Classified high-risk or dangerous399 / 399 analyzed
Threat score range (min / median / mean / max)160 / 203 / 256 / 999
Band split347 high-risk, 52 dangerous
Scored clean or low (missed)0
Detection signalsYARA (CVE + heap-spray patterns), behavioural sandbox (network/exec/mmap), object & action analysis, XRef integrity, differential parsing

Detection statistics

Treating the 399 analysed malware samples as positives and the 86 genuinely-benign real documents (clean publications, real government/XFA forms, PDF 2.0) as negatives gives a labelled two-class problem. The pdf.js block is excluded from this calculation because it deliberately mixes benign and malicious files and cannot be cleanly labelled. Malware scored 160–999; the benign control scored 0–38 — the two classes do not overlap.

Metric@ suspicious (≥30)@ high-risk (≥150)
True-positive rate (recall)1.000 (95% CI 0.990–1.000)1.000 (0.990–1.000)
False-positive rate0.023 (0.006–0.081)0.000 (0.000–0.043)
Precision0.9951.000
F10.9971.000
ROC AUC1.000 — complete separation (malware min 160 vs benign max 38)

95% confidence intervals are Wilson score intervals on n = 399 positives and n = 86 negatives. An AUC of 1.000 reflects complete class separation on this corpus; with larger and more varied negatives the boundary cases would grow, and these intervals are the honest bound on how far the result generalises. Every malware score is published per SHA-256 in malware-detection-breakdown.csv, so the table above can be recomputed from the raw data.

Per-engine contribution — and an honest caveat

Recording which engines raised a high or critical indicator on each of the 398 fully analysed samples shows how the verdict is reached — and exposes the corpus’s single biggest limitation in one number:

Engine / signalFired on%
Threat-intelligence hash match398100%
Object analysis5714%
YARA (CVE + heap-spray)3810%
qpdf structural integrity3810%
Campaign attribution3810%
Pattern scanner349%
ClamAV signatures338%
Differential parsing308%
Embedded-file / polyglot19 / 185%
Structural, entropy, stego, JS, OCR, … (long tail)≤13 each≤3%

The 100% threat-intelligence hit rate is the most important caveat in this paper. It means every sample was already catalogued — which is exactly what a MalwareBazaar corpus is. On this corpus the verdict is therefore over-determined: a known-bad hash alone would flag every file, so the study cannot cleanly separate detection-by-prior-knowledge from detection-by-analysis. The analytical engines (YARA, ClamAV, structural integrity, differential parsing, object and embedded-file analysis) each fire on a subset, demonstrating they contribute independent signal, but the clean test of signature-independent detection — novel samples with no threat-intel hit — is one this corpus does not contain. The honest reading of the malware result is: the scanner reliably flags known-bad PDFs through multiple redundant paths; whether it would catch a true zero-day on behaviour alone is a question this corpus cannot answer.

Architecturally, the redundancy still matters: a sample that hides JavaScript from one parser is surfaced by another, by the raw-byte pass, or by sandbox syscall capture — the hidden-JavaScript fixtures (payloads in a compressed Names tree, an orphan/sleeper object, or an object stream) are caught even when most structural parsers miss them. That is a property of the design, not a claim this corpus proves about novel malware.

The samples are kept private and not redistributed; per-sample scores are published by SHA-256 in the data bundle. Origin study: PDF Malware Scanner.

Forms, Signatures & Value/Appearance Divergence

A PDF form field stores its value (/V) and its rendered appearance (/AP) in two independent places, with no obligation to agree. When they diverge, a human sees one thing and a parser — or an LLM — reads another. The scanner detects this with five checks operating on the raw object model: /NeedAppearances, checkbox /V-vs-/AS, decoded /AP-stream text vs /V (with hex and UTF-16 decoding and /Opt resolution), blank-appearance detection, and missing-appearance detection.

The dedicated validation study measured this directly: 9 of 9 hand-crafted V/AP positives detected — including evasion variants using hex-encoded values, Unicode confusables, and font-encoding remaps — with 0 of 187 false positives (0.00%) across the benign validation set of 44 IRS forms, agency XFA forms, federal publications, academic papers and adversarial files. Signed documents are graded on signature coverage: the scanner models DocMDP permission levels, so a P=2 certified form that a recipient legitimately fills and re-saves is recognised as permitted form-filling, not flagged as a shadow-document attack — while a genuine post-signature execution vector still escalates. At the full 1,572-file scale here the result holds and strengthens: the deception axis fired on 0 of 86 genuinely-benign real documents (clean publications, 46 real government forms including complex multi-signature XFA forms, and PDF 2.0 examples) and on 0 of 399 live malware samples — malware is caught on the threat axis by other signals, not mistaken for a value/appearance problem. No legitimate document in the corpus was graded into the malware band.

Origin study: PDF Form Security — V/AP, DocMDP, FieldMDP.

Reality Drift: When the Rendered Page and the Extracted Text Disagree

A PDF can show one thing to a human eye and hand a different string to any program that extracts its text. The scanner detects the full family of these reality-drift mechanisms: ToUnicode CMap remapping (the glyph drawn is not the character extracted), OCR text-layer mismatch (a scanned image with a selectable-text overlay that disagrees with it), accessibility injection (/Alt and /ActualText strings that a screen reader or an AI ingests but no reader sees), reading-order ambiguity in multi-column layouts, homoglyph and right-to-left-override spoofing, and metadata desynchronisation between the document info dictionary and XMP.

The full taxonomy is thirteen structural drift vectors, each a place where the rendered reality and the machine reality can be made to disagree:

  1. Incremental update chains (“ghost revisions”)
  2. Object-stream compression cloaking
  3. Optional content groups — hidden layers (OCG)
  4. Rendering-time logic (JavaScript, actions, triggers)
  5. Embedded files and nested containers
  6. Alternate representations — dual-reality PDFs
  7. Font-level semantic attacks (ToUnicode remapping)
  8. Spatial ambiguity and reading-order collapse
  9. Metadata desynchronisation
  10. Malformed-but-tolerated structures (parser differential)
  11. Accessibility trees as hidden semantic channels
  12. Embedded OCR lies (hidden text-layer poisoning)
  13. PDF as a polyglot container

These findings land on the integrity axis, not the malware axis — a document whose extracted text contradicts its rendered appearance is a content-integrity problem even when it carries no exploit. The dedicated prevalence study (182 documents) found reading-order ambiguity in 43 of 44 IRS tax forms (98%) and all of the academic papers and government publications sampled, while 0 of 103 adversarial proof-of-concept files triggered any drift vector — confirming these vectors describe everyday structural ambiguity in real documents, not an attack signature. At the full 1,572-file scale here the pattern holds and sharpens: reading-order ambiguity fired on 45 of 46 government forms (98%) and on all 34 academic and legislative publications, accessibility structure was present in all 46 government forms, and once again 0 of 103 adversarial proof-of-concept files triggered any drift vector. Reality drift is pervasive in real documents and absent in adversarial ones — the opposite of an attack signature, and exactly why it belongs on the integrity axis. The scanner reports each drift vector per page so an analyst can see exactly where the rendered reality and the machine reality diverge.

Origin studies: PDF Reality Drift and PDF Semantic Determinism.

Parser Disagreement

Six production parsers do not return the same account of the same file. They disagree on page count, JavaScript presence, encryption status, and whether an AcroForm exists — because the PDF specification leaves those behaviours underspecified. The scanner runs all six on every upload and reports the disagreements directly: a JavaScript-visibility discrepancy (one parser sees the script, five do not) is itself a strong signal, and is one of the ways hidden-JavaScript payloads are caught.

This is not a rare condition. Across the 1,572-file corpus the six parsers diverged on 502 files — roughly one in three — on page count, JavaScript visibility, encryption status, or AcroForm presence. Differential parsing is reported on the informational axis when the divergence is benign (a version or object-count disagreement) and escalated when it coincides with an execution vector — so a parser split that hides a script is treated as the threat it is, while a harmless structural quirk is not.

The security reading — hidden-JavaScript detection — is the narrow one. The broader finding is that roughly a third of real-world PDFs can yield a materially different answer depending on which library reads them, with no malware involved at all. That has consequences well beyond malware scanning:

  • Forensics & e-discovery — two tools can extract different page counts or text from the same exhibit, so “what the document says” depends on the tool of record.
  • Digital-signature validation — if parsers disagree on what content or how many revisions a file contains, they can disagree on what a signature actually covers.
  • Compliance & archiving — a PDF/A or retention pipeline that validates with one parser and renders with another can certify content a reader never sees.
  • Document indexing & AI ingestion — a single-parser extraction layer indexes whichever account it happened to read as authoritative.

A 32% disagreement rate is the kind of structural-integrity problem that exists independent of any attacker.

The government-forms result, in detail

The sharpest case is the real government forms: all 46 of 46 produced a parser disagreement, and — tellingly — the same two, every time. Each form stores its AcroForm and JavaScript inside compressed object streams (/ObjStm), which the six parsers resolve differently:

Divergence (all 46 forms)What the parsers reportConsequence
AcroForm visibilityMuPDF=none, Poppler=AcroForm, pdfminer=AcroFormA MuPDF-based tool reports no interactive form on a real fillable government form
JavaScript visibilityMuPDF=none, Poppler=JS, Ghostscript=none, pdfminer=JS, pdfjs=noneThree of five parsers see no JavaScript; two do — the form’s field logic is invisible to some tools

This is not random noise — it is a single, reproducible structural pattern (compressed-object-stream resolution) that makes a third of real-world PDFs, and every government form in this corpus, read differently depending on the library. The per-form divergences for all 46 are published in parser-disagreement-gov-forms.json (and .csv), so the result can be inspected form by form and reproduced against the named parsers.

Origin study: Parser Disagreement — Six Parsers, Eleven Divergences.

What This Means for AI Ingestion

Value/appearance divergence, reality drift, and parser disagreement do not stop at the human reviewer — they follow the document into RAG pipelines and LLM training corpora. A single-parser extraction layer implicitly assumes there is one canonical truth inside every PDF; for documents with structural ambiguity, there is not, and the wrong value enters the knowledge base as authoritative fact with no rejection signal. The same checks that make this scanner a forensic tool — multi-parser extraction, V/AP divergence detection, reality-drift analysis — are exactly the pre-ingestion checks an AI pipeline needs to avoid silently indexing content no human ever saw.

Origin study: PDF Structural Problems in AI Ingestion Pipelines.

Capability Is Not Threat: The Multi-Axis Verdict in Practice

The hardest job for a forensic scanner is not catching obvious malware — it is not alarming on legitimate documents that happen to be complex. The multi-axis verdict is what makes that distinction, and the corpus shows it working in both directions:

DocumentWhat it containsVerdict
Live malware sampleExploit pattern, auto-exec payloadDangerous — threat axis
Real IRS / XFA tax formJavaScript, XFA, embedded filesLow — capability, no threat
Signed government XFA form (filled)Hundreds of form scripts, multiple signatures covering part of the fileSuspicious — integrity review, driver = integrity, not malware
arXiv paper / federal legislationStandard publication structureClean
Form with value/appearance mismatch/V ≠ rendered /APFlagged — deception axis

A document’s scripting, forms, and embedded files are reported as capability, never mistaken for malware on their own. CVE attribution requires the actual exploit evidence, not merely the structure a vulnerable feature uses — so a dynamic XFA form is not labelled a heap-overflow exploit because it contains the XFA scripting API. The result across the corpus: malware is caught, complex legitimate documents are graded accurately, and not one of the 86 genuinely-benign real documents (clean publications and real government forms) was graded into the malware band.

A Verdict on Every File

A forensic scanner must never be silenced by a hostile upload. Decompression bombs, pathological object graphs, and files engineered to hang a parser are treated as a finding, not an error: the scanner bounds its own analysis and, when a document cannot be fully analysed within budget, returns a graceful “analysis could not complete — file resisted full analysis” verdict graded at least suspicious, with the partial findings from every engine that completed. Across all 1,572 files, every upload received a verdict; the only documents that could not be fully analysed were genuine resource bombs, which are exactly the documents a scanner should flag rather than choke on.

Behaviour Over Signatures: an Architectural Distinction

Signature-based antivirus answers one question — “have we seen this before?” — and answers it well: against catalogued threats, multi-engine AV is mature and effective. This scanner answers a different question — “what does this PDF actually do, and do its realities agree?” — using behavioural execution, structural differential analysis, and content-integrity checks that are designed to work without a prior signature. The two are complementary, not interchangeable: one recognises known-bad bytes, the other reasons about structure and behaviour. A caveat we make explicit elsewhere applies here too — on this corpus every malware sample also matched threat intelligence, so the corpus shows the analytical engines contributing alongside hash lookup, not replacing it; isolating behaviour-only detection would need novel samples this corpus does not contain.

This is an architectural comparison of what each approach analyses — not a controlled benchmark. We did not run VirusTotal, MetaDefender or any other product against this corpus, so the table below is illustrative of design, not a measured head-to-head. The one experimental result we do report is on this scanner alone: complete separation between the live-malware and benign-control sets (see the detection statistics above).

What each approach is designed to analyse (architectural, not measured):
DimensionSignature-based AV (general)This scanner (by design)
Core questionHas this been seen before?What does this file do?
PDF structural analysisScans bytes, not PDF structureStatic engines on xref / objects / streams
Behavioural sandboxGeneral, not PDF-specificSix PDF renderers, isolated namespaces, syscall capture
Content integrity (V/AP, drift)Not in scopeValue/appearance, reading-order, OCR, accessibility, ToUnicode

Privacy. Every file in this study, and every file submitted to the public tool, is analysed on the server and deleted immediately afterwards — no account, no file retention, no third-party upload. A forensic scan should not itself become a data-exposure event.

Scope & Limitations

This is a measurement on a specific corpus, not a universal benchmark. Five limitations bound how far the results generalise, and we state them plainly:

  1. The negative class is small. The AUC of 1.000 and the 0% false-positive rate are computed against only 86 benign-control documents (46 government forms, 34 publications, 6 PDF 2.0 files). Perfect separation on 86 negatives is a clean result on this corpus — it is not evidence of perfect separation in production, where a larger and more varied benign population would produce boundary cases. The Wilson interval (FPR upper bound 0.043) is the honest ceiling implied by this sample size.
  2. Most files do not participate in the classification metrics. The TPR/FPR/precision/AUC figures use 399 malware positives and 86 benign negatives — 485 of the 1,572 files. The 950 pdf.js files and 103 corkami adversarial files are deliberately excluded from those statistics because they mix benign and malicious content and cannot be cleanly labelled. They inform the prevalence and behavioural sections, but readers should not assume all 1,572 files contributed equally to the detection numbers.
  3. There is no comparison baseline. We did not run pdfid, peepdf, pdf-parser, or any commercial scanner against this corpus. The study supports “this scanner performs well on this corpus,” not “it performs better than tool X” — that claim would require a controlled side-by-side we have not done.
  4. MalwareBazaar bias — and we can quantify it. The 400 malware samples are drawn from a public repository of already-known, already-clustered files, and the per-engine data proves it: the threat-intelligence hash lookup matched 100% of samples. Every file was already catalogued, so a hash match alone would flag the whole set and the corpus cannot separate detection-by-prior-knowledge from detection-by-analysis. The analytical engines each fire on a subset (so they do contribute), but a clean test of behaviour-only detection — novel PDFs, private or targeted campaigns, or zero-days with no threat-intel hit — is one this corpus does not contain. The result shows reliable flagging of known-bad through redundant paths, not exhaustive zero-day detection.
  5. Snapshot in time. The malware set is one feed pulled on one day; a different feed or date would give different files.

Live malware binaries are kept private and are not redistributed (their SHA-256 hashes are published so each can be retrieved from MalwareBazaar); the benign corpus manifest, the per-file results, and the analysis code are published in the Data & Code Availability section below, so every figure here can be recomputed and challenged.

Methodology & Reproducibility

The “47” is a coverage figure, not a measure of independence: it is 47 analysis passes over one shared parse of the file, of very different weights — some are major subsystems (the behavioural sandbox, the six-parser differential, the XFA parser), others are lightweight checks (a metadata field, an entropy reading). They group into six functional categories:

CategoryPassesExamples
Structural & parsing10structure validator, six-parser differential, xref-integrity graph, qpdf check, content/object-stream & trailer-chain forensics
Malware, exploit & behavioural15YARA, CVE matcher, ClamAV, behavioural sandbox, JS AST de-obfuscation & emulation, polyglot/embedded-binary, entropy topology, steganography, ML
Signatures, forms & document integrity8signature forensics, DocMDP/FieldMDP, AcroForm field forensics, XFA/FormCalc, revision history, annotation & named-tree analysis
Content integrity / reality drift7value/appearance, reading-order, OCR-layer, accessibility-tree, ToUnicode, codec validation, compliance-fraud
Metadata & threat intelligence6metadata/ExifTool reconciliation, URL extraction, threat-intel hash lookup, phishing, campaign attribution
Correlation & scoring1severity fusion, execution-vector gating, multi-axis verdict
Total47analysis passes, not 47 independent products

The full corpus was scanned under a resource-governed scheduler so a single hostile file could neither exhaust the host nor skew the run: each scan ran in its own control group bounded to 80% of available memory with a fixed system reserve, concurrency scaled to free memory, and a progress-watchdog terminated any scan stuck on one engine. Two engine safeguards mean every file produces a verdict rather than a crash: a resource-bomb guard that detects decompression and pixel bombs by bounded decode (so content-heavy engines never materialise a multi-hundred-MB buffer), and a fallback that converts any residual hang or out-of-memory kill into a graceful “file defeated analysis” verdict. Across all 1,572 files the only documents that could not be fully analysed were genuine resource bombs, which are flagged as such.

The corpus is reproducible from public sources: the Mozilla pdf.js test set, the corkami adversarial PDFs, the PDF Association PDF 2.0 examples, GovInfo federal publications, arXiv papers, and real IRS/agency forms; live malware was pulled from MalwareBazaar by SHA-256. The per-file results, the corpus manifest, the malware hash list, and the scripts are published in the data bundle below.

Data & Code Availability

The full reproducibility bundle is published, not held on request:

ArtifactContents
per-file-results.jsonlOne record per file for all 1,572 PDFs: multi-axis scores (threat / deception / structural), verdict band, and per-domain detection flags (V/AP, parser-disagreement, reading-order, accessibility, signed, JavaScript, XFA)
malware-sha256.txtSHA-256 of all 400 live malware samples — binaries are not redistributed, but each is retrievable from MalwareBazaar by hash
malware-detection-breakdown.csvPer-sample result keyed by SHA-256: threat score, verdict band and driver for every one of the 400 malware samples — join to the hash list, retrieve from MalwareBazaar, and re-scan to verify each score independently
benign-corpus-manifest.txtExact file list of the benign and adversarial corpus, grouped by source (pdf.js, corkami, PDF 2.0, GovInfo, arXiv, IRS/agency forms, V/AP and hidden-JS fixtures)
scan_harness.pyResumable, resource-bounded batch scan harness
scan_governor.pyAdaptive resource governor (80%-available memory, fixed OS reserve, memory-scaled concurrency, progress-stall watchdog)
analyze.pyPer-domain aggregation over the results JSONL — reproduces the numbers in this article
README.mdBundle index and reproduction steps

The scoring weights and verdict bands needed to interpret the result records are documented in the verdict and methodology sections above. The scanner itself is the same engine the free public tool runs, so any file in the manifest can be re-scanned directly.

References

  1. ISO 32000-1:2008 — Portable Document Format, Part 1. AcroForm field values (§12.7), appearance streams, filters (RunLengthDecode, FlateDecode, DCTDecode, JPXDecode). (specification)
  2. ISO 32000-2:2020 — PDF 2.0. DocMDP / MDP transform permission levels (§12.8.2.2), /NeedAppearances, ToUnicode CMaps. (specification)
  3. abuse.ch — MalwareBazaar malicious-sample repository (PDF-tagged samples, retrieved by SHA-256). bazaar.abuse.ch (threat-intel feed)
  4. Mozilla pdf.js test corpus — real-world and regression PDFs (fonts, CID/Type3, encodings, malformed and exploit test files). github.com/mozilla/pdf.js (public corpus)
  5. corkami / Ange Albertini — PDF format-edge and polyglot proof-of-concept files. (public corpus)
  6. Mladenov, V., Mainka, C., Rohlmann, S., Schwenk, J. — “Shadow Attacks: Hiding and Replacing Content in Signed PDFs,” NDSS 2021; and the PDF Insecurity series. (peer-reviewed)
  7. CVE-2021-21017, CVE-2024-45112 (Acrobat XFA/AcroForm); CVE-2010-1240 (/Launch + embedded file) — pattern references used by the YARA rule sets. (NVD / Adobe advisories)
  8. Parser & analysis tooling: MuPDF, Poppler, Ghostscript, qpdf, pdfminer.six, pdf.js, Tesseract OCR, YARA. (open-source tools)

Origin Research

This page carries the current numbers. The pages below first characterised each mechanism — useful for the construction detail and methodology behind a specific vector — but they were measured on smaller corpora that predate this study, so where a figure differs, the value on this page is the current one:

  • PDF Malware Scanner — the engines and what each detects
  • PDF Form Security — value/appearance divergence, DocMDP, FieldMDP construction
  • PDF Reality Drift — rendered glyph vs extracted text
  • PDF Semantic Determinism — one file, different realities to human, parser, and LLM
  • Parser Disagreement — the eleven hand-built divergence cases
  • PDF AI Ingestion Pipelines — how these problems enter RAG and training corpora

→ Run the PDF Forensic Scanner — Multi-Parser, 47-Module Analysis, Free


PQ PDF PQ PDF Tools

© 2026 PQ PDF — All rights reserved.

← All PDF Tools • About • Legal • Privacy • Security • Contact

Secure document utilities — free, private, zero-retention. pqpdf.com