# PDF Forensics at Scale — reproducibility bundle

Companion data and code for https://pqpdf.com/pdf-forensics-at-scale.php — a forensic
measurement of a 47-module PDF scanner across 1,572 PDFs spanning eight domains.

## Files
- `per-file-results.jsonl` — one JSON record per file (all 1,572): multi-axis scores
  (threat / deception / structural), verdict band, and per-domain detection flags
  (V/AP, parser-disagreement, reading-order, accessibility, signed, JavaScript, XFA).
- `malware-detection-breakdown.csv` — per-sample result keyed by SHA-256: threat score,
  verdict band and driver for every one of the 400 malware samples (independently verifiable
  by retrieving the hash from MalwareBazaar and re-scanning).
- `malware-sha256.txt` — SHA-256 of the 400 live malware samples. The binaries are not
  redistributed; each is retrievable from MalwareBazaar (https://bazaar.abuse.ch/) by hash.
- `benign-corpus-manifest.txt` — exact file list of the benign and adversarial corpus,
  grouped by source (Mozilla pdf.js, corkami, PDF Association PDF 2.0, GovInfo, arXiv,
  real IRS/agency forms, and the hand-crafted V/AP and hidden-JS fixtures).
- `malware-per-engine-contribution.csv` — for each engine, the number of the 398 fully-
  analysed malware samples on which it raised a high/critical indicator (note: the threat-
  intelligence hash lookup fired on 100%, evidencing the known-sample nature of the corpus).
- `parser-disagreement-gov-forms.json` / `.csv` — the per-form parser divergences for all
  46 government forms (every disagreement type with the per-parser values), so the
  '46/46 government forms disagree' result can be inspected and reproduced sample by sample.
- `scan_harness.py` — resumable, resource-bounded batch scan harness.
- `scan_governor.py` — adaptive resource governor (per-scan memory = 80% of available
  with a fixed OS reserve; concurrency scaled to free memory; progress-stall watchdog).
- `analyze.py` — discrepancy / per-domain aggregation over the results JSONL.

## Reproducing
Run each PDF through the scanner (public tool at /tools/scan.php) with the harness or
governor, producing one result JSON per file; `analyze.py` aggregates the per-domain
numbers reported in the article. The scoring weights and verdict bands are documented in
the article's methodology section.
