PDF Parser Disagreement: Six Parsers, Eleven Divergences — Empirical Evidence

New — population-scale companion: PDF Forensics at Scale runs the scanner against 1,572 real-world PDFs — including 400 live malware samples — reporting live-malware detection, the real-world false-positive rate, and the files that crash a scanner (and how the engine was hardened).

Why Multi-Parser PDF Security Is Not Optional

Every PDF security scanner — antivirus, email gateway, DLP system — uses exactly one PDF parser. That parser is the scanner’s entire view of reality. If the parser that renders the file in the victim’s browser reports a different reality than the scanner’s parser, the malicious content is invisible to the scanner by construction. This is the PDF parser disagreement problem.

This is not a theoretical risk. PDF parser confusion has been exploited in documented malware campaigns. The problem persists because the PDF specification contains genuine ambiguities — fields where the spec either contradicts itself, gives parsers discretion, or fails to define behavior for malformed input.

The eleven tests below are not proof-of-concept exploits. They are minimal reproducible demonstrations of real parsing divergence, verified against live production scanners.

Who this matters for SOC & incident response DFIR & malware analysts RAG & document AI builders DLP & email gateway engineers PDF/A & PDF/UA archivists PDF standards contributors

Methodology

We wrote six PDFs by hand, in Python, without using any PDF library. Each file was built from raw bytes with byte-accurate cross-reference tables, targeting a single ambiguity per file. Files range from 349 to 798 bytes. No fonts, no images, no embedded content beyond the specific structural feature under test.

Each file was submitted to the PQ PDF Forensic Scanner, which runs six parsers in parallel inside isolated Linux namespaces and compares their output across seven structural dimensions. Parser versions used:

Parser	Binary	What it extracts
MuPDF	`mutool info` + `mutool show xref` + `mutool show trailer`	pages, objects, version, JS, AcroForm, OpenAction, encryption
Poppler	`pdfinfo` + `pdfdetach -list`	pages, version, JS, encryption, form type, embedded files, suspects
Ghostscript	`gs -sDEVICE=nullpage` (render pass)	pages rendered, JS triggered, OpenAction, Launch actions
qpdf	`qpdf --show-npages`, `--show-encryption`, `--check`	pages, encryption, linearization, version, structural integrity
pdfminer	Python subprocess (`pdfminer.six`)	pages, encryption, JS (Names tree), OpenAction, AcroForm
pdf.js / Node	`node -e` (raw byte regex scan)	pages (/Count), JS (/S /JavaScript), OpenAction, encryption

All six parsers ran on the same Linux server against the same file path. No network calls, no sandboxing differences. The only variable is the parser.

The Disagreement at a Glance

Before the eleven individual tests, here is the clearest single example: the demo PDF (Test 3 — 686 bytes, downloadable below). Six parsers. Same file. Same machine. Same instant. JavaScript presence: 3 say yes, 2 say no, 1 has no path to check.

Parser	Pages	PDF Version	JavaScript	Encrypted	AcroForm
MuPDF (`mutool info`)	1	1.4	No	No	No
Poppler (`pdfinfo`)	1	1.4	Yes	No	No
Ghostscript (nullpage render)	1	—	No	—	—
qpdf (`--show-npages --check`)	1	1.4	—	No	—
pdfminer (Python library)	1	—	Yes	No	No
pdf.js / Node (raw byte scan)	1	—	Yes	No	—

The file contains a JavaScript action object in the /Names/JavaScript catalog tree, with its payload compressed via FlateDecode. Poppler, pdfminer, and pdf.js find it through three independent paths. MuPDF and Ghostscript do not. A scanner built on either of those two would silently pass this file as clean. This is the core problem.

Test 1 — PDF Version Header vs. Catalog Override

Technique

The PDF specification (ISO 32000-1 §7.5.2) states that if a document's Catalog dictionary contains a /Version entry, that version supersedes the %PDF-x.y header. We built a file where the header says %PDF-1.4 but the Catalog says /Version /1.7.

1 0 obj
<< /Type /Catalog /Pages 2 0 R /Version /1.7 >>
endobj

File size: 349 bytes. The %PDF-1.4 magic bytes appear at offset 0; the catalog /Version override appears at byte 51.

Scanner Output — Live Results

Parser	Reported Version	Source
MuPDF	1.7	Catalog `/Version` — spec-compliant
Poppler	1.7	Catalog `/Version` — spec-compliant
Ghostscript	—	Render-only; does not report version
qpdf	1.4	`%PDF-x.y` header — ignores Catalog override
pdfminer	—	Does not extract version
pdf.js / Node	—	Does not extract version

Scanner finding [medium]: “Differential Parsing: PDF Version Mismatch — MuPDF=1.7, Poppler=1.7, qpdf=1.4”

Why It Matters

Version determines which PDF features a parser accepts. A file can claim PDF 1.4 compliance (no JavaScript, no XFA) in its header — satisfying a version-gating scanner — while carrying a /Version /1.7 Catalog that legitimizes JavaScript and full AcroForm XFA support in the rendering engine. Risk score: 28.

Test 2 — /Count vs. Actual Page Nodes

Technique

The /Pages dictionary's /Count field is supposed to reflect the total number of leaf page nodes in the tree. We set /Count 3 while providing only two /Page objects in the /Kids array.

2 0 obj
<< /Type /Pages /Kids [3 0 R 4 0 R] /Count 3 >>
endobj

File size: 432 bytes. Two /Page objects exist; /Count claims three.

Scanner Output — Live Results

Parser	Reported Pages	Strategy
MuPDF	3	Trusts `/Count` field
Poppler	3	Trusts `/Count` field
Ghostscript	2	Renders actual pages — counts “Page N” output lines
qpdf	3	Trusts `/Count` field
pdfminer	2	Iterates leaf `/Page` nodes — ignores `/Count`
pdf.js / Node	3	Regex match on first `/Count \d+` occurrence

Scanner finding [medium]: “Differential Parsing: Page Count Disagreement — MuPDF=3, Poppler=3, Ghostscript=2, qpdf=3, pdfminer=2, pdfjs_node=3”

Why It Matters

Scanners that report page count from /Count can be lied to about document length. More critically, this technique underlies shadow-page attacks: an incremental update can silently add pages whose xref entries are only partially visible to parsers that traverse the page tree structurally, while remaining fully rendered in browsers. Risk score: 35.

Test 3 — JavaScript Hidden from Two of Six Parsers

Technique

JavaScript was placed in the document's /Names/JavaScript tree, referenced via an /OpenAction on the Catalog. The JavaScript payload was stored in a FlateDecode-compressed stream (object 7), with the action object referencing it as an indirect object.

1 0 obj
<< /Type /Catalog /Pages 2 0 R /Names 6 0 R /OpenAction 5 0 R >>

5 0 obj
<< /Type /Action /S /JavaScript /JS 7 0 R >>

6 0 obj
<< /JavaScript << /Names [(startup) 5 0 R] >> >>

7 0 obj
<< /Length 29 /Filter /FlateDecode >>
stream
[zlib-compressed: app.alert("PQPDF-differential-test");]
endstream

File size: 686 bytes.

Scanner Output — Live Results

Parser	Detects JS?	Detection path
MuPDF	✗ No	`mutool info` looks for “JavaScript: yes” in rendered info — not reported here
Poppler	✓ Yes	`pdfinfo` parses the Names tree and reports “JavaScript: yes”
Ghostscript	✗ No	Render output and stderr contain no “JavaScript” string
qpdf	—	qpdf's targeted commands do not expose JS presence
pdfminer	✓ Yes	Resolves `/Names/JavaScript` catalog entry via `resolve1()`
pdf.js / Node	✓ Yes	Regex `/\/S\s*\/JavaScript/` matches action object dictionary outside streams

Scanner finding [critical]: “Differential Parsing: JavaScript Visibility Discrepancy — MuPDF=none, Poppler=JS, Ghostscript=none, pdfminer=JS, pdfjs_node=JS”

Why It Matters

A scanner built on MuPDF or Ghostscript as its sole parser would report this file as JavaScript-free. The file would pass JavaScript-based policy gates, email filters, and DLP rules. Yet any browser using Poppler, the pdf.js rendering engine (Firefox), or pdfminer internals would execute the JavaScript on open. This is a single-parser scanner's blind spot made concrete. Risk score: 528 — the scanner's other engines independently confirmed JavaScript via object analysis and YARA, which is exactly why differential parsing alone is insufficient.

Test 4 — Encryption Oracle: Three Parsers Say Encrypted, One Says Clear

Technique

We placed a syntactically present but semantically null /Encrypt dictionary in the trailer, with /V 0 (encryption version zero — “unknown algorithm” per ISO 32000-1 §7.6). The file content itself is unencrypted; only the dictionary reference exists.

4 0 obj
<< /Filter /Standard /V 0 /R 2
   /O (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
   /U (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) /P -4 >>
endobj

trailer
<< /Size 5 /Root 1 0 R /Encrypt 4 0 R >>

File size: 501 bytes. The content streams are plaintext; only the /Encrypt reference exists.

Scanner Output — Live Results

Parser	Reports Encrypted?	Behavior
MuPDF	⚠ Yes (encrypted)	Sees `/Encrypt` in trailer — reports file as encrypted
Poppler	✓ No (clear)	Attempts parse without password; succeeds — reports “not encrypted”
Ghostscript	—	Render-only; no encryption status reported via nullpage device
qpdf	⚠ Yes (encrypted)	`--show-encryption` does not output “File is not encrypted” — reports encrypted
pdfminer	—	Encryption attribute indeterminate for V=0
pdf.js / Node	⚠ Yes (encrypted)	Regex `/\/Encrypt/` matches the trailer reference

Scanner finding [critical]: “Differential Parsing: Encryption Status Mismatch — MuPDF=encrypted, Poppler=clear, qpdf=encrypted, pdfjs_node=encrypted”

Why It Matters

Many scanners skip deep analysis of encrypted PDFs because they cannot decompress streams without the key. A file that falsely presents as encrypted to the scanner — while remaining fully readable to the renderer — bypasses the scanner's content analysis entirely. The inverse is also exploitable: a file that hides its true encryption state to appear as a clear-text document can confuse parsers that skip decryption when they think the file is unencrypted. Risk score: 106.

Test 5 — One Parser Finds JavaScript That Five Others Miss

Technique

We built a base document containing object 5 as a benign /XObject /Subtype /Form, then appended an incremental update that redefines object 5 as a JavaScript action and adds a second JavaScript action as object 6. A conforming parser must use the last xref table — making the JS objects the “live” definition of object 5. However, since neither object is referenced from the document tree (/Catalog has no /OpenAction or /Names/JavaScript entry), structural parsers do not traverse to them.

[Base document: obj 5 = benign /XObject]
[Base xref: obj 5 → offset 0x128]
%%EOF

[Incremental update body:]
5 0 obj
<< /Type /Action /S /JavaScript /JS (app.alert("hidden");) >>
endobj
6 0 obj
<< /Type /Action /S /JavaScript /JS (app.alert("also hidden");) >>
endobj

[Update xref: obj 5 → new offset, obj 6 → new offset]
[Trailer: /Prev → base xref offset]
%%EOF

File size: 798 bytes. The JavaScript objects exist in the xref-indexed object space but are unreachable from /Root.

Scanner Output — Live Results

Parser	Detects JS?	Why
MuPDF	✗ No	Traverses document tree from /Root — no path to the orphan JS objects
Poppler	✗ No	Traverses document tree from /Root — no path to the orphan JS objects
Ghostscript	✗ No	Renders only reachable content — orphan objects not processed
qpdf	—	Targeted commands do not expose unreferenced object content
pdfminer	✗ No	Resolves catalog then traverses the tree — orphans not visited
pdf.js / Node	✓ Yes	Regex over raw byte stream after stripping `stream…endstream` blocks — matches `/S /JavaScript` in the incremental update body

Scanner finding [critical]: “Differential Parsing: JavaScript Visibility Discrepancy — MuPDF=none, Poppler=none, Ghostscript=none, pdfminer=none, pdfjs_node=JS”

Why It Matters

This is the sharpest result in the set: five structural parsers miss the JavaScript entirely, while the raw-byte regex scanner finds it. The JavaScript objects are real — they exist in the file's object space, properly indexed in the xref table. A vulnerability in a PDF reader that allows unreferenced objects to be executed (for example, a use-after-free triggered by incremental-update processing that makes “orphan” objects reachable) would expose the payload to parsers that today report “no JavaScript.” Only the raw-byte layer catches this class of staging attack. Risk score: 688.

Test 6 — AcroForm Invisible to Two of Three Comparing Parsers

Technique

The base document contains no form. An incremental update appends an /AcroForm object and a new Catalog revision that references it. Per the PDF specification, the updated Catalog supersedes the original — the document now has an AcroForm. Parsers that process only the base xref miss the update entirely.

[Base: /Catalog has no /AcroForm]
%%EOF

[Incremental update:]
4 0 obj << /Fields [] /DR <<>> /DA (/Helv 12 Tf 0 g) >> endobj  ← AcroForm
1 0 obj << /Type /Catalog /Pages 2 0 R /AcroForm 4 0 R >> endobj  ← updated Catalog

[Update xref: obj 1 → new Catalog, obj 4 → AcroForm]
%%EOF

File size: 582 bytes. The original Catalog remains in the base body; only parsers reading the update xref chain find the AcroForm.

Scanner Output — Live Results

Parser	Detects AcroForm?	Behavior
MuPDF	✗ No	`mutool info` reports “Forms: none” — does not expose the incremental AcroForm
Poppler	✓ Yes	`pdfinfo` reports “Form: AcroForm” — correctly reads the updated Catalog
Ghostscript	—	No form-type reporting in nullpage render output
qpdf	—	AcroForm presence not exposed via targeted fast commands
pdfminer	✗ No	Resolves Catalog but does not report AcroForm as visible feature
pdf.js / Node	—	AcroForm not in the pdf.js byte-scan dimension set

Scanner finding [medium]: “Differential Parsing: AcroForm Visibility Discrepancy — MuPDF=none, Poppler=AcroForm, pdfminer=none”

Why It Matters

AcroForm detection matters for two reasons. First, AcroForms enable auto-submit actions (/SubmitForm) that exfiltrate data on document open — a common phishing vector that DLP systems watch for. Second, XFA (/XDP inside an AcroForm) can execute arbitrary FormCalc or JavaScript in Adobe Acrobat and certain enterprise readers. A scanner that reports “no form” based on MuPDF will miss an XFA auto-exec payload delivered via incremental update. Risk score: 111.

Part II — Keyword Injection vs. Structural Parsing

The first six tests targeted ambiguities in the PDF object model — fields where the specification leaves room for interpretation. These are structural disagreements: parsers disagree because they make different choices about which part of the object tree is authoritative.

The next five tests target a different class of vulnerability: the gap between a raw-byte regex scanner and a structural PDF parser. The pdf.js component of the scanner is implemented as a Node.js script that runs regex patterns over the raw file bytes (stripping stream bodies but not comment lines). This is significantly faster than full parsing and catches many real threats — but it also means the scanner can be fooled in both directions: keywords in comments or stream bodies that it finds but structural parsers ignore, or objects accessible only after ObjStm decompression that structural parsers find but the regex misses.

The following five tests demonstrate this empirically.

Test 7 — JavaScript in Page Additional Actions (/AA)

Technique

A JavaScript action was placed in a page's /AA (Additional Actions) dictionary under the /O key (on-open trigger). This is a legitimate PDF feature for triggering actions when a page is opened or closed. It is not in the /Names/JavaScript catalog entry that most parsers use as their primary JS detection path, and it is not in the document-level /OpenAction.

3 0 obj
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792]
   /AA << /O << /Type /Action /S /JavaScript
                /JS (app.alert("aa-open");) >> >> >>
endobj

File size: 430 bytes. The JavaScript action sits entirely inside the page object dictionary, not in any stream body.

Scanner Output — Live Results

Parser	Detects JS?	Why
MuPDF	✗ No	`mutool info` checks “JavaScript: yes” string and `/JavaScript` in trailer — does not surface /AA actions
Poppler	✓ Yes	`pdfinfo` traverses page /AA dictionaries and reports JavaScript presence
Ghostscript	✗ No	Render output does not include “JavaScript” string; /AA not triggered in nullpage device
qpdf	—	Targeted commands do not expose JS presence
pdfminer	✗ No	Only checks `/Names/JavaScript` catalog entry — /AA is not traversed
pdf.js / Node	✓ Yes	Regex `/\/S\s*\/JavaScript/` matches the action dict in page object (not inside a stream body)

Scanner finding [critical]: “Differential Parsing: JavaScript Visibility Discrepancy — MuPDF=none, Poppler=JS, Ghostscript=none, pdfminer=none, pdfjs_node=JS”

Why It Matters

Poppler and pdf.js agree there is JavaScript, but for entirely different reasons: Poppler traverses the structural action tree; pdf.js finds the keyword in raw bytes. MuPDF and pdfminer, which are commonly used as the sole parser in security tools, both miss it. A scanner built on mutool info or pdfminer's catalog walker will report this file as JavaScript-free. Risk score: 498 — the other 47 engines independently found and confirmed the JavaScript via object analysis.

Test 8 — False /Count Injected in Content Stream Body

Technique

The pdf.js scanner searches for page count using a raw-byte regex (/\/Count\s+(\d+)/) on the unstripped file bytes, taking the first match in the file. We placed the content stream object — whose body contains /Count 99 as a PDF graphics operator argument — before the /Pages object in the file body. The real /Count 1 appears later in the file.

%PDF-1.4

4 0 obj  ← content stream appears FIRST in file
<< /Length 38 >>
stream
/Count 99 cm
BT /F1 12 Tf (Hello) Tj ET
endstream
endobj

2 0 obj  ← real /Pages dict appears later
<< /Type /Pages /Kids [3 0 R] /Count 1 >>

File size: 460 bytes. Structurally valid: one page, correct xref. The fake /Count is inside a stream body, which structural parsers ignore for page counting.

Scanner Output — Live Results

Parser	Reported Pages	Strategy
MuPDF	1	Reads xref → /Pages → /Count field at correct offset
Poppler	1	Reads xref → /Pages → /Count field at correct offset
Ghostscript	1	Renders one page, counts “Page N” output lines
qpdf	1	`--show-npages` uses the structural page tree
pdfminer	1	Iterates leaf `/Page` nodes from the tree
pdf.js / Node	99	Raw regex hits `/Count 99` in stream body first — takes that value

Scanner finding [critical]: “Differential Parsing: Page Count Disagreement — MuPDF=1, Poppler=1, Ghostscript=1, qpdf=1, pdfminer=1, pdfjs_node=99”

Why It Matters

This test shows the inverse of the structural-parsing blind spots above: the raw-byte scanner reports data that is completely wrong and that no structural parser accepts. A scanner reporting 99 pages for a 1-page file produces false positives in page-count anomaly detection, masks the actual page count in reports, and could exhaust downstream systems that allocate resources proportional to page count. The page delta of 98 drives the indicator to critical severity with the maximum score bonus. Risk score: 153.

Test 9 — OpenAction JavaScript Added via Incremental Update

Technique

The base document is clean: a three-object PDF with no actions. An incremental update appends a JavaScript action object and a new Catalog revision that sets /OpenAction to point at it. This is the same update-chain mechanism as Test 6 (AcroForm), applied to JavaScript and OpenAction instead.

[Base document: /Catalog has no /OpenAction]
%%EOF

[Incremental update:]
4 0 obj
<< /Type /Action /S /JavaScript /JS (app.alert("openaction-hidden");) >>
endobj
1 0 obj
<< /Type /Catalog /Pages 2 0 R /OpenAction 4 0 R >>  ← new Catalog
endobj

[xref2: obj 1 → new Catalog offset, obj 4 → JS action offset]
[trailer2: /Prev → base xref]
%%EOF

File size: 608 bytes.

Scanner Output — Live Results

Parser	Detects JS?	Detects OpenAction?
MuPDF	✗ No	No
Poppler	✗ No	—
Ghostscript	✗ No	No
qpdf	—	—
pdfminer	✗ No	No
pdf.js / Node	✓ Yes	Yes

Scanner finding [critical]: “Differential Parsing: JavaScript Visibility Discrepancy — MuPDF=none, Poppler=none, Ghostscript=none, pdfminer=none, pdfjs_node=JS”

Why It Matters

The OpenAction points to a JavaScript action in the incremental update — the canonical malware delivery mechanism for PDF auto-exec exploits. Yet five of six parsers report no JavaScript. Only the raw-byte scanner finds it, because /S /JavaScript appears in the update body's raw object text. The structural parsers’ failure here (they should follow the update chain to the new Catalog and traverse /OpenAction) suggests their JS detection paths only check specific catalog keys, not all action references. Risk score: 648.

Test 10 — Dual %%EOF: First /Count Wins in Raw Scan

Technique

The file contains two complete and syntactically valid PDF document structures, each with its own xref table, trailer, and %%EOF marker. The first document describes a 2-page file; the second describes a 1-page file. Standard PDF parsing requires reading from the physical end of the file to find the last %%EOF, then following the startxref backward. Five parsers do this correctly and load the 1-page document. pdf.js’s raw regex for page count picks up the first /Count it encounters in file order, which belongs to the 2-page document.

[2-page document]
2 0 obj << /Type /Pages /Kids [3 0 R 5 0 R] /Count 2 >>  ← first /Count in file
...
startxref [offset of 2-page xref]
%%EOF

[1-page document starts here]
2 0 obj << /Type /Pages /Kids [3 0 R] /Count 1 >>  ← second /Count in file
...
startxref [offset of 1-page xref]
%%EOF  ← last %%EOF — correct parsers start here

File size: 772 bytes.

Scanner Output — Live Results

Parser	Reported Pages	Parsing strategy
MuPDF	1	Reads from last %%EOF → last xref → 1-page /Pages dict
Poppler	1	Reads from last %%EOF → last xref → 1-page /Pages dict
Ghostscript	1	Renders the 1-page document found from last %%EOF
qpdf	1	`--show-npages` uses last xref
pdfminer	1	Loads document from last trailer, finds 1-page tree
pdf.js / Node	2	Raw regex first `/Count \d+` match hits the 2-page document’s `/Count 2`

Scanner finding [medium]: “Differential Parsing: Page Count Disagreement — MuPDF=1, Poppler=1, Ghostscript=1, qpdf=1, pdfminer=1, pdfjs_node=2”

Why It Matters

This is the complement to Test 8: in Test 8, a fake count in a stream body fools the raw-byte scanner. Here, a completely valid PDF structure at an earlier file position fools it. Both attacks exploit the same root cause: the raw-byte scanner reads linearly while structural parsers read the xref chain. Dual-%%EOF files are also a polyglot technique — the file is simultaneously two different valid PDFs, which can confuse signature-based detection and archival tools. Risk score: 90.

Test 11 — /Encrypt Keyword in a PDF Comment Line

Technique

A PDF comment line (any line beginning with %, per ISO 32000-1 §7.2.4) is syntactically invisible to all parsers — they discard everything from % to the end of the line. The pdf.js scanner, however, runs its encryption detection as a raw regex on the full byte stream: /\/Encrypt/.test(noStreams). The variable noStreams strips stream…endstream blocks but does not strip comment lines. We placed /Encrypt inside a comment.

%PDF-1.4
%âãÏÓ
% Document policy: /Encrypt /Standard /V 4 /R 4 /P -3904

1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj
...
trailer << /Size 4 /Root 1 0 R >>   ← no /Encrypt here
startxref
...
%%EOF

File size: 392 bytes. The trailer has no /Encrypt entry. The file is structurally unencrypted.

Scanner Output — Live Results

Parser	Reports Encrypted?	Source
MuPDF	No (clear)	Checks `/Encrypt` in trailer — not present
Poppler	No (clear)	`pdfinfo` parses trailer, reports “Encrypted: no”
Ghostscript	—	No encryption status in render output
qpdf	No (clear)	`--show-encryption` outputs “File is not encrypted”
pdfminer	No (clear)	Document encryption attribute is None / False
pdf.js / Node	⚠ Yes (encrypted)	Regex `/\/Encrypt/` matches the comment line — comment not stripped

Scanner finding [critical]: “Differential Parsing: Encryption Status Mismatch — MuPDF=clear, Poppler=clear, qpdf=clear, pdfminer=clear, pdfjs_node=encrypted”

Why It Matters

This is a pure keyword-injection false positive: the raw-byte scanner reports encryption that does not exist. An attacker can use this to trigger defensive behaviors in systems that skip content analysis of encrypted files (because they cannot decrypt them), while the content itself is fully readable. Conversely, a benign file can be misreported as encrypted simply by including a comment that mentions encryption policy in plain text. Risk score: 96.

Summary: All Eleven Tests

Part I — Structural ambiguities (spec interpretation differences between parsers)

Test	Technique	Split	Severity	Risk
1 — Header vs. Catalog version	`%PDF-1.4` header / `/Version /1.7` Catalog	MuPDF/Poppler=1.7 • qpdf=1.4	medium	28
2 — /Count vs. actual pages	`/Count 3` with 2 Kids	4 parsers=3 • GS/pdfminer=2	medium	35
3 — JS in compressed Names tree	`/Names/JavaScript` via FlateDecode stream	Poppler/pdfminer/pdfjs=JS • MuPDF/GS=none	critical	528
4 — Null encryption (/V 0)	`/Encrypt /V 0` in trailer	MuPDF/qpdf/pdfjs=enc • Poppler=clear	critical	106
5 — Orphan JS via incremental update	obj 5 redefined as JS action, not in doc tree	pdfjs=JS • MuPDF/Poppler/GS/qpdf/pdfminer=none	critical	688
6 — AcroForm via incremental update	AcroForm added in update, base Catalog clean	Poppler=AcroForm • MuPDF/pdfminer=none	medium	111

Part II — Keyword injection (raw-byte scanner vs. structural parsers)

Test	Technique	Split	Severity	Risk
7 — JS in page /AA	JavaScript in `/AA /O` page action	Poppler/pdfjs=JS • MuPDF/GS/pdfminer=none	critical	498
8 — False /Count in stream body	`/Count 99` in content stream before real /Pages	pdfjs=99 pages • all others=1	critical	153
9 — OpenAction via incremental update	`/OpenAction` JS added in update body	pdfjs=JS • MuPDF/Poppler/GS/pdfminer=none	critical	648
10 — Dual %%EOF page confusion	2-page doc before 1-page doc; raw scan hits first	pdfjs=2 pages • all others=1	medium	90
11 — /Encrypt keyword in comment	`% /Encrypt …` in a PDF comment line	pdfjs=encrypted • MuPDF/Poppler/qpdf/pdfminer=clear	critical	96

Every file produced a confirmed cross-parser disagreement. Seven of eleven triggered a critical-severity indicator. File sizes ranged from 349 to 798 bytes. No PDF library was used to construct any test file — all were built from raw bytes with byte-accurate cross-reference tables.

Why This Matters Beyond the Lab

The results above are not a critique of any individual parser. Each parser is doing something reasonable given the specification’s ambiguities and its own engineering priorities. MuPDF is a rendering library optimized for speed. Poppler’s pdfinfo is a metadata tool. Ghostscript is a PostScript interpreter with PDF rendering bolted on. Each has a different view of “what the document is.” The security problem is using any one of them as the sole arbiter of document content.

Security Scanning

Any scanner that passes a file as “no JavaScript” or “not encrypted” based on a single parser is making a claim the data does not support. The correct claim is “no JavaScript visible to this parser.” Attackers who understand parser-specific behavior can construct PDFs that are simultaneously malicious to the renderer and benign to the scanner — using nothing more than knowledge of the specification and which parser makes which choice at each ambiguity. No exploit code required.

RAG Pipelines and AI Ingestion

Retrieval-augmented generation (RAG) systems, document AI platforms, and enterprise knowledge bases ingest PDFs using a single parser — often pdfminer or MuPDF — to extract text. When that parser sees a different page tree than the rendering engine, the extracted text diverges from what a human reader sees. Test 2 demonstrates this: four parsers report 3 pages, two report 2. An AI system ingesting 2 pages indexes different content than a human reading 3. In adversarial contexts, this gap can be exploited to inject content that is invisible to the AI indexer but visible to the human reader — or vice versa.

Fraud Detection and Legal Review

Incremental update attacks (demonstrated in Tests 5, 6, and 9) are a known vector for document fraud: a signed PDF is modified via an incremental update without invalidating the digital signature. If the forensic tool examining the document uses a parser that ignores the update chain, it reports the original (unmodified) content as authoritative. Poppler and pdf.js follow the update chain; MuPDF and pdfminer may not surface the update’s additions in their standard output modes. A legal review relying on MuPDF-extracted text would examine a different document than one viewed in a Poppler-based reader.

Compliance and DLP

Data loss prevention systems and compliance gateways classify PDFs based on content analysis. If the DLP system uses a parser that reports “no AcroForm” (Test 6) but the recipient’s reader opens an AcroForm with an auto-submit exfiltration action, the DLP gateway lets it pass. If the system uses a parser that reports “encrypted” for a file with a null /V 0 encryption dictionary (Test 4) and skips deep content analysis of encrypted files, the plaintext content bypasses all content inspection. Parser choice is policy, whether the DLP vendor acknowledges it or not.

Standards Context: Where ISO 32000 Leaves Behavior Undefined

These findings are not bugs in individual parsers. They are consequences of genuine ambiguities and underspecified behaviors in the PDF specification itself. Understanding where the standard leaves room for interpretation is essential for anyone contributing to parser development, archival workflows, or accessibility tooling.

ISO 32000-1 and ISO 32000-2

ISO 32000-1:2008 (PDF 1.7) and ISO 32000-2:2020 (PDF 2.0) specify the PDF format. Several of the ambiguities demonstrated above trace directly to the standard:

Version resolution (§7.5.2): The standard states that a /Version entry in the document Catalog “shall override” the file header version. Parsers differ on whether this override applies to all version-gated feature checks or only to conformance declarations. Test 1 shows qpdf reading 1.4 while MuPDF and Poppler read 1.7 from the same file.
Page tree /Count field (§7.7.3.2): The standard defines /Count as “the number of leaf nodes (page objects) that are descendants of this node.” It does not specify what a conforming reader must do when /Count disagrees with the actual number of descendants. Test 2 shows a 4-2 parser split on this.
Incremental update processing (§7.5.6): The standard is clear that the last cross-reference section defines the current object states, but does not define the minimum set of incremental revision chain entries a conforming reader must process when extracting metadata. Tests 6, 9, and the AcroForm test expose this gap.
Encryption dictionary (§7.6): The standard defines /V 0 as “an algorithm that is undocumented and no longer supported.” It does not specify whether a conforming reader encountering /V 0 should treat the document as encrypted or as an error. Test 4 shows a 3-1 split on this exact question.

PDF/A and PDF/UA Implications

PDF/A (ISO 19005) mandates that conforming files be renderable consistently across readers without external dependencies. Parser disagreement on page count or document structure violates the spirit of this guarantee: an archival tool validating a PDF/A file using MuPDF may report it as conforming while a Poppler-based reader renders additional pages from an incremental update that MuPDF does not surface.

PDF/UA (ISO 14289) requires that the logical reading order and document structure be accessible to assistive technology. When a screen reader’s underlying PDF parser reports different page content than the visual renderer, accessibility compliance becomes parser-dependent rather than document-dependent. A document that passes PDF/UA validation against one parser may fail accessibility requirements in the rendering stack used by the actual reader.

A Note for PDF Association and Adobe Reviewers

The disagreements demonstrated here do not represent parser bugs — they represent specification gaps that reasonable implementations fill differently. The appropriate resolution is not to fix individual parsers but to add normative language to ISO 32000 and ISO 19005 that specifies required behavior for malformed or ambiguous inputs. Areas that would benefit from tightened language: the precedence of /Count vs. structural traversal; the semantics of /V 0 encryption dictionaries; the minimum update-chain depth that conforming readers must process for metadata extraction; and the normative treatment of /Version catalog overrides in feature-gating contexts.

Three Things Disagreement Does Not Mean

Parser disagreement is a forensic signal, not a verdict. Before using differential analysis operationally, three common misreadings need to be corrected.

1. Disagreement Is Not Evidence of Attack

Most cross-parser disagreements in production PDF traffic come from benign sources: export software that writes technically non-conformant but harmless files, old PDF generators with known quirks, document repair tools that add non-standard recovery structures, and interoperability edge cases between PDF/A workflows and generic readers. The 11 test files above were constructed to isolate disagreement patterns — in practice, most disagreements arrive packaged with enough clean context to score conservatively.

The correct interpretation is:

disagreement = elevated forensic interest — not confirmed compromise

The risk score reflects the combination of the disagreement with what the other 46 engines observe. A disagreement on page count in a file with clean metadata, no entropy anomalies, a known producer string, and a valid digital signature scores low. The same disagreement in a file with stripped metadata, high-entropy streams, and no producer scores high. The differential finding is a multiplier, not a standalone accusation.

2. Aggressive Differential Analysis Can Be Noisy

If every cross-parser discrepancy triggered a high-severity alert, enterprise adoption would collapse under false positive volume. PDF interoperability is genuinely messy: Microsoft Office export, Adobe Acrobat, LibreOffice, Chrome’s built-in print-to-PDF, and various SaaS document platforms all produce files with at least one quirk that at least one parser handles differently. A scanner that treats all of those as critical findings is not useful.

Signal quality depends on three things that are harder than detection itself:

Scoring calibration — weighting disagreements by dimension (JS visibility disagreement is qualitatively different from a version-string discrepancy).
Context from corroborating engines — a disagreement that no other engine corroborates scores near zero; one that five engines corroborate scores multiplicatively higher.
Explainability — analysts need to know which parsers disagreed on what value, not just that a score crossed a threshold. The differential engine outputs per-dimension parser value sets precisely so that a human reviewer can reproduce the disagreement independently.

Operational tuning remains an ongoing process. The scoring weights in the table above reflect current calibration against real-world traffic; they are not immutable constants.

3. Output Normalization Across Parsers Is a Genuinely Hard Engineering Problem

Each of the six parsers produces output in a different format with different semantics. MuPDF’s mutool info emits human-readable key-value text. Poppler’s pdfinfo emits different key-value text with different field names. qpdf emits structured JSON with its own schema. Ghostscript does not emit document metadata at all — it renders, and the scanner infers structure from render success/failure and page output. pdfminer exposes a Python object model. The Node.js subprocess runs raw regex on bytes.

Normalizing these into a comparable set of dimensions requires:

Field mapping across six distinct output schemas
Type normalization (booleans expressed as “yes”, “true”, “1”, or implied by field presence vs. absence)
Handling parser-specific absences: a parser that does not report a dimension should be excluded from comparisons for that dimension, not treated as reporting a null value
Accounting for different repair behaviors: when a parser encounters a malformed xref, it may silently repair it and return a value, return an error, or return a partial result — all three need different handling

This normalization layer is not glamorous and does not lend itself to clean demos. But it is the prerequisite for everything else on this page. Getting it wrong produces spurious disagreements that have nothing to do with the PDF — they are artifacts of schema mismatches between parsers. Getting it right requires treating each parser as an unreliable witness that must be cross-examined rather than believed.

The Same Architectural Weakness Breaks AI Ingestion

The specification ambiguities documented above do not respect the boundary between security tooling and AI tooling. The same structural properties of a PDF that allow malicious content to hide from a security scanner also allow content to hide from — or be misrepresented to — a RAG pipeline, an OCR system, or an enterprise document ingestion stack. The failure modes are different, but the root cause is identical: a single parser is being asked to render a definitive account of a document whose structure is genuinely ambiguous.

The Structural Symmetry

In security contexts, the relevant question is: does this file contain malicious content? If two parsers disagree, one is missing the content and the attacker wins. In AI ingestion contexts, the relevant question is: what does this document say? If two parsers disagree, the knowledge base indexed by one pipeline reflects a different document than the one a human reader opens. Neither failure requires an attacker. Both follow from the same set of specification gaps shown in Tests 1 through 11.

Test	Security failure mode	AI ingestion failure mode
Test 2: /Count lie	Scanner clears a multi-page attack doc as single-page	RAG pipeline indexes 2 pages; reader opens 3; knowledge base is incomplete
Test 3: JS in compressed stream	JS hidden from structural parsers → scanner misses auto-exec payload	Any text in compressed streams invisible to extractors that skip stream bodies
Test 5: Incremental update	Post-signature content addition bypasses audit	Ingestion pipeline indexes original version; updated content never reaches the index
Test 6: AcroForm in update	Form with auto-submit exfiltration passes DLP	Form fields and their values missed by extractors that don't follow update chains
Test 11: Keyword in comment	Raw-byte scanner falsely reports encryption; file skipped	Extraction tool treating comments as metadata ingests false metadata alongside document text

Why Scientific, Legal, and Patent PDFs Break Parsers Harder

Most RAG benchmarks use clean, well-formed PDFs. Production document corpora don’t. The hardest categories for parsers are precisely the categories most valuable for enterprise AI:

Scientific papers (arXiv, journal PDFs): Multi-column layouts cause different parsers to extract text in different reading orders. MuPDF’s column detection, pdfminer’s position-sorted extraction, and PyPDF’s stream-order extraction all produce different token sequences from the same two-column paper. The LLM downstream receives different context.
Patent documents (USPTO, EPO): Patent PDFs frequently combine scanned page images with a selectable-text overlay added by the patent office OCR pipeline. Parsers diverge on whether to read the overlay text, attempt independent OCR on the image, or report no text content at all. The same patent can produce 4,000 words from one extractor and zero from another.
Legal contracts with redactions: Some redaction tools draw black rectangles over existing text objects without removing the underlying text from the PDF structure. A parser that reads the visual layer sees a redacted document. A parser that reads the object tree sees the original unredacted text. This is not hypothetical: it has affected court filings in publicized cases.
Digitally signed PDFs: Signature validation locks specific byte ranges. Tools that process the full file rather than the signed byte range can see content that technically post-dates the signature. Ingestion pipelines that do not reconstruct the revision chain index a different document than the one the signature covers.

Adversarial Documents: The Prompt Injection Angle

Parser disagreement creates a natural split-view attack surface for AI pipelines. A PDF where the visual rendering layer (what the human reviewer sees in Adobe Acrobat) shows benign content while the extraction layer (what the RAG pipeline’s pdfminer or PyMuPDF call returns) contains additional or different content is a prompt injection vector that requires no exploit code — only knowledge of the specification and which parser makes which choice at each ambiguity. Test 3 demonstrates a minimal version of this: the human opening the file in a viewer sees an empty page; parsers that follow the /Names/JavaScript tree see JavaScript code.

The adversarial content does not have to be JavaScript. It can be natural-language instruction text embedded in a compressed stream, an incremental update, or a content layer that only some parsers surface. If that text reaches the LLM’s context window and the human reviewer’s visual inspection did not reveal it, the pipeline has been injected.

Multi-Parser Verification as an AI Pipeline Primitive

The same multi-parser comparison approach that flags disagreements as a security signal can serve as a completeness check for AI ingestion. If three parsers extract 1,400 words from a PDF and a fourth extracts 1,900, the 500-word gap is worth investigating before the document reaches the index. The investigation does not have to be automated to be valuable: surfacing "parsers disagree on content length" as a flag for human review is qualitatively better than silently ingesting whatever one parser happened to return.

Parser agreement is a weak signal that extraction is complete. Parser disagreement is a strong signal that at least one parser is wrong. Neither guarantee correctness — but the first is better than flying blind with a single extractor.

How PQPDF Resolves Parser Disagreement

Detecting a disagreement is the easy part. The harder question is: which parser is correct? PQPDF does not attempt to declare a single parser the winner. Instead, it treats the disagreement itself as a security signal and uses multi-layer consensus to determine the risk weight of each discrepancy.

1. Parser Consensus and Signal Weighting

The differential parsing engine collects output from all six parsers across seven structural dimensions: page count, JavaScript presence, encryption status, PDF version, AcroForm presence, embedded file count, and object count. For each dimension, it records the set of distinct values reported. A dimension with one value across all parsers is consistent. A dimension with two or more distinct values is flagged.

Dimension	Discrepancy threshold	Severity
Page count	Any difference between parsers; score scales with delta magnitude	medium → critical
JavaScript presence	At least one reports JS, at least one does not	critical (+50 score)
Encryption status	Any difference between parsers	critical (+40 score)
PDF version	Any difference between parsers	medium (+10 score)
AcroForm presence	Any difference between parsers	medium (+15 score)
Embedded file count	Any difference between parsers	medium → high
Object count	>10% relative difference between parsers	medium (+15 score)

Structural parsers (MuPDF, Poppler, qpdf) are considered higher-confidence sources for version and xref validity. The raw-byte scanner (pdf.js / Node) is treated as a broad-net detector: if it sees a keyword, the keyword exists in the file regardless of structural context — that matters independently of whether structural parsers agree.

2. xref Structural Validation

qpdf --check produces a structural validity verdict independent of content analysis. A finding of “structural integrity intact” from qpdf combined with a JS visibility discrepancy between other parsers tells the correlation engine something specific: the discrepancy is not due to a malformed file that parsers repair differently — the xref is valid, so the disagreement is about interpretation of a structurally sound document. This is scored differently from a case where qpdf also reports xref errors, which suggests the parsers are resolving a genuinely malformed file.

3. Incremental Update Reconstruction

For documents with incremental revisions, the scanner uses qpdf --json (selectively) and pdfminer’s trailer chain traversal to reconstruct the full revision history. Each revision’s object additions and replacements are recorded separately. A JavaScript action object that appears only in revision 2 (after a signature on revision 1) is flagged not just as “has JavaScript” but as “post-signature JS insertion” — a specific pattern associated with signature bypass attacks. The revision history feeds Engine 21 (Signature Forensics) for cross-correlation. Engine 21 now also validates DocMDP certification constraints against incremental updates — including /P permission level verification, /Contents structural integrity (all-zero placeholder detection, DER SEQUENCE header check), and ByteRange coverage integrity (must start at offset 0; inner gap must contain only the /Contents blob).

4. Multi-Engine Correlation (Engine 43)

Differential parsing findings are one input to Engine 43 (Correlation Engine), which cross-references all 47 engine outputs. A JavaScript visibility discrepancy alone scores conservatively. The same discrepancy combined with a high-entropy compressed stream (Engine 7), a missing /Producer metadata field (Engine 6), and a qpdf xref reconstruction note (Engine 11) — confirmed by three independent engines — scores multiplicatively higher. This compound-indicator design is why the risk scores in the tests above range from 28 to 688: the differential finding is a multiplier on the other signals, not a standalone verdict.

5. Hard Isolation and Timeout

All six parsers run inside isolated Linux namespaces (separate network, PID, and mount spaces). A hard 30-second SIGALRM wraps the entire engine; each individual parser subprocess has its own 5–12 second timeout. pdfminer runs as a subprocess specifically because Python’s runtime cannot be interrupted from a parent thread — subprocess isolation guarantees hard kill. No malformed PDF can stall the scan, block the job queue, or cause a resource exhaustion denial-of-service through the differential parsing engine.

Download the Demo PDF

The primary demonstration file (Test 3) is available for independent verification. This is the exact binary that produced the scanner results above — byte-for-byte identical, unmodified.

Property	Value
File	`parser-disagreement-demo.pdf`
Size	686 bytes
MD5	`4607e660f3d14fbb1978ce191c8b4080`
PDF version	1.4 (header)
Technique	JavaScript in `/Names/JavaScript` via FlateDecode-compressed stream + `/OpenAction`
Expected JS split	Poppler=yes • pdfminer=yes • pdf.js=yes • MuPDF=no • Ghostscript=no

↓ Download parser-disagreement-demo.pdf (686 bytes)

Reproduction commands:

# Poppler — reports: JavaScript: yes
pdfinfo parser-disagreement-demo.pdf

# MuPDF — reports nothing for JavaScript
mutool info parser-disagreement-demo.pdf

# Ghostscript — no JS in output
gs -dNOPAUSE -dBATCH -sDEVICE=nullpage parser-disagreement-demo.pdf 2>&1 | grep -i javascript

# pdfminer — sees JS via Names tree
python3 -c "
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
with open('parser-disagreement-demo.pdf','rb') as f:
    d=PDFDocument(PDFParser(f))
    nm=resolve1(d.catalog).get('Names')
    print('JavaScript' in (resolve1(nm) if nm else {}))
"

# pdf.js (Node) — raw regex matches /S /JavaScript outside stream bodies
node -e "
const r=require('fs').readFileSync('parser-disagreement-demo.pdf','latin1');
const ns=r.replace(/\bstream[\s\S]*?endstream\b/g,' ');
console.log(/\/Type\s*\/JavaScript|\/S\s*\/JavaScript/.test(ns));
"

The JavaScript payload in the file is app.alert("PQPDF-differential-test"); — a display-only alert with no side effects. The file does not contact any network endpoint and contains no executable code outside the PDF JavaScript sandbox.

Use This in Your Workflow

The eleven PDFs and the methodology above are designed to be used as a validation suite for any scanner, pipeline, or tool that processes PDF files. Here is how different teams can put this research to work directly.

Role	Action	What to look for
Security engineers	Download the demo PDF and run it through your scanner	Does your tool report JavaScript? If not, you are using a single-parser scanner that misses the /Names/JavaScript path
DFIR analysts	Run a suspicious PDF through all six parsers using the reproduction commands above	Any dimension where parsers disagree is a structural anomaly worth investigating — especially JS presence and encryption status
RAG / AI builders	Compare your ingestion parser’s page count and text output against a second parser on the same file	Page count delta ≥1 means your index may be missing content visible to the user’s viewer
DLP / gateway engineers	Submit Test 4 (null `/V 0` encryption) and Test 11 (`/Encrypt` in comment) to your gateway	Does the gateway skip content inspection because it thinks the file is encrypted? That’s a policy bypass
PDF tool developers	Run all 11 test files through your parser and record output for each dimension	Compare against the table above — any column where your parser joins the minority is a spec interpretation worth documenting

If your tool reports “no JavaScript” where three of six parsers in this study disagree, you have a documented blind spot. That’s not a failure — it’s a known limitation of single-parser design. The question is whether your threat model accounts for it.

Scan Your PDFs with Multi-Parser Analysis

The PQ PDF Forensic Scanner runs differential parsing on every scan — no configuration required. All six parsers run in parallel. Every disagreement is flagged, scored, and fed into the 47-engine correlation layer. Upload any PDF and see what each parser reports, where they disagree, and what the disagreement implies for security risk.

→ PDF Forensics Scanner — 47 Engines, Multi-Parser, Free

No account. No file retention. Differential parsing, behavioral sandbox, YARA, ClamAV, offline threat intelligence (6.4M+ indicators), and AI synthesis — all running on the same file in parallel.