Why Multi-Parser PDF Security Is Not Optional
Every PDF security scanner — antivirus, email gateway, DLP system — uses exactly one PDF parser. That parser is the scanner’s entire view of reality. If the parser that renders the file in the victim’s browser reports a different reality than the scanner’s parser, the malicious content is invisible to the scanner by construction. This is the PDF parser disagreement problem.
This is not a theoretical risk. PDF parser confusion has been exploited in documented malware campaigns. The problem persists because the PDF specification contains genuine ambiguities — fields where the spec either contradicts itself, gives parsers discretion, or fails to define behavior for malformed input.
The eleven tests below are not proof-of-concept exploits. They are minimal reproducible demonstrations of real parsing divergence, verified against live production scanners.
Methodology
We wrote six PDFs by hand, in Python, without using any PDF library. Each file was built from raw bytes with byte-accurate cross-reference tables, targeting a single ambiguity per file. Files range from 349 to 798 bytes. No fonts, no images, no embedded content beyond the specific structural feature under test.
Each file was submitted to the PQ PDF Forensic Scanner, which runs six parsers in parallel inside isolated Linux namespaces and compares their output across seven structural dimensions. Parser versions used:
| Parser | Binary | What it extracts |
|---|---|---|
| MuPDF | mutool info + mutool show xref + mutool show trailer |
pages, objects, version, JS, AcroForm, OpenAction, encryption |
| Poppler | pdfinfo + pdfdetach -list |
pages, version, JS, encryption, form type, embedded files, suspects |
| Ghostscript | gs -sDEVICE=nullpage (render pass) |
pages rendered, JS triggered, OpenAction, Launch actions |
| qpdf | qpdf --show-npages, --show-encryption, --check |
pages, encryption, linearization, version, structural integrity |
| pdfminer | Python subprocess (pdfminer.six) |
pages, encryption, JS (Names tree), OpenAction, AcroForm |
| pdf.js / Node | node -e (raw byte regex scan) |
pages (/Count), JS (/S /JavaScript), OpenAction, encryption |
All six parsers ran on the same Linux server against the same file path. No network calls, no sandboxing differences. The only variable is the parser.
The Disagreement at a Glance
Before the eleven individual tests, here is the clearest single example: the demo PDF (Test 3 — 686 bytes, downloadable below). Six parsers. Same file. Same machine. Same instant. JavaScript presence: 3 say yes, 2 say no, 1 has no path to check.
| Parser | Pages | PDF Version | JavaScript | Encrypted | AcroForm |
|---|---|---|---|---|---|
MuPDF (mutool info) |
1 | 1.4 | No | No | No |
Poppler (pdfinfo) |
1 | 1.4 | Yes | No | No |
| Ghostscript (nullpage render) | 1 | — | No | — | — |
qpdf (--show-npages --check) |
1 | 1.4 | — | No | — |
| pdfminer (Python library) | 1 | — | Yes | No | No |
| pdf.js / Node (raw byte scan) | 1 | — | Yes | No | — |
The file contains a JavaScript action object
in the /Names/JavaScript catalog tree, with its payload compressed
via FlateDecode. Poppler, pdfminer, and pdf.js find it through
three independent paths. MuPDF and Ghostscript do not. A scanner built on
either of those two would silently pass this file as clean. This is
the core problem.
Test 1 — PDF Version Header vs. Catalog Override
Technique
The PDF specification (ISO 32000-1 §7.5.2) states that if a document's Catalog
dictionary contains a /Version entry, that version supersedes the
%PDF-x.y header. We built a file where the header says
%PDF-1.4 but the Catalog says /Version /1.7.
1 0 obj
<< /Type /Catalog /Pages 2 0 R /Version /1.7 >>
endobj
File size: 349 bytes. The %PDF-1.4 magic bytes appear at offset 0; the catalog /Version override appears at byte 51.
Scanner Output — Live Results
| Parser | Reported Version | Source |
|---|---|---|
| MuPDF | 1.7 | Catalog /Version — spec-compliant |
| Poppler | 1.7 | Catalog /Version — spec-compliant |
| Ghostscript | — | Render-only; does not report version |
| qpdf | 1.4 | %PDF-x.y header — ignores Catalog override |
| pdfminer | — | Does not extract version |
| pdf.js / Node | — | Does not extract version |
Why It Matters
Version determines which PDF features a parser accepts. A file can claim PDF 1.4
compliance (no JavaScript, no XFA) in its header — satisfying a version-gating
scanner — while carrying a /Version /1.7 Catalog that legitimizes
JavaScript and full AcroForm XFA support in the rendering engine. Risk score: 28.
Test 2 — /Count vs. Actual Page Nodes
Technique
The /Pages dictionary's /Count field is supposed to
reflect the total number of leaf page nodes in the tree. We set
/Count 3 while providing only two /Page objects in the
/Kids array.
2 0 obj
<< /Type /Pages /Kids [3 0 R 4 0 R] /Count 3 >>
endobj
File size: 432 bytes. Two /Page objects exist; /Count claims three.
Scanner Output — Live Results
| Parser | Reported Pages | Strategy |
|---|---|---|
| MuPDF | 3 | Trusts /Count field |
| Poppler | 3 | Trusts /Count field |
| Ghostscript | 2 | Renders actual pages — counts “Page N” output lines |
| qpdf | 3 | Trusts /Count field |
| pdfminer | 2 | Iterates leaf /Page nodes — ignores /Count |
| pdf.js / Node | 3 | Regex match on first /Count \d+ occurrence |
Why It Matters
Scanners that report page count from /Count can be lied to about
document length. More critically, this technique underlies shadow-page attacks:
an incremental update can silently add pages whose xref entries are only partially
visible to parsers that traverse the page tree structurally, while remaining fully
rendered in browsers. Risk score: 35.
Test 3 — JavaScript Hidden from Two of Six Parsers
Technique
JavaScript was placed in the document's /Names/JavaScript tree,
referenced via an /OpenAction on the Catalog. The JavaScript payload
was stored in a FlateDecode-compressed stream (object 7), with the
action object referencing it as an indirect object.
1 0 obj
<< /Type /Catalog /Pages 2 0 R /Names 6 0 R /OpenAction 5 0 R >>
5 0 obj
<< /Type /Action /S /JavaScript /JS 7 0 R >>
6 0 obj
<< /JavaScript << /Names [(startup) 5 0 R] >> >>
7 0 obj
<< /Length 29 /Filter /FlateDecode >>
stream
[zlib-compressed: app.alert("PQPDF-differential-test");]
endstream
File size: 686 bytes.
Scanner Output — Live Results
| Parser | Detects JS? | Detection path |
|---|---|---|
| MuPDF | ✗ No | mutool info looks for “JavaScript: yes” in rendered info — not reported here |
| Poppler | ✓ Yes | pdfinfo parses the Names tree and reports “JavaScript: yes” |
| Ghostscript | ✗ No | Render output and stderr contain no “JavaScript” string |
| qpdf | — | qpdf's targeted commands do not expose JS presence |
| pdfminer | ✓ Yes | Resolves /Names/JavaScript catalog entry via resolve1() |
| pdf.js / Node | ✓ Yes | Regex /\/S\s*\/JavaScript/ matches action object dictionary outside streams |
Why It Matters
A scanner built on MuPDF or Ghostscript as its sole parser would report this file as JavaScript-free. The file would pass JavaScript-based policy gates, email filters, and DLP rules. Yet any browser using Poppler, the pdf.js rendering engine (Firefox), or pdfminer internals would execute the JavaScript on open. This is a single-parser scanner's blind spot made concrete. Risk score: 528 — the scanner's other engines independently confirmed JavaScript via object analysis and YARA, which is exactly why differential parsing alone is insufficient.
Test 4 — Encryption Oracle: Three Parsers Say Encrypted, One Says Clear
Technique
We placed a syntactically present but semantically null /Encrypt
dictionary in the trailer, with /V 0 (encryption version zero —
“unknown algorithm” per ISO 32000-1 §7.6). The file content itself is
unencrypted; only the dictionary reference exists.
4 0 obj
<< /Filter /Standard /V 0 /R 2
/O (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
/U (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) /P -4 >>
endobj
trailer
<< /Size 5 /Root 1 0 R /Encrypt 4 0 R >>
File size: 501 bytes. The content streams are plaintext; only the /Encrypt reference exists.
Scanner Output — Live Results
| Parser | Reports Encrypted? | Behavior |
|---|---|---|
| MuPDF | ⚠ Yes (encrypted) | Sees /Encrypt in trailer — reports file as encrypted |
| Poppler | ✓ No (clear) | Attempts parse without password; succeeds — reports “not encrypted” |
| Ghostscript | — | Render-only; no encryption status reported via nullpage device |
| qpdf | ⚠ Yes (encrypted) | --show-encryption does not output “File is not encrypted” — reports encrypted |
| pdfminer | — | Encryption attribute indeterminate for V=0 |
| pdf.js / Node | ⚠ Yes (encrypted) | Regex /\/Encrypt/ matches the trailer reference |
Why It Matters
Many scanners skip deep analysis of encrypted PDFs because they cannot decompress streams without the key. A file that falsely presents as encrypted to the scanner — while remaining fully readable to the renderer — bypasses the scanner's content analysis entirely. The inverse is also exploitable: a file that hides its true encryption state to appear as a clear-text document can confuse parsers that skip decryption when they think the file is unencrypted. Risk score: 106.
Test 5 — One Parser Finds JavaScript That Five Others Miss
Technique
We built a base document containing object 5 as a benign
/XObject /Subtype /Form, then appended an incremental update that
redefines object 5 as a JavaScript action and adds a second JavaScript action as
object 6. A conforming parser must use the last xref table — making the JS
objects the “live” definition of object 5. However, since neither
object is referenced from the document tree (/Catalog has no
/OpenAction or /Names/JavaScript entry), structural
parsers do not traverse to them.
[Base document: obj 5 = benign /XObject]
[Base xref: obj 5 → offset 0x128]
%%EOF
[Incremental update body:]
5 0 obj
<< /Type /Action /S /JavaScript /JS (app.alert("hidden");) >>
endobj
6 0 obj
<< /Type /Action /S /JavaScript /JS (app.alert("also hidden");) >>
endobj
[Update xref: obj 5 → new offset, obj 6 → new offset]
[Trailer: /Prev → base xref offset]
%%EOF
File size: 798 bytes. The JavaScript objects exist in the xref-indexed object space but are unreachable from /Root.
Scanner Output — Live Results
| Parser | Detects JS? | Why |
|---|---|---|
| MuPDF | ✗ No | Traverses document tree from /Root — no path to the orphan JS objects |
| Poppler | ✗ No | Traverses document tree from /Root — no path to the orphan JS objects |
| Ghostscript | ✗ No | Renders only reachable content — orphan objects not processed |
| qpdf | — | Targeted commands do not expose unreferenced object content |
| pdfminer | ✗ No | Resolves catalog then traverses the tree — orphans not visited |
| pdf.js / Node | ✓ Yes | Regex over raw byte stream after stripping stream…endstream blocks — matches /S /JavaScript in the incremental update body |
Why It Matters
This is the sharpest result in the set: five structural parsers miss the JavaScript entirely, while the raw-byte regex scanner finds it. The JavaScript objects are real — they exist in the file's object space, properly indexed in the xref table. A vulnerability in a PDF reader that allows unreferenced objects to be executed (for example, a use-after-free triggered by incremental-update processing that makes “orphan” objects reachable) would expose the payload to parsers that today report “no JavaScript.” Only the raw-byte layer catches this class of staging attack. Risk score: 688.
Test 6 — AcroForm Invisible to Two of Three Comparing Parsers
Technique
The base document contains no form. An incremental update appends an
/AcroForm object and a new Catalog revision that references it.
Per the PDF specification, the updated Catalog supersedes the original —
the document now has an AcroForm. Parsers that process only the base xref
miss the update entirely.
[Base: /Catalog has no /AcroForm]
%%EOF
[Incremental update:]
4 0 obj << /Fields [] /DR <<>> /DA (/Helv 12 Tf 0 g) >> endobj ← AcroForm
1 0 obj << /Type /Catalog /Pages 2 0 R /AcroForm 4 0 R >> endobj ← updated Catalog
[Update xref: obj 1 → new Catalog, obj 4 → AcroForm]
%%EOF
File size: 582 bytes. The original Catalog remains in the base body; only parsers reading the update xref chain find the AcroForm.
Scanner Output — Live Results
| Parser | Detects AcroForm? | Behavior |
|---|---|---|
| MuPDF | ✗ No | mutool info reports “Forms: none” — does not expose the incremental AcroForm |
| Poppler | ✓ Yes | pdfinfo reports “Form: AcroForm” — correctly reads the updated Catalog |
| Ghostscript | — | No form-type reporting in nullpage render output |
| qpdf | — | AcroForm presence not exposed via targeted fast commands |
| pdfminer | ✗ No | Resolves Catalog but does not report AcroForm as visible feature |
| pdf.js / Node | — | AcroForm not in the pdf.js byte-scan dimension set |
Why It Matters
AcroForm detection matters for two reasons. First, AcroForms enable auto-submit
actions (/SubmitForm) that exfiltrate data on document open — a common
phishing vector that DLP systems watch for. Second, XFA (/XDP inside
an AcroForm) can execute arbitrary FormCalc or JavaScript in Adobe Acrobat and
certain enterprise readers. A scanner that reports “no form” based on
MuPDF will miss an XFA auto-exec payload delivered via incremental update.
Risk score: 111.
Part II — Keyword Injection vs. Structural Parsing
The first six tests targeted ambiguities in the PDF object model — fields where the specification leaves room for interpretation. These are structural disagreements: parsers disagree because they make different choices about which part of the object tree is authoritative.
The next five tests target a different class of vulnerability: the gap between a raw-byte regex scanner and a structural PDF parser. The pdf.js component of the scanner is implemented as a Node.js script that runs regex patterns over the raw file bytes (stripping stream bodies but not comment lines). This is significantly faster than full parsing and catches many real threats — but it also means the scanner can be fooled in both directions: keywords in comments or stream bodies that it finds but structural parsers ignore, or objects accessible only after ObjStm decompression that structural parsers find but the regex misses.
The following five tests demonstrate this empirically.
Test 7 — JavaScript in Page Additional Actions (/AA)
Technique
A JavaScript action was placed in a page's /AA
(Additional Actions) dictionary under the /O key (on-open trigger).
This is a legitimate PDF feature for triggering actions when a page is opened or
closed. It is not in the /Names/JavaScript catalog entry that most
parsers use as their primary JS detection path, and it is not in the
document-level /OpenAction.
3 0 obj
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792]
/AA << /O << /Type /Action /S /JavaScript
/JS (app.alert("aa-open");) >> >> >>
endobj
File size: 430 bytes. The JavaScript action sits entirely inside the page object dictionary, not in any stream body.
Scanner Output — Live Results
| Parser | Detects JS? | Why |
|---|---|---|
| MuPDF | ✗ No | mutool info checks “JavaScript: yes” string and /JavaScript in trailer — does not surface /AA actions |
| Poppler | ✓ Yes | pdfinfo traverses page /AA dictionaries and reports JavaScript presence |
| Ghostscript | ✗ No | Render output does not include “JavaScript” string; /AA not triggered in nullpage device |
| qpdf | — | Targeted commands do not expose JS presence |
| pdfminer | ✗ No | Only checks /Names/JavaScript catalog entry — /AA is not traversed |
| pdf.js / Node | ✓ Yes | Regex /\/S\s*\/JavaScript/ matches the action dict in page object (not inside a stream body) |
Why It Matters
Poppler and pdf.js agree there is JavaScript, but for entirely different reasons:
Poppler traverses the structural action tree; pdf.js finds the keyword in raw bytes.
MuPDF and pdfminer, which are commonly used as the sole parser in security tools,
both miss it. A scanner built on mutool info or pdfminer's catalog
walker will report this file as JavaScript-free. Risk score: 498 — the other
44 engines independently found and confirmed the JavaScript via object analysis.
Test 8 — False /Count Injected in Content Stream Body
Technique
The pdf.js scanner searches for page count using a raw-byte regex
(/\/Count\s+(\d+)/) on the unstripped file bytes, taking the
first match in the file. We placed the content stream object
— whose body contains /Count 99 as a PDF graphics operator
argument — before the /Pages object in the file body.
The real /Count 1 appears later in the file.
%PDF-1.4
4 0 obj ← content stream appears FIRST in file
<< /Length 38 >>
stream
/Count 99 cm
BT /F1 12 Tf (Hello) Tj ET
endstream
endobj
2 0 obj ← real /Pages dict appears later
<< /Type /Pages /Kids [3 0 R] /Count 1 >>
File size: 460 bytes. Structurally valid: one page, correct xref. The fake /Count is inside a stream body, which structural parsers ignore for page counting.
Scanner Output — Live Results
| Parser | Reported Pages | Strategy |
|---|---|---|
| MuPDF | 1 | Reads xref → /Pages → /Count field at correct offset |
| Poppler | 1 | Reads xref → /Pages → /Count field at correct offset |
| Ghostscript | 1 | Renders one page, counts “Page N” output lines |
| qpdf | 1 | --show-npages uses the structural page tree |
| pdfminer | 1 | Iterates leaf /Page nodes from the tree |
| pdf.js / Node | 99 | Raw regex hits /Count 99 in stream body first — takes that value |
Why It Matters
This test shows the inverse of the structural-parsing blind spots above: the raw-byte scanner reports data that is completely wrong and that no structural parser accepts. A scanner reporting 99 pages for a 1-page file produces false positives in page-count anomaly detection, masks the actual page count in reports, and could exhaust downstream systems that allocate resources proportional to page count. The page delta of 98 drives the indicator to critical severity with the maximum score bonus. Risk score: 153.
Test 9 — OpenAction JavaScript Added via Incremental Update
Technique
The base document is clean: a three-object PDF with no actions. An incremental
update appends a JavaScript action object and a new Catalog revision that sets
/OpenAction to point at it. This is the same update-chain mechanism
as Test 6 (AcroForm), applied to JavaScript and OpenAction instead.
[Base document: /Catalog has no /OpenAction]
%%EOF
[Incremental update:]
4 0 obj
<< /Type /Action /S /JavaScript /JS (app.alert("openaction-hidden");) >>
endobj
1 0 obj
<< /Type /Catalog /Pages 2 0 R /OpenAction 4 0 R >> ← new Catalog
endobj
[xref2: obj 1 → new Catalog offset, obj 4 → JS action offset]
[trailer2: /Prev → base xref]
%%EOF
File size: 608 bytes.
Scanner Output — Live Results
| Parser | Detects JS? | Detects OpenAction? |
|---|---|---|
| MuPDF | ✗ No | No |
| Poppler | ✗ No | — |
| Ghostscript | ✗ No | No |
| qpdf | — | — |
| pdfminer | ✗ No | No |
| pdf.js / Node | ✓ Yes | Yes |
Why It Matters
The OpenAction points to a JavaScript action in the incremental update —
the canonical malware delivery mechanism for PDF auto-exec exploits. Yet five of
six parsers report no JavaScript. Only the raw-byte scanner finds it, because
/S /JavaScript appears in the update body's raw object text. The
structural parsers’ failure here (they should follow the update chain to the
new Catalog and traverse /OpenAction) suggests their JS detection paths
only check specific catalog keys, not all action references. Risk score: 648.
Test 10 — Dual %%EOF: First /Count Wins in Raw Scan
Technique
The file contains two complete and syntactically valid PDF document structures,
each with its own xref table, trailer, and
%%EOF marker. The first document describes a 2-page file; the second
describes a 1-page file. Standard PDF parsing requires reading from the physical
end of the file to find the last %%EOF, then following the
startxref backward. Five parsers do this correctly and load the
1-page document. pdf.js’s raw regex for page count picks up the first
/Count it encounters in file order, which belongs to the 2-page
document.
[2-page document]
2 0 obj << /Type /Pages /Kids [3 0 R 5 0 R] /Count 2 >> ← first /Count in file
...
startxref [offset of 2-page xref]
%%EOF
[1-page document starts here]
2 0 obj << /Type /Pages /Kids [3 0 R] /Count 1 >> ← second /Count in file
...
startxref [offset of 1-page xref]
%%EOF ← last %%EOF — correct parsers start here
File size: 772 bytes.
Scanner Output — Live Results
| Parser | Reported Pages | Parsing strategy |
|---|---|---|
| MuPDF | 1 | Reads from last %%EOF → last xref → 1-page /Pages dict |
| Poppler | 1 | Reads from last %%EOF → last xref → 1-page /Pages dict |
| Ghostscript | 1 | Renders the 1-page document found from last %%EOF |
| qpdf | 1 | --show-npages uses last xref |
| pdfminer | 1 | Loads document from last trailer, finds 1-page tree |
| pdf.js / Node | 2 | Raw regex first /Count \d+ match hits the 2-page document’s /Count 2 |
Why It Matters
This is the complement to Test 8: in Test 8, a fake count in a stream body fools the raw-byte scanner. Here, a completely valid PDF structure at an earlier file position fools it. Both attacks exploit the same root cause: the raw-byte scanner reads linearly while structural parsers read the xref chain. Dual-%%EOF files are also a polyglot technique — the file is simultaneously two different valid PDFs, which can confuse signature-based detection and archival tools. Risk score: 90.
Test 11 — /Encrypt Keyword in a PDF Comment Line
Technique
A PDF comment line (any line beginning with %, per ISO 32000-1
§7.2.4) is syntactically invisible to all parsers — they discard
everything from % to the end of the line. The pdf.js scanner,
however, runs its encryption detection as a raw regex on the full byte stream:
/\/Encrypt/.test(noStreams). The variable noStreams
strips stream…endstream blocks but does not strip
comment lines. We placed /Encrypt inside a comment.
%PDF-1.4
%âãÏÓ
% Document policy: /Encrypt /Standard /V 4 /R 4 /P -3904
1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj
...
trailer << /Size 4 /Root 1 0 R >> ← no /Encrypt here
startxref
...
%%EOF
File size: 392 bytes. The trailer has no /Encrypt entry. The file is structurally unencrypted.
Scanner Output — Live Results
| Parser | Reports Encrypted? | Source |
|---|---|---|
| MuPDF | No (clear) | Checks /Encrypt in trailer — not present |
| Poppler | No (clear) | pdfinfo parses trailer, reports “Encrypted: no” |
| Ghostscript | — | No encryption status in render output |
| qpdf | No (clear) | --show-encryption outputs “File is not encrypted” |
| pdfminer | No (clear) | Document encryption attribute is None / False |
| pdf.js / Node | ⚠ Yes (encrypted) | Regex /\/Encrypt/ matches the comment line — comment not stripped |
Why It Matters
This is a pure keyword-injection false positive: the raw-byte scanner reports encryption that does not exist. An attacker can use this to trigger defensive behaviors in systems that skip content analysis of encrypted files (because they cannot decrypt them), while the content itself is fully readable. Conversely, a benign file can be misreported as encrypted simply by including a comment that mentions encryption policy in plain text. Risk score: 96.
Summary: All Eleven Tests
Part I — Structural ambiguities (spec interpretation differences between parsers)
| Test | Technique | Split | Severity | Risk |
|---|---|---|---|---|
| 1 — Header vs. Catalog version | %PDF-1.4 header / /Version /1.7 Catalog |
MuPDF/Poppler=1.7 • qpdf=1.4 | medium | 28 |
| 2 — /Count vs. actual pages | /Count 3 with 2 Kids |
4 parsers=3 • GS/pdfminer=2 | medium | 35 |
| 3 — JS in compressed Names tree | /Names/JavaScript via FlateDecode stream |
Poppler/pdfminer/pdfjs=JS • MuPDF/GS=none | critical | 528 |
| 4 — Null encryption (/V 0) | /Encrypt /V 0 in trailer |
MuPDF/qpdf/pdfjs=enc • Poppler=clear | critical | 106 |
| 5 — Orphan JS via incremental update | obj 5 redefined as JS action, not in doc tree | pdfjs=JS • MuPDF/Poppler/GS/qpdf/pdfminer=none | critical | 688 |
| 6 — AcroForm via incremental update | AcroForm added in update, base Catalog clean | Poppler=AcroForm • MuPDF/pdfminer=none | medium | 111 |
Part II — Keyword injection (raw-byte scanner vs. structural parsers)
| Test | Technique | Split | Severity | Risk |
|---|---|---|---|---|
| 7 — JS in page /AA | JavaScript in /AA /O page action |
Poppler/pdfjs=JS • MuPDF/GS/pdfminer=none | critical | 498 |
| 8 — False /Count in stream body | /Count 99 in content stream before real /Pages |
pdfjs=99 pages • all others=1 | critical | 153 |
| 9 — OpenAction via incremental update | /OpenAction JS added in update body |
pdfjs=JS • MuPDF/Poppler/GS/pdfminer=none | critical | 648 |
| 10 — Dual %%EOF page confusion | 2-page doc before 1-page doc; raw scan hits first | pdfjs=2 pages • all others=1 | medium | 90 |
| 11 — /Encrypt keyword in comment | % /Encrypt … in a PDF comment line |
pdfjs=encrypted • MuPDF/Poppler/qpdf/pdfminer=clear | critical | 96 |
Every file produced a confirmed cross-parser disagreement. Seven of eleven triggered a critical-severity indicator. File sizes ranged from 349 to 798 bytes. No PDF library was used to construct any test file — all were built from raw bytes with byte-accurate cross-reference tables.
Why This Matters Beyond the Lab
The results above are not a critique of any individual parser. Each parser is
doing something reasonable given the specification’s ambiguities and its
own engineering priorities. MuPDF is a rendering library optimized for speed.
Poppler’s pdfinfo is a metadata tool. Ghostscript is a
PostScript interpreter with PDF rendering bolted on. Each has a different
view of “what the document is.” The security problem is
using any one of them as the sole arbiter of document content.
Security Scanning
Any scanner that passes a file as “no JavaScript” or “not encrypted” based on a single parser is making a claim the data does not support. The correct claim is “no JavaScript visible to this parser.” Attackers who understand parser-specific behavior can construct PDFs that are simultaneously malicious to the renderer and benign to the scanner — using nothing more than knowledge of the specification and which parser makes which choice at each ambiguity. No exploit code required.
RAG Pipelines and AI Ingestion
Retrieval-augmented generation (RAG) systems, document AI platforms, and enterprise knowledge bases ingest PDFs using a single parser — often pdfminer or MuPDF — to extract text. When that parser sees a different page tree than the rendering engine, the extracted text diverges from what a human reader sees. Test 2 demonstrates this: four parsers report 3 pages, two report 2. An AI system ingesting 2 pages indexes different content than a human reading 3. In adversarial contexts, this gap can be exploited to inject content that is invisible to the AI indexer but visible to the human reader — or vice versa.
Fraud Detection and Legal Review
Incremental update attacks (demonstrated in Tests 5, 6, and 9) are a known vector for document fraud: a signed PDF is modified via an incremental update without invalidating the digital signature. If the forensic tool examining the document uses a parser that ignores the update chain, it reports the original (unmodified) content as authoritative. Poppler and pdf.js follow the update chain; MuPDF and pdfminer may not surface the update’s additions in their standard output modes. A legal review relying on MuPDF-extracted text would examine a different document than one viewed in a Poppler-based reader.
Compliance and DLP
Data loss prevention systems and compliance gateways classify PDFs based on
content analysis. If the DLP system uses a parser that reports
“no AcroForm” (Test 6) but the recipient’s reader opens an
AcroForm with an auto-submit exfiltration action, the DLP gateway lets
it pass. If the system uses a parser that reports “encrypted” for
a file with a null /V 0 encryption dictionary (Test 4) and skips
deep content analysis of encrypted files, the plaintext content bypasses
all content inspection. Parser choice is policy, whether the DLP vendor
acknowledges it or not.
Standards Context: Where ISO 32000 Leaves Behavior Undefined
These findings are not bugs in individual parsers. They are consequences of genuine ambiguities and underspecified behaviors in the PDF specification itself. Understanding where the standard leaves room for interpretation is essential for anyone contributing to parser development, archival workflows, or accessibility tooling.
ISO 32000-1 and ISO 32000-2
ISO 32000-1:2008 (PDF 1.7) and ISO 32000-2:2020 (PDF 2.0) specify the PDF format. Several of the ambiguities demonstrated above trace directly to the standard:
- Version resolution (§7.5.2): The standard states that a
/Versionentry in the document Catalog “shall override” the file header version. Parsers differ on whether this override applies to all version-gated feature checks or only to conformance declarations. Test 1 shows qpdf reading 1.4 while MuPDF and Poppler read 1.7 from the same file. - Page tree /Count field (§7.7.3.2):
The standard defines
/Countas “the number of leaf nodes (page objects) that are descendants of this node.” It does not specify what a conforming reader must do when/Countdisagrees with the actual number of descendants. Test 2 shows a 4-2 parser split on this. - Incremental update processing (§7.5.6): The standard is clear that the last cross-reference section defines the current object states, but does not define the minimum set of incremental revision chain entries a conforming reader must process when extracting metadata. Tests 6, 9, and the AcroForm test expose this gap.
- Encryption dictionary (§7.6):
The standard defines
/V 0as “an algorithm that is undocumented and no longer supported.” It does not specify whether a conforming reader encountering/V 0should treat the document as encrypted or as an error. Test 4 shows a 3-1 split on this exact question.
PDF/A and PDF/UA Implications
PDF/A (ISO 19005) mandates that conforming files be renderable consistently across readers without external dependencies. Parser disagreement on page count or document structure violates the spirit of this guarantee: an archival tool validating a PDF/A file using MuPDF may report it as conforming while a Poppler-based reader renders additional pages from an incremental update that MuPDF does not surface.
PDF/UA (ISO 14289) requires that the logical reading order and document structure be accessible to assistive technology. When a screen reader’s underlying PDF parser reports different page content than the visual renderer, accessibility compliance becomes parser-dependent rather than document-dependent. A document that passes PDF/UA validation against one parser may fail accessibility requirements in the rendering stack used by the actual reader.
A Note for PDF Association and Adobe Reviewers
The disagreements demonstrated here do not represent parser bugs — they
represent specification gaps that reasonable implementations fill differently.
The appropriate resolution is not to fix individual parsers but to add
normative language to ISO 32000 and ISO 19005 that specifies required behavior
for malformed or ambiguous inputs. Areas that would benefit from tightened
language: the precedence of /Count vs. structural traversal;
the semantics of /V 0 encryption dictionaries; the minimum
update-chain depth that conforming readers must process for metadata extraction;
and the normative treatment of /Version catalog overrides in
feature-gating contexts.
Three Things Disagreement Does Not Mean
Parser disagreement is a forensic signal, not a verdict. Before using differential analysis operationally, three common misreadings need to be corrected.
1. Disagreement Is Not Evidence of Attack
Most cross-parser disagreements in production PDF traffic come from benign sources: export software that writes technically non-conformant but harmless files, old PDF generators with known quirks, document repair tools that add non-standard recovery structures, and interoperability edge cases between PDF/A workflows and generic readers. The 11 test files above were constructed to isolate disagreement patterns — in practice, most disagreements arrive packaged with enough clean context to score conservatively.
The correct interpretation is:
disagreement = elevated forensic interest — not confirmed compromise
The risk score reflects the combination of the disagreement with what the other 43 engines observe. A disagreement on page count in a file with clean metadata, no entropy anomalies, a known producer string, and a valid digital signature scores low. The same disagreement in a file with stripped metadata, high-entropy streams, and no producer scores high. The differential finding is a multiplier, not a standalone accusation.
2. Aggressive Differential Analysis Can Be Noisy
If every cross-parser discrepancy triggered a high-severity alert, enterprise adoption would collapse under false positive volume. PDF interoperability is genuinely messy: Microsoft Office export, Adobe Acrobat, LibreOffice, Chrome’s built-in print-to-PDF, and various SaaS document platforms all produce files with at least one quirk that at least one parser handles differently. A scanner that treats all of those as critical findings is not useful.
Signal quality depends on three things that are harder than detection itself:
- Scoring calibration — weighting disagreements by dimension (JS visibility disagreement is qualitatively different from a version-string discrepancy).
- Context from corroborating engines — a disagreement that no other engine corroborates scores near zero; one that five engines corroborate scores multiplicatively higher.
- Explainability — analysts need to know which parsers disagreed on what value, not just that a score crossed a threshold. The differential engine outputs per-dimension parser value sets precisely so that a human reviewer can reproduce the disagreement independently.
Operational tuning remains an ongoing process. The scoring weights in the table above reflect current calibration against real-world traffic; they are not immutable constants.
3. Output Normalization Across Parsers Is a Genuinely Hard Engineering Problem
Each of the six parsers produces output in a different format with different
semantics. MuPDF’s mutool info emits human-readable key-value
text. Poppler’s pdfinfo emits different key-value text with
different field names. qpdf emits structured JSON with its own schema. Ghostscript
does not emit document metadata at all — it renders, and the scanner infers
structure from render success/failure and page output. pdfminer exposes a Python
object model. The Node.js subprocess runs raw regex on bytes.
Normalizing these into a comparable set of dimensions requires:
- Field mapping across six distinct output schemas
- Type normalization (booleans expressed as “yes”, “true”, “1”, or implied by field presence vs. absence)
- Handling parser-specific absences: a parser that does not report a dimension should be excluded from comparisons for that dimension, not treated as reporting a null value
- Accounting for different repair behaviors: when a parser encounters a malformed xref, it may silently repair it and return a value, return an error, or return a partial result — all three need different handling
This normalization layer is not glamorous and does not lend itself to clean demos. But it is the prerequisite for everything else on this page. Getting it wrong produces spurious disagreements that have nothing to do with the PDF — they are artifacts of schema mismatches between parsers. Getting it right requires treating each parser as an unreliable witness that must be cross-examined rather than believed.
The Same Architectural Weakness Breaks AI Ingestion
The specification ambiguities documented above do not respect the boundary between security tooling and AI tooling. The same structural properties of a PDF that allow malicious content to hide from a security scanner also allow content to hide from — or be misrepresented to — a RAG pipeline, an OCR system, or an enterprise document ingestion stack. The failure modes are different, but the root cause is identical: a single parser is being asked to render a definitive account of a document whose structure is genuinely ambiguous.
The Structural Symmetry
In security contexts, the relevant question is: does this file contain malicious content? If two parsers disagree, one is missing the content and the attacker wins. In AI ingestion contexts, the relevant question is: what does this document say? If two parsers disagree, the knowledge base indexed by one pipeline reflects a different document than the one a human reader opens. Neither failure requires an attacker. Both follow from the same set of specification gaps shown in Tests 1 through 11.
| Test | Security failure mode | AI ingestion failure mode |
|---|---|---|
| Test 2: /Count lie | Scanner clears a multi-page attack doc as single-page | RAG pipeline indexes 2 pages; reader opens 3; knowledge base is incomplete |
| Test 3: JS in compressed stream | JS hidden from structural parsers → scanner misses auto-exec payload | Any text in compressed streams invisible to extractors that skip stream bodies |
| Test 5: Incremental update | Post-signature content addition bypasses audit | Ingestion pipeline indexes original version; updated content never reaches the index |
| Test 6: AcroForm in update | Form with auto-submit exfiltration passes DLP | Form fields and their values missed by extractors that don't follow update chains |
| Test 11: Keyword in comment | Raw-byte scanner falsely reports encryption; file skipped | Extraction tool treating comments as metadata ingests false metadata alongside document text |
Why Scientific, Legal, and Patent PDFs Break Parsers Harder
Most RAG benchmarks use clean, well-formed PDFs. Production document corpora don’t. The hardest categories for parsers are precisely the categories most valuable for enterprise AI:
- Scientific papers (arXiv, journal PDFs): Multi-column layouts cause different parsers to extract text in different reading orders. MuPDF’s column detection, pdfminer’s position-sorted extraction, and PyPDF’s stream-order extraction all produce different token sequences from the same two-column paper. The LLM downstream receives different context.
- Patent documents (USPTO, EPO): Patent PDFs frequently combine scanned page images with a selectable-text overlay added by the patent office OCR pipeline. Parsers diverge on whether to read the overlay text, attempt independent OCR on the image, or report no text content at all. The same patent can produce 4,000 words from one extractor and zero from another.
- Legal contracts with redactions: Some redaction tools draw black rectangles over existing text objects without removing the underlying text from the PDF structure. A parser that reads the visual layer sees a redacted document. A parser that reads the object tree sees the original unredacted text. This is not hypothetical: it has affected court filings in publicized cases.
- Digitally signed PDFs: Signature validation locks specific byte ranges. Tools that process the full file rather than the signed byte range can see content that technically post-dates the signature. Ingestion pipelines that do not reconstruct the revision chain index a different document than the one the signature covers.
Adversarial Documents: The Prompt Injection Angle
Parser disagreement creates a natural split-view attack surface for AI
pipelines. A PDF where the visual rendering layer (what the human reviewer sees
in Adobe Acrobat) shows benign content while the extraction layer (what the
RAG pipeline’s pdfminer or PyMuPDF call returns) contains additional or
different content is a prompt injection vector that requires no exploit code
— only knowledge of the specification and which parser makes which
choice at each ambiguity. Test 3 demonstrates a minimal version of this: the
human opening the file in a viewer sees an empty page; parsers that follow
the /Names/JavaScript tree see JavaScript code.
The adversarial content does not have to be JavaScript. It can be natural-language instruction text embedded in a compressed stream, an incremental update, or a content layer that only some parsers surface. If that text reaches the LLM’s context window and the human reviewer’s visual inspection did not reveal it, the pipeline has been injected.
Multi-Parser Verification as an AI Pipeline Primitive
The same multi-parser comparison approach that flags disagreements as a security signal can serve as a completeness check for AI ingestion. If three parsers extract 1,400 words from a PDF and a fourth extracts 1,900, the 500-word gap is worth investigating before the document reaches the index. The investigation does not have to be automated to be valuable: surfacing "parsers disagree on content length" as a flag for human review is qualitatively better than silently ingesting whatever one parser happened to return.
Parser agreement is a weak signal that extraction is complete. Parser disagreement is a strong signal that at least one parser is wrong. Neither guarantee correctness — but the first is better than flying blind with a single extractor.
How PQPDF Resolves Parser Disagreement
Detecting a disagreement is the easy part. The harder question is: which parser is correct? PQPDF does not attempt to declare a single parser the winner. Instead, it treats the disagreement itself as a security signal and uses multi-layer consensus to determine the risk weight of each discrepancy.
1. Parser Consensus and Signal Weighting
The differential parsing engine collects output from all six parsers across seven structural dimensions: page count, JavaScript presence, encryption status, PDF version, AcroForm presence, embedded file count, and object count. For each dimension, it records the set of distinct values reported. A dimension with one value across all parsers is consistent. A dimension with two or more distinct values is flagged.
| Dimension | Discrepancy threshold | Severity |
|---|---|---|
| Page count | Any difference between parsers; score scales with delta magnitude | medium → critical |
| JavaScript presence | At least one reports JS, at least one does not | critical (+50 score) |
| Encryption status | Any difference between parsers | critical (+40 score) |
| PDF version | Any difference between parsers | medium (+10 score) |
| AcroForm presence | Any difference between parsers | medium (+15 score) |
| Embedded file count | Any difference between parsers | medium → high |
| Object count | >10% relative difference between parsers | medium (+15 score) |
Structural parsers (MuPDF, Poppler, qpdf) are considered higher-confidence sources for version and xref validity. The raw-byte scanner (pdf.js / Node) is treated as a broad-net detector: if it sees a keyword, the keyword exists in the file regardless of structural context — that matters independently of whether structural parsers agree.
2. xref Structural Validation
qpdf --check produces a structural validity verdict independent
of content analysis. A finding of “structural integrity intact” from
qpdf combined with a JS visibility discrepancy between other parsers tells the
correlation engine something specific: the discrepancy is not due to a malformed
file that parsers repair differently — the xref is valid, so the
disagreement is about interpretation of a structurally sound document.
This is scored differently from a case where qpdf also reports xref errors,
which suggests the parsers are resolving a genuinely malformed file.
3. Incremental Update Reconstruction
For documents with incremental revisions, the scanner uses
qpdf --json (selectively) and pdfminer’s trailer chain
traversal to reconstruct the full revision history. Each revision’s object
additions and replacements are recorded separately. A JavaScript action object
that appears only in revision 2 (after a signature on revision 1) is flagged
not just as “has JavaScript” but as “post-signature JS
insertion” — a specific pattern associated with signature bypass
attacks. The revision history feeds Engine 22 (Signature Forensics) for
cross-correlation.
4. Multi-Engine Correlation (Engine 43)
Differential parsing findings are one input to Engine 43 (Correlation Engine),
which cross-references all 44 engine outputs. A JavaScript visibility discrepancy
alone scores conservatively. The same discrepancy combined with a high-entropy
compressed stream (Engine 7), a missing /Producer metadata field
(Engine 6), and a qpdf xref reconstruction note (Engine 11) — confirmed
by three independent engines — scores multiplicatively higher. This
compound-indicator design is why the risk scores in the tests above range from
28 to 688: the differential finding is a multiplier on the other signals,
not a standalone verdict.
5. Hard Isolation and Timeout
All six parsers run inside isolated Linux namespaces (separate network, PID, and mount spaces). A hard 30-second SIGALRM wraps the entire engine; each individual parser subprocess has its own 5–12 second timeout. pdfminer runs as a subprocess specifically because Python’s runtime cannot be interrupted from a parent thread — subprocess isolation guarantees hard kill. No malformed PDF can stall the scan, block the job queue, or cause a resource exhaustion denial-of-service through the differential parsing engine.
Download the Demo PDF
The primary demonstration file (Test 3) is available for independent verification. This is the exact binary that produced the scanner results above — byte-for-byte identical, unmodified.
| Property | Value |
|---|---|
| File | parser-disagreement-demo.pdf |
| Size | 686 bytes |
| MD5 | 4607e660f3d14fbb1978ce191c8b4080 |
| PDF version | 1.4 (header) |
| Technique | JavaScript in /Names/JavaScript via FlateDecode-compressed stream + /OpenAction |
| Expected JS split | Poppler=yes • pdfminer=yes • pdf.js=yes • MuPDF=no • Ghostscript=no |
↓ Download parser-disagreement-demo.pdf (686 bytes)
Reproduction commands:
# Poppler — reports: JavaScript: yes
pdfinfo parser-disagreement-demo.pdf
# MuPDF — reports nothing for JavaScript
mutool info parser-disagreement-demo.pdf
# Ghostscript — no JS in output
gs -dNOPAUSE -dBATCH -sDEVICE=nullpage parser-disagreement-demo.pdf 2>&1 | grep -i javascript
# pdfminer — sees JS via Names tree
python3 -c "
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
with open('parser-disagreement-demo.pdf','rb') as f:
d=PDFDocument(PDFParser(f))
nm=resolve1(d.catalog).get('Names')
print('JavaScript' in (resolve1(nm) if nm else {}))
"
# pdf.js (Node) — raw regex matches /S /JavaScript outside stream bodies
node -e "
const r=require('fs').readFileSync('parser-disagreement-demo.pdf','latin1');
const ns=r.replace(/\bstream[\s\S]*?endstream\b/g,' ');
console.log(/\/Type\s*\/JavaScript|\/S\s*\/JavaScript/.test(ns));
"
The JavaScript payload in the file is app.alert("PQPDF-differential-test");
— a display-only alert with no side effects. The file does not contact
any network endpoint and contains no executable code outside the PDF
JavaScript sandbox.
Use This in Your Workflow
The eleven PDFs and the methodology above are designed to be used as a validation suite for any scanner, pipeline, or tool that processes PDF files. Here is how different teams can put this research to work directly.
| Role | Action | What to look for |
|---|---|---|
| Security engineers | Download the demo PDF and run it through your scanner | Does your tool report JavaScript? If not, you are using a single-parser scanner that misses the /Names/JavaScript path |
| DFIR analysts | Run a suspicious PDF through all six parsers using the reproduction commands above | Any dimension where parsers disagree is a structural anomaly worth investigating — especially JS presence and encryption status |
| RAG / AI builders | Compare your ingestion parser’s page count and text output against a second parser on the same file | Page count delta ≥1 means your index may be missing content visible to the user’s viewer |
| DLP / gateway engineers | Submit Test 4 (null /V 0 encryption) and Test 11 (/Encrypt in comment) to your gateway |
Does the gateway skip content inspection because it thinks the file is encrypted? That’s a policy bypass |
| PDF tool developers | Run all 11 test files through your parser and record output for each dimension | Compare against the table above — any column where your parser joins the minority is a spec interpretation worth documenting |
If your tool reports “no JavaScript” where three of six parsers in this study disagree, you have a documented blind spot. That’s not a failure — it’s a known limitation of single-parser design. The question is whether your threat model accounts for it.
Scan Your PDFs with Multi-Parser Analysis
The PQ PDF Forensic Scanner runs differential parsing on every scan — no configuration required. All six parsers run in parallel. Every disagreement is flagged, scored, and fed into the 44-engine correlation layer. Upload any PDF and see what each parser reports, where they disagree, and what the disagreement implies for security risk.
→ PDF Forensics Scanner — 44 Engines, Multi-Parser, Free
No account. No file retention. Differential parsing, behavioral sandbox, YARA, ClamAV, offline threat intelligence (6.4M+ indicators), and AI synthesis — all running on the same file in parallel.