PQ PDF Logo
PQ PDF Tools Secure document utilities for everyday workflows.
Home About Enterprise Contact Feedback Legal Privacy Security Status Development Analytics

Security Research — Published 26 May 2026

PDF Structural Problems
in AI Ingestion Pipelines

A factual analysis of V/AP divergence, parser disagreement, and what happens when both enter a knowledge base at scale

PDF is one of the dominant formats for enterprise and high-value document corpora feeding RAG systems and LLM training pipelines. The format carries two structural problems that standard ingestion tooling does not detect: V/AP divergence (a field’s machine-readable value and its rendered appearance can disagree inside the same signed document) and parser disagreement (six production parsers return six different accounts of the same file). Both problems were present before AI ingestion. Pipelines that rely on a single-pass extraction inherit both — and face a third: a single-parser extraction layer implicitly assumes there is one canonical truth inside every PDF. For documents with structural ambiguity, there is not.

Key Claims
  • PDF AcroForm fields store values in two independent locations with no obligation to agree
  • /NeedAppearances true + a digital signature means certified content and rendered content have different provenance by design
  • Six production parsers disagree on JavaScript presence, encryption status, page count, and AcroForm presence — empirically verified across 11 hand-crafted files
  • RAG and training pipelines treat extracted PDF text as authoritative; that assumption is false when the source document contains structural ambiguity
  • Static V/AP divergence detection: 8/8 positives, 0/187 false positives on a 196-document validation corpus
Who this matters for RAG & document AI builders LLM training data engineers Legal & financial AI platforms DLP & email gateway engineers DFIR & document forensics PDF standards contributors
Contents
  1. Scale Context
  2. The V/AP Structural Problem
  3. NeedAppearances Compounds the Problem
  4. Parser Disagreement: Six Parsers, Eleven Files
  5. Methodology and Reproducibility
  6. What This Means for AI Pipelines
  7. The Core Architectural Mismatch
  8. Ingestion Corruption vs. Adversarial Injection
  9. Growth Projections
  10. What the Spec Does and Does Not Address
  11. Scope Note
  12. References

Scale Context

Retrieval-augmented generation (RAG) has moved from experimental to production infrastructure. McKinsey’s 2025 AI adoption survey found 71% of organizations report regular generative AI use in at least one business function.[18] The same survey notes enterprises are choosing RAG architectures for 30–60% of AI use cases that require high accuracy or custom data. The global AI training dataset market was estimated at $3.2 billion in 2025 and is projected to reach $16.3 billion by 2033 (Grand View Research, CAGR 22.6%[7] — industry analyst forecast; independent verification not available).

PDF is the dominant format for the document corpora feeding these systems. In September 2025, Hugging Face released FinePDFs — 475 million documents, 3 trillion tokens, 3.65 terabytes — sourced exclusively from PDFs spanning 105 Common Crawl snapshots from 2013 to February 2025. The dataset covers 1,733 languages, with English alone comprising over 1.1 trillion tokens and Spanish, German, French, Russian, and Japanese each exceeding 100 billion tokens. This is the largest publicly available PDF-derived corpus and is already being used in LLM pretraining pipelines.

PDFs carry a disproportionate share of high-value content: legal contracts, regulatory filings, scientific papers, patents, financial instruments, government publications. These are not documents where extraction errors are cosmetic. They are documents where a wrong number has operational consequences.

The Structural Problem That AI Ingestion Inherits

PDF AcroForm fields have two independent data stores that have no obligation to agree with each other.

/V is the machine-readable field value. JavaScript reads it. Form submissions post it. Digital signatures include it in their byte-range hash.

/AP (specifically /AP /N) is an appearance stream — a self-contained PDF content stream that the viewer renders as pixels. It is what the human sees on screen.

These two stores are not derived from each other. A PDF author can set /V to $12,000.00 and author an /AP stream that renders $1,200.00. Both values coexist in the same file. A digital signature can certify the entire byte range covering both, and the signature remains cryptographically valid. The signed content and the displayed content structurally disagree inside the same certified byte range.

This is called the V/AP problem. It was documented academically in Mainka, Mladenov, Rohlmann, and Schwenk, “Shadow Attacks: Hiding and Replacing Content in Signed PDFs,” NDSS/CCS 2021, which identified three attack classes (Hide, Hide-and-Replace, Replace) exploiting the gap between the signed byte range and what the viewer renders. The disclosure reached 28 PDF viewer vendors and produced patches from Adobe, Foxit, LibreOffice, and others. An earlier research series by Müller, Mladenov, Somorovsky, and Schwenk (2017–2019) at pdf-insecurity.org systematically mapped signature-validation weaknesses across major PDF viewers, establishing the framework that the Shadow Attack work built on. The V/AP structural separation is a consequence of the format’s design, not a defect that can be patched at the specification level.

In operational contexts, this class of discrepancy is directly applicable to invoice fraud and document-centric financial workflows: an automated payment system reads /V while the human reviewer sees what /AP renders, producing a gap between the displayed and processed values that survives signature validation intact. No modification of the signature byte range is required, and no viewer warning is produced.

V/AP Divergence — Two Data Stores, One Signed Byte Range
PDF DOCUMENT /V machine-readable value "$12,000.00" /AP /N appearance stream (pixels) "$1,200.00" no obligation to agree digital signature byte range covers both — signature valid regardless of disagreement parser extracts one representation — no signal that another exists knowledge base / vector index
What AI Ingestion Adds

Many RAG pipelines and LLM training pipelines extract text from PDFs using a single parser — typically pdfminer.six, PyMuPDF, Docling, or a proprietary extractor. These parsers do not universally agree on which data store to read. Some read /V. Some read text operators in /AP. Some do both, depending on the field type and the extraction mode. None are documented as implementing the full five-indicator V/AP divergence check described in the PQPDF V/AP research (NeedAppearances detection, checkbox /V / /AS comparison, AP stream text extraction with /Opt resolution, blank AP detection, missing AP detection). Pipelines that normalize PDFs to plain text before embedding, rasterize pages for OCR, or flatten forms before extraction may avoid some of these failure modes — but those are additional processing steps that most single-pass extraction pipelines do not perform.

The result: a pipeline without explicit V/AP divergence checking will extract one of the two conflicting values with no signal that a conflict exists. The extracted value enters the knowledge base, the vector index, or the training corpus as authoritative fact.

NeedAppearances Compounds the Problem

There is a flag in the AcroForm dictionary called /NeedAppearances. When set to true, it instructs every PDF viewer to discard the stored appearance streams and regenerate them from /V field values at document open time. This is legitimate in programmatic form-fill workflows where appearance regeneration is deferred for performance reasons — DocuSign uses it, mail-merge pipelines use it.

When combined with a digital signature, the consequence is structural: the byte-range hash covers the /AP streams stored on disk at signing time. The viewer then regenerates the appearance from /V after opening.

What later viewers render may not be identical to the originally signed appearance state — and the document provides no mechanism to detect this. The signature is still valid. ISO 32000-2 is aware of this behaviour and does not classify it as a specification defect.

For AI ingestion pipelines this creates a specific fork: a parser reading /AP as the display value reads appearance data that was stale at signing time and was never shown to any human reviewer. A parser reading /V reads the field value. Whether those two agree is unknown at ingestion time unless the pipeline explicitly checks.

The PQPDF V/AP research validated static detection across 196 PDFs including 44 IRS tax forms, 102 Corkami adversarial proof-of-concept files, 29 arXiv papers, and 4 federal legislative publications. Detection rate on 8 hand-crafted positive cases: 8/8. False positive rate on 187 negatives: 0/187. These are validation corpus results on a specific hand-crafted test set; they characterize the detector’s behavior on those cases and are not a claim of universal detection effectiveness across the PDF ecosystem.

For comparison: pikepdf detects 2 of the 5 V/AP indicator types. pdfminer.six detects 0. Neither library was built for this. Neither library is presenting itself as having built for this. The gap is not a criticism of those libraries; it is an accurate statement of what is and is not detected in the standard PDF ingestion toolchain.

Parser Disagreement: Six Parsers, Eleven Conflicting Accounts of the Same File

The V/AP problem concerns disagreement between two data stores within a single file. The parser disagreement problem concerns what happens when different parsers each read a single file and return different answers about what it contains.

The PQPDF parser disagreement research constructed 11 minimal hand-crafted PDFs deliberately targeting structural ambiguities in the PDF specification — areas where the spec is underspecified, contradictory, or leaves parser behaviour to implementer discretion.

Each file was run through six production-grade parsers in parallel: MuPDF (mutool), Poppler (pdfinfo), Ghostscript (nullpage render), qpdf, pdfminer.six, and pdf.js/Node. Under these ambiguity conditions, every file produced at least one confirmed cross-parser disagreement. Seven of eleven triggered critical-severity findings on JavaScript visibility, encryption status, or page count.

This does not mean all normal, well-formed PDFs produce major parser disagreement. It means these specific structural ambiguities, when present in a document, reliably do — and the document reaching an ingestion pipeline has no obligation to announce their presence.

Selected results:

JavaScript in a compressed Names tree (686 bytes).

The file places a JavaScript action in the /Names/JavaScript catalog tree, payload compressed via FlateDecode. MuPDF and Ghostscript report no JavaScript. Poppler, pdfminer, and pdf.js/Node find it through three independent code paths. A scanner or ingestion pipeline built on MuPDF or Ghostscript as the sole parser would pass this file as JavaScript-free. The PQPDF scanner scored this file 528 out of 1000 (dangerous threshold ≥500); scores are weighted heuristic aggregates across analysis passes, not CVSS-equivalent severity ratings. The remaining 43 analysis modules independently confirmed JavaScript presence via object analysis and YARA.

Orphan JavaScript via incremental update (798 bytes).

A base document is clean. An incremental update redefines an object as a JavaScript action, but the action is not referenced from the document tree — it exists in the xref-indexed object space but is unreachable by tree traversal. Five of six structural parsers (MuPDF, Poppler, Ghostscript, qpdf, pdfminer) report no JavaScript because they traverse the document tree from /Root and never reach the orphan objects. Only the raw-byte regex scanner (pdf.js/Node) finds it by scanning the raw file bytes. Scanner score: 688 out of 1000 (weighted heuristic aggregate across all analysis modules; critical threshold ≥600).

Null encryption dictionary (501 bytes).

The trailer contains an /Encrypt entry with /V 0 (an algorithm the spec describes as “undocumented and no longer supported”). The file content is unencrypted. MuPDF, qpdf, and pdf.js report the file as encrypted. Poppler opens it successfully and reports it as not encrypted. This is an obscure edge case — /V 0 is a legacy artefact that appears rarely in real documents. Its relevance here is as a demonstration of the parser disagreement pattern, not a claim that encryption detection is broadly unreliable. The practical consequence is specific: scanners that classify files as encrypted before inspecting content may skip that inspection, and parsers disagree on whether this file qualifies.

AcroForm invisible to two of three parsers (582 bytes).

A base document has no form. An incremental update appends an AcroForm. Poppler reads the updated Catalog and correctly reports an AcroForm. MuPDF and pdfminer do not surface the incremental AcroForm. A DLP system using MuPDF would report no form on a document that contains an AcroForm auto-submit exfiltration action in the update body.

False page count via keyword injection (460 bytes).

The content stream body contains /Count 99 before the real /Pages dictionary, which specifies /Count 1. Five structural parsers read the xref, find the /Pages object, and report 1 page. The raw-byte regex scanner reads the first /Count \d+ match in file order and reports 99 pages. This is the inverse failure mode: the raw-byte scanner reports data that no structural parser accepts, producing a false positive that could exhaust downstream systems allocating resources proportional to page count.

These are not exotic attack scenarios requiring specialized exploit development. Each file is between 349 and 798 bytes, built from raw bytes without any PDF library, targeting ambiguities that exist in the PDF specification itself. ISO 32000-1 and ISO 32000-2 leave the following behaviors underspecified or contradictory: the precedence of /Count versus structural traversal; the semantics of /V 0 encryption dictionaries; the minimum incremental update chain depth a conforming reader must process for metadata extraction; and the scope of /Version catalog overrides in feature-gating contexts.

Methodology and Reproducibility

The claims in this article derive from two primary research pages, each of which publishes full methodology. This section summarises the key details; full commands, raw scanner output, and individual file results are in the linked sources.

Parser Disagreement: File Construction and Parser Commands

The 11 test PDFs were written by hand in Python without calling any PDF library. Each file was built from raw bytes with byte-accurate cross-reference tables, targeting one ambiguity per file. Files range from 349 to 798 bytes. No fonts, no images, no embedded content beyond the specific structural feature under test. All six parsers ran on the same Linux server against the same file path with no network calls and no sandboxing differences between parsers. The only variable is the parser.

Parser Command / invocation Dimensions extracted
MuPDF mutool info + mutool show xref + mutool show trailer pages, objects, version, JS, AcroForm, OpenAction, encryption
Poppler pdfinfo + pdfdetach -list pages, version, JS, encryption, form type, embedded files
Ghostscript gs -sDEVICE=nullpage (render pass) pages rendered, JS triggered, OpenAction, Launch actions
qpdf qpdf --show-npages, --show-encryption, --check pages, encryption, linearization, version, structural integrity
pdfminer.six Python subprocess (pdfminer.six) pages, encryption, JS (Names tree), OpenAction, AcroForm
pdf.js / Node node -e (raw byte regex scan) pages (/Count), JS (/S /JavaScript), OpenAction, encryption

Full per-file results including verbatim scanner output: pdf-parser-disagreement.php

V/AP Detection: The Five Checks

All five checks operate on the raw PDF object model without rendering or OCR. They are implemented in the PQPDF scanner’s AcroForm field analysis module and their findings feed into a weighted multi-module correlation layer alongside signature forensics and behavioural sandbox results.

  1. /NeedAppearances detection — regex match against the AcroForm dictionary. Presence alone is medium severity; presence combined with a digital signature (/Sig field) escalates to critical.
  2. Checkbox / radio /V vs /AS comparison — a pure string comparison of two name objects in the widget annotation dictionary. If /V /Yes and /AS /Off, the field appears unchecked to the viewer while the machine-readable value is checked.
  3. AP stream text extraction — decompress the /AP /N content stream; parse Tj/TJ operators; PDF-unescape and whitespace-normalise; compare to /V with hex-string decoding (bytes.fromhex(), UTF-16BE BOM detection) and /Opt export-value resolution for listbox / combobox fields.
  4. Blank AP stream detection — the /AP /N content stream decompresses to an empty or whitespace-only byte sequence; the field is covered by any signature but renders blank to the viewer.
  5. Missing AP detection — the widget annotation has no /AP key at all; the viewer must synthesise a default appearance, which may not match /V.

V/AP Validation Corpus

196 PDFs were submitted to the scanner across six categories. All predictions for positive test files were stated before scanning. One scan error occurred (blank-AP edge case); it is not counted in the denominator.

CategoryFilesPredictionResult
Structural V/AP positive cases (hand-crafted) 8 scanned V/AP indicator should fire 8 / 8 — 100%
Evasion: hex-encoded /V 1 Hex-decode handles this — should detect Detected [HIGH] Value/Appearance Mismatch
Evasion: Unicode confusable in /V (Cyrillic а / Latin a) 1 Byte-level comparison catches it Detected [HIGH] Value/Appearance Mismatch
Evasion: font encoding remap (/Differences swapping digit glyph) 1 Font glyph table now parsed — should detect Detected — rendered 9200.00 vs /V 1200.00
Hand-crafted clean controls (text, checkbox, listbox) 3 No V/AP indicator 0 / 3 false positives
Tool-generated clean PDFs (qpdf, pdflatex, wkhtmltopdf) 3 No V/AP indicator 0 / 3 false positives
IRS tax forms — 44 real AcroForm documents
W-9, W-4, 1040, 941, 1120, 1065, 433-A, 1099-NEC, and 37 others. Real JavaScript, embedded files, XFA. Overall risk scores 328–486; zero V/AP indicators.
44 No V/AP indicator 0 / 44 false positives
US agency forms (VA-10091, VA-40-1330) 2 No V/AP indicator 0 / 2 false positives
US federal legislation — GovInfo PDFs
Infrastructure Investment and Jobs Act, Consolidated Appropriations Act 2021, Tax Cuts and Jobs Act 2017, CARES Act.
4 No V/AP indicator 0 / 4 false positives
Academic papers from arXiv
2023–2025 papers across CS, physics, and mathematics. Standard pdflatex output; no AcroForms.
29 No V/AP indicator 0 / 29 false positives
Corkami PDF PoC adversarial files
Deliberately malformed or structurally unusual PDFs: truncated xrefs, version mismatches, orphaned objects, compressed object streams, JS obfuscation, signature edge cases, encoding tricks.
102 No V/AP indicator 0 / 102 false positives
MetricValueScope note
V/AP detection rate 8 / 8 — 100% All 8 scanned positive cases detected; 1 scan error not counted in denominator; 0 false negatives after font-encoding-remap fix
False-positive rate 0 / 187 — 0.00% Across 44 IRS forms, 2 agency forms, 4 federal publications, 29 arXiv papers, 102 Corkami PoC files, 6 hand-crafted and tool-generated clean controls
Confirmed false negatives 0 All hand-crafted evasion attempts detected after implementation fixes

These are validation corpus results on a specific hand-crafted test set with a known-positive class. They characterise detector behaviour on those cases and are not a claim of universal detection effectiveness across the broader PDF ecosystem. Corpus and generation scripts are available from the authors on request. Full methodology including exact scoring weights, AcroForm module implementation detail, and per-file verbatim output: pdf-form-security.php

What This Means for AI Pipelines Specifically

When a RAG pipeline ingests a PDF where two parsers disagree on page count, the knowledge base contains different content than what a human reader opens in their viewer. Test 2 in the PQPDF corpus demonstrates this directly: four parsers report 3 pages, two report 2. The one-page gap is not benign if the missing page contains material relevant to a query the RAG system will later answer.

Incremental update attacks (demonstrated in Tests 5, 6, and 9 of the PQPDF research) are a known vector for document fraud: a signed PDF is modified via an incremental update without invalidating the digital signature. If the ingestion tool reads only the base revision, it indexes the original unmodified content as authoritative, even though the document has been altered post-signing.

Patent documents frequently combine scanned page images with a selectable-text overlay added by the patent office OCR pipeline. Different parsers diverge on whether to read the overlay text, attempt independent OCR on the image layer, or report no text at all. The same patent can produce thousands of words from one extractor and zero from another.

Multi-column scientific papers produce a different failure: parsers extract text in different reading orders, yielding different token sequences from the same document and delivering different context to the LLM.

Legal contracts with redactions expose a different failure class. Some redaction tools draw visual rectangles over text objects without removing the underlying text from the PDF structure. A parser reading the visual layer sees a redacted document. A parser reading the object tree recovers the original unredacted text. This has affected court filings in publicized cases and applies directly to any legal knowledge base populated by PDF ingestion.

The Core Architectural Mismatch

PDF is a rendering format. It was designed to describe what a document should look like on screen or in print. It was not designed as a semantic truth format — a canonical, unambiguous representation of document content that machines can extract and reason over with confidence.

Pipelines that rely on a single extraction pass without cross-parser validation treat it as if it were. That single-interpretation assumption breaks whenever the document contains:

  • V/AP divergence in AcroForm fields
  • /NeedAppearances true combined with signed appearance streams
  • Incremental updates that different parsers resolve differently
  • Encrypted-looking structures that are not actually encrypted
  • Redacted text that remains in the object tree
  • Multi-layer content where parsers choose different layers
  • Compressed streams that some parsers decompress and others do not

The failure is not that parsers are buggy. The failure is that the format permits multiple coexisting representations of the same content, and AI extraction pipelines inherit whichever representation their chosen parser returns — without knowing that other representations exist.

Ambiguity at the parser layer becomes authority at the embedding layer. A wrong number extracted from a financial form enters the vector index as a fact. A missing page produces an incomplete summary. A hidden text layer that only some parsers surface produces context that was never visible to a human reviewer of the document. The downstream LLM has no way to distinguish any of these from correctly extracted content, and operates on all of them with the same confidence.

Ingestion Corruption vs. Adversarial Injection: Two Distinct Problems

These failure modes need precise labels because the defenses and the evidence base are different for each.

Accidental ingestion corruption is the dominant case and requires no attacker. It happens when a knowledge base ingests a PDF corpus using a single parser without V/AP or parser-disagreement checks. The ingested values may be wrong not because anyone intended them to be, but because the extraction layer resolved a structural ambiguity differently than the document’s author intended or than the display layer renders. The downstream LLM receives incorrect facts, operates on them, and produces incorrect outputs with normal confidence. Developer reports consistently describe this class of problem: extraction inconsistency, layout destruction across columns, OCR divergence on scanned overlays, and page count mismatches even without any malicious input. This is silent extraction failure at scale, not adversarial attack.

Adversarial hidden-context injection is a structurally grounded capability, but evidence for its active exploitation specifically via PDF structural mechanisms in LLM pipelines remains limited in public disclosures. The parser disagreement research identifies the mechanism: a PDF where the visual rendering layer shows benign content while the extraction layer returns additional or different content can deliver natural-language instruction text to an LLM’s context window that never appeared in any human reviewer’s inspection of the document. No exploit code is required — only knowledge of the specification and which parser makes which choice at each structural ambiguity. The adversarial content can be embedded in a compressed stream, an incremental update, or a content layer that only some parsers surface.

Research confirms document-level injection is an active area broadly: a November 2025 preprint documented that five carefully crafted documents can manipulate AI responses 90% of the time through RAG document injection (generic document manipulation, not PDF-structural specifically). The MDPI paper on prompt injection in LLMs (January 2026) documents that adversarial documents can be positioned to match target queries while containing malicious content invisible to text-based inspection. The V/AP mechanism adds a specific subclass: /V contains adversarial content while /AP renders legitimate-looking content to human reviewers. Whether this specific subclass is being actively exploited in production pipelines is not currently documented in public disclosures.

The accidental corruption case is the more prevalent problem today and requires no adversary. The adversarial injection case is a real structural capability that requires an attacker with specific knowledge of the target pipeline’s parser choices. Both are real. They require different defenses. Conflating them overstates the adversarial threat while underselling the much larger accidental-corruption problem that affects every single-parser ingestion pipeline right now.

Growth Projections and Trajectory

The scale of PDF ingestion is not decreasing. The AI training dataset market is projected to grow from $3.2 billion in 2025 to $16.3 billion by 2033 (Grand View Research[7] — industry analyst forecast). Text datasets represent the largest segment within that market: $1.29 billion measured in 2025, projected to $2.12 billion by 2028 (ibid.).

FinePDFs (3 trillion tokens from 475 million documents) is a public dataset with a documented release date and verifiable size.[6] Production training pipelines at major labs are not public. The FinePDFs scale provides a reference point for the lower bound of PDF-token ingestion at public-dataset scale.

RAG deployment numbers come from multiple sources of varying independence. Vectara, a RAG platform vendor, reported enterprises choosing RAG for 30–60% of high-accuracy use cases (January 2025[8] — vendor prediction). Grand View Research measured the document retrieval segment at 32.4% of global RAG revenue in 2024.[7] Menlo Ventures, a venture capital firm, reported enterprise generative AI spending at $37 billion in 2025, up 3.2x from $11.5 billion in 2024, based on an enterprise survey.[9] These figures are self-reported or analyst-estimated; they are included as directional context, not verified measurements.

In that context, the PDF structural problems described above are not niche edge cases. They are present in the dominant document format feeding the dominant enterprise AI architecture during a period of 3x annual spending growth. The extraction layer is not keeping pace with the deployment layer.

What the Spec Does and Does Not Address

ISO 32000-2:2020 is aware of the V/AP structural separation. The spec does not classify it as a defect. The /NeedAppearances behavior — where signed bytes cover stale appearance streams and viewers regenerate from /V at open time — is documented as a consequence of the format’s design.

The incremental update processing ambiguities (how deep must a conforming reader follow the xref chain; what constitutes a permitted modification under DocMDP P=2) are not fully resolved in the current specification. The parser disagreement research specifically identifies four areas where tightened normative language in ISO 32000 and ISO 19005 would reduce parser divergence: the precedence of /Count versus structural traversal; the semantics of /V 0 encryption dictionaries; the minimum update-chain depth for metadata extraction; and the normative treatment of /Version catalog overrides in feature-gating contexts.

PDF/A (ISO 19005) mandates that conforming files be renderable consistently across readers. Parser disagreement on page count or document structure challenges the interoperability guarantees PDF/A aims to provide. A PDF/A archival tool validating a file using MuPDF may report conformance while a Poppler-based reader renders additional content from an incremental update that MuPDF does not surface. PDF/UA (ISO 14289) accessibility compliance similarly becomes parser-dependent rather than document-dependent when the underlying parsers disagree on content.

Scope Note

The V/AP and parser-disagreement problems are not the dominant PDF threat class in terms of raw volume. Most malicious PDFs observed in the wild are credential-harvesting overlays and social engineering lures that do not touch AcroForm field structure at all. Those are URL reputation and static JavaScript problems. V/AP divergence matters for signed documents carrying legal or financial authority — contracts, invoices, regulatory filings — where the gap between what a human sees and what a machine processes has direct operational consequences. Parser disagreement matters for any single-parser ingestion pipeline operating on a document corpus where content accuracy is required.

Both classes are real. They require different tools. The same pipeline cannot address them with the same check.

References

Source types are labelled: peer-reviewed, industry analyst forecast, vendor report/prediction, VC survey, specification, or primary research. Market forecasts are analyst projections, not independently verified measurements.

  1. Mainka, C., Mladenov, V., Rohlmann, S., Schwenk, J. “Shadow Attacks: Hiding and Replacing Content in Signed PDFs.” NDSS Symposium 2021 / ACM CCS 2021. (peer-reviewed)
  2. Müller, J., Mladenov, V., Somorovsky, J., Schwenk, J. PDF Insecurity series, 2017–2019. pdf-insecurity.org (peer-reviewed / disclosed research)
  3. ISO 32000-2:2020. Document management — Portable Document Format — Part 2: PDF 2.0. §12.8.2.2 (DocMDP), §12.7.3 (AcroForm), §12.7.4 (Field types and /V semantics). (specification)
  4. PQ PDF. “PDF Forms as Executable Security Boundaries: V/AP Divergence, DocMDP, and What Gets Certified.” Published 24 May 2026. pqpdf.com/pdf-form-security.php (primary research — this site)
  5. PQ PDF. “PDF Parser Disagreement: Six Parsers, Eleven Divergences.” Published 20 May 2026. pqpdf.com/pdf-parser-disagreement.php (primary research — this site)
  6. Hugging Face. FinePDFs dataset. 475 million documents, 3 trillion tokens. Released September 2025. huggingface.co/datasets/HuggingFaceFW/finepdfs (public dataset — verifiable)
  7. Grand View Research. “AI Training Dataset Market Size, Share & Trends Analysis Report, 2026–2033.” CAGR 22.6%, projected $16.32 billion by 2033. (industry analyst forecast — paywalled; independent verification not available)
  8. Vectara. “Enterprise RAG Predictions for 2025.” January 2025. vectara.com/blog/top-enterprise-rag-predictions (vendor report — Vectara is a RAG platform; treat as industry prediction, not independent measurement)
  9. Menlo Ventures. Enterprise generative AI spending survey, 2025. (VC-firm enterprise survey — methodology not independently verified)
  10. Prompt Injection Attacks in Large Language Models and AI Agent Systems: A Comprehensive Review. MDPI Information, vol. 17, issue 1. Published January 2026. mdpi.com/2078-2489/17/1/54 (peer-reviewed survey)
  11. Preprints.org. “Prompt Injection Attacks in Large Language Models.” Posted November 2025. preprints.org/manuscript/202511.0088/v1 (preprint — not peer-reviewed at time of citation)
  12. CVE-2021-28550 (APSB21-29): AcroForm + getField/setFocus JavaScript, use-after-free, CVSS 8.8. (NVD / Adobe advisory)
  13. CVE-2021-21017 (APSB21-09): XFA heap buffer overflow, exploited in the wild, CVSS 8.8. (NVD / Adobe advisory)
  14. CVE-2024-45112 (APSB24-70): XFA/AcroForm type confusion, CVSS 8.6. (NVD / Adobe advisory)
  15. CVE-2023-21608 (APSB23-01): Use-after-free via AcroForm event.target JavaScript, CVSS 7.8. (NVD / Adobe advisory)
  16. MITRE ATT&CK T1566.001: Phishing: Spearphishing Attachment. attack.mitre.org/techniques/T1566/001/
  17. Stevens, D. PDF analysis tools: pdfid.py, pdf-parser.py. blog.didierstevens.com/programs/pdf-tools/
  18. McKinsey & Company. “The state of AI: How organizations are rewiring to capture value.” McKinsey Global Survey, 2025. (consulting-firm survey — methodology available on request from McKinsey; 71% figure from the 2025 edition of their annual AI adoption survey)

Scan Your PDFs Before They Enter the Pipeline

The PQ PDF Forensic Scanner runs V/AP divergence detection, differential multi-parser analysis, behavioral sandbox, YARA, and a 44-module correlation layer on every upload — no configuration required. The 44 modules span structural integrity checks, entropy analysis, YARA rule sets, heuristic scoring passes, and cross-parser disagreement detection; they are not 44 separate rendering engines. See exactly what each parser reports, where they disagree, and what divergence means for document authenticity and ingestion integrity.

→ PDF Forensics Scanner — Multi-Parser, 44-Module Analysis, Free

→ Full V/AP Divergence Methodology — DocMDP, FieldMDP, NeedAppearances

→ Parser Disagreement Research — Six Parsers, Eleven Divergences

No account. No file retention. Differential parsing, V/AP divergence detection, behavioral sandbox, YARA, ClamAV, offline threat intelligence (6.4M+ indicators), and AI synthesis — all running on the same file in parallel.


PQ PDF PQ PDF Tools

© 2026 PQ PDF — All rights reserved.

← All PDF Tools • About • Legal • Privacy • Security • Contact

Secure document utilities — free, private, zero-retention. pqpdf.com