AI Document Integrity

Your AI may read a different PDF than your users.

Parser disagreement, OCR drift, hidden layers and semantic divergence silently corrupt RAG pipelines and training data. Measure the gap before it reaches production.

Built for researchers, security teams and AI engineers investigating document integrity at scale.

Test your PDFs now → Read the research

24,824PDFs analyzed

1 in 3parser disagreement

18.6%DOJ corpus semantic drift

The problem hides in plain sight

A PDF isn't one document. The page a person reads and the text a machine extracts are not guaranteed to match — and your AI ingests the machine's version. When they differ, the model learns, retrieves and answers from a version of the document no human ever saw.

RAG poisoning

Retrieval pipelines index extracted text. If extraction diverges from the rendered page, your assistant cites content that isn't there — confidently.

Corrupted training data

Fine-tuning on parsed PDFs bakes in extraction errors, hidden layers and reading-order scrambles at scale — invisibly.

Compliance & e-discovery

When the value stored differs from the value shown — on signed forms, contracts and filings — automated review reaches the wrong conclusion.

Silent, not loud

None of this throws an error. It degrades answer quality and audit integrity quietly, until someone downstream is wrong and can't say why.

One measured finding

18.6%

Across the 16,971-PDF DOJ Epstein release, 18.6% of files read differently to a machine than to a human — the extracted text layer diverged from the rendered page.

In an adversarial corpus, 502 of 1,572 PDFs (~1 in 3) produced materially different results across parsers. Among IRS tax forms, 43 of 44 exhibited semantic drift.

See the methodology and corpus →

See it on your own documents

The same engine behind the research runs as a live scanner — 47 forensic engines that measure parser disagreement, value-vs-appearance drift, hidden layers and OCR-vs-render divergence. Upload a PDF and see exactly where machine and human readings split. Zero retention: files are deleted immediately after analysis. No external APIs, no third-party processing — analysis runs entirely within the PQ PDF environment.

Upload a PDF & measure the gap → Browse all research

Who uses this?

RAG and AI ingestion pipelinesVerify that models ingest the same document users read.
Document AI and OCR systemsDetect hidden layers, OCR drift and semantic mismatch.
Security teamsInvestigate parser confusion, hidden content and malicious PDFs.
Legal and compliance teamsValidate document integrity and value-vs-appearance consistency.
ResearchersStudy parser disagreement and document semantics at scale.

Frequently asked

What is AI document integrity?

Whether the text your AI ingests from a document matches what a human actually sees on the page. When they diverge, models retrieve, learn and answer from content no human read.

How is this different from OCR accuracy?

OCR error is only one source of drift. We also measure parser disagreement, hidden layers, reading-order scrambles and value-vs-appearance divergence — the gaps OCR metrics miss.

Do my documents leave your servers?

No. Analysis runs entirely within the PQ PDF environment — no external APIs, no third-party processing, and zero retention: files are deleted immediately after analysis.

What does it actually measure?

Parser disagreement across six independent parsers, OCR-vs-render divergence, hidden or invisible layers, reading order, and value-vs-appearance (V/AP) mismatches — pinpointing where machine and human readings split.

Can I run it on my own corpus or at scale?

Yes. The scanner runs on individual files now, and the engine can be integrated or licensed for batch and pipeline use. Contact us for a demo or licensing.

Measure the gap before it reaches production.

For teams running RAG, document AI or LLM training at scale — let's talk about checking your corpus, integrating the engine, or licensing the technology.

Schedule a demo Contact Allan directly License the technology