Does the scanner send my file to external services?

No. All 47 engines run entirely on pqpdf.com infrastructure. The threat intelligence engine queries four local PostgreSQL databases (URLhaus, MalwareBazaar, ThreatFox, FeodoTracker + OpenPhish — 6.4M+ indicators) updated from public feeds — no hash, URL, or byte leaves the server. ClamAV runs against a local signature database. The AI report uses a self-hosted Qwen 2.5 model on a private GPU server. Zero external API calls.

How does the risk score work?

Every finding is classified onto one of four forensic axes — exploit (code execution / malware delivery), tampering (signature forgery, shadow documents), deception (content/semantic-determinism manipulation: value-vs-appearance divergence, glyph remapping, OCR poisoning), and structural/informational (neutral modern-PDF features, never counted as a threat). The headline Threat Score = exploit + tampering and drives the verdict; each indicator adds Critical (50 pts), High (25 pts), Medium (10 pts) or Low (3 pts), capped at 3 occurrences, plus Correlation Engine bonuses (35–100) for dangerous combinations. Because this is a full forensics tool, a confirmed deception finding grades the verdict on its own axis even at threat score zero. Bands: 0 = Clean, 1–29 = Low, 30–149 = Suspicious, 150–349 = High Risk, 350+ = Dangerous.

Can the scanner clean or sanitize a malicious PDF?

Yes. After scanning, the Sanitize panel offers 9 removal methods: flatten to images (removes all active content by rasterizing pages), strip JavaScript, remove embedded files, remove annotations, remove XFA forms, remove metadata, linearize, repair structure, and PDF/A conversion. The flatten method is the most aggressive — it produces a pixel-for-pixel image PDF with zero active content.

How does PQ PDF compare to Hybrid Analysis or ANY.RUN for PDF scanning?

Hybrid Analysis and ANY.RUN are general-purpose dynamic sandboxes — they observe runtime behavior and map it to MITRE ATT&CK. PQ PDF does both of those things too, using six independent PDF renderers with full syscall capture and every indicator mapped to ATT&CK technique IDs. Where PQ PDF goes significantly further: 46 additional engines that run in parallel, covering PDF structural forensics, JavaScript AST deobfuscation, XFA FormCalc emulation, AcroForm field exfiltration analysis, signature shadow attack detection, six-parser differential comparison, ML anomaly detection, AI ingestion integrity checks (reading order, OCR text layer, accessibility tree forensics), and 9-method sanitization — none of which those platforms perform. The only thing they offer that PQ PDF does not is a community reputation database for 'has this file been seen before.' PQ PDF also deletes submitted files immediately; Hybrid Analysis stores them.

No ads. No tracking. No data sold. Ever.

🛡️ Security Guide

PDF Malware Scanner
47 Forensic Engines. Free. Online.

Q: Can VirusTotal detect all malicious PDFs?

VirusTotal's 70+ antivirus engines are signature-based — they identify threats that have already been catalogued. A zero-day PDF exploit, a freshly obfuscated JavaScript shellcode loader, or a novel XFA FormCalc attack will not match any signature and will pass through clean. PQ PDF detects unknown threats through engines that don't require prior examples: the behavioral sandbox catches what the PDF does at runtime (network beaconing, shell spawning, executable memory allocation) regardless of whether the payload is known; the ML IsolationForest flags structurally anomalous files without needing a labelled training example; differential parsing exposes hidden objects by comparing six independent parsers — a discrepancy is suspicious whether or not the hidden content matches any signature. VirusTotal answers 'have we seen this before?' PQ PDF answers 'what does this PDF actually do?' — including for files no scanner has ever encountered.

Q: Is it safe to upload a potentially malicious PDF?

Yes. Every uploaded file is processed in a four-layer isolated environment: prlimit resource limits, AppArmor MAC policy (pqpdf-unshare profile), Linux user+mount+network+PID namespaces, and a private tmpfs mount. The behavioral sandbox runs inside an additional nested namespace. The file is deleted immediately after analysis — pqpdf.com retains no copy, hash, or metadata. No file data is sent to any external service.

Q: What types of PDF threats does the scanner detect?

The scanner detects: malicious JavaScript (eval+unescape, obfuscated payloads, AST-deobfuscated shellcode), CVE-specific byte patterns (CVE-2009-0658 JBIG2, CVE-2024-41869, CVE-2024-45112 and 20+ others), XFA FormCalc exploits, XRef Shadow Attacks (PDF signature forgery), AcroForm credential exfiltration, embedded PE/ELF executables, phishing pages with brand impersonation, OCG layer content cloaking, Unicode invisible text (RLO U+202E, rendering mode 3/7), heap spray patterns, polyglot files, steganographic payloads, and any behavioral anomaly detected by six independent PDF renderers including network beaconing, shell spawning, and anonymous executable memory.

Malware & exploits · document-integrity tampering · content-integrity & AI-ingestion attacks — graded across four forensic axes.

PDF files are one of the most common malware delivery vectors — used in phishing campaigns, APT attacks, and exploit kits for decades. Most scanners can only find threats they have already seen. PQ PDF's 47 forensic engines detect both known and unknown threats: behavioral sandbox execution catches what a PDF does regardless of whether it has a signature, ML anomaly detection flags structurally abnormal files even with no prior example, and differential parsing exposes hidden objects whether or not they match any known exploit pattern.

But this is full document forensics, not just a malware scanner. It also detects integrity tampering — signature forgery, shadow documents, post-signing injection — and content-integrity / semantic-determinism attacks, where a file deliberately shows one thing to a human and a different thing to a parser or AI/RAG pipeline: value-vs-appearance (V/AP) divergence, font glyph remapping, OCR text-layer poisoning, and /Alt & /ActualText prompt injection. The verdict is graded across all four forensic axes — so a document-integrity or AI-ingestion attack that carries no malware at all is still surfaced. All free, with zero data retention.

Forensic engines

6.4M+

Offline threat indicators

Sandbox renderers

Report tabs

Data retained

NIST Listed in the NIST Computer Forensics Tools & Techniques Catalog National Institute of Standards and Technology · toolcatalog.nist.gov ↗

🔬 Scan a PDF Now — Free

No account. No upload limit. File deleted immediately after analysis.

What PDF files can hide

PDF Threats You Can't See by Opening the File

A PDF that opens and looks normal can still be malicious. The PDF specification is complex enough that attack vectors are buried in layers of structure that no viewer surfaces to the reader. Emotet used password-protected PDF lures to deliver macro-laced Word droppers. MuddyWater (Iranian APT) relied on PDF first-stage attachments throughout 2022–2024 campaigns against government targets. APT28 (Fancy Bear) distributed CVE-2015-2545 EPS-exploit PDFs in spear-phishing operations against NATO targets. More recently, QakBot and IcedID campaigns shifted entirely to PDF delivery after Microsoft disabled Office macros by default. And beyond classic malware, a newer class of attack exploits the fact that a PDF can present different content to a human than to a parser or AI pipeline — no exploit code required. We measured how widespread this is across nearly 8,000 PDFs in our PDF security research (parser disagreement, V/AP divergence, reality drift, and AI-ingestion poisoning), then applied the full scanner to a real-world corpus in our forensic study of all 16,971 DOJ Epstein PDFs. These are the threat categories PQ PDF detects in malicious and deceptive PDFs:

⚡

Malicious JavaScript

PDF supports a full JavaScript engine. Attackers embed eval(unescape(...)) shellcode loaders, heap spray sequences, and multi-layer obfuscated scripts that execute silently when the file opens in a vulnerable viewer.

💥

CVE Exploit Patterns

Specific byte sequences trigger bugs in PDF renderers. CVE-2009-0658 (JBIG2 heap overflow), CVE-2024-41869 (use-after-free in Adobe Reader), CVE-2024-45112 (type confusion) — the scanner has 24 YARA rules targeting these and more.

📬

AcroForm Exfiltration

Interactive form fields can silently POST all field data — including typed passwords — to an attacker-controlled server via /SubmitForm actions. Hidden fields collect data without user knowledge.

🖼️

OCG Layer Cloaking

Optional Content Groups (layers) can hide malicious content — JavaScript, phishing text, embedded payloads — in layers that are invisible in normal view but present in the file structure and executed by the parser.

🔏

Signature Forgery (Shadow Attack)

A PDF can display a valid digital signature while containing content the signer never approved. The Shadow Attack exploits ByteRange gaps in the signature specification to hide malicious objects outside the signed byte range.

📦

Embedded Executables

PDF allows embedding arbitrary file attachments. Malicious PDFs routinely embed PE (.exe), ELF, OLE compound documents, VBA macro files, ZIP archives, and nested PDFs — activated by /Launch actions or /EmbeddedFile streams.

🎣

Phishing & Brand Impersonation

PDF is a common phishing delivery format. Fake login pages, QR codes pointing to credential-harvesting sites, and brand impersonation (Microsoft, DocuSign, Adobe) are embedded as interactive forms or URI actions.

👻

Invisible & Unicode-Obfuscated Text

Text with rendering mode 3 (invisible) or 7 (clip only) is present in the PDF but never drawn. Combined with RTL override characters (U+202E), attackers reverse filenames and URLs in a way that looks legitimate to a casual reader.

🧩

XFA FormCalc Exploits

XFA (XML Forms Architecture) is a complex XML-based alternative form system supported by Adobe Reader. It contains its own scripting language (FormCalc) and has been the vehicle for multiple critical RCE vulnerabilities rarely analysed by general-purpose scanners.

🎭

Value/Appearance (V/AP) Divergence

A form field can display one value while storing a different one in the signed and extracted data — “I agree to $10” on screen, “$10,000” in the data. Checkbox/radio /V-vs-/AS mismatch, blank or image-based appearance streams, and /NeedAppearances staleness all let the visible document disagree with its machine-readable value. No exploit code — pure document fraud.

🤖

AI-Ingestion & Semantic Poisoning

A document can show one thing to a human and another to an AI/RAG pipeline: font glyph remapping (rendered “$1,200”, extracted “$12,000” via a poisoned /ToUnicode map), hidden OCR/text-layer poisoning on scanned pages, and /Alt & /ActualText prompt injection — invisible to readers but consumed verbatim by LLMs. PQ PDF measures cross-extractor semantic determinism to catch what no malware scanner looks for.

Use cases

Who Should Scan a PDF Before Opening It?

PDF is the most common malware delivery format in targeted attacks. According to Verizon's DBIR, email attachments account for over 90% of malware delivery — and PDF is consistently in the top two formats alongside Office macros. These are the people who scan before they open:

🛡️

SOC Analysts

Triaging email attachments from phishing alerts. Need MITRE ATT&CK mapping, IOC extraction, and a structured verdict they can attach to a ticket — not just a pass/fail AV result.

💼

IT & Security Administrators

Checking vendor-supplied PDFs, software documentation, or procurement contracts before distributing inside the organisation. One malicious PDF forwarded internally becomes a lateral movement risk.

⚖️

Legal & Compliance Teams

Law firms and compliance officers routinely receive PDFs from opposing parties, regulators, and clients — including adversarial actors who know the recipient will open the file. Privileged documents cannot be uploaded to VirusTotal.

🏥

Healthcare & Finance

Insurance claims, billing statements, and financial reports in PDF format are a known targeting vector for ransomware groups (including LockBit and Cl0p campaigns). Regulations like HIPAA prohibit sharing patient data with cloud services — offline scanning is required.

🔬

Malware Researchers

Analysts studying Emotet, MuddyWater, APT28, and other threat actors that use PDF as a first-stage delivery mechanism. The full 47-engine output and AI forensic report provide the depth needed to document a campaign technically.

🏠

Remote Workers

Receiving an unsolicited PDF on a work laptop — a courier notification, a contract revision, an invoice from an unknown sender. The HR and finance departments are the most-targeted recipients of spear-phishing PDFs.

How the alternatives compare

PQ PDF vs. VirusTotal, Hybrid Analysis, Adobe & MetaDefender

Antivirus engines answer one question: "Have we seen this before?" If a threat has been catalogued, they find it. If it hasn't — a zero-day exploit, a freshly obfuscated payload, a novel XFA FormCalc attack, a new JS shellcode loader — it passes straight through. PQ PDF answers a different question: "What does this PDF actually do?" Behavioral execution, ML anomaly detection, structural differential analysis, entropy profiling, and AI ingestion integrity checks find dangerous files whether or not any signature for them exists anywhere. Here is how the tools compare:

Capability	PQ PDF Free · No account	VirusTotal Free (account) · Online	Hybrid Analysis Free (limited) · CrowdStrike	Adobe Acrobat Pro ~$23/month	MetaDefender OPSWAT · Paid
AV signature scanning	✓ ClamAV 700k+ sigs	✓ 70+ AV engines	✓ CrowdStrike + partners	✗ No AV scanning	✓ 30+ AV engines
YARA rules (PDF-specific)	✓ 24 custom PDF YARA rules	⚠ Community rules, generic	⚠ Generic YARA rules	✗ No	⚠ Limited, generic
Behavioral sandbox execution	✓ 6 PDF renderers, isolated namespaces, strace	⚠ General sandbox — not PDF-specific renderers	✓ Good dynamic analysis, general sandbox	✗ No sandbox	⚠ Basic sandbox, limited PDF renderer coverage
PDF structural analysis (XRef, objects, streams)	✓ 15 static engines built for PDF structure	✗ AV engines scan bytes, not PDF structure	✗ No structural PDF analysis	✗ No structural analysis	✗ No structural PDF analysis
JavaScript AST deobfuscation	✓ Full AST deobfuscator + Acrobat API emulation	✗ No	⚠ Runtime observation only	✗ No	✗ No
XFA FormCalc parsing	✓ Dedicated XFA parser engine	✗ No	✗ No	✗ No	✗ No
Signature forgery / Shadow Attack detection	✓ ByteRange forensics engine	✗ No	✗ No	✗ No	✗ No
AcroForm exfiltration / hidden field analysis	✓ Full field tree, SubmitForm targets, JS triggers	✗ No	✗ No	✗ No	✗ No
Six-parser differential comparison	✓ MuPDF, Poppler, GS, qpdf, pdfminer, pdf.js	✗ No	✗ No	✗ No	✗ No
Machine learning anomaly detection	✓ IsolationForest + RandomForest + LightGBM + SHAP	✗ No	✗ No	✗ No	✗ No
OCR vs. text layer divergence (hidden text poisoning)	✓ Tesseract OCR vs. embedded text layer — Jaccard similarity per page	✗ No	✗ No	✗ No	✗ No
Reading order & spatial ambiguity (AI ingestion)	✓ Multi-column layout detection, parser extraction order conflicts	✗ No	✗ No	✗ No	✗ No
Accessibility tree injection (/Alt, /ActualText)	✓ /StructTreeRoot forensics — prompt injection in semantic layer	✗ No	✗ No	✗ No	✗ No
MITRE ATT&CK technique mapping	✓ Every indicator mapped to technique IDs	⚠ Some detections, not systematic	✓ Good ATT&CK coverage	✗ No	⚠ Limited mapping
AI forensic narrative report	✓ Self-hosted Qwen 2.5 — structured verdict + findings	✗ No	✗ No	✗ No	✗ No
File privacy / zero data retention	✓ Deleted immediately, no external calls, no hashes shared	✗ Files stored; hashes and reports are community-shared	✗ Files stored; can be set private (paid only)	✓ Local processing, file stays on your machine	⚠ Enterprise tier offers private scanning
Offline threat intelligence	✓ 6.4M+ indicators in local databases — zero external calls	⚠ All queries sent to external services	⚠ Online lookups	✗ No threat intel	⚠ Cloud-based lookups
Sanitize / clean the PDF	✓ 9 methods: flatten-to-images, strip JS, remove XFA, PDF/A…	✗ No	✗ No	✓ "Sanitize Document" removes active content	⚠ Basic sanitization in some tiers
Cost	✓ Free — no account required	✓ Free with account (rate limited)	✓ Free tier (limited submissions/day)	✗ ~$23/month subscription	✗ Paid — enterprise pricing

The honest assessment: VirusTotal's 70+ AV engines are the best tool in existence for one specific question — "has this exact file been seen and named by the antivirus industry?" If you need community reputation across 70 vendors, use it. For everything else — detecting what a PDF does, finding zero-days, structural forensics, AI ingestion integrity, sanitization, MITRE ATT&CK mapping, and keeping your file private — PQ PDF does all of it, free, with no account required.

How the scanner works

All 47 Forensic Engines Explained

Every uploaded PDF passes through 47 independent analysis engines in a single request. Each engine is orthogonal — designed to catch a different class of threat that the others might miss. Results are correlated by the Correlation Engine (Engine 47) that maps compound indicators to MITRE ATT&CK techniques.

🔍

Static Analysis — Structure & Byte Level

Engines 1–15

ENGINE 1

Structure Validator

Validates the PDF header, version declaration, cross-reference table, trailer dictionary, and byte offsets. Malformed structures are a hallmark of exploit kits that deliberately break parsers to hide objects. Also detects linearized first-page object overrides: incremental updates that re-define an existing Page 1 object to inject JavaScript or actions that renderers fast-pathing the linearization hint table will not see on initial render. Records PDF 2.0 (ISO 32000-2) structures — the /DPartRoot document-part hierarchy (§14.12) and tagged-PDF /Namespaces (§14.7.4), neutral structure surfaced because namespaces form part of the accessibility/semantic layer reality-drift attacks target.

ENGINE 2

Pattern Scanner

Byte-level search for dangerous PDF keywords: /JavaScript, /JS, /Launch, /OpenAction, /AA, /EmbeddedFile, /RichMedia, /XFA, /AcroForm, heap spray constants, and shellcode sequences.

ENGINE 3

Stream Inspector

Decompresses and inspects every stream object in the PDF. Computes per-stream entropy — high-entropy streams hidden inside otherwise clean documents are a strong indicator of encrypted payloads or steganographic content.

ENGINE 4

Object Analyzer

Traverses the full PDF object tree. Maps parent-child relationships, counts suspicious object types, identifies cross-reference anomalies (duplicate object numbers, phantom free entries), and enumerates all dictionary keys.

ENGINE 5

URL Extractor

Extracts all URIs from the PDF including hex-encoded, percent-encoded, and split/obfuscated variants. Flags javascript:, data:, file://, and vbscript: schemes. All URLs are passed to the Threat Intelligence engine.

ENGINE 6

Metadata Analyzer

Examines XMP and Info dictionary metadata: Creator, Producer, Author, creation date, modification date, and custom metadata keys. Detects exploit-kit fingerprints (Metasploit, msfvenom, Canvas, Core Impact) in tool identifiers.

ENGINE 7

Font Analyzer

Inspects every font object for non-standard encoding, oversized /Widths arrays (historic heap-overflow vector), non-embedded fonts that trigger external font lookups, and suspicious glyph name mappings. Analyses ToUnicode CMap tables — the mapping from glyph IDs to Unicode codepoints — detecting remaps where a visually rendered ASCII character (e.g. A) resolves to a non-ASCII Unicode codepoint in the extracted text layer. These remaps make visible text differ from extracted text, corrupting entity extraction, compliance scanning, and AI embeddings without any visible change to the rendered document.

ENGINE 8

CVE Pattern Matcher

Checks for /JBIG2Decode (CVE-2009-0658), /JBIG2Globals exploit parameters, oversized /Widths arrays, and codec parameter combinations associated with heap-overflow and memory corruption CVEs in Adobe Reader and Foxit.

ENGINE 9

Structural Statistics

Computes structural ratios: JavaScript-to-page ratio, stream-to-object ratio, compression diversity index, average object size, and entropy distribution. Statistically anomalous documents are flagged even without specific rule matches.

ENGINE 10

ExifTool Metadata Forensics

Runs ExifTool for deep metadata extraction and cross-source reconciliation across all metadata channels: /Info dictionary, XMP metadata, embedded XML, and attachment timestamps. Detects desynchronization where these sources report conflicting dates — e.g. /Info creation date 2024 while XMP reports 2019 — a strong indicator of document manipulation, backdating, or incremental-update tampering. Also independently confirms XFA form presence, surfaces embedded attachment flags, and detects creator/producer strings from known exploit-generation toolkits.

ENGINE 11

qpdf Structural Integrity

Runs qpdf --check to validate cross-reference tables and trailer dictionaries from a second, independent parser. Intentionally malformed XRef tables are a hallmark of exploit kits designed to hide objects from basic parsers. Also flags PDF 2.0 unencrypted-wrapper / encrypted-payload documents (ISO 32000-2 §7.6.7) — a clear cover page carrying an /AF file with /AFRelationship /EncryptedPayload seals the real content inside an encrypted attachment no static engine can read; graded on the tampering axis as a content-hiding construct.

ENGINE 12

YARA Rule Engine

Applies 24 custom YARA rules: heap-spray patterns (%u9090, 0x0c0c), CVE-specific byte sequences (CVE-2009-0658, CVE-2024-41869, CVE-2024-45112), obfuscated JS loaders, XFA+script combos, Cobalt Strike beacon signatures, PowerShell encoded commands, and multi-layer encoder chains.

ENGINE 13

PeePDF Deep Analysis

Independent analysis using the PeePDF framework — a completely separate parser that builds its own object tree independently of PyMuPDF. Provides a full second-opinion parse, locating vulnerability patterns with exact object IDs and identifying suspicious elements invisible to the primary parser.

ENGINE 14

ClamAV Signature Scanner

Runs the local ClamAV daemon against the file — 700,000+ signatures including the Pdf.Exploit.* family covering CVE-2009-0927, CVE-2009-4324, and the Exploit.PDF-JS category. A ClamAV match means the file is a confirmed known threat.

ENGINE 15

Polyglot Detection

Detects files that are simultaneously valid in two or more formats. Two detection layers: (1) File-level polyglot — checks whether a recognised format signature (JPEG FF D8 FF, ZIP PK\x03\x04, PNG, GIF, Gzip, OLE, RIFF) appears in the bytes before the %PDF- header. ISO 32000 §7.5.2 permits arbitrary bytes before %PDF-; attackers exploit this to create JPEG+PDF or ZIP+PDF polyglots that bypass format-based content filters — email gateways see a JPEG, the PDF payload executes. (2) Stream-level polyglot — scans every PDF stream (raw and decompressed) for embedded executable magic: ZIP, Windows PE, Linux ELF, Mach-O, Java class, OLE/CFBF, RAR, 7-Zip, WebAssembly, HTML, and PostScript. Polyglot files smuggle dropper payloads past content-type security controls.

🔥

Dynamic Behavioral Analysis

Engine 16

ENGINE 16

Dynamic Behavioral Sandbox

The only engine that actually executes the PDF. Renders through six independent engines — Ghostscript (PostScript + JS interpreter), MuPDF, Poppler, LibreOffice Draw, Chromium PDFium (Chrome's engine — the most common modern viewer), and pdf.js/Node (Firefox engine) — each inside isolated Linux namespaces with its own network stack, PID space, and mount point. All syscalls captured by strace. Detects: network beaconing, anonymous executable memory (shellcode), shell spawning, filesystem escape attempts, and process bombs. Static analysis sees structure; this engine sees what the PDF does.

🧠

Machine Learning & Differential Parsing

Engines 17–18

ENGINE 17

ML Intelligence Engine

Extracts a 38-feature vector from all preceding engine outputs and applies three layers: Bayesian contextual scoring (dampens known-benign creator tools, amplifies exploit-kit fingerprints), IsolationForest anomaly detection (unsupervised, active from the first scan), and RandomForest + LightGBM classifiers with SHAP explainability. Reports top contributing features for each scan so analysts understand the ML verdict, not just the score.

ENGINE 18

Differential Parsing

Runs six independent PDF parsers — MuPDF (mutool), Poppler, Ghostscript, qpdf, pdfminer, and pdf.js — and cross-compares eight structural dimensions: page count, object count, PDF version, JavaScript presence, encryption status, AcroForm presence, embedded file count, and OpenAction. Discrepancies mean the file exploits parser differences to hide objects — the signature of broken-xref exploit staging and incremental-update attacks. See the empirical parser disagreement tests for 11 reproducible examples with live scanner output.

🌐

Threat Intelligence, JavaScript & Campaign Attribution

Engines 19–24

ENGINE 19

JS AST Deobfuscation

Parses embedded JavaScript to its abstract syntax tree, then applies symbolic simplification to undo eval/unescape layers, string-split obfuscation, hexadecimal encoding, and multi-pass encoder chains. Surfaces the final deobfuscated payload for manual review.

ENGINE 20

Threat Intelligence

Queries four fully offline local databases: URLhaus, MalwareBazaar, ThreatFox, and FeodoTracker + OpenPhish — 6.4M+ indicators including URLs, IPs, domains, file hashes, and botnet C2 addresses. Zero external API calls. All extracted URLs and IPs from the PDF are cross-referenced.

ENGINE 21

Signature Forensics

Deep forensics on PDF digital signatures across six dimensions: ByteRange coverage integrity (per ISO 32000 §12.8.1, offsets are from the %PDF- header — o1 must be 0, both segments within file bounds, inner gap must contain only the /Contents blob, and o2+l2 must reach at least %%EOF); shadow document detection (unsigned bytes beyond the signed region containing execution vectors — CVE-2019-14980 class); full-save rewrite detection (when o2+l2 < %%EOF and the unsigned trailing region contains xref/trailer structure without execution vectors, a PDF viewer performed a complete file rewrite — this invalidates the cryptographic signature while the visual signature appearance remains, a pattern used by DocuSign and similar tools); /Contents blob structural validation (all-zero placeholders, sub-32-byte blobs, missing DER SEQUENCE header); SubFilter deprecation (SHA-1 collision risk, legacy no-chain variants, unknown formats); and weak digest algorithm detection (MD5/SHA-1 vulnerable to collision-assisted forgery).

ENGINE 22

Phishing Detection

Combines regex heuristics with NLP analysis to detect credential-harvesting forms, QR codes pointing to phishing domains, brand impersonation (Microsoft, Adobe, DocuSign, DHL, PayPal), urgency language patterns, and deceptive URI display vs. actual destination mismatches.

ENGINE 23

Embedded File Analysis

Enumerates all embedded attachments. Identifies PE executables, ELF binaries, OLE compound documents, VBA macro files, ZIP archives, nested PDFs, and JavaScript files. Flags dangerous /Launch actions that auto-execute embedded files on viewer interaction.

ENGINE 24

Campaign Attribution

Computes a TLSH fuzzy hash of the PDF and compares it against previously scanned samples. Clusters similar files into malware families and named campaigns, reporting a similarity score and any known cluster associations. Reveals whether a file is a variant of a known threat.

📄

PDF-Specific Deep Forensics

Engines 25–43

ENGINE 25

AcroForm Field Forensics

Enumerates every form field and analyses: JavaScript on /A and /AA field events (focus, blur, keystroke, validate), hidden NoExport fields, password-type fields (credential harvesting), /SubmitForm exfiltration targets, and calculation-order chain exploitation. Also performs Value / Appearance Stream (V/AP) divergence detection — flags /NeedAppearances true (stale AP, critical when signed), checkbox/radio /V vs /AS key mismatch (rendering-independent), text/listbox/combobox field AP stream text extraction with font encoding remap (resolves /Encoding /Differences tables so a font mapping byte 0x31 to glyph /nine is decoded correctly before comparison to /V — catches the custom-font evasion path), image-based AP stream detection (AP renders via Do image XObject with no text operators — /V is not visually verifiable without image recognition, flagged high severity), and blank AP streams that hide a signed value from the viewer.

ENGINE 26

Document Revision History

Splits the PDF at each %%EOF boundary and extracts per-revision metadata: author, producer, modification date, and changed/new/deleted object counts per revision. Detects author identity changes, execution vectors injected after original creation, and automated exploit staging via large final-revision object injections.

ENGINE 27

Annotation Forensics

Examines every annotation object for dangerous action dictionaries: javascript: URI schemes, JavaScript triggers on click/hover, /Launch actions that spawn programs, /GoToR remote links, and /SubmitForm in annotation actions — attack vectors completely invisible to byte-level scanners.

ENGINE 28

Named Tree Analysis

Catalogues the full PDF action infrastructure: the Named JavaScript Registry (/Names /JavaScript), /AA additional actions, /OpenAction type classification, and /Perms and /UR3 permission restriction exploitation. Deep DocMDP forensics — parses the /P permission level (1 = no changes, 2 = form fill-ins, 3 = annotations — the most exploitable), validates /TransformParams and /Reference structure, checks /SigFlags AppendOnly bit, detects incremental updates violating MDP constraints, and flags multiple /DocMDP entries (validator confusion attack). FieldMDP per-signature field lock (ISO 32000 §12.8.2.4, "File MDP") — distinct from DocMDP, FieldMDP locks specific named form fields per approval signature and can be selectively permissive: detects Action=Include with empty /Fields (locks nothing despite appearing to certify), Action=Exclude with named fields (those fields are explicitly unlocked), and incremental updates that modify form fields after a FieldMDP signature is in place. PDF 2.0 Associated Files (/AF) (ISO 32000-2 §7.11.4) — enumerates document- and page-level /AF file bindings and their /AFRelationship types (Source, Data, EncryptedPayload, Alternative, Supplement), a modern attachment surface a legacy /EF-only walk misses.

ENGINE 29

Content Stream Forensics

Inspects decompressed content streams for dangerous PostScript operators: exec (dynamic execution), run (file execution), token (string-to-code eval), setpagedevice (PostScript-to-system bridge). Also detects malformed /ICCBased color profiles of anomalous size — the CVE-2021-21017 class of heap buffer overflows.

ENGINE 30

Object Stream Analysis

PDF 1.5+ allows objects to be compressed into /ObjStm containers — invisible to byte scanners. This engine decompresses every object stream and re-scans the content for JavaScript, Launch actions, EmbeddedFile references, and high-entropy payloads (entropy >7.5 bits) suggesting hidden encrypted content.

ENGINE 31

PDF Token Obfuscation Decoder

Decodes hex-escaped PDF name tokens: /J#61vaScript → /JavaScript, whitespace-split token injection, and null-byte injection in name objects. These bypass simple pattern matchers while remaining valid to the PDF renderer — a classic evasion technique found in real-world exploit kits.

ENGINE 32

XFA FormCalc Parser

Extracts and parses XFA (XML Forms Architecture) streams including embedded FormCalc scripts. XFA-based attacks (CVE-2021 XFA class) are rarely covered by general-purpose scanners. Detects dangerous FormCalc function calls, script injection in XFA event handlers, and XFA-activated JavaScript triggers.

ENGINE 33

Action Dependency Graph

Maps the full chain of PDF actions: /OpenAction → /AA → field actions → annotation triggers → named actions. Visualises multi-hop execution chains where a seemingly innocent trigger leads through a chain of named actions to a final exploit — invisible when examining any single action in isolation.

ENGINE 34

OCG Layer Cloaking

Analyses Optional Content Groups (PDF layers). Malicious content — JavaScript, phishing text, embedded payloads, deceptive instructions — can be placed in a layer set to invisible by default, present in the file but never rendered to the user. This engine enumerates all layers and their visibility states, flagging hidden active content.

ENGINE 35

Unicode & Invisible Text Forensics

Detects text with rendering mode 3 (invisible — clips nothing, draws nothing) and mode 7 (clip only — used to silently position clickable areas). Flags RTL override characters (U+202E) that reverse displayed filenames and URLs, and zero-width joiners used to split and reassemble malicious keywords.

ENGINE 36

Trailer Chain Forensics

Analyses the chain of PDF trailer dictionaries across all incremental updates. Detects Shadow Attack variants where a second document is hidden in the byte gap between the end of the signed region and the actual EOF — allowing content replacement while preserving a valid digital signature.

ENGINE 37

Codec Parameter Validation

Validates stream filter parameters for exploit-relevant codecs: /JBIG2Decode + /JBIG2Globals combinations (CVE-2009-0658 class), abnormally large /Columns and /Rows values in CCITT streams, and unusual parameter combinations in /CCITTFaxDecode and /DCTDecode filters associated with historic heap overflow exploits.

ENGINE 38

Physical Entropy Topology

Maps byte-level entropy across the physical file in sliding windows, producing an entropy profile of the entire document. Locates encrypted or compressed regions at unexpected byte offsets, encrypted blobs appended after the nominal EOF, and entropy spikes that indicate hidden payload injection invisible to object-based analysis.

ENGINE 39

Image Steganography & Tracking Beacons

Runs LSB chi-square statistical analysis on raster images embedded in the PDF — elevated chi-square scores indicate LSB steganographic payload injection. Also detects tracking beacons: 1×1 pixel images with external URI references that phone home on document open, allowing attackers to confirm successful delivery without any explicit JavaScript.

ENGINE 40

PDF/A Compliance Fraud Detection

Checks whether a PDF claiming PDF/A or PDF/UA conformance (typically to pass corporate compliance filters) actually meets the standard. Documents falsely claiming archival conformance to bypass security gateways that whitelist "archival" formats are a known evasion technique.

ENGINE 41

JavaScript Behavioral Emulation

Executes embedded JavaScript in a sandboxed Acrobat API stub environment. Simulates the Acrobat object model (app, this, util) to reveal what JavaScript does without a real viewer — catching payload assembly that requires runtime evaluation to surface. doc.getField() returns the actual /V field values from the PDF (collected by Engine 25) so conditional exploitation chains are correctly evaluated and SUBMIT_FORM events carry real field content.

ENGINE 42

Font CharString Emulator

Emulates Type 1 and Type 2 font CharString programs (the bytecode embedded in font outlines) to detect seac operator abuse (out-of-bounds glyph lookup), stack exhaustion via deeply nested subroutine calls, and arithmetic overflow patterns in CharString arithmetic — a class of font-engine exploits affecting all major PDF viewers.

ENGINE 43

XRef Integrity Graph

Builds a complete cross-reference integrity graph and identifies: phantom objects (objects referenced in the XRef but absent from the file body), orphan sleepers (objects in the file body unreferenced by any XRef entry — hidden until a parser recovers them), and free-entry exploitation (objects with generation numbers manipulated to survive deletion).

🤖

AI Ingestion Integrity

Engines 44–46

ENGINE 44

Reading Order & Spatial Ambiguity

Analyses spatial text positioning to detect multi-column and non-linear layouts where extraction order is ambiguous. PDF has no native concept of paragraphs, reading order, or semantic structure — text is positioned drawing operations. When columns, tables, or complex spatial clusters are present, parsers reconstruct reading order heuristically and disagree. Flags pages where linear text extraction produces semantically incorrect content, creating hallucinated relationships, inverted meanings, and corrupted table structures in LLM and RAG ingestion pipelines that treat parser output as canonical truth.

ENGINE 45

OCR Text Layer Integrity

Compares the embedded PDF text extraction layer against Tesseract OCR output rendered from the page raster image. A significant mismatch between the two is a strong indicator of a hidden text layer attack: malicious instructions, prompt injection, or sensitive data placed in the invisible text stream — invisible to human readers looking at the rendered image, but fully ingested by LLMs and RAG pipelines that consume the text layer. The check targets genuinely scanned pages — a single image covering >60% of the page with substantial extracted text — and flags a page only when the rendered image OCRs to real text that conflicts with the embedded layer (Jaccard word-set overlap below 0.30). To avoid false positives at scale, a page whose image legitimately yields little or no OCR text is excluded rather than scored as a mismatch; the blank-overlay and invisible-text variants are caught by the dedicated invisible-text and prompt-injection engines.

ENGINE 46

Accessibility Tree Forensics

Parses the /StructTreeRoot accessibility structure and inspects all semantic elements: /Alt image descriptions, /ActualText character-level overrides, heading hierarchy, logical structure labels, and figure captions. These channels are increasingly preferred by AI document processors because they improve chunking quality — making them high-value injection targets. Detects prompt injection in /Alt and /ActualText attributes: semantic content that exists only in the accessibility tree, fully invisible in rendering but completely visible to LLM ingestion pipelines, tagged-PDF extractors, and screen-reader-style processors. Flags payloads containing instruction-override patterns ("ignore prior instructions", "system:", "INST") that would execute in downstream AI processing.

⚙️

Synthesis & AI Report

Engine 47 + AI

ENGINE 47

Correlation Engine

Evaluates 60+ compound patterns across all preceding engine outputs. Individual indicators may be low-risk in isolation — but /OpenAction + embedded JavaScript + obfuscated URL + non-embedded font is a dangerous combination. The Correlation Engine awards bonus risk points (35–100) for such combinations and maps each compound pattern to MITRE ATT&CK technique IDs.

AI REPORT

🤖 AI Forensic Report — Qwen 2.5

A self-hosted Qwen 2.5 model (running on a private GPU server — no third-party AI API) synthesises all 47 engine outputs into a structured verdict: threat classification, confidence level, key findings, MITRE ATT&CK technique grid, and recommended actions. Zero data leaves pqpdf.com infrastructure.

Risk assessment

How the Risk Score Works

Every finding is classified onto one of four forensic axes, and the headline verdict is graded by what each finding actually means — not by a single undifferentiated count. This is what lets a feature-rich but legitimate document (a government form with field JavaScript, an academic paper with hundreds of embedded-font objects) read as clean while a genuine attack still scores.

Exploit — code execution, memory corruption, malware/dropper delivery, confirmed-malicious (AV/threat-intel) hits.
Tampering — integrity/authenticity violations: signature forgery, shadow documents, post-signing injection.
Deception — content/semantic-determinism manipulation: value-vs-appearance (V/AP) divergence, font glyph remapping, OCR text-layer poisoning, /Alt & /ActualText prompt injection, homoglyphs.
Structural / informational — neutral modern-PDF capability & structure (object streams, incremental updates, capability presence). Reported for context; never counted as a threat.

The headline Threat Score = exploit + integrity-tampering, and it drives the verdict band below. Because this is a full forensics tool and not just a malware scanner, a confirmed deception finding (e.g. the displayed value ≠ the stored/extracted value) grades the verdict on its own axis even when the malware threat score is zero — a document that shows one thing to a human and a different thing to a parser or LLM is a first-class finding. Deception and structural scores are reported separately and never inflate the malware verdict.

Severity tiers: Critical (+50 pts) · High (+25 pts) · Medium (+10 pts) · Low (+3 pts) — capped at 3 occurrences per indicator. The Correlation Engine adds +35 to +100 bonus points for dangerous indicator combinations — a single /OpenAction is low-risk, but /OpenAction + obfuscated JavaScript + a known-malicious URL is definitively dangerous.

Clean

No exploit, tampering or content-deception findings.

1–29

Low

Minor findings, or active content present with no confirmed exploit. Review before use.

30–149

Suspicious

A real semantic/integrity finding or several indicators. Manual review recommended.

150–349

High Risk

Confirmed exploit, tampering or semantic-determinism attack. Sanitize before opening.

350+

Dangerous

Confirmed execution chain. Do not open in any reader. Sanitize or discard.

Privacy & isolation

Your File Never Leaves Our Server

Uploading a potentially malicious PDF to an online scanner is only sensible if the scanner's security model is trustworthy. PQ PDF is designed around the principle that the scanner must be as safe to use as the file is dangerous.

🗑️

Zero retention

The file is deleted from the server immediately after analysis completes. No copy, hash, or metadata is retained. No database entry of your file.

🔒

Four-layer isolation

Every analysis runs inside prlimit resource limits + AppArmor MAC policy + Linux user/mount/network/PID namespaces + private tmpfs. The file cannot escape its container.

📡

Offline threat intelligence

All 6.4M+ threat indicators are stored in local PostgreSQL databases. No hash, URL, or byte from your file is transmitted to URLhaus, VirusTotal, or any external service.

🤖

Self-hosted AI

The AI report uses a Qwen 2.5 model hosted on a private GPU server. No content is sent to OpenAI, Anthropic, Google, or any third-party AI API.

👤

No account required

No login, no email, no registration. There is no way to link a scan to a user identity because no identity is collected.

📊

No tracking

No Google Analytics, no ad pixels, no third-party scripts. The CSP policy explicitly blocks all external script sources and third-party connections.

Common questions

Frequently Asked Questions

Can VirusTotal detect all malicious PDFs?

Is it safe to upload a potentially malicious PDF?

Yes. Every file is processed in a four-layer isolated environment: prlimit resource limits, AppArmor MAC policy (pqpdf-unshare profile), Linux user + mount + network + PID namespaces, and a private tmpfs mount. The behavioral sandbox adds another nested namespace with its own isolated network stack. The file is deleted immediately after analysis — no copy, hash, or metadata is retained.

Does the scanner send my file to VirusTotal or any external service?

What types of PDF threats does the scanner detect?

Can the scanner clean a malicious PDF?

How does PQ PDF compare to Hybrid Analysis or ANY.RUN?

How long does a scan take?

Scan Your PDF Now — Free

No account. No file size limit. Results in under a minute.
File deleted immediately. Zero data retained.

🔬 Open PDF Forensics Scanner

🗂️ Office Document Scanner 🔬 Universal File Scanner 🔍 File Fingerprint Comparator ⬛ Redact PDF 🛡️ Protect PDF

PDF Malware Scanner47 Forensic Engines. Free. Online.