Inspects PDF header position, counts %%EOF markers (exploit PDFs often carry multiple), audits cross-reference table depth, linearisation flags, and excessive filter chains used for obfuscation.
45+ byte-level signatures: /JavaScript /Launch /OpenAction /EmbeddedFile /JBIG2Decode /XFA /RichMedia, NOP sleds (%u9090 %u4141), heapspray fills, and dangerous JS APIs: eval() unescape() collab.getIcon() util.printf().
Decompresses every FlateDecode stream via PyMuPDF and re-scans the raw content — catching JavaScript and shellcode hidden inside compressed objects that raw-byte scanners miss entirely. Calculates Shannon entropy per stream; values above 7.2 bits flag encrypted or packed payloads.
Walks the full cross-reference object graph, resolving indirect references and checking every object dictionary for dangerous action-type combinations (/S /Launch, /S /JavaScript, /RichMedia, /XFA). Reports exact xref numbers of suspicious objects.
Extracts all HTTP/HTTPS URLs from raw bytes and decompressed streams, de-duplicates, and lists them so you can assess every domain the PDF attempts to contact — phoning home, tracking pixels, and C2 beaconing.
Inspects Producer and Creator fields for known exploit-tool strings (Metasploit, Canvas, Core Impact), flags missing metadata — a hallmark of crafted exploits — and scans XMP streams for embedded script references.
Checks every font object for /JBIG2Decode usage — the codec exploited in CVE-2009-0658 and CVE-2010-0188 — and for abnormally large /Widths arrays used in historic heap-overflow attacks against Acrobat's font engine.
Matches against known exploit signatures: CVE-2009-0658, CVE-2009-0927, CVE-2009-4324, CVE-2008-2992, CVE-2007-5659, CVE-2007-5020, CVE-2010-0188, binary heapspray NOP sleds (0x0C / 0x0D fill patterns).
⑨
Structural Statistics
▼
Collects page count, object count, encryption status, embedded file count, form fields, annotations, and link count via PyMuPDF's document model — providing the full structural picture for the summary dashboard.
⑩
ExifTool Metadata Forensics
▼
Runs exiftool for deep metadata extraction complementing PyMuPDF's view. Detects exploit-kit fingerprints in Creator/Producer/Author fields (Metasploit, msfvenom, Canvas, Core Impact), independently confirms XFA forms, and surfaces embedded attachment flags visible only via EXIF/XMP metadata layers. Results feed into the Correlation Engine.
⑪
qpdf Structural Integrity
▼
Runs qpdf --check to validate cross-reference tables, trailer dictionaries, and overall document structure. Intentionally malformed or "damaged" PDFs — where xref tables are deliberately broken — are a hallmark of exploit kits designed to hide objects from basic parsers while still rendering in vulnerable viewers. Results feed into the Correlation Engine.
⑫
YARA Rule Engine — 24 Rules
▼
Applies 24 custom YARA rules targeting PDF-specific attack signatures: classic heap-spray patterns (%u9090, 0x0c0c fills), CVE-specific byte patterns (CVE-2009-0658, CVE-2008-2992, CVE-2010-1240, CVE-2018-4990, CVE-2021 XFA, CVE-2024-41869 use-after-free, CVE-2024-45112 type confusion), JavaScript shellcode loaders (eval+unescape), hex-obfuscated keywords, auto-open executable combos, XFA+script exploits, Cobalt Strike beacon signatures, PowerShell encoded commands, Unicode obfuscation, and multi-layer encoder chains. Provides byte-level corroboration independent of PyMuPDF parsing. Results feed into the Correlation Engine.
Independent PDF analysis using the PeePDF framework — a separate parser that builds its own object tree entirely independently of PyMuPDF. Identifies vulnerability patterns with exact object IDs, locates suspicious elements (/Launch, unescape, getIcon, printf, eval, /EmbeddedFile), and reports JavaScript object locations. Where our other engines parse bytes and structure, PeePDF provides a full second-opinion parse. Results feed into the Correlation Engine.
⑭
Dynamic Behavioral Sandbox
▼
The only engine that actually executes the PDF. Renders through six independent engines — Ghostscript (PostScript + JS interpreter), MuPDF, Poppler, LibreOffice Draw (OLE/macro paths), Chromium PDFium (Chrome browser engine — the dominant modern PDF viewer), and pdf.js/Node (Firefox engine) — each inside a Linux process namespace with its own isolated network stack, PID space, and mount point. All syscalls are captured by strace. Detects: outbound network connections (beaconing in an isolated namespace is definitively malicious), anonymous executable memory mappings (the runtime signature of shellcode), unauthorised process spawning (shell execution from a CVE), filesystem escape attempts, and excessive fork/clone calls (process bombs). Static analysis sees the PDF's structure; this engine sees what it does.
⑮
ClamAV Signature Scanner
▼
Runs the local ClamAV daemon against the file — matching 700,000+ signatures including the Pdf.Exploit.* family (CVE-2009-0927, CVE-2009-4324, Exploit.PDF-JS, and many more). Where the other engines use heuristics and structural analysis to catch zero-days, ClamAV provides authoritative signature intelligence for known samples. A match here means the file is a confirmed known threat.
⑯
ML Intelligence Engine
▼
Sits above all 15 engines and applies three layers of intelligence: Bayesian contextual scoring adjusts risk based on document origin — a JasperReports or Microsoft Word creator is dampened; a Metasploit/msfvenom creator is amplified. IsolationForest provides unsupervised anomaly detection from the very first scan — flags documents whose 38-feature vector is statistically unusual compared to the scanned population. RandomForest + LightGBM classifiers activate once ≥10 labeled samples accumulate; bootstrap pseudo-labeling (unlabeled scans with raw_score ≤ 5 treated as pseudo-benign) enables supervised activation from the first real malicious scan. Explainable ML reports the top feature contributions via SHAP for each scan. Every scan is persisted to PostgreSQL. User feedback (false positive / confirmed threat) feeds directly into retraining.
Runs six independent PDF parsers — MuPDF (mutool), Poppler (pdfinfo/pdfdetach), Ghostscript, qpdf, pdfminer, and pdf.js — against the same file and cross-compares 8 structural dimensions: page count, object count, PDF version, JavaScript presence, encryption status, AcroForm presence, embedded file count, and OpenAction. Malicious PDFs abuse broken xref tables, hidden incremental updates, and duplicate object numbers so that one parser recovers hidden exploit objects that another ignores entirely. Seven discrepancy checks (Critical/High/Medium) flag page mismatches, JS visibility gaps, encryption oracle indicators, version header confusion attacks, hidden form action trees, and attachment count discrepancies — all invisible to single-parser scanners. A hard 30-second timeout guards the entire engine so no malformed PDF can stall the scan.
Scans every PDF stream — both raw and decompressed — for file magic byte signatures: PK\x03\x04 (ZIP), MZ (Windows PE executable), \x7fELF (Linux ELF binary), \xcf\xfa\xed\xfe (Mach-O), \xca\xfe\xba\xbe (Java class), \xd0\xcf\x11\xe0 (OLE/CFBF — Office binary), RAR, 7-Zip, and embedded PostScript. Polyglot files simultaneously satisfy the format rules of two or more file types — the file appears as a valid PDF to viewers while also containing a self-extracting archive or executable dropper that activates when saved to disk and opened by a compatible application. This technique is used to smuggle payloads past content-type-based security controls that only inspect the file header.
Extracts all JavaScript fragments from the PDF — both inline /JS strings and JavaScript-bearing compressed streams — then parses each through the Acorn parser to build a full Abstract Syntax Tree. Instead of scanning text for keywords, the AST walker detects meaning: eval() and execScript() dynamic execution entry points; String.fromCharCode() arrays that assemble shellcode from integer sequences at runtime; unescape() decode chains that two-stage-deliver encoded payloads; numeric arrays of 150+ elements (the structural signature of heap spray); and new Function(string) dynamic code construction. These patterns are completely invisible to regex-based scanners but trivially visible at the AST level.
Queries four fully offline local databases — URLhaus (abuse.ch malware payload hashes + URLs), MalwareBazaar (SHA-256 hash reputation + family labels), ThreatFox (IOC indicators), and FeodoTracker + OpenPhish (C2 IPs + phishing URLs) — totalling 6.4M+ indicators stored in local PostgreSQL tables. Zero external API calls per scan, zero rate-limit dependency, zero data exfiltration. Databases are updated from public feeds on a periodic schedule. A hash match raises a critical indicator and auto-labels the scan. URLs extracted from the PDF are checked against URLhaus URL and domain feeds and the OpenPhish phishing URL list. No hash, URL, or file byte is transmitted externally.
Deep forensics on PDF digital signatures via pyhanko. Computes ByteRange coverage: if the byte ranges declared in the signature do not cover the full file, the gap contains unsigned content — the classic shadow document attack where a malicious payload is appended after signing. Diffs the object inventory across every incremental update revision after the signature to detect execution vectors (/JavaScript, /Launch, /OpenAction, /EmbeddedFile) added post-signing. A signed-then-modified document with active content is treated as a critical finding.
Multi-vector phishing analysis: urgency/deception phrase detection (30+ patterns: "login required", "verify your account", "prize notification", "limited time"); brand impersonation keywords (Microsoft, Apple, Amazon, PayPal, DocuSign, Adobe, DHL, IRS, and others); credential harvesting via AcroForm — detects SubmitForm actions combined with password-type form fields, the structural signature of a phishing PDF form; QR code decoding — renders every PDF page as an image and extracts all embedded QR codes via zbarimg, then checks decoded URLs for suspicious domains (IP addresses, non-HTTPS schemes, URL shorteners). High urgency phrase density combined with brand impersonation is scored as a high-confidence phishing indicator.
㉓
Embedded File Analysis
▼
Uses pdfdetach (Poppler) to extract every embedded file attachment, then forensically analyses each: magic byte detection for Windows PE (MZ), Linux ELF (\x7fELF), OLE/CFBF (\xd0\xcf — Office binary), OOXML archives, script files (.bat, .ps1, .vbs, .sh), and 7-Zip/RAR containers; VBA macro detection in OOXML — inspects xl/vbaProject.bin, word/vbaProject.bin, and ppt/vbaProject.bin entries inside Office attachments; strings extraction on executable payloads to surface suspicious API calls, IP addresses, or command-line arguments. A PDF carrying a PE executable is a confirmed dropper — scored critical regardless of other indicators.
Computes a TLSH (Trend Locality Sensitive Hash) of the full PDF — a similarity-preserving hash where similar files produce similar hashes, enabling fuzzy matching unlike SHA-256. TLSH score <30 indicates near-identical files (same exploit kit generation); <100 indicates the same malware campaign family. The hash is compared against every confirmed-malicious PDF previously scanned and stored in the PostgreSQL database. A cluster match surfaces the associated campaign context. The structural fingerprint (object counts, stream sizes, action types) is also computed for samples too small for TLSH. Falls back to ssdeep if TLSH is unavailable.
㉕
AcroForm Field Forensics
▼
Enumerates every AcroForm widget field across all pages and forensically analyses each: JavaScript on fields (/A and /AA actions — JS fires on focus, blur, keystroke, validate, or calculate events, invisible during static review); hidden fields (NoExport flag — fields not shown to the user but present in submitted data); password-type fields (credential harvesting indicators); SubmitForm exfiltration targets — the URL(s) to which all field data is POSTed; /AA additional-action JS triggers on field objects (secondary execution vector independent of /OpenAction); and calculation order (/CO) — adversaries reorder field calculations to chain JS evaluations across fields, enabling multi-step payload staging hidden inside form arithmetic. Results feed into the Correlation Engine.
㉖
Document Revision History
▼
Splits the PDF at each %%EOF boundary and extracts per-revision metadata: author, producer, modification date, and new/modified/deleted object counts for each revision. Detects author identity changes between revisions, execution vectors injected after the original document was created, and large object injections in the final revision — the structural signature of automated exploit staging. Results feed into the Correlation Engine.
Enumerates every annotation object across all pages and forensically analyses each action dictionary: dangerous URI schemes (javascript:, data:, file://, vbscript:); JavaScript action triggers on annotation interaction; /Launch actions that spawn arbitrary programs; GoToR remote links that open external files; and SubmitForm actions that exfiltrate form data. Annotation-borne payloads are completely invisible to scanners that only analyse raw bytes or page content streams. Results feed into the Correlation Engine.
Catalogues the full PDF action infrastructure: Named JavaScript Registry (/Names /JavaScript subtree — persistent JS objects callable by name from any action); /AA Additional Actions (event-driven triggers on page open/close, print, save, field events); /OpenAction classification (JavaScript, Launch, GoToR, URI, GoTo); DocMDP modification prevention signatures that lock out sanitizers; /Perms cryptographic permission restrictions; and UR3 usage-rights signatures used to exploit extended viewer features. Results feed into the Correlation Engine.
㉙
Content Stream Forensics
▼
Inspects all decompressed content streams for dangerous PostScript execution operators: exec (dynamic code execution), run (file execution), token (string-to-code eval), setpagedevice (PostScript-to-system passthrough — bridges to the PostScript interpreter from PDF context). Also detects ICC color profile abuse — malformed /ICCBased profiles of anomalous size exploit heap buffer overflows (CVE-2021-21017 class). Flags content bombs: streams exceeding 5 MB that may exhaust parser memory or conceal data in oversized objects. Results feed into the Correlation Engine.
㉚
Object Stream Analysis
▼
PDF 1.5+ allows multiple objects to be compressed together in a single /ObjStm stream. Scanners that only search raw bytes will miss any object inside a compressed container. This engine decompresses every /ObjStm in the document and re-scans the decompressed content: JavaScript, /Launch actions, /EmbeddedFile references, and high-entropy payloads (entropy >7.5 bits) that suggest encrypted content hidden inside compressed object bundles. Complements the Stream Inspector (Engine 3) with object-container-specific forensics. Results feed into the Correlation Engine.
㉛
PDF Token Obfuscation
▼
PDF name objects can embed hex escape sequences: /J#61vaScript is syntactically identical to /JavaScript to every PDF parser, but bypasses every raw-byte keyword scanner. This engine decodes all hex escapes in name tokens and checks whether the decoded name matches a dangerous keyword — JavaScript, Launch, OpenAction, EmbeddedFile, XFA, SubmitForm, JBIG2Decode, and 10 more. Also detects whitespace-split keywords (e.g. /Ja\nvaScript — valid PDF whitespace normalised by parsers but invisible to substring search), formfeed byte injection (0x0C — a valid PDF whitespace delimiter used instead of space to confuse tools that only accept 0x20/0x09/0x0A as delimiters), and null byte injection in header regions (0x00 bytes outside binary streams — evasion technique against C-string comparison scanners). Finds obfuscation that Engine 2 (byte patterns) misses because it searches for literal keyword bytes. Results feed into the Correlation Engine.
XFA embeds a complete XML application with its own scripting language (FormCalc), distinct from JavaScript. Parses the XFA XML stream using lxml, extracting FormCalc scripts (xfa.host.exec(), Url.resolve(), xfa.host.openURL()), submit actions (<submit action="http://..."> — direct form field exfiltration), initialize event auto-execution (fires on open without /OpenAction, bypassing standard detection), and remote template inheritance (mergeMode="matchTemplate" pulling malicious XDP from a remote server). Fills a genuine gap in the entire PDF forensics industry.
㉝
Action Dependency Graph
▼
Every engine treats PDF actions as independent findings. This engine builds the directed graph of the entire action space and reasons about its topology. Graph analysis: cycles (A→B→A crash loops in vulnerable readers), depth >5 (deep action chains indicate automated exploit construction), fan-in (single JavaScript object triggered by >8 distinct events — trigger maximization for reliability), and dead action nodes (JavaScript/Launch objects with no inbound edges from /Root traversal — sleeper payloads reachable only through vulnerabilities). Computes graph density and edge count. Architecturally unique — no existing tool does this.
Optional Content Groups let a PDF show different content under different conditions — an industry-wide blind spot. Uses PyMuPDF layer APIs to detect: default-OFF layers containing annotations with URI/JS actions (hidden until user interaction — social engineering trap), Never-state layers (/View [/Never] — content invisible to all viewer UI but accessible to the PDF rendering engine), screen/print divergence (visible on screen but not in print, or vice versa — hides content from PDF review while delivering it during printing), and viewer-conditional layers (activate only on specific app.viewerType or app.viewerVersion — targeted attack indicator). Including VMRay — it sandboxes the default view, not OCG state variants.
㉟
Unicode & Invisible Text
▼
Completely absent from commercial PDF forensics tools. Uses page.get_text("rawdict") to access raw rendering data: Rendering mode 3 (invisible — clips to outline only, not rendered or printed — hides machine-readable content from visual review), mode 7 (fill+clip — invisible fill, part of accessibility tree — screen reader sees it, human doesn't), white text on white background (visually invisible but present in copy-paste and accessibility tree). Unicode scan: U+202E RIGHT-TO-LEFT OVERRIDE (makes invoice_FDP.exe display as invoice_exe.PDF), U+200B ZERO WIDTH SPACE in field names breaking security tool matching, U+00AD SOFT HYPHEN in URLs evading pattern matchers, and homograph domain attacks using Cyrillic/Greek lookalike characters in URLs.
㊱
Trailer Chain Forensics
▼
Engine 26 extracts high-level revision metadata. This engine goes lower — walks the raw /Prev pointer chain at the binary level. For each revision boundary: /ID array mutation (document fingerprint changes mid-document = document substitution attack), /Encrypt changes (encryption added or removed after signing — not covered by Engine 21's byte-range check alone), /Root OID change (entire document catalog swapped between revisions while retaining a valid signature over the original — Shadow Attack / PDF signature bypass), /Size shrinkage (incremental update that reduces declared object count — objects being hidden), and post-%%EOF data (raw bytes after final %%EOF marker — invisible to all PDF-aware parsers, used for payload staging or steganography). No commercial tool does this at the raw trailer-chain level.
㊲
Codec Parameter Validation
▼
Engines 2 and 29 detect filter names. This engine validates the actual parameters of each filter against known exploit profiles — completely absent from every static scanner. CCITTFaxDecode: K<-1 (OOB in some decoders), Columns/Rows >65535 (integer overflow). JBIG2Decode: JBIG2Globals presence (shared segment attack — CVE-2009-0658). DCTDecode: ColorTransform/ColorSpace mismatch (color channel confusion in decoders). LZWDecode EarlyChange=0: historically exploited code path in Acrobat's LZW implementation. RunLengthDecode decompression bomb: extreme compression ratio. Multi-filter chains (≥3 chained filters — legitimate PDFs essentially never need this, used purely to evade single-layer scanners). ASCIIHex+Flate double-encoding obfuscation. Crypt /Identity bypass — nominally encrypted but identity-transform, bypasses scanners that skip encrypted content.
㊳
Physical Entropy Topology
▼
PDF-structure-aware entropy mapping — no forensics tool does this. Reads raw bytes, computes Shannon entropy in 512-byte windows with 256-byte stride, and maps each window to its structural region. Detects: post-%%EOF data (any bytes after the final %%EOF not part of a valid incremental update — raw executables appended are invisible to all PDF-aware parsers), high entropy in non-stream zones (>7.2 bits/byte in comment blocks or trailer area — indicates binary data hidden as PDF comments), entropy cliffs between revisions (sudden entropy jump at an incremental update boundary not corresponding to a new compressed stream = injected encrypted payload), and near-zero entropy in compressed streams (nearly all-zero FlateDecode input that expands to massive output — decompression bomb precursor). Zone averages computed for header, comment, stream, non-stream, and post-EOF regions.
No commercial PDF scanner applies steganography detection to embedded image streams. Chi-square LSB test: in natural images, LSBs are near-uniform; in LSB stego they are exactly uniform (ratio <0.005 from 50/50) — flags stego candidates with statistical confidence. Tracking beacons: 1×2px images that are fetched when the PDF is rendered, revealing opener IP, timestamp, and user agent to the attacker. JPEG EXIF/COM anomalies: APP1/COM segments >512 bytes in simple images — common steganography carrier used by APT groups to embed C2 configuration in logo images. Colorspace mismatches: declared /DeviceGray but actual RGB (extra channels used for data hiding). Duplicate visual content with different raw bytes (per-recipient tracking watermarks). Requires ≥2 simultaneous indicators to flag, controlling false positive rate.
㊵
Compliance Fraud Detection
▼
PDFs claiming archival or print-exchange compliance are trusted by DLP systems, email gateways, and archival workflows without validation. Falsely claiming compliance is a documented bypass technique. Parses XMP metadata for pdfaid:conformance, pdfaid:part, GTS_PDFXVersion and cross-checks against PDF/A prohibitions: JavaScript (forbidden in all PDF/A variants), /Launch actions (forbidden), embedded files (forbidden in PDF/A-1), encryption (forbidden in PDF/A), URI actions (forbidden in PDF/A-1). Also detects fake compliance markers — raw strings claiming compliance without proper XMP metadata (string injection to confuse parsers). A finding of "Claims PDF/A-1b but contains JavaScript" is unambiguous evidence of deliberate DLP bypass.
㊶
JS Behavioral Emulation
▼
Middle ground between static analysis (Engine 19) and full OS sandbox (Engine 14). Executes extracted JavaScript in an isolated vm.runInNewContext with a complete PDF Acrobat API stub: app.launchURL(), doc.submitForm(), doc.getURL(), doc.mailDoc(), app.openDoc(), doc.getField() with realistic return values. Captures: LAUNCH_URL (C2 beacon URL decoded at runtime), SUBMIT_FORM (credential exfiltration target URL), GET_URL (network fetch), MAIL_DOC (email exfiltration target), DYNAMIC_FUNCTION (runtime code generation payload). Six-pass multi-eval resolution unwraps nested deobfuscation chains. Catches deobfuscated URLs and payloads invisible to static AST analysis, in under 1 second, with no OS fingerprinting evasion vectors.
㊷
Font CharString Emulator
▼
Type 1 fonts contain programs in the CharString bytecode language — a stack machine with its own operators. Every PDF renderer (FreeType, Windows font engine, Acrobat) executes these. No static scanner emulates them. Decrypts CharString data using the standard linear congruential cipher (seed 4330), then emulates the stack machine to detect: subroutine call depth >10 (nested callsubr/callgsubr — relevant CVE classes: depth exhaustion), missing endchar (program terminates without proper end operator — parser overread), seac OOB (standard encoding accented character operator with character codes >127 — out-of-bounds indexing), othersubr argument overflow (>16 arguments), and oversized CharString programs (>2000 bytes — legitimate glyphs are almost always <500). Identifies the specific glyph name and OID for each finding.
Beyond qpdf structural validity — builds the full object reference graph via BFS from /Root and reasons about topology. Phantom objects: referenced by OID in another object but absent from xref — parser-confusion attack where some readers synthesize the missing object. Orphan objects: present in xref, not reachable from /Root — staged payload waiting for a vulnerability to make it reachable (especially dangerous when type is JavaScript, Action, or EmbeddedFile). Xref F-entry on live objects: object marked as free in xref but still referenced — exploits free-entry dereferencing bugs in vulnerable readers. OID type confusion: same OID used for different /Type objects across revisions (type confusion exploits object recycling). Stream /Length falsification: declared /Length doesn't match actual stream size — causes buffer over/underread in decoders (broad CVE class).
Cross-references all 44 engine findings. Individual indicators are scored conservatively — this engine identifies dangerous combinations that are orders of magnitude more serious than their parts. Now incorporates findings from all new engines: XFA auto-execute + submit exfiltration, OCG hidden layer + active content, RLO + phishing indicators, trailer mutation + signature forensics, entropy cliff + high-risk content, behavioral emulation URL + threat intelligence match, compliance fraud + JavaScript (confirmed DLP bypass), stego beacon + network indicators, phantom objects + JavaScript reachability. Classic patterns retained: /OpenAction+JS, JBIG2+JS CVE, eval()+unescape() shellcode, heapspray+JS. Multi-engine vote weighting: indicators confirmed by multiple engines receive logarithmically scaled bonus scoring. 60+ compound patterns.
After all 44 analysis engines complete, a self-hosted Qwen 2.5 1.5B Instruct Q4_K_M LLM synthesises the structured scan output into a concise forensic verdict. The model outputs seven structured fields: threat_verdict (MALICIOUS / SUSPICIOUS / LIKELY_CLEAN / CLEAN), confidence, executive_summary (one sentence), key_findings (signal + severity + MITRE ID), observed_techniques (MITRE ATT&CK IDs and names), recommended_actions, and false_positive_note. Verdict is PHP-enforced against the engine-computed risk_score so it cannot contradict the quantitative analysis — the AI adds narrative and correlation, not classification override. Runs on an isolated remotellm node over an encrypted WireGuard VPN tunnel — zero third-party AI, no data leaves your infrastructure. Typical latency: ~15–25 s at ~13 tokens/s (CPU-only, no GPU). Served by llama-server (llama.cpp) as a systemd service.