The Complete Guide
PQ PDF — Every Tool, Explained
45 free PDF tools. No accounts. No ads. No file retention. Everything runs on our servers — your files are deleted the moment your download starts.
Most online PDF tools share a structural problem: they are built around cloud storage. A file uploaded to add a watermark travels to a third-party processing service, sits in object storage, passes through analytics pipelines, and is subject to retention policies that are vague at best.
PQ PDF is different. Every operation creates one isolated temporary directory, runs entirely inside it, streams the result back to your browser, and deletes the directory — while the download is still in flight. There is no retention window because there is no buffer.
Four specific gaps drove this project: no free tool offered a genuine zero-retention guarantee anywhere in the stack; no free tool ran multi-engine threat analysis on PDFs; no tool offered post-quantum cryptography in document workflows; and no tool was transparent about which engines were actually running — every operation here is described in the exact pipeline terms of the code.
Why privacy-first matters for PDFs
Your file is deleted while the download streams. cleanup()
is called immediately after readfile() — not on a schedule, not on the next request.
There is no temp-file cleanup job because nothing is left to clean.
All 45 tools run on pqpdf.com's own servers. Ghostscript, LibreOffice, Tesseract, PyMuPDF, ClamAV — every engine runs locally. The AI features (forensic report, document analysis, redaction suggestions, change analysis) run on our own self-hosted Qwen 2.5 1.5B LLM via llama.cpp — no OpenAI, no Anthropic, no Google, no third-party AI API of any kind. No file data is ever sent outside our infrastructure.
Every page uses a per-request nonce-based Content Security Policy.
No inline scripts, no unsafe-eval. All event handlers are registered via
addEventListener() in external JS files. Including this page.
No analytics pixels. No advertising networks. No social-media trackers. Server access logs (IP, timestamp, path) are retained for 30 days for abuse prevention only, then permanently deleted.
The Protect PDF tool includes 31 post-quantum algorithms (NIST ML-KEM-1024, HQC, FN-DSA, and hybrid modes) running entirely client-side in your browser. The server receives only the encrypted bundle — your plaintext never crosses the network.
Every tool works without registration. There are no user accounts, no email addresses collected, no passwords stored. Rate limiting uses session cookies that exist only in your browser and are never transmitted to or stored on the server.
What the 45 tools cover
Five groups covering every common PDF workflow.
Merge, split, compress, rotate, reorder, delete pages, extract pages, flatten, repair, grayscale, N-up imposition, and auto-crop & deskew.
Explore core tools →Convert between PDF and Word, Excel, PowerPoint, HTML, Images, Markdown, PDF/A, and PDF/X. In both directions.
Explore convert tools →44-engine PDF forensics scanner, AES-256 + PQC encryption, unlock, permanent redaction, watermarking, and PAdES-compliant signing.
Explore security tools →Full visual editor, form fill, PDF diff, OCR, accessibility audit, font inspection, colour profiling, table extraction, and more.
Explore annotation tools →Chain multiple operations into a named workflow. Save, load, append, and export pipelines as JSON. Run on one or many PDFs in one click.
Explore automation →Temp-dir isolation, CSP nonces, rate limiting, file validation, zero-retention architecture, and the full engine stack — all documented.
Read the architecture →Side-by-side comparison with Adobe, Smallpdf, iLovePDF, PDF24, and Sejda — tools, limits, privacy, and pricing verified from official sources.
See the comparison →How PQ PDF compares
Verified from official privacy policies, terms of service, and tool pages as of March 2026. Only publicly documented claims are listed.
| Feature | PQ PDF | Adobe Acrobat Online | Smallpdf | iLovePDF | PDF24 | Sejda |
|---|---|---|---|---|---|---|
| File retention after processing | ✔ Deleted during download
cleanup() called inside send_file() — no retention window of any kind |
Not disclosed Files deleted "after processing" — no specific time given on any public page | 1 hour Automatically deleted 1 hour after processing (stated on tool pages) | 2 hours Documented in Terms of Service §9.3 and the public FAQ | 1 hour Stated in Privacy Policy and Terms of Use (pdf24.org) | 2 hours Stated on tool pages and in the Sejda Privacy Policy |
| Free tier — what's included | ||||||
| Core / Organise | ✔ 12 tools Merge · Split · Compress (5 presets + custom DPI + live preview) · Rotate · Extract Pages · Delete Pages · Reorder · Repair · Flatten · Grayscale / B&W · N-up / Imposition · Auto-crop & Deskew (per-page interactive editor) | ✔ 7 tools Free Adobe ID required for most Merge · Split · Rotate · Delete pages · Reorder pages · Add page numbers · Add watermark | ✔ 7 tools 2 tasks/day cap Merge · Split · Compress · Rotate · Delete pages · Reorder pages · Extract pages | ✔ 6 tools Per-task caps apply Merge (25 files, 100 MB) · Split · Remove pages · Extract pages · Rotate · Compress | ✔ 6 tools All free · no caps · ad-supported Merge · Split · Compress · Rotate · Delete pages · Reorder pages | ✔ 13+ tools 3 tasks/hr rate limit Merge (4 variants: std, specific pages, alternate, resize) · Split (5 variants: pages, text, bookmarks, size, extract) · Rotate · Delete pages · Crop · Repair · Flatten · Header & footer · Bates numbering · Reverse PDF |
| Convert | ✔ 13 tools PDF ↔ Word · Excel · PowerPoint · HTML · Images · Markdown (pymupdf4llm) · PDF/A (1b/2b/3b) · PDF/X (X-1a/X-3/X-4) | Partial
PDF → Office is paid
Free (Adobe ID): Word / Excel / PPT / HTML / JPG → PDF · PDF → JPG (limited quality) Paid only: PDF → Word · Excel · PPT · PDF/A · OCR |
Partial
PDF → Office is Pro only
Free: Word / Excel / PPT / HTML / JPG → PDF · PDF → JPG Pro only: PDF → Word · Excel · PPT · OCR · PDF/A |
✔ 10 tools
File size caps per tool
→ PDF (5): Word · Excel · PPT · HTML · JPG (20 files, 100 MB each) PDF → (5): Word (10 MB) · Excel (10 MB) · PPT (10 MB) · JPG (100 MB) · HTML (10 MB) |
✔ 10 tools All free · no caps · ad-supported PDF ↔ Word · PDF ↔ Excel · PDF ↔ PowerPoint · Image ↔ PDF · HTML → PDF | ✔ 8 tools
3 tasks/hr rate limit
PDF → (5): Word · Excel · PPT · JPG · HTML → PDF (3): Word · Excel · JPG |
| Security & Encryption | ✔ 6 tools 44-engine PDF forensics scanner (structural · dynamic sandbox · ML+SHAP · XFA FormCalc · action dependency graph · OCG cloaking · Unicode/invisible text · trailer chain forensics · codec exploit params · entropy topology · image stego · compliance fraud · JS behavioral emulation · font CharString emulator · XRef integrity graph · local threat intelligence 6.4M+ indicators · MITRE ATT&CK · signature forensics · phishing · campaign attribution) · AES-256 + 31-algorithm PQC encrypt · Unlock · Permanent redact · Watermark · PAdES-B sign | ✔ 2 tools free
Most security features are paid
Free (Adobe ID): Protect (encrypt) · Unlock Paid only: Redaction · Advanced watermark |
✔ 3 tools
2 tasks/day cap
Protect (encrypt) · Unlock · Watermark Pro only: Redact · PDF/A compliance |
✔ 5 tools
Per-task caps apply
Protect (encrypt) · Unlock · E-sign (request) · Validate signature · Redact PDF No AES-256 PQC encryption, no threat scanning |
✔ 4 tools All free · no caps · ad-supported Protect (encrypt) · Unlock (decrypt) · Sign PDF · Compare PDFs | ✔ 5 tools 3 tasks/hr rate limit Protect (encrypt) · Unlock · Sign PDF (request) · Validate signature · Redact PDF |
| Edit, Annotate & Inspect | ✔ 11 tools Visual editor (16 annotation tools + AcroForm builder + bookmarks) · Fill forms · Compare / diff · Tesseract 5 LSTM OCR (searchable PDF + confidence) · Bookmarks editor · WCAG 2.1 accessibility checker · Font inspector · Colour / CMYK inspector · Tables → JSON · Extract text · PDF info | Partial
Advanced edit, OCR & compare are paid
Free: Add comments · Fill & Sign · Basic text edit Paid only: Advanced editing · OCR · Compare PDFs · Accessibility checker · Bates numbering |
✔ 4 tools
2 tasks/day cap
Edit PDF · Add page numbers · Fill forms · Flatten Pro only: OCR · Redact · AI chat / summarise |
✔ 6 tools
Per-task caps apply
Edit PDF (50 MB) · Add page numbers · Fill forms · Sign PDF · OCR PDF · PDF Scanner Premium: AI tools, API, batch |
✔ 5 tools All free · no caps · ad-supported Edit PDF · Add page numbers · Fill forms · OCR PDF · PDF/A conversion | ✔ 9 tools 3 tasks/hr rate limit Edit PDF · Add text / image / shapes / links · Whiteout · Edit hyperlinks · Add page numbers · OCR PDF · PDF Scanner · Optimise for web · HTML → PDF |
| Automation / Workflow | ✔ 15-step workflow builder Visual builder — chain, save, append, export as JSON; run on multiple PDFs in one job; fully free, no account needed | ✘ None free Acrobat Actions (macro-style) — paid Acrobat Pro only; no visual builder | ✘ None Batch via API — Pro only; no visual workflow builder | ✘ None Batch — Premium & API only; no visual builder | ✘ None Tools run individually; no workflow builder | ✘ None Tools run individually; no workflow builder |
| Send for e-signature (multi-party) | ✔ Free — no caps
Up to 10 signers · sequential or parallel order · unique secure link per signer · PAdES-B cryptographic option (sender can enforce it — require_crypto rejects submissions without a cryptographic signature) · free drag-and-drop signature placement · date stamp · add/remove signers from tracking page · 24-hr TTL · no account needed
|
✔ 3 requests/month free Request e-sig from others via free Adobe ID; unlimited via Acrobat Sign (paid) | ✔ Limited free E-sign requests included in free 2-tasks/day tier; Pro removes cap | ✔ Free with caps E-sign — request sigs from others; 1-file per-task cap applies | ✔ Free Request-signature workflow; free, no cap, ad-supported | ✔ Free with rate limit E-sign — request sigs from others; 3 tasks/hr cap applies |
| PDF Scanner (camera to PDF) | ✔ Free — no caps Browser camera or photo upload · real-time edge detection · OpenCV perspective correction · CLAHE & B&W enhancement · Tesseract 5 OCR · multi-page · no app install · zero retention | ✔ Adobe Scan (app required) Dedicated free mobile app; scan to searchable PDF with OCR — requires install | ✘ Not available | ✔ Yes — free with caps PDF Scanner web tool; mobile camera to PDF with OCR; per-task file cap applies | ✔ PDF24 app (app required) Mobile app includes scan-to-PDF; free, ad-supported — requires install | ✔ Yes — rate limited PDF Scanner tool; 3 tasks/hr cap applies |
| Free tier limits | No account · No task caps · No daily limits · No ads · No upsells 50 MB / file · 200 MB total per request | 2 GB / file Most tools need free Adobe ID; aggressive upgrade prompt after 1–2 uses of most tools; full access requires paid plan | 5 GB / file 2 tasks/day hard cap — upgrade prompt after that; batch & API are Pro-only | 200 MB / task Per-tool file count & size caps; tighter on PDF → Office (10 MB); Premium for batch & larger files | 500 MB / file No task cap, no daily limit; all tools free with advertising; Premium removes ads | 50 MB / file 3 tasks/hr · 50 pages/task · 30 files/hr; Paid plan removes all rate limits |
| Malware & threat scanning | ✔ 44 independent engines Static heuristics · dynamic Linux namespace sandbox · ML anomaly detection · XFA FormCalc parser · PDF action dependency graph · OCG layer cloaking · Unicode/invisible text · trailer chain forensics · codec exploit params · entropy topology · image steganography · compliance fraud · JS behavioral emulation · font CharString emulator · XRef integrity graph · six-parser differential · JS AST deobfuscation · AcroForm forensics · signature forensics · campaign attribution · weighted correlation engine | ✘ No malware scanning CSAM content check only (Adobe Terms §2.2(C)); no PDF threat analysis disclosed | ✘ None disclosed | ✘ Explicitly none "We won't check, copy or analyze your files in any way" — iLovePDF FAQ | ✘ None disclosed | ✘ None disclosed |
| Post-quantum encryption | ✔ 31 algorithms — client-side
NIST ML-KEM-1024/768/512, HQC-128/192/256, FN-DSA variants, and hybrid modes via @noble/post-quantum; server never sees plaintext |
✘ Not available Not disclosed on any public page | ✘ Not available | ✘ Not available | ✘ Not available | ✘ Not available |
| Processing on own servers | ✔ All engines run locally No file data sent to any third-party service — every engine runs on pqpdf.com's own server | Third party — not named "Trusted cloud infrastructure providers and CDNs" (Adobe Terms) — providers not disclosed | Not disclosed Privacy policy pages returned errors during verification; subprocessors not published | Third party — not named "Leading cloud data storage provider" cited on Security page — name not disclosed | EU servers confirmed; provider not named "All servers within the EU" — PDF24 Privacy Policy (geek software GmbH, Berlin) | DigitalOcean, Cloudflare, Fastly All three named as infrastructure providers in the Sejda Privacy Policy |
| AI analysis — self-hosted LLM | 🤖 Qwen 2.5 1.5B — self-hosted 4 AI features — all self-hosted on own hardware, zero third-party AI calls: 🤖 AI Forensic Report (synthesises all 44 engine outputs → verdict, confidence, executive summary, key findings, MITRE techniques, recommended actions, false_positive_note; MALICIOUS/CLEAN auto-labels for ML retraining), 🤖 AI Document Analysis (type classification with confidence, language, entities incl. locations, topics, reading level), 🤖 AI Redaction Suggestions (PII pattern proposals across 13 categories with example + reason), 🤖 AI Change Analysis (significance rating, change type, plain-English summary, per-change details array, recommendation) — Qwen 2.5 1.5B Instruct via llama.cpp, ~13 t/s on Ryzen 5 3550H. No OpenAI, no Anthropic, no Google. Your document text never leaves our infrastructure. | ⚠️ Adobe Sensei (cloud AI) Adobe AI routes content through Adobe's own cloud AI pipeline — separate from PDF processing infrastructure | ✘ None disclosed | ✘ None disclosed | ✘ None disclosed | ✘ None disclosed |
| Processing engines disclosed | ✔ All engines named Ghostscript, Poppler, LibreOffice, Tesseract 5, PyMuPDF, YARA, ClamAV, PeePDF, pikepdf, Acorn, scikit-learn, LightGBM, imagehash — every tool documented | ✘ None disclosed Described as proprietary; no library or engine names on any public page | ✘ None disclosed | ✘ None disclosed | ✘ None disclosed | Partial Tesseract named for the desktop app OCR feature only; web-side engines not disclosed |
| Max upload — free tier | 50 MB / file 200 MB total per request across all files | 2 GB Stated on the Compress tool page (acrobat.adobe.com) | Not stated publicly Pricing page not accessible; per-tool limits not published on accessible pages | 200 MB / task Varies by tool — lower limits on conversions (e.g. 15 MB for some); ilovepdf.com | 500 MB / file Stated on tool pages at tools.pdf24.org | 50 MB / file 200 MB / task; 3 tasks/hour; page cap 50 pages/task (sejda.com) |
| Open-source engines only | ✔ 100% open source Ghostscript, Poppler, LibreOffice, Tesseract 5, PyMuPDF, YARA, ClamAV, PeePDF, pikepdf, Acorn, scikit-learn, LightGBM, imagehash — every engine is named open-source software | ✘ No Proprietary Adobe processing pipeline; specific engines not disclosed on any public page | ✘ Not disclosed | ✘ Not disclosed | ✘ Not disclosed | Partial Tesseract named for desktop OCR only; web processing pipeline not disclosed |
| Cryptographic signing standard | ✔ PAdES-B (ETSI EN 319 102-1) Incremental CMS/PKCS#7 via pyhanko 0.34 — verifiable in Adobe Reader's Signatures panel. Draw/type/upload modes also available with embedded RSA-2048 cert | Acrobat standard signature Adobe Sign product available; standard acrobat.adobe.com e-sign. PAdES compliance not documented on public pages | Basic e-signature Sign tool available; signing standard not disclosed on accessible pages | Basic e-signature Sign tool available; signing standard not disclosed on accessible pages | Basic e-signature Sign tool available; signing standard not disclosed on accessible pages | Basic e-signature Sign tool available; signing standard not disclosed on accessible pages |
| Advertising / monetisation model | ✔ No ads, no upsells No advertising, no tracking pixels, no in-tool upgrade prompts, no affiliate links — tool is self-funded | Freemium — subscription upsells Free basic use; persistent prompts to upgrade to paid Acrobat plan | Freemium — subscription upsells Free tier with task limits; upgrade prompts throughout the tool flow | Freemium — subscription upsells Free tier with limits; Premium plan promoted within tools | Free with ads Web app displays advertising on the free tier; Premium plan available for ad-free experience | Freemium — task-limited Free tier capped at 3 tasks/hour; paid plan promoted in tool UI |
ℹ️ All competitor claims verified from official sources as of March 2026:
Adobe Terms ·
iLovePDF Terms §9.3 ·
iLovePDF FAQ ·
PDF24 Privacy Policy ·
Sejda Privacy Policy ·
Smallpdf retention confirmed from tool pages (privacy policy pages unavailable at time of research).
PQ PDF claims are derived from api.php, _tool_head.php, and tool source files on this server.
12 tools for everyday PDF manipulation. All processing is server-side; nothing is rasterised unless explicitly requested.
Combine up to 20 PDFs into one (200 MB total). Drag thumbnails to reorder before merging. Real-time upload progress percentage.
Try it →Split by every page, a fixed interval, custom page ranges, or interactive cut-point selection. Output is a ZIP of individual PDFs.
Try it →Five quality presets plus custom DPI slider (50–600). Optional metadata stripping, linearisation, and stream recompression. Live before/after split-canvas preview from page 1. Shows size reduction after download.
Try it →Rotate all, odd, even, or a custom range. Supports 90°/180°/270° and arbitrary decimal angles. Live canvas preview of page 1.
Try it →Click a thumbnail grid to select the pages you want to keep. Selections auto-compress to ranges (e.g. 1–3, 5, 7–9).
Try it →Reconstructs corrupted or malformed PDFs via Ghostscript. On upload, PDF.js diagnoses the file client-side — checking the header, xref table, and content streams — and shows red error badges or a green "readable" confirmation before any server work runs.
Try it →Permanently bakes form fields, annotations, and layers into the page content. Client-side pre-scan shows exactly what will be flattened — field counts, annotation types, layer names — with a green "already flat" badge if nothing is found.
Try it →Convert a colour PDF to grayscale or pure black-and-white. Live before/after split-canvas preview: colour on the left, grayscale simulation on the right.
Try it →Arrange multiple PDF pages on each output sheet: 2-up, 4-up, 6-up,
8-up, 9-up, or booklet (pages re-ordered for saddle-stitch binding). Uses PyMuPDF
show_pdf_page() — vector output, no rasterisation.
Page size and orientation selectable.
Remove excess white margins and correct page rotation. Three modes: crop only, fix rotation only, or both. Features a per-page interactive crop editor — see the deep dive below.
Try it →Most deskew tools apply a single global correction. This one gives you a per-page interactive editor before anything is sent to the server.
How auto-detection works
After upload, each page is rendered via PDF.js. Text extraction and PyMuPDF's bounding-box analysis across text blocks, vector paths, and raster images detects the tight content boundary. A 20pt safety margin is added. The result is drawn as a draggable crop box over the rendered page.
The interactive crop editor
What gets sent to the server
Per-page overrides are sent as a JSON array of {page, x0, y0, x1, y1}
in PDF display-space points. Pages without manual overrides continue to use
server-side auto-detection. The rotation fix bakes the /Rotate flag
into the content stream so output pages have rotation=0 in all viewers —
aspect ratio and coordinate mapping are preserved for 90°/180°/270° pages via an
offset target-rect approach.
Compress PDF controls image DPI downsampling via Ghostscript and optionally applies stream-level recompression via qpdf. A live split-canvas preview renders page 1 as soon as you upload — original on the left, simulated compression on the right — so you can see the visual impact before committing to a download.
Quality presets
/prepress quality settings. Use for documents destined for a print shop or colour-critical workflows.
Advanced options
What gets compressed — and what doesn't
DPI settings affect raster images embedded in the PDF. Photographs, scanned pages, and screenshots will see the largest reductions — typically 40–90%. Vector content and text are not affected by DPI; they are resolution-independent and remain identical across all presets. A text-only document will see minimal reduction regardless of preset; in that case, enabling stream recompression and metadata stripping provides the most benefit.
Split-canvas preview
As soon as a file is uploaded, page 1 is rendered at full resolution in the browser via PDF.js. The canvas is split vertically — left half shows the original, right half simulates the compressed output at the selected DPI. The preview updates live as you switch presets. This shows the visual quality trade-off before any server processing occurs.
Split PDF supports four distinct split strategies. All modes use Poppler at the binary level — pages are extracted without re-rendering, so fonts, images, and text layers are preserved exactly as they appear in the source.
Split modes
1-3, 5, 7-9). Each range becomes a separate PDF. Pages not covered by any range are discarded. Useful for extracting specific named sections from a longer document.
Output
Every Page, Interval, and Interactive modes output a ZIP archive. Custom ranges output a single PDF when one range is specified, or a ZIP for multiple ranges. No re-encoding occurs — output pages are binary-identical to the source pages.
N-up imposition places multiple source pages onto each output sheet. Unlike tools that rasterise pages before imposing them, this tool uses PyMuPDF's show_pdf_page() — pages are placed as live PDF content. Text remains selectable and searchable in the output; images are not re-compressed.
Layouts
Output options
Output page size is selectable (A4, Letter, Legal, A3, or original). Orientation (portrait/landscape) is configurable independently. Pages are auto-scaled to fit each cell while preserving aspect ratio.
13 tools converting between PDF and common document, spreadsheet, presentation, image, and web formats. All conversions run via locally-installed open-source engines — LibreOffice, Ghostscript, PyMuPDF, ImageMagick.
PDF to other formats
Export to .docx, .odt, .rtf, or
.txt via LibreOffice. A format fidelity indicator shows star ratings (out of 4)
for each output format before you convert.
Extracts tables to .xlsx via LibreOffice.
Best suited to PDFs where table structure is preserved in the source document.
Renders pages to PNG or JPEG at 72–600 DPI. Select all pages or a custom range. JPEG quality slider available. Live DPI preview: page 1 is rendered in a canvas at the selected DPI immediately on upload, showing actual pixel dimensions before processing. Download as ZIP.
Try it →Convert to PDF/A-1b, PDF/A-2b, or PDF/A-3b for long-term archival (ISO 19005). Fonts are embedded, transparency is flattened, colour profiles are attached.
Try it →Convert to print-industry PDF/X (X-1a, X-3, or X-4) via Ghostscript with
CMYK colour conversion, /prepress quality, and configurable render intent.
All fonts embedded, colour data print-shop compliant.
Each page is rendered at 150 DPI via PyMuPDF and placed as a full-bleed image on its own slide using python-pptx. Slide dimensions match the original page aspect ratio.
Try it →Converts pages to a styled HTML document using PyMuPDF
page.get_text("html"), preserving font, size, and positioned text spans.
Produces a single self-contained .html file with print-friendly styling.
Uses pymupdf4llm — the AI/LLM-optimised layout analysis
engine built on PyMuPDF 1.27 + ONNX. Detects headings, paragraphs, tables,
code blocks, and list structures. Produces clean .md ideal for RAG pipelines
and LLM ingestion.
Other formats to PDF
Convert .doc / .docx / .odt /
.rtf / .txt via LibreOffice. A fidelity indicator shows
expected quality for the file type you upload.
Convert .xls / .xlsx / .ods /
.csv via LibreOffice. A sheet selector fetches sheet names from the uploaded
file so you can choose which sheets to convert.
Convert .ppt / .pptx / .odp
via LibreOffice. A slide selector fetches slide titles from the uploaded file so you can
choose which slides to include.
Pack JPEG / PNG / WebP / BMP / TIFF / GIF images into a single PDF via ImageMagick. Drag thumbnails to reorder before generating.
Try it →Upload a .html / .htm file or enter any public URL.
Converted via Playwright/Chromium — full Chromium rendering engine captures modern CSS,
web fonts, lazy-loaded images, and JavaScript-rendered content. Page size, orientation, and margins
are configurable.
Most PDF-to-text converters flatten the document into a single stream of characters, destroying the layout information that makes it useful — headings become plain lines, tables become scrambled text, multi-column layouts interleave content from adjacent columns. pymupdf4llm analyses the document layout before extracting text.
Engine: pymupdf4llm + ONNX
pymupdf4llm is the AI/LLM-optimised extraction layer built on PyMuPDF 1.27 with an ONNX inference backend. It analyses bounding-box positions, font sizes, column boundaries, and text flow to infer document structure before generating Markdown — the same technique used by state-of-the-art document AI pipelines.
Structural elements detected
LLM and RAG use cases
Convert PDF pages to individual image files using Poppler's pdftoppm renderer. A live preview shows exactly what the output will look like at the selected DPI — including the actual output pixel dimensions — before any server processing starts.
DPI options
Output formats
PNG — Lossless. All detail preserved. Recommended for technical documents, forms, and anything with sharp edges or small text. JPEG — Lossy with a configurable quality slider (50–100%, default 85%). Considerably smaller at quality 80+. Recommended for photo-heavy PDFs or when file size is critical.
Live DPI preview with file size estimate
As soon as a file is uploaded, page 1 is rendered in the browser at the selected DPI. The size estimate card shows the actual output pixel dimensions and an estimated file size per page — for example, "1240 × 1754 px per page — ~1.5 MB per page". PNG estimates use ~0.7 bytes/pixel (lossless document content). JPEG estimates scale with the quality slider: at quality 85 the multiplier is ~0.14 bytes/pixel — considerably smaller than PNG. Changing the DPI, format, or quality slider re-runs the estimate immediately so you know exactly how large the ZIP will be before any server processing starts.
Page selection and output
All pages or a custom range (e.g. 1-3, 5, 7-10). Output is a ZIP archive containing one image per page, named sequentially — page-001.png, page-002.png, etc.
PDF/X is the ISO standard for print exchange (ISO 15930). It constrains the PDF feature set to what is reliably reproducible by commercial presses — no RGB images, all fonts embedded, transparency flattened in most variants. This tool converts via Ghostscript's /prepress quality settings with configurable render intent.
PDF/X standards
Render intent — controls RGB → CMYK mapping
What the conversion does
All RGB images are converted to DeviceCMYK. All fonts are embedded and subsetted. Transparency is flattened (PDF/X-1a and X-3). Ghostscript's /prepress output intent is applied. The resulting file meets the ISO 15930 constraint set for the selected variant and is accepted by commercial RIP workflows.
Six tools covering PDF forensics & analysis, encryption, decryption, permanent content removal, watermarking, and cryptographic signing. All run server-side on local engines — nothing is sent to a third-party service.
Forensic analysis across 44 independent engines — structural, behavioural, provenance, ML anomaly detection with SHAP, local threat intelligence (URLhaus · MalwareBazaar · ThreatFox — 6.4M+ indicators, no external APIs), AcroForm field forensics, PDF signature forensics, phishing detection, embedded file analysis, and TLSH campaign attribution. MITRE ATT&CK mapping on every indicator. Results across 24 analysis tabs including 🤖 AI Forensic Report — Qwen 2.5 1.5B Instruct synthesises all 44 engine outputs into a structured verdict, with semantic context from live engine data: actual phishing phrases, JavaScript call targets, embedded payload strings, FormCalc code, and SHAP feature explanations fed directly to the model. Verdict is exec-vector-aware (high score with no execution vector caps at LIKELY_CLEAN). MALICIOUS verdict auto-labels the record as 'malicious'; CLEAN/LIKELY_CLEAN as 'benign'; SUSPICIOUS is not labeled (ambiguous). Triggers ML retrain at threshold — no user input needed. 9-mode sanitize: flatten to images, strip active content, remove JavaScript, remove embedded files, remove XFA, remove rich media, normalize structure, flatten forms, or strip metadata. The most technically deep tool on the site — see the deep dive.
Try it →Two modes: AES-256-CBC server-side with granular permissions, or client-side post-quantum encryption with 31 algorithms. In PQC mode the server never sees your plaintext. See the deep dive.
Try it →Remove password protection (owner password required). Detects encryption
type client-side before upload — shows AES-256 or PQC badge. PQC bundles (.pqcpdf)
are auto-detected and routed to the quantum-safe decryption panel.
Two modes: text-pattern redaction (multi-pattern list, case sensitivity, whole-word matching) or mouse-drawn region redaction on a canvas preview. Redaction is permanent — content is erased server-side, not just covered. Includes 🤖 AI Redaction Suggestions — Qwen 2.5 1.5B analyses extracted text and proposes redaction patterns by PII category (names, emails, IDs, financial data, and more) with one-click add to the redaction list.
Try it →Stamp text watermarks with 8-position placement, opacity, rotation, font size, font style, and hex colour. Apply to all, odd, even, or custom page ranges. Live canvas preview updates in real time as you adjust settings.
Try it →Four signature modes: draw, type, upload image, or invisible
PAdES cryptographic signature. All modes support RSA-2048 certificates — auto-generated
or your own .p12. See the deep dive.
PDF is the most abused document format for delivering malware. This forensics scanner runs 44 independent engines covering every investigative dimension — byte-level signatures, structural integrity, sliding-window entropy, provenance analysis, dynamic behavioural tracing, machine learning anomaly detection with SHAP explanations (IsolationForest + RandomForest + LightGBM), multi-parser differential analysis across six independent parsers, fully offline threat intelligence (URLhaus · MalwareBazaar · ThreatFox · FeodoTracker · OpenPhish — 6.4M+ indicators, zero external API calls), PDF digital signature forensics, phishing detection, AcroForm field forensics (JS triggers on field events, SubmitForm exfiltration targets, hidden fields, password fields, /AA hooks, calc-order chain exploitation), embedded file analysis (magic-byte classification, VBA macro detection, full ZIP archive content listing, nested PDF detection, PowerShell content analysis), and TLSH + pHash + JS-fingerprint campaign attribution. Every indicator is tagged with MITRE ATT&CK technique IDs. Results are presented across 24 analysis tabs: Summary, Threats, Score, a per-engine two-panel browser (click any of the 44 engines for its full findings + structure fields), URLs, Streams, ML/SHAP, Sandbox, Threat Intel, MITRE, Differential Parsing, Polyglot, Phishing, Embedded Files, Signature Forensics, Revision History, Annotations, Metadata, XFA FormCalc, Action Graph, Deep Forensics (engines 34–43), 🤖 AI Forensic Report (Qwen 2.5 1.5B Instruct synthesises all 44 engine outputs into threat verdict, confidence rating, executive summary, key findings, MITRE technique grid, and recommended actions — fully local, structured JSON output, ~15–25 s on CPU), Raw JSON, and a Raw Forensics view showing decoded stream content, JavaScript sources, all indicator contexts, and the complete structure dump. File bytes never leave the server — no hash or data is sent to any external service at any point. Results are forensic-grade: each indicator is documented with engine source, severity, and contextual explanation. File size limit: 10 MB. Threat intelligence research (MalwareBazaar corpus, HP Wolf Security telemetry, Contagio malware archive) consistently shows real-world malicious PDFs are under 5 MB — exploit-kit payloads average 200 KB–1 MB, phishing lures 300 KB–4 MB, dropper PDFs up to 8 MB. The 10 MB cap covers every known threat class with 2× headroom. Scanning larger files requires enterprise deployment.
After a scan, a 9-mode sanitize panel appears. Basic: Flatten to Images (PyMuPDF raster rebuild — maximum safety, destroys all active content) · Strip Active Content (Ghostscript -dSAFER — moderate safety, text usually retained). Advanced — Surgical Cleaning: Remove JavaScript (/JS /AA nullified, layout preserved) · Remove Embedded Files (all /EmbeddedFile attachments) · Remove XFA Forms (/XFA definitions) · Remove Rich Media (/RichMedia /Movie /Sound) · Normalize Structure (qpdf rebuild — collapses incremental updates, disables object streams, decodes filter chains) · Flatten Forms (PyMuPDF bake() renders AcroForm widgets to static content) · Strip Metadata (/Info + XMP stream). All modes produce a new file; the original is never modified.
The 44 engines
%PDF- header position (flagged if beyond byte offset 1,024), %%EOF marker count (>2 indicates incremental update stacking or exploit layering), xref table depth (>3 flagged), obfuscation codec count (ASCIIHexDecode / ASCII85Decode / LZWDecode >3 flagged on non-image streams — image XObjects are excluded since they legitimately use these codecs as standard output from PDF generators such as ReportLab and Ghostscript), and excessive filter chains (>120 /Filter entries). Proportional incremental injection: flags if the final revision adds >10 new objects compared to prior revisions — a disproportionately large final update is a strong indicator of post-signing payload injection. Collects: PDF version, linearised flag, binary comment presence.
/JavaScript, /JS, /Launch, /OpenAction, /AA — remote & form actions: /GoToR, /SubmitForm, /ImportData, /Rendition, /Hide — embedded & rich content: /EmbeddedFile, /RichMedia, /XFA, /AcroForm — obfuscation: /ObjStm, /JBIG2Decode, /ASCIIHexDecode — dangerous JS APIs: unescape(), eval(), String.fromCharCode, collab.getIcon (CVE-2009-0927), util.printf (CVE-2008-2992), media.newPlayer (CVE-2009-4324), Collab.collectEmailInfo (CVE-2007-5659) — shellcode: %u9090 (Unicode NOP sled), %u4141, %u0c0c%u0c0c heap-fill patterns. Evasion patterns: /Trans with JavaScript (page-transition trigger used to execute JS while evading action-based detection); /OpenAction hidden inside an AcroForm /DR indirect reference (indirect variant bypasses naive dictionary-key scanners). Each match records a context snippet (20 bytes before, 60 bytes after) for the Threats tab.
doc.xref_stream(xref) — catching JavaScript and shellcode hidden inside compressed objects that raw-byte scanners miss entirely. Calculates entropy using 512-byte sliding windows; any window exceeding 7.6 bits/byte on non-image streams flags encrypted, packed, or obfuscated payloads (detects shellcode splices that average out in whole-stream analysis). Decompression bomb detection flags streams with >500:1 compression ratio; image XObjects (/Subtype /Image, DCT, JPX, CCITT, JBIG2) are excluded from both the entropy check and the decompression bomb check — uniform-fill or solid-colour images legitimately achieve extreme compression ratios at near-zero entropy. Scans decompressed content for 14 JS/shellcode signatures. Returns up to 40 streams with xref number, entropy, type, and matched patterns.
data: URI schemes (data:text/html, data:application/*) that deliver payloads without network requests, bypassing URL-reputation filters. Also detects hex-encoded URLs in JavaScript (\x68\x74\x74\x70 = "http") used to hide C2 addresses from static scanners.
CreationDate and ModDate indicates scripted, automated document generation — a common characteristic of malware factory pipelines. Engine 7 (Font Analyzer): Unusual font names, encoding flags, and embedding status. Font objects are a common exploit carrier — malformed font tables trigger heap corruption in viewer rendering engines (e.g. CVE-2010-2883, Type1C font vulnerabilities). JBIG2 exploit detection follows indirect /FontFile* references: some exploits store the JBIG2-filtered stream on a separate object pointed to by the font dict rather than embedding the filter directly, and both forms are caught. Engine 8 (CVE Pattern Matcher): Byte-level CVE signature matching — known exploit patterns for CVE-2009-0658 (JBIG2), CVE-2009-4324 (/OpenAction JS), CVE-2010-2883 (font), and other historically weaponised PDF CVEs. Engine 9 (Structural Statistics): Object-to-page ratio heuristic — >50 objects per page is anomalous and flags potential exploit payload inflation. Zero-page detection: a PDF with 0 pages is a pure exploit payload with no legitimate document content (critical severity).
.yar rule files are loaded from a configured rules directory. Rules cover patterns not caught by byte-string matching alone.
/ObjStm), reconstructs the internal object graph, and analyses suspicious cross-references, duplicate object definitions, and object version stacking. Supplemented by pikepdf (a modern libqpdf-based Python parser) which independently extracts the JavaScript Names tree, counts embedded file attachments, detects per-page /AA triggers, and provides a second independent indicator set. Crash/timeout behaviour of each parser is tracked separately.
unshare --net --pid --mount with all syscalls captured by strace. The network namespace makes any connect() or sendto() syscall definitively malicious — there is no legitimate reason for a PDF renderer to initiate network contact in an isolated namespace. Detects: outbound C2 beacons, anonymous executable memory mappings (shellcode staging), unauthorised process spawning (code execution), filesystem escape attempts, DNS lookups, and fork-bomb patterns. PDFium (Playwright/Chromium) covers the Chrome browser attack surface — where most users now open PDFs. pdf.js/Node covers the Firefox/Mozilla rendering engine. LibreOffice Draw exposes OLE macro and embedded content paths. When all renderers complete without triggering, a confirmed clean result is explicitly surfaced so analysts know the sandbox ran successfully.
clamdscan daemon. The clamav user is a member of the www-data group so the daemon reads upload files directly — no --fdpass needed, no fallback to the slow single-process scanner. The only engine that makes external calls — and only for signature database updates via clamav.net, never for file analysis.
mutool), Poppler (pdfinfo/pdfdetach), Ghostscript, qpdf, pdfminer, and Node.js pdf.js across 8 structural dimensions simultaneously: page count, object count, JavaScript presence, PDF version, encryption status, AcroForm presence, embedded file count, and OpenAction. Seven distinct discrepancy checks (Critical/High/Medium) flag hidden objects, shadow object trees, or deliberate parser-confusion exploits. Page delta scoring is weighted by magnitude (up to +70/critical for >50 page delta). A hard 30-second SIGALRM wraps the engine; pdfminer runs in a subprocess with timeout 6 for guaranteed hard-kill.
\x00asm), and Python bytecode. Also performs mid-stream scanning at non-zero offsets to catch payloads prefixed by junk bytes. JAR files are detected via ZIP + META-INF/MANIFEST.MF. Detects polyglot files that embed executable droppers inside a valid PDF container.
/JS literals and keyword-bearing compressed streams (with Unicode \uXXXX pre-processing) is parsed into an AST via Acorn (Node.js, ECMAScript 2022) and walked for obfuscation constructs invisible to pattern-matching: eval() chains, String.fromCharCode() arrays (shellcode staging), unescape() decode pipelines, large numeric arrays (heap spray), new Function() dynamic construction, atob()/btoa() base64 decode chains, and property accessor obfuscation — including the split-string concatenation technique (window["ev"+"al"]) used to evade static keyword detection. Performs 6 iterative deobfuscation passes (each pass feeds its output into the next) to unravel multi-layer obfuscation chains. Also detects anti-sandbox patterns (app.platform, screen.width, navigator.*) and executes multi-stage eval chains in a Node.js VM sandbox to decode obfuscated payloads statically hidden from pattern-matching.
/JavaScript, /Launch, /OpenAction, /EmbeddedFile) added post-signing. A signed-then-modified document with active content is critical.
SubmitForm action + password-type field detection; QR code extraction and decoding via zbarimg with suspicious domain scoring. High urgency phrase density combined with brand impersonation scores as high-confidence phishing.
pdfdetach (Poppler) to extract every embedded file attachment. Inspects each for magic bytes: Windows PE (MZ), Linux ELF (\x7fELF), OLE/CFBF (\xd0\xcf), OOXML archives, script files (.bat, .ps1, .vbs, .sh), RAR, 7-Zip. Detects VBA macros in OOXML Office attachments (vbaProject.bin). Non-OOXML ZIP archives have their full contents listed (up to 50 entries) and are scanned for dangerous files (.exe, .dll, .ps1, .vbs, etc.) — flagged Critical when dropper files are present. Nested PDFs (embedded PDF documents) are detected and flagged — nested PDFs can carry independent malicious payloads processed outside outer-document defences. PowerShell .ps1 content analysis: embedded scripts are scanned for high-risk patterns including Invoke-Expression, DownloadString, and -ExecutionPolicy Bypass — all common stager and downloader primitives. Extracts readable strings from executables to surface suspicious API calls or IP addresses. A PDF carrying a PE executable is a confirmed dropper — scored critical.
/A and /AA dictionaries — JS fires on focus, blur, keystroke, validate, or calculate events, invisible during static review but executing in any Acrobat-compatible viewer); hidden NoExport fields (present in submitted data but not displayed to the user); password-type fields (credential harvesting indicators); SubmitForm exfiltration targets — the URL(s) to which all form field data is POSTed; /AA additional-action JS triggers on field objects (a secondary execution vector independent of /OpenAction); and calculation order (/CO) exploitation — adversaries reorder field calculations to chain JS evaluations across fields, enabling multi-step payload staging hidden entirely within form arithmetic. SubmitForm target URLs are scanned and flagged for external HTTP destinations. Results feed into the Correlation Engine.
%%EOF boundary and extracts per-revision metadata: author, producer, modification date, and new/modified/deleted object counts for each incremental update. Detects author identity changes between revisions, execution vectors injected (/JavaScript, /Launch, /EmbeddedFile, /OpenAction) after the original document was created, and large late-stage object injections in the final revision — the structural signature of automated exploit staging. Injection depth (revision number) is recorded for each vector. Results feed into the Correlation Engine.
/Annot object across all pages and forensically analyses each action dictionary. Detects dangerous URI schemes (javascript:, data:, file://, vbscript:); JavaScript action triggers on annotation interaction; /Launch actions that spawn arbitrary programs; GoToR remote links that open external files; and SubmitForm actions that exfiltrate form data to external servers. Also inspects the /T (author/title) field of every annotation for XSS payloads — matching the CVE-2025-70401 attack vector in which PDF viewers pass the annotation author string through DOM reconciliation into innerHTML without sanitisation, executing injected scripts on every component re-render. Checks 15 patterns including <script>, onerror=, <svg><foreignObject> (the bypass used in the disclosed PoC), javascript:, and percent/unicode-encoded variants; handles both literal and hex-encoded /T values. Annotation-borne payloads are completely invisible to scanners that only analyse raw bytes or page content streams. Results feed into the Correlation Engine.
/Names /JavaScript subtree — persistent JS objects callable by name from any action); /AA Additional Actions count (event-driven triggers on page open/close, print, save, field events); /OpenAction type classification (JavaScript, Launch, GoToR, URI, GoTo); DocMDP modification prevention signatures that lock out sanitizers; /Perms cryptographic permission restrictions; and UR3 usage-rights signatures used to exploit extended viewer features. Results feed into the Correlation Engine.
exec (dynamic code execution), run (file execution — detected as ) run requiring an explicit filename string argument, avoiding false positives from the English word "run" appearing in page content), token (string-to-code eval), setpagedevice (PostScript-to-system passthrough — bridges to the PostScript interpreter from PDF context), def. Also detects ICC color profile abuse — malformed /ICCBased profiles of anomalous size exploit heap buffer overflows (CVE-2021-21017 class). Flags content bombs: non-image streams exceeding 5 MB that may exhaust parser memory or conceal oversized payloads (image XObjects are excluded — large raster data is expected). Results feed into the Correlation Engine.
/ObjStm stream. Scanners that only search raw bytes will miss any object inside a compressed container. This engine decompresses every /ObjStm and re-scans the decompressed content for JavaScript, /Launch actions, /EmbeddedFile references, and high-entropy payloads (entropy >7.5 bits) that suggest encrypted content hidden inside compressed object bundles. Complements the Stream Inspector (Engine 3) with object-container-specific forensics. Results feed into the Correlation Engine.
/J#61vaScript → /JavaScript) and checks decoded names against a dangerous-keyword list: JavaScript, Launch, OpenAction, EmbeddedFile, AA, URI, SubmitForm, ImportData, GoToR, RichMedia, and others. Counts total hex-encoded name tokens, dangerous-keyword obfuscations, and unique obfuscated forms. Also detects whitespace-split keyword injection — byte sequences like /Java\nscript or /Lau\tch in the raw byte stream that evade simple string scanners; detection requires at least one actual whitespace character inside the keyword (a zero-width match would flag every normal /JavaScript token). Scans outside compressed stream bodies for formfeed byte injection (0x0C) and null bytes in the PDF header region — stream bodies are excluded since FlateDecode binary data naturally contains these bytes; both are classic evasion markers when found in the structural token layer. Excessive hex-encoded name tokens are flagged at a threshold of >500 tokens (benign PDF generators such as ReportLab routinely hex-encode colour names and resource keys, so only counts far beyond normal generator output are reported, at low severity). Every obfuscated dangerous keyword triggers a Critical indicator. Results feed into the Correlation Engine.
/Next action pointer is followed to map the full execution sequence. Detects circular action cycles (infinite loops); deep chains exceeding 10 hops (overflows parser stack depth in hardened viewers); high fan-in nodes — single action objects referenced from many triggers simultaneously (covert shared-execution points); and sleeper nodes — actions present in the graph but unreachable from the nominal entry points, planted for deferred detonation via a separate trigger. The graph is serialised and available for raw forensic inspection. Results feed into the Correlation Engine.
/OCG) layer defined in the /OCProperties dictionary. Detects layers configured as never-visible (display-state forced off in all circumstances) — a technique for hiding malicious content from visual review; screen/print divergence (content visible on screen but suppressed in print, or vice versa — used in watermarking and DLP-evasion attacks); and hidden clickable links inside invisible layers, which are fully interactive in Acrobat despite being visually absent. Results feed into the Correlation Engine.
Tr (text rendering mode) operator — PyMuPDF's span flags field encodes font flags (bold/italic/serif) rather than the rendering mode and cannot be used for this purpose. Flags homograph domains using Cyrillic/Greek/Armenian lookalike characters (confusable with ASCII). Results feed into the Correlation Engine.
/Prev byte-offset pointers without relying on any PDF library's repair logic. For each trailer, records the /ID array pair, the /Root reference, and the /Prev offset, building a chronological chain of all incremental updates. Detects Document ID mutation across updates (both entries of the /ID array should be stable after creation — mutation is a structural anomaly); /Root reference swaps between trailer versions (the Shadow Document Attack — a signed PDF whose signed version and visible version have different catalog roots); and malformed /Prev pointers that would confuse incremental-update-aware parsers. Results feed into the Correlation Engine.
Columns and Rows against the stream length — out-of-bounds values trigger heap overflows in multiple decoders. JBIG2Decode: checks for a /JBIG2Globals reference (required for CVE-2009-0658 / Pwn2Own 2009 Adobe Reader exploit). DCTDecode: validates that the declared stream length is plausible for the claimed image dimensions. Multi-filter chains: flags streams using 3+ stacked decoders (a classic technique to slow forensic analysis and trigger parser differential vulnerabilities — each decoder in the chain may parse the preceding output differently). Results feed into the Correlation Engine.
%%EOF marker (invisible to all structure-respecting parsers); entropy cliffs — sudden sharp transitions between low-entropy and high-entropy regions that indicate injection boundaries; header entropy anomalies — unexpected compression or encryption in the first 256 bytes of the file; and under-entropy in compressed streams — near-zero entropy (<1.5 bits) in a compressed region that should be random (consistent with a decompression bomb). Image XObjects (/Subtype /Image) are excluded from the under-entropy check — solid-colour or uniform-fill images produce near-zero entropy in their compressed stream by design and are not suspicious. Uses the PDF's object offset table to partition the entropy map into structural regions (header, objects, streams, trailer, post-EOF). Results feed into the Correlation Engine.
pdfaid:conformance and pdfaid:part XMP metadata) and, if so, validates that the document actually conforms to the declared standard. PDF/A forbids JavaScript, embedded executables, non-embedded fonts, encryption, and external references — all of which are attack vectors. Detecting a PDF that claims PDF/A but contains active content is a reliable indicator of a document engineered to bypass DLP systems and email gateways that whitelist PDF/A. Also checks for conformance level mismatch (e.g. claiming PDF/A-1a but using features only in PDF/A-2). Results feed into the Correlation Engine.
vm context with a full stub of the Acrobat JavaScript API — app, this, event, util, console, Doc, Field, and others. Intercepts and records all calls to dangerous methods: app.launchURL(), this.submitForm(), app.openDoc(), app.execMenuItem(), util.printd(). Detects obfuscated eval() and string-concatenation assembly of dangerous payloads at runtime. Records the full call log: function name, argument list, and execution timestamp. This engine catches JavaScript payloads that static AST analysis (Engine 16) cannot — obfuscated strings that are only assembled and evaluated at runtime. Results feed into the Correlation Engine.
seac (seac/accented-character — calls two other glyphs by name, enabling recursive execution that overflows the call stack in vulnerable renderers, used in exploits targeting Adobe Reader ≤9); excessive stack depth (CharString programs that push ≥200 values onto the stack, triggering stack exhaustion in strict interpreters); and abnormal subroutine depth (recursion deeper than 10 levels in the subr/globalsubr call chain). Flags obfuscated font binaries with unusually high entropy in the eexec-encrypted region. Results feed into the Correlation Engine.
/XRef objects, PDF 1.5+). Cross-references every declared object against actual byte positions in the file. Detects phantom objects — entries in the XRef table that point to byte offsets with no valid object header; orphan sleepers — objects present at valid byte offsets but absent from every XRef table (reachable only through raw parsing, not through standard readers); free-entry exploitation — free-list entries (f type) whose generation numbers deviate from standard increments (a technique for hiding objects that become reachable after a use-after-free in the parser); and object length fraud — stream objects whose declared /Length diverges from the actual byte count between stream markers. Reachability BFS starts from doc.pdf_catalog() — the authoritative PDF Catalog xref returned by the parser — rather than assuming OID 1 is always the root (which produces large false-positive orphan lists in non-standard PDFs). Orphaned Action objects are classified by subtype: execution subtypes (JavaScript, Launch, GoToR, ImportData, SubmitForm, GoToE) are flagged as dangerous; navigational subtypes (URI, GoTo, Named, Sound, Movie) are treated as benign and not flagged. Results feed into the Correlation Engine.
/OpenAction + high entropy = +100 bonus; JavaScript + /Launch = +75 bonus. Cross-engine: YARA heap-spray + JS, PeePDF vuln + JS, qpdf structural damage + active content, ExifTool exploit-kit fingerprint + execution. Dynamic sandbox: live network beacon + JS, runtime shellcode + heap spray, dynamic shell spawn + trigger. Form patterns: AcroForm JS field + SubmitForm exfiltration target, /AA keystroke trigger + credential field, calc-order chain + JS payload. New-engine patterns: token obfuscation + JS keyword, annotation JS trigger + auto-exec, post-signature revision injection + execution vector, object stream concealment + active content, named JS registry + OpenAction, DocMDP bypass + content modification, XFA exec + auto-fire, action cycle + JS node, OCG hidden link + JS, trailer /Root swap + execution, codec OOB + active content, post-EOF entropy + execution, steganography + exfiltration target, PDF/A claim fraud + active content, JS emulation live call + obfuscated eval, font seac OOB + JS, XRef phantom object + orphan sleeper. TI + sandbox + YARA triple-confirmation. TI domain match + active content: a domain from the PDF's links matching threat intelligence databases combined with JavaScript or auto-execute content raises a high-confidence combined indicator. 60+ compound patterns. Multi-engine JS confirmation bonus: when 3+ independent engines confirm JavaScript presence, score is amplified. Final score capped at 999.
Risk scoring
Each indicator contributes base points multiplied by min(occurrence_count, 3) — capped at 3 occurrences per finding type to prevent artificial inflation from a single pattern appearing many times. The Correlation Engine adds weighted bonus points on top for dangerous combinations.
| Risk level | Base points per occurrence |
|---|---|
| Critical | 50 |
| High | 25 |
| Medium | 10 |
| Low | 3 |
Forensic Console
During the scan a live terminal-style event log streams timestamped events to the browser — upload confirmation, per-engine START/DONE lines, and the final risk verdict. Section dividers separate Upload, Engines, and Results phases. The console can be collapsed or cleared without affecting the scan.
Result banner and risk levels
When all 45 engines complete, a full-width banner appears at the top of the Summary tab showing the risk level, a text explanation, and a score meter bar (0–999):
Statistics grid — 15 fields
Below the banner a 15-cell grid shows key structural stats at a glance. Three cells are clickable and jump directly to the relevant tab. Cells turn red when values exceed safe thresholds:
Scan report — 24 tabs
Results are rendered across 24 tabs. Each tab is independently navigable. Dynamic badges on several tabs update live (threat count, ML %, MITRE technique count, phishing signal score, embedded file count).
Sanitize panel
After every scan (including clean results) a 9-mode sanitize panel appears below the result. Selecting a method sends the session token to the server, produces a new file, and reveals a Download Sanitized PDF button and a Scan the Sanitized File button to re-run the full 45-engine scan on the cleaned output. The original file is never modified.
ML data policy
The ML engine stores a 38-dimensional feature vector per scan (structural statistics: byte counts, entropy values, object type flags, parser discrepancy counts, sandbox syscall anomalies). No file content, no filename, no hash, no IP address, and no PII is stored. Feature vectors are used to retrain the IsolationForest, RandomForest, and LightGBM models every 30 minutes. Model drift detection reports if models have not been retrained in >30 days. Retained indefinitely — not subject to GDPR Article 17 as no personal data is involved. Full details on the Security page.
Standard mode — AES-256-CBC
Password is transmitted over TLS, used to encrypt via Ghostscript with AES-256-CBC, and never stored. Granular permission flags are configurable: print, copy, modify, annotate, form fill, accessibility, and assembly.
PQC mode — client-side quantum-safe encryption
In PQC mode the encryption happens in your browser before the file is uploaded.
Key generation uses @noble/post-quantum — a local JavaScript library.
The server receives only the encrypted .pqcpdf bundle.
Your plaintext file never crosses the network unencrypted.
Available algorithms (31 total, 29 quantum-resistant)
Organised by category. NIST = NIST-standardised primitive.
@noble/post-quantum.
Signature modes
Visual placement controls (Draw / Type / Upload modes)
First / last / all / custom page selector. Two placement modes:
- Snap grid — 3×3 position grid (left/center/right × top/middle/bottom) for one-click alignment.
- Free placement — drag the signature to any position
on the page. Coordinates are transmitted as fractional page offsets
(
pos_x_pct,pos_y_pct, range 0.0–1.0) and applied with sub-point precision regardless of page dimensions.
A size slider (40–300 pt). Live placement preview composites the signature image onto a rendered page 1 canvas in real time as position and size are adjusted.
Date stamp — an optional date string (up to 30 characters) can be rendered in small text directly below the signature image. Accepts any alphanumeric format, separators, and common date punctuation.
Certificate options
All modes embed a cryptographic digital signature. Certificate source is either
an auto-generated ephemeral RSA-2048 self-signed certificate (created per-request,
never stored) or a user-supplied .p12 / .pfx file.
Signer name (required), email, reason, and location metadata are embedded in the
CMS/PKCS#7 signature block.
/tools/pades.php 301-redirects to
/tools/sign.php?tab=pades — existing links and bookmarks continue
to work.
Workflow
The initiator uploads a PDF, adds up to 10 signers (name + optional email),
and chooses a signing order. The server creates an ephemeral workspace
(/tmp/esign_{32hex}/, mode 0700) and generates a unique
256-bit secure token per signer. Each token produces a signing URL that
can be shared directly — no account is required on either side. The
initiator's tracking page polls status every 5 seconds and provides
a download link once all signers have completed.
Signing order
Signature placement
Each signer sees a page-1 thumbnail and can place their signature using
the same three input modes as the solo Sign PDF tool (draw canvas, typed
name, uploaded image). Placement supports the full 3×3 snap grid
and free drag-and-drop positioning via fractional page
coordinates (pos_x_pct, pos_y_pct).
An optional date stamp can be rendered below the signature image.
Cryptographic enforcement — require_crypto
When the document creator enables require_crypto at creation
time, signers who attempt to submit without enabling the PAdES-B
cryptographic layer receive an error response:
"A PAdES-B cryptographic signature is required for this document."
This lets initiators mandate that every signature in the workflow is
cryptographically verifiable in Adobe Reader's Signatures panel — not just
a visual stamp. The certificate source is the signer's own
.p12/.pfx or an auto-generated ephemeral
RSA-2048 self-signed certificate created per request and never stored.
Workflow management (from the tracking page)
- Add signer — append a new signer to an in-progress workflow; a fresh token and signing URL are generated immediately.
- Remove signer — remove a signer who has not yet signed; their token is invalidated.
- Cancel request — terminate the entire workflow; all tokens are invalidated and the workspace is scheduled for cleanup.
- Return URL / copy link — the initiator can copy a resume link to return to the tracking page from any device.
Storage & retention
All state is stored in the ephemeral temp directory — no database writes, no cloud storage. The workspace has a 24-hour TTL; it is purged on expiry and at create-time cleanup. The final signed PDF is never stored beyond the TTL window. Zero retention applies to the e-sign workflow exactly as it does to all other tools.
Watermark renders directly to the PDF content stream via PyMuPDF — not as a separate annotation layer. The text is permanently embedded; it cannot be removed by deleting an annotation. A live canvas preview composites your watermark text over page 1 in real time as you adjust any setting.
Placement positions (8)
Style controls
#cccccc neutral grey, #ff0000 red for CONFIDENTIAL, #0000ff blue for DRAFT.Page targeting
Apply to All pages, Odd pages (recto-only in duplex documents), Even pages, or a custom range (comma-separated, e.g. 1-3, 5, 8-10).
Redaction is not the same as drawing a black box over text. A black rectangle drawn on top of text leaves the original text in the PDF file — it can be selected, copied, and searched by anyone who removes or moves the rectangle. Genuine redaction removes the underlying content from the PDF's data structures. This tool uses PyMuPDF's native redaction API, which permanently erases content at the structural level.
page.add_redact_annot() marks regions, then page.apply_redactions() removes the content from the page's content streams — text, images, and vector graphics within the region are erased, not covered.
Mode 1 — Text pattern redaction
Enter search patterns and the tool finds every matching text occurrence across the document and permanently removes it.
Mode 2 — Canvas region redaction
Draw rectangular redaction areas directly on a rendered preview of each PDF page.
Fill colour and page targeting
Black fill (standard) produces the visible redaction box. White fill is invisible on white backgrounds — useful when removing content without leaving a visible mark, such as stripping header metadata. Page targeting: All pages, Odd, Even, or a custom range.
11 tools for editing, filling, comparing, reading, and inspecting PDF documents. From a full visual editor to a font-embedding checker to table extraction.
Full page-by-page visual editor: 16 annotation tools, an interactive AcroForm builder, and a bookmark editor. All edits are permanently flattened server-side. See the deep dive below.
Try it →Detect and fill all interactive AcroForm fields — text inputs, checkboxes, radio buttons, dropdowns, and list boxes. Values are written server-side via PyMuPDF. Optional flatten-after-fill bakes values into static content.
Try it →Visual pixel-level diff of two PDFs. Configurable DPI (72–300) and sensitivity. Side-by-side previews render immediately when files are selected. Output is a highlighted diff PDF with change regions marked. Includes 🤖 AI Change Analysis — Qwen 2.5 1.5B classifies change significance (MAJOR/MODERATE/MINOR/NONE), change type, plain-English change_summary, details array (per-change breakdown), and recommendation.
Try it →Export all text to .txt with optional layout preservation,
text encoding selection, and custom page range.
Includes 🤖 AI Document Analysis — Qwen 2.5 1.5B classifies document type (13 categories) with classification_confidence, language, key entities (people, organisations, locations, dates, amounts), topics, and reading level.
Full metadata inspection: title, author, subject, keywords, creator, producer, page count, dimensions, PDF version, encryption status, form type, tagged flag, fast web view, permission flags, and creation/modification dates. Shows a canvas preview of page 1 alongside the data.
Try it →Optical character recognition for scanned and image-based PDFs via Tesseract 5 LSTM. Three output formats, DPI control, four page segmentation modes, up to 100 pages per job. Returns OCR confidence score, word count, character count, and a live text preview tab. See the deep dive.
Try it →Load a PDF's existing table of contents. Add, rename, reorder,
delete, and set the level (1–4) of each entry. Each row has a page-number input
validated against the actual page count. Reads and writes via PyMuPDF
get_toc() / set_toc().
WCAG 2.1 / PDF/UA compliance audit via PyMuPDF. 8 checks: document title (2.4.2), language metadata (3.1.1), tagged structure (PDF/UA §7.1), image alt-text (1.1.1), reading order (1.3.2), font embedding (PDF/UA §7.21), bookmark navigation (2.4.5), and page-size consistency. Returns pass/fail with WCAG criterion references and overall A–F grade.
Try it →Lists every font across every page: name, type (Type1, TrueType,
CIDFont, etc.), encoding, embedded status, subset flag (presence of +
prefix in BaseFont name), and the pages each font appears on. Non-embedded fonts
flagged in red — critical for print and PDF/UA compliance.
Comprehensive colour audit across all PDF content — raster images,
vector paths, shapes, and text. Detects DeviceRGB, DeviceCMYK, DeviceGray, Spot, ICC, Lab,
and more. Flags overprint, transparency, and Total Ink Coverage over 300%. Ghostscript
inkcov gives structured per-page CMYK percentages.
Extracts all tables from a PDF using pdfplumber with
lines_strict strategy (explicit table borders from PDF path operators),
falling back to text-position heuristics. First row becomes column headers.
Output: {table_count, page_count, tables:[{id, page, rows, cols, headers, data}]}.
16 annotation tools
Interactive form builder
Draw AcroForm widgets directly onto the PDF canvas. Supported field types:
Each field has configurable field name, tooltip, required/read-only flags, font size, and text colour. Fields are written as native AcroForm annotations.
Additional features
set_toc().
How edits are committed
All annotation data (positions, colours, text, field definitions) is collected client-side and sent to the server as structured JSON alongside the original PDF. PyMuPDF applies every annotation and permanently flattens them into the page content. The output PDF has no interactive annotation layer — all edits are baked in.
Engine
Tesseract 5 with the LSTM neural network engine (OEM mode 1). The LSTM engine significantly outperforms the older pattern-matching engine on low-quality scans, handwriting, and non-standard fonts.
Output formats
.txt and the searchable PDF together.
Controls
DPI: 150 / 200 / 300. Higher DPI improves accuracy on dense text but increases processing time. Page segmentation modes (PSM): auto-detect, single column, single block, and sparse text — important for forms and tables where the default auto-detect makes wrong assumptions. Custom page range: up to 100 pages per job.
What comes back
Along with the output file, the response includes an OCR confidence score (per-word Tesseract TSV confidence averaged across all pages), word count, and character count. A live text preview tab in the browser lets you read extracted text without downloading the file.
Compare PDFs performs a page-by-page pixel-level diff between two documents. Because it operates on rendered pixels rather than the text layer, it works equally on text-based PDFs and scanned documents — any visual change is detected, including font substitutions, layout shifts, and image replacements that text-diff tools would miss.
Resolution
Both documents are rendered at the selected DPI before comparison. Higher DPI catches smaller visual differences but increases processing time and output file size.
Sensitivity threshold
Controls the minimum per-pixel difference required to flag a change. Lower values catch more (including compression artefacts); higher values ignore minor differences.
Change map colour coding
Preview and output
Side-by-side canvas previews of both documents render immediately when each file is selected — no upload required for the preview. Output is a single diff PDF with change regions overlaid on every compared page pair.
On upload, the tool reads the PDF's AcroForm dictionary and generates a matching input form in the browser — one input per field, typed to match the field's widget type. Fill the form in the browser, then submit: PyMuPDF writes the values server-side and returns the filled PDF.
Supported AcroForm field types
Flatten-after-fill
When the flatten option is enabled, field values are baked into the page content stream after writing — the output PDF has no interactive form layer. The filled values appear as static text. This is the correct format for archiving, printing, or sharing a completed form — interactive fields in a shared PDF can otherwise be re-edited by any recipient.
No-fields detection
If the uploaded PDF has no AcroForm dictionary, the tool shows a "No interactive fields found" notice immediately rather than presenting an empty form. For PDFs that need form fields added, use the Edit PDF tool's form builder.
The accessibility checker audits a PDF against WCAG 2.1 and PDF/UA-1 requirements. It returns a pass/fail result for each criterion, an impact level (Critical / High / Medium), and an overall letter grade (A–F) based on a weighted score.
The 8 checks
en-US) must be set in the document's XMP metadata. Screen readers use this to select the correct speech synthesis voice.
/MarkInfo /Marked true). Tagged PDFs expose heading levels, paragraphs, lists, and tables to assistive technology. Untagged PDFs are effectively inaccessible to screen readers.
/Alt) or be marked decorative (/Artifact). Images without alt text are invisible to screen readers.
Grading
Each check carries a weight corresponding to its accessibility impact. The weighted pass rate maps to letter grades: A (all critical + high pass), B (minor failures only), down to F. The report lists each check's pass/fail status, the specific WCAG criterion, and the impact level — giving developers and document authors a clear remediation priority order.
Font Inspector
Enumerates every font used across every page. For each font the report shows:
+ prefix in BaseFont name (e.g. ABCDEF+Helvetica)
Pages — list of pages where this font appears
Why non-embedded fonts fail print: When a font is not embedded, the viewer or RIP must substitute it. Substitution changes glyph widths, reflows text, and breaks any layout that depends on exact positioning. PDF/X and PDF/UA compliance both require full font embedding. Non-embedded fonts are flagged in red.
Subset embedding: A + prefix means only the glyphs actually used in the document are included — reducing file size while remaining fully compliant with PDF/X and PDF/UA standards.
Colour Inspector
Audits colour space usage for print-readiness using five detection layers — covering every type of PDF colour content:
extract_image())
Checks every embedded image's colour space via component count: 1 = DeviceGray, 3 = DeviceRGB, 4 = DeviceCMYK. RGB images are flagged — commercial presses expect CMYK, and RGB requires conversion during RIP processing, which can produce unexpected colour shifts.
get_drawings())
PyMuPDF preserves the original colour space in drawing colour tuples: 1-component = Gray, 3-component = RGB, 4-component = CMYK. Catches all filled and stroked paths, shapes, and borders.
rg/RG (DeviceRGB), k/K (DeviceCMYK), g/G (DeviceGray), and cs/CS for named colour spaces. This layer catches text colours and inline images that neither image extraction nor drawing analysis would detect.
/OP true) and transparency (/ca, /CA, /BM) flags.
inkcov device to compute per-page C/M/Y/K percentages. Calculates Total Ink Coverage (TIC = C+M+Y+K) per page and flags any page over 300% — a common press limit beyond which wet ink can cause trapping, drying, and registration problems.
The overall verdict — Print-ready (CMYK only, no RGB) or Requires conversion — is shown at the top of the report alongside the per-page breakdown and structured ink coverage table.
PDFs do not store tables as data structures — they store text characters at absolute positions and path objects that may or may not form visible borders. pdfplumber reconstructs table structure from these primitives using two strategies in sequence.
Detection strategies
l, re commands in the content stream). If the PDF was generated from software that draws explicit table borders — Word, Excel, LibreOffice, InDesign — this strategy reliably reconstructs cell boundaries. Applied first; if no tables are found, the fallback runs.
JSON output schema
Output is a single .json file. The first row of each detected table is treated as column headers; subsequent rows become an array of objects keyed by those headers. Multiple tables per page are each represented as separate entries.
table_count — total number of tables found
page_count — total pages in the document
tables[].id — sequential table number
tables[].page — page the table appears on
tables[].rows / .cols — dimensions
tables[].headers — array of column header strings
tables[].data — array of row objects keyed by header
Limitations
Scanned PDFs (image-based, no text layer) are not supported — use OCR first, then extract tables. Tables spanning multiple pages are detected as separate tables. Merged cells are flattened to the flat row/column structure.
The Workflow Builder chains multiple PDF operations into a single automated pipeline. Build once, run on any PDF.
How it works
Add steps from the step picker, configure per-step parameters, and drag to reorder. Upload one or more PDFs and run the full pipeline in one click. Each step processes the output of the previous step.
Supported pipeline steps
Saving and composing workflows
Chain 15 operations, save named pipelines to localStorage, export/import as JSON, and run on multiple PDFs in a single job.
Try it →How every request is handled — from upload to download to deletion.
Request lifecycle
%PDF for PDF operations.
Size checked: 50 MB per file, 200 MB total.
sys_get_temp_dir() . '/pdftool_' . bin2hex(random_bytes(12))
with permissions 0700. No other process can access it.
escapeshellarg().
A 120-second timeout wraps every external process. At most 4 heavy jobs run simultaneously.
readfile($path) begins streaming the output
to your browser over the existing HTTP connection.
cleanup() is called immediately after
readfile() returns. The temp directory and all its contents are
deleted while your download is still in flight. There is no retention
window.
Security controls
Every page generates two fresh random nonces per request.
script-src allows only 'nonce-{ext}' and
'nonce-{inline}'. No unsafe-inline, no
unsafe-eval. style-src 'self' — no inline styles anywhere
in any HTML, including this page.
Two independent rate-limiting layers run on every request.
Session-based: 10 operations per 5-minute sliding window per browser session.
IP-based: 30 operations per 5-minute window per source IP — generous enough
for shared NAT networks but still bounds individual abusers. Both are backed by Redis with
filesystem fallback so limits are always enforced. Polling and keepalive operations
(edit-ping, pdf-scan-poll, esign-status) are exempt
from both limits to avoid blocking live progress UIs. Returns HTTP 429 when either limit
is exceeded.
A third layer enforces server-wide concurrency: at most 4 heavy operations execute simultaneously. When that limit is reached the server returns HTTP 503 ("Server is busy — please try again shortly"). Lightweight operations and status polls are exempt. All limit breaches are recorded as structured security events.
Two-step validation before any processing:
(1) magic-byte check — first 4 bytes must be %PDF;
(2) secondary structural parse via pdfinfo — the file must be
parseable as a valid PDF cross-reference table. Both checks must pass; a file
that starts with %PDF but contains no valid PDF structure is
rejected. Repeated failures within a session are counted — three consecutive
failures trigger a security event log entry. MIME type validated against
allowlist. No user-controlled string reaches the shell without
escapeshellarg(). Page range inputs are validated against
/^\d+$/ before any integer conversion.
Every invocation of a heavy external tool — Ghostscript, Python, LibreOffice,
Playwright, ImageMagick — passes through a mandatory four-layer sandbox chain. The architecture is
sandbox-by-default: new tools are sandboxed automatically; an explicit opt-out is
required to exempt a tool (only four read-only helpers are exempt: pdfinfo, qpdf,
pdfseparate, pdftotext).
Layer 1 — prlimit: kernel-enforced resource caps applied before any process image loads:
1.5 GB virtual memory (RLIMIT_AS), 512 MB max file write (RLIMIT_FSIZE),
256 processes (RLIMIT_NPROC), 512 open file descriptors (RLIMIT_NOFILE).
Layer 2 — AppArmor aa-exec: transitions the process into the pqpdf-unshare
mandatory-access-control profile. Required on Ubuntu 24.04+ where user namespace creation is gated
behind the AppArmor userns permission. The profile grants only what unshare and the sandbox
script need; all other filesystem writes are denied.
Layer 3 — unshare (Linux namespaces): creates isolated kernel namespaces.
--user --map-root-user — the process believes it is root but holds no real capabilities.
--net — private network stack with no interfaces; the tool cannot connect to the internet or
any internal service; any connect() syscall fails.
--pid --fork — isolated PID tree; child processes cannot escape to the host.
--ipc — private shared memory and message queues.
--mount — private mount namespace so bind-mounts are invisible to the host.
Layer 4 — pqpdf-sandbox script: runs inside the new namespaces, mounts a 512 MB
tmpfs as scratch space so all I/O happens in-memory and vanishes when the namespace exits, bind-mounts
the job directory into the scratch tmpfs, applies a CPU time limit via ulimit -t
(enforced after the PID namespace fork, avoiding a kernel sigprocmask conflict), then execs
the real tool binary. No shell remains after exec.
SANDBOX_MIN_LEVEL = 'full' in production — if any layer is
unavailable the operation fails rather than running unsandboxed. Degraded execution is always logged
as a security event.
HTTP/3 over QUIC v1 (RFC 9000) — primary protocol. TLS 1.3 only;
TLS 1.0, 1.1, and 1.2 disabled — no downgrade possible. Key exchange uses
X25519MLKEM768 hybrid post-quantum cryptography (NIST FIPS 203). Cipher suite:
TLS_AES_256_GCM_SHA384. Certificate: Let's Encrypt ECDSA + SHA-384, CT-logged.
HSTS preload eligible (max-age=31536000; includeSubDomains; preload).
Full transport details →
Security-relevant events are written as structured
NDJSON to /var/log/pqpdf/security.ndjson — one event per line,
ingestible by Elasticsearch, Loki, Datadog, or jq. Events logged:
invalid HTTP method, unknown operation, session rate limit breach, IP rate limit breach,
concurrency limit reached, file size exceeded,
total upload size exceeded, repeated PDF validation failures (threshold:
3 consecutive), malformed page range input. Every entry carries a
hashed session token (first 12 hex chars of
sha256(session_id()) — stable but cannot be used to hijack the
session), IP address, operation name, and sanitised user-agent string. Falls
back to error_log() if the log file is not writable so no event
is silently dropped. A live Security Dashboard
at /security-dashboard.php presents aggregated telemetry — event timeline, activity heatmap,
top source IPs, and a filterable event log table with CSV/JSON export. The dashboard
is token-gated via the PQPDF_DASHBOARD_TOKEN environment variable.
The contact form layers four independent defences:
(1) AI behavioural verification — client-side analysis of interaction
patterns before the submit button is enabled;
(2) honeypot fields — two hidden inputs invisible to humans are sent with
every submission; any non-empty value causes the server to reject the request
via SpamException;
(3) server-side spam pattern matching — pharmaceutical keywords, excessive
capitalisation, disposable email domains, and common bot phrases;
(4) IP-based rate limit — maximum 5 submissions per hour per IP,
enforced in PostgreSQL before any email is dispatched.
Engine stack
All engines run locally. No file data is ever sent to a third-party service.
| Engine | Used by | External calls? |
|---|---|---|
| Ghostscript | Compress, watermark, rotate, protect, unlock, flatten, grayscale, repair, PDF/X | None |
| Poppler | Merge, split, extract text, to-images, PDF info | None |
| qpdf | Protect/unlock, structural analysis (scanner) | None |
| LibreOffice | All Office ↔ PDF conversions (Word, Excel, PowerPoint, ODT, ODS, ODP) | None |
| Playwright / Chromium | HTML → PDF (URL and file modes, JavaScript rendering, lazy-load, web fonts) | None (sandboxed) |
| ImageMagick | Images → PDF, typed signature rendering | None |
| Tesseract 5 LSTM | OCR PDF | None |
| PyMuPDF 1.27 | Edit, fill, nup, deskew, outline, a11y, font/colour inspect, PDF info, scanner engines 1–9 | None |
| pymupdf4llm | PDF → Markdown | None |
| python-pptx | PDF → PowerPoint | None |
| pdfplumber | Tables to JSON | None |
| pyhanko 0.34 | PAdES / Sign PDF (incremental CMS/PKCS#7) | None |
| endesive | Visual + crypto sign modes | None |
| ExifTool 12 | Scanner engine 10 | None |
| YARA 4.5 | Scanner engine 12 (24 custom rules + external .yar support) | None |
| ClamAV 1.4+ | Scanner engine 15 (700k+ signatures) | Signature updates only (clamav.net) |
| PeePDF 0.4 | Scanner engine 13 | None |
| prlimit + AppArmor aa-exec + unshare + pqpdf-sandbox | Four-layer process isolation sandbox — wraps every heavy tool invocation; also used explicitly by Scanner engine 14 for the dynamic behavioral sandbox with strace syscall tracing | None (network namespace isolates all tools) |
| pikepdf | Scanner engine 13 (supplemental PDF parser — JS Names tree, EmbeddedFiles, per-page AA) | None |
| scikit-learn + LightGBM | Scanner engine 16 (IsolationForest + RandomForest + LightGBM ensemble, model drift detection) | None |
| Acorn (Node.js) | Scanner engine 19 (JS AST deobfuscation, ECMAScript 2022, 6 iterative deobfuscation passes) | None |
| imagehash | Scanner engines 24, 39 (pHash perceptual similarity for campaign attribution · LSB chi-square steganalysis and tracking beacon detection) | None |
| Node.js vm | Scanner engine 41 (JS behavioral emulation — sandboxed Acrobat API stub, runtime call interception: app.launchURL, this.submitForm, app.openDoc) | None |
| python-tlsh | Scanner engine 24 (TLSH locality-sensitive hash for campaign clustering) | None |
| @noble/post-quantum | Protect PDF — PQC mode (runs in browser) | None |
PQ PDF runs behind PQCrypta Proxy — a Rust-based QUIC proxy (built on quinn) that provides HTTP/3, WebTransport, and post-quantum hybrid TLS at the network layer. Every connection uses TLS 1.3 with X25519MLKEM768 hybrid key exchange — the same algorithm now deployed by Chrome, Firefox, and Cloudflare. TLS 1.0, 1.1, and 1.2 are disabled entirely.
Post-quantum hybrid key exchange
X25519MLKEM768 combines a classical algorithm with a post-quantum algorithm. Both must be broken simultaneously for the key exchange to be compromised.
The post-quantum half. ML-KEM-768 (formerly Kyber-768) is a lattice-based key encapsulation mechanism standardised by NIST in August 2024. It provides 192-bit post-quantum security — the key cannot be recovered by either a classical computer or a cryptographically-relevant quantum computer.
The classical half. X25519 (Curve25519 Diffie-Hellman) is the fastest and most-audited elliptic-curve key exchange in production use. Constant-time arithmetic eliminates timing side-channels. Secure against all known classical attacks.
The final session key is derived from the output of both components. An adversary must break both X25519 and ML-KEM-768 to recover the key. If either algorithm is later found broken, the connection is still protected by the other — forward security is maintained at two independent levels.
Nation-state actors are known to archive encrypted traffic today, intending to decrypt it once a sufficiently powerful quantum computer exists. X25519MLKEM768 renders those archives useless — even a future quantum computer cannot reconstruct the session key from captured ciphertext.
Check whether your server negotiates X25519MLKEM768 or another PQC hybrid key exchange — and whether TLS 1.2 is still reachable.
Protocol stack
Clients negotiate the highest protocol they support. All three are advertised via ALPN and Alt-Svc. All enforce TLS 1.3 — there is no downgrade path to TLS 1.2.
| Protocol | Transport | TLS | ALPN | Multiplexing | HOL blocking |
|---|---|---|---|---|---|
| HTTP/3 Primary | QUIC v1 (UDP) | TLS 1.3 (QUIC built-in) | h3 |
✔ Stream-level | ✔ Eliminated |
| WebTransport | QUIC v1 (UDP) | TLS 1.3 (QUIC built-in) | h3 |
✔ Bidirectional streams + datagrams | ✔ Eliminated |
| HTTP/2 Fallback | TCP | TLS 1.3 | h2 |
✔ Stream-level | ⚠️ TCP-level HOL remains |
| HTTP/1.1 Legacy | TCP | TLS 1.3 | http/1.1 |
✘ None | ⚠️ Request + TCP HOL |
The PQCrypta scanner checks HTTP/3 negotiation, QUIC version, WebTransport availability, Alt-Svc headers, and 0-RTT configuration.
QUIC v1 — security hardening
RFC 9000 defines a suite of anti-abuse mechanisms. All are enabled.
Zero-RTT session resumption is intentionally off. While 0-RTT reduces latency for repeat connections, it opens a replay window where an adversary can re-submit captured early data. Disabling it eliminates this class of attack. All connections use full 1-RTT handshakes.
Before allocating connection state, the proxy issues a RETRY packet with a server-generated token. The client must echo this token in its next Initial, proving it controls the claimed source address. Prevents IP spoofing and connection-state exhaustion attacks.
The server sends no more than 3× the bytes received from an unvalidated client address. This prevents the QUIC handshake from being weaponised as a UDP amplification vector — a significant concern for protocols that can send large responses to small initial packets.
If the proxy loses connection state (e.g. after a restart), it sends a STATELESS_RESET to the client, cleanly terminating the connection rather than leaving the client retransmitting into a broken session indefinitely.
QUIC connections are identified by a Connection ID, not the IP/port 4-tuple. A client that switches from Wi-Fi to mobile data mid-upload continues the same logical connection without starting over. Connection migration is enabled with address re-validation on path change.
The proxy sends Generate Random Extensions And Sustain Extensibility values in TLS extension slots. This prevents middleboxes and TLS stacks from hardcoding assumptions about which extension IDs are valid — keeping the protocol extensible as new standards are adopted.
QUIC-native loss recovery with CUBIC CC. Initial congestion window: 12,000 bytes. Path MTU Discovery (PMTUD) enabled — the proxy probes for the optimal UDP payload size (measured MTU 1,452 bytes, UDP MTU 1,200 bytes, datagram payload 1,162 bytes).
Encrypted Client Hello (RFC 9289) would encrypt the SNI field, hiding the target hostname from network observers. It is not yet supported by PQCrypta Proxy. The TLS handshake itself is fully encrypted; only the SNI in the Client Hello remains visible to on-path observers.
WebTransport
Available on port 443 (path /) and port 4433. WebTransport runs over HTTP/3 and exposes QUIC streams and unreliable datagrams directly to browser code.
Unlike WebSockets (which layer over HTTP/1.1 TCP), WebTransport streams are independent QUIC streams with no head-of-line blocking. Multiple large file operations can transfer in parallel — one slow stream does not stall others.
Beyond streams, WebTransport supports fire-and-forget datagrams (max 1,162 bytes each). Ideal for low-latency signals — live progress events, cancellation, real-time preview requests — where retransmitting stale data would add unnecessary latency.
Every WebTransport session shares the underlying QUIC connection's TLS 1.3 encryption, X25519MLKEM768 key exchange, address validation, and anti-amplification hardening. No separate security layer to configure.
TLS 1.3 — cipher & certificate
| Parameter | Value | Notes |
|---|---|---|
| Cipher suite | TLS_AES_256_GCM_SHA384 |
AES-256-GCM authenticated encryption; 256-bit key; SHA-384 transcript hash |
| Key exchange | X25519MLKEM768 |
Hybrid: X25519 classical ECDH + ML-KEM-768 post-quantum (NIST FIPS 203) |
| Signature | ecdsa-with-SHA384 |
Certificate signed with ECDSA + SHA-384; P-384 curve |
| Certificate issuer | Let's Encrypt E8 | Free public CA; certificate transparency logged; 90-day auto-renewal |
| ALPN | h3 |
HTTP/3 negotiated via TLS ALPN extension |
| TLS versions | TLS 1.3 only | TLS 1.0, 1.1, and 1.2 explicitly disabled — no downgrade possible |
| GREASE | Enabled | Random extension values injected to prevent middlebox ossification (RFC 8701) |
| Handshake RTTs | 1-RTT only | 0-RTT disabled; full handshake on every new connection — no replay window |
| ECH | Not yet supported | SNI remains visible to on-path observers; ECH (RFC 9289) planned |
Connection & performance metrics
Measured by PQCrypta scanner against pqpdf.com, March 2026.
| Metric | Value | Notes |
|---|---|---|
| TLS handshake | 48 ms | Full 1-RTT QUIC Initial + Handshake packet exchange |
| TTFB | 3 ms | Time to first byte from proxy after handshake |
| RTT | 0 ms | Sub-millisecond measured round-trip time |
| Packet loss | 0.00% | 0 of 16 packets lost during scan |
| Congestion control | CUBIC | RFC 9002 QUIC loss recovery + CUBIC CC algorithm |
| Initial CWND | 12,000 bytes | ~8 QUIC packets before ACK feedback required |
| Max stream data | 12,000 bytes | Initial per-stream flow control window |
| MTU / UDP MTU | 1,452 / 1,200 bytes | PMTUD enabled; max datagram payload 1,162 bytes |
| Idle timeout | 20 s | Server-initiated close after 20 s of inactivity |
| Proxy processing | 2.64 ms | PQCProxy internal duration (Server-Timing: proxy;dur=2.64) |
HTTP/3 response headers
Headers sent on every HTTP/3 response that convey transport metadata, observability, and client hint negotiation.
| Header | Value / Purpose |
|---|---|
Alt-Svc |
h3=":443"; ma=86400, h3=":4434"; ma=86400 — advertises HTTP/3 on ports 443 and 4434; browsers cache for 24 hours |
Server-Timing |
proxy;dur=2.64;desc="PQCProxy Processing", quic;desc="QUIC v1" |
Priority |
u=3 — RFC 9218 Extensible Prioritisation Scheme; urgency 3 (default) |
Accept-CH |
DPR · Viewport-Width · Width · ECT · RTT · Downlink · Sec-CH-UA-Platform · Sec-CH-UA-Mobile — client hints for adaptive responses |
NEL |
Network Error Logging configured — browsers report transport failures to the Report-To endpoint |
Report-To |
Reporting API endpoint for NEL, CSP violation, and COOP violation reports |
| 103 Early Hints | Supported — server can push Link: preload hints before the full response is ready |
Implementation
PQCrypta Proxy v0.2.1 — purpose-built Rust proxy using the quinn library (the leading Rust QUIC implementation, also used by Cloudflare). Rated A++ / HTTP/3 Ultimate by the PQCrypta scanner with 95% confidence: "Post-Quantum ready, Rust QUIC (quinn), HTTP/3 RFC 9114, Standard port (443)".