The Complete Guide

PQ PDF — Every Tool, Explained

45 free PDF tools. No accounts. No ads. No file retention. Everything runs on our servers — your files are deleted the moment your download starts.

45 Free Tools
44 Scan Engines
31 PQC Algorithms
0 Files Stored
50 MB Max File Size
✅ Zero retention ✅ No accounts ✅ No tracking ✅ No third-party cloud ✅ All engines local 🤖 Self-hosted AI — no OpenAI/Anthropic/Google

Most online PDF tools share a structural problem: they are built around cloud storage. A file uploaded to add a watermark travels to a third-party processing service, sits in object storage, passes through analytics pipelines, and is subject to retention policies that are vague at best.

PQ PDF is different. Every operation creates one isolated temporary directory, runs entirely inside it, streams the result back to your browser, and deletes the directory — while the download is still in flight. There is no retention window because there is no buffer.

Four specific gaps drove this project: no free tool offered a genuine zero-retention guarantee anywhere in the stack; no free tool ran multi-engine threat analysis on PDFs; no tool offered post-quantum cryptography in document workflows; and no tool was transparent about which engines were actually running — every operation here is described in the exact pipeline terms of the code.

Why privacy-first matters for PDFs

🗄️
Zero file retention

Your file is deleted while the download streams. cleanup() is called immediately after readfile() — not on a schedule, not on the next request. There is no temp-file cleanup job because nothing is left to clean.

🚫
No third-party cloud — including AI

All 45 tools run on pqpdf.com's own servers. Ghostscript, LibreOffice, Tesseract, PyMuPDF, ClamAV — every engine runs locally. The AI features (forensic report, document analysis, redaction suggestions, change analysis) run on our own self-hosted Qwen 2.5 1.5B LLM via llama.cpp — no OpenAI, no Anthropic, no Google, no third-party AI API of any kind. No file data is ever sent outside our infrastructure.

🔒
Strict CSP — no unsafe-inline

Every page uses a per-request nonce-based Content Security Policy. No inline scripts, no unsafe-eval. All event handlers are registered via addEventListener() in external JS files. Including this page.

👁️
No tracking, ever

No analytics pixels. No advertising networks. No social-media trackers. Server access logs (IP, timestamp, path) are retained for 30 days for abuse prevention only, then permanently deleted.

Post-quantum ready

The Protect PDF tool includes 31 post-quantum algorithms (NIST ML-KEM-1024, HQC, FN-DSA, and hybrid modes) running entirely client-side in your browser. The server receives only the encrypted bundle — your plaintext never crosses the network.

🧪
No accounts required

Every tool works without registration. There are no user accounts, no email addresses collected, no passwords stored. Rate limiting uses session cookies that exist only in your browser and are never transmitted to or stored on the server.

What the 45 tools cover

Five groups covering every common PDF workflow.

⚙️
Core Manipulation — 12 tools

Merge, split, compress, rotate, reorder, delete pages, extract pages, flatten, repair, grayscale, N-up imposition, and auto-crop & deskew.

Explore core tools →
📄
Format Conversion — 13 tools

Convert between PDF and Word, Excel, PowerPoint, HTML, Images, Markdown, PDF/A, and PDF/X. In both directions.

Explore convert tools →
🛡️
Security & Privacy — 6 tools

44-engine PDF forensics scanner, AES-256 + PQC encryption, unlock, permanent redaction, watermarking, and PAdES-compliant signing.

Explore security tools →
✏️
Annotate & Inspect — 11 tools

Full visual editor, form fill, PDF diff, OCR, accessibility audit, font inspection, colour profiling, table extraction, and more.

Explore annotation tools →
🔄
Automation — 1 tool

Chain multiple operations into a named workflow. Save, load, append, and export pipelines as JSON. Run on one or many PDFs in one click.

Explore automation →
🔧
How it works

Temp-dir isolation, CSP nonces, rate limiting, file validation, zero-retention architecture, and the full engine stack — all documented.

Read the architecture →
📊
How PQ PDF compares

Side-by-side comparison with Adobe, Smallpdf, iLovePDF, PDF24, and Sejda — tools, limits, privacy, and pricing verified from official sources.

See the comparison →

How PQ PDF compares

Verified from official privacy policies, terms of service, and tool pages as of March 2026. Only publicly documented claims are listed.

Feature PQ PDF Adobe Acrobat Online Smallpdf iLovePDF PDF24 Sejda
File retention after processing ✔ Deleted during download cleanup() called inside send_file() — no retention window of any kind Not disclosed Files deleted "after processing" — no specific time given on any public page 1 hour Automatically deleted 1 hour after processing (stated on tool pages) 2 hours Documented in Terms of Service §9.3 and the public FAQ 1 hour Stated in Privacy Policy and Terms of Use (pdf24.org) 2 hours Stated on tool pages and in the Sejda Privacy Policy
Free tier — what's included
Core / Organise ✔ 12 tools Merge · Split · Compress (5 presets + custom DPI + live preview) · Rotate · Extract Pages · Delete Pages · Reorder · Repair · Flatten · Grayscale / B&W · N-up / Imposition · Auto-crop & Deskew (per-page interactive editor) ✔ 7 tools Free Adobe ID required for most Merge · Split · Rotate · Delete pages · Reorder pages · Add page numbers · Add watermark ✔ 7 tools 2 tasks/day cap Merge · Split · Compress · Rotate · Delete pages · Reorder pages · Extract pages ✔ 6 tools Per-task caps apply Merge (25 files, 100 MB) · Split · Remove pages · Extract pages · Rotate · Compress ✔ 6 tools All free · no caps · ad-supported Merge · Split · Compress · Rotate · Delete pages · Reorder pages ✔ 13+ tools 3 tasks/hr rate limit Merge (4 variants: std, specific pages, alternate, resize) · Split (5 variants: pages, text, bookmarks, size, extract) · Rotate · Delete pages · Crop · Repair · Flatten · Header & footer · Bates numbering · Reverse PDF
Convert ✔ 13 tools PDF ↔ Word · Excel · PowerPoint · HTML · Images · Markdown (pymupdf4llm) · PDF/A (1b/2b/3b) · PDF/X (X-1a/X-3/X-4) Partial PDF → Office is paid Free (Adobe ID): Word / Excel / PPT / HTML / JPG → PDF · PDF → JPG (limited quality)
Paid only: PDF → Word · Excel · PPT · PDF/A · OCR
Partial PDF → Office is Pro only Free: Word / Excel / PPT / HTML / JPG → PDF · PDF → JPG
Pro only: PDF → Word · Excel · PPT · OCR · PDF/A
✔ 10 tools File size caps per tool → PDF (5): Word · Excel · PPT · HTML · JPG (20 files, 100 MB each)
PDF → (5): Word (10 MB) · Excel (10 MB) · PPT (10 MB) · JPG (100 MB) · HTML (10 MB)
✔ 10 tools All free · no caps · ad-supported PDF ↔ Word · PDF ↔ Excel · PDF ↔ PowerPoint · Image ↔ PDF · HTML → PDF ✔ 8 tools 3 tasks/hr rate limit PDF → (5): Word · Excel · PPT · JPG · HTML
→ PDF (3): Word · Excel · JPG
Security & Encryption ✔ 6 tools 44-engine PDF forensics scanner (structural · dynamic sandbox · ML+SHAP · XFA FormCalc · action dependency graph · OCG cloaking · Unicode/invisible text · trailer chain forensics · codec exploit params · entropy topology · image stego · compliance fraud · JS behavioral emulation · font CharString emulator · XRef integrity graph · local threat intelligence 6.4M+ indicators · MITRE ATT&CK · signature forensics · phishing · campaign attribution) · AES-256 + 31-algorithm PQC encrypt · Unlock · Permanent redact · Watermark · PAdES-B sign ✔ 2 tools free Most security features are paid Free (Adobe ID): Protect (encrypt) · Unlock
Paid only: Redaction · Advanced watermark
✔ 3 tools 2 tasks/day cap Protect (encrypt) · Unlock · Watermark
Pro only: Redact · PDF/A compliance
✔ 5 tools Per-task caps apply Protect (encrypt) · Unlock · E-sign (request) · Validate signature · Redact PDF
No AES-256 PQC encryption, no threat scanning
✔ 4 tools All free · no caps · ad-supported Protect (encrypt) · Unlock (decrypt) · Sign PDF · Compare PDFs ✔ 5 tools 3 tasks/hr rate limit Protect (encrypt) · Unlock · Sign PDF (request) · Validate signature · Redact PDF
Edit, Annotate & Inspect ✔ 11 tools Visual editor (16 annotation tools + AcroForm builder + bookmarks) · Fill forms · Compare / diff · Tesseract 5 LSTM OCR (searchable PDF + confidence) · Bookmarks editor · WCAG 2.1 accessibility checker · Font inspector · Colour / CMYK inspector · Tables → JSON · Extract text · PDF info Partial Advanced edit, OCR & compare are paid Free: Add comments · Fill & Sign · Basic text edit
Paid only: Advanced editing · OCR · Compare PDFs · Accessibility checker · Bates numbering
✔ 4 tools 2 tasks/day cap Edit PDF · Add page numbers · Fill forms · Flatten
Pro only: OCR · Redact · AI chat / summarise
✔ 6 tools Per-task caps apply Edit PDF (50 MB) · Add page numbers · Fill forms · Sign PDF · OCR PDF · PDF Scanner
Premium: AI tools, API, batch
✔ 5 tools All free · no caps · ad-supported Edit PDF · Add page numbers · Fill forms · OCR PDF · PDF/A conversion ✔ 9 tools 3 tasks/hr rate limit Edit PDF · Add text / image / shapes / links · Whiteout · Edit hyperlinks · Add page numbers · OCR PDF · PDF Scanner · Optimise for web · HTML → PDF
Automation / Workflow ✔ 15-step workflow builder Visual builder — chain, save, append, export as JSON; run on multiple PDFs in one job; fully free, no account needed ✘ None free Acrobat Actions (macro-style) — paid Acrobat Pro only; no visual builder ✘ None Batch via API — Pro only; no visual workflow builder ✘ None Batch — Premium & API only; no visual builder ✘ None Tools run individually; no workflow builder ✘ None Tools run individually; no workflow builder
Send for e-signature (multi-party) ✔ Free — no caps Up to 10 signers · sequential or parallel order · unique secure link per signer · PAdES-B cryptographic option (sender can enforce it — require_crypto rejects submissions without a cryptographic signature) · free drag-and-drop signature placement · date stamp · add/remove signers from tracking page · 24-hr TTL · no account needed ✔ 3 requests/month free Request e-sig from others via free Adobe ID; unlimited via Acrobat Sign (paid) ✔ Limited free E-sign requests included in free 2-tasks/day tier; Pro removes cap ✔ Free with caps E-sign — request sigs from others; 1-file per-task cap applies ✔ Free Request-signature workflow; free, no cap, ad-supported ✔ Free with rate limit E-sign — request sigs from others; 3 tasks/hr cap applies
PDF Scanner (camera to PDF) ✔ Free — no caps Browser camera or photo upload · real-time edge detection · OpenCV perspective correction · CLAHE & B&W enhancement · Tesseract 5 OCR · multi-page · no app install · zero retention ✔ Adobe Scan (app required) Dedicated free mobile app; scan to searchable PDF with OCR — requires install ✘ Not available ✔ Yes — free with caps PDF Scanner web tool; mobile camera to PDF with OCR; per-task file cap applies ✔ PDF24 app (app required) Mobile app includes scan-to-PDF; free, ad-supported — requires install ✔ Yes — rate limited PDF Scanner tool; 3 tasks/hr cap applies
Free tier limits No account  ·  No task caps  ·  No daily limits  ·  No ads  ·  No upsells 50 MB / file  ·  200 MB total per request 2 GB / file Most tools need free Adobe ID; aggressive upgrade prompt after 1–2 uses of most tools; full access requires paid plan 5 GB / file 2 tasks/day hard cap — upgrade prompt after that; batch & API are Pro-only 200 MB / task Per-tool file count & size caps; tighter on PDF → Office (10 MB); Premium for batch & larger files 500 MB / file No task cap, no daily limit; all tools free with advertising; Premium removes ads 50 MB / file 3 tasks/hr · 50 pages/task · 30 files/hr; Paid plan removes all rate limits
Malware & threat scanning ✔ 44 independent engines Static heuristics · dynamic Linux namespace sandbox · ML anomaly detection · XFA FormCalc parser · PDF action dependency graph · OCG layer cloaking · Unicode/invisible text · trailer chain forensics · codec exploit params · entropy topology · image steganography · compliance fraud · JS behavioral emulation · font CharString emulator · XRef integrity graph · six-parser differential · JS AST deobfuscation · AcroForm forensics · signature forensics · campaign attribution · weighted correlation engine ✘ No malware scanning CSAM content check only (Adobe Terms §2.2(C)); no PDF threat analysis disclosed ✘ None disclosed ✘ Explicitly none "We won't check, copy or analyze your files in any way" — iLovePDF FAQ ✘ None disclosed ✘ None disclosed
Post-quantum encryption ✔ 31 algorithms — client-side NIST ML-KEM-1024/768/512, HQC-128/192/256, FN-DSA variants, and hybrid modes via @noble/post-quantum; server never sees plaintext ✘ Not available Not disclosed on any public page ✘ Not available ✘ Not available ✘ Not available ✘ Not available
Processing on own servers ✔ All engines run locally No file data sent to any third-party service — every engine runs on pqpdf.com's own server Third party — not named "Trusted cloud infrastructure providers and CDNs" (Adobe Terms) — providers not disclosed Not disclosed Privacy policy pages returned errors during verification; subprocessors not published Third party — not named "Leading cloud data storage provider" cited on Security page — name not disclosed EU servers confirmed; provider not named "All servers within the EU" — PDF24 Privacy Policy (geek software GmbH, Berlin) DigitalOcean, Cloudflare, Fastly All three named as infrastructure providers in the Sejda Privacy Policy
AI analysis — self-hosted LLM 🤖 Qwen 2.5 1.5B — self-hosted 4 AI features — all self-hosted on own hardware, zero third-party AI calls: 🤖 AI Forensic Report (synthesises all 44 engine outputs → verdict, confidence, executive summary, key findings, MITRE techniques, recommended actions, false_positive_note; MALICIOUS/CLEAN auto-labels for ML retraining), 🤖 AI Document Analysis (type classification with confidence, language, entities incl. locations, topics, reading level), 🤖 AI Redaction Suggestions (PII pattern proposals across 13 categories with example + reason), 🤖 AI Change Analysis (significance rating, change type, plain-English summary, per-change details array, recommendation) — Qwen 2.5 1.5B Instruct via llama.cpp, ~13 t/s on Ryzen 5 3550H. No OpenAI, no Anthropic, no Google. Your document text never leaves our infrastructure. ⚠️ Adobe Sensei (cloud AI) Adobe AI routes content through Adobe's own cloud AI pipeline — separate from PDF processing infrastructure ✘ None disclosed ✘ None disclosed ✘ None disclosed ✘ None disclosed
Processing engines disclosed ✔ All engines named Ghostscript, Poppler, LibreOffice, Tesseract 5, PyMuPDF, YARA, ClamAV, PeePDF, pikepdf, Acorn, scikit-learn, LightGBM, imagehash — every tool documented ✘ None disclosed Described as proprietary; no library or engine names on any public page ✘ None disclosed ✘ None disclosed ✘ None disclosed Partial Tesseract named for the desktop app OCR feature only; web-side engines not disclosed
Max upload — free tier 50 MB / file 200 MB total per request across all files 2 GB Stated on the Compress tool page (acrobat.adobe.com) Not stated publicly Pricing page not accessible; per-tool limits not published on accessible pages 200 MB / task Varies by tool — lower limits on conversions (e.g. 15 MB for some); ilovepdf.com 500 MB / file Stated on tool pages at tools.pdf24.org 50 MB / file 200 MB / task; 3 tasks/hour; page cap 50 pages/task (sejda.com)
Open-source engines only ✔ 100% open source Ghostscript, Poppler, LibreOffice, Tesseract 5, PyMuPDF, YARA, ClamAV, PeePDF, pikepdf, Acorn, scikit-learn, LightGBM, imagehash — every engine is named open-source software ✘ No Proprietary Adobe processing pipeline; specific engines not disclosed on any public page ✘ Not disclosed ✘ Not disclosed ✘ Not disclosed Partial Tesseract named for desktop OCR only; web processing pipeline not disclosed
Cryptographic signing standard ✔ PAdES-B (ETSI EN 319 102-1) Incremental CMS/PKCS#7 via pyhanko 0.34 — verifiable in Adobe Reader's Signatures panel. Draw/type/upload modes also available with embedded RSA-2048 cert Acrobat standard signature Adobe Sign product available; standard acrobat.adobe.com e-sign. PAdES compliance not documented on public pages Basic e-signature Sign tool available; signing standard not disclosed on accessible pages Basic e-signature Sign tool available; signing standard not disclosed on accessible pages Basic e-signature Sign tool available; signing standard not disclosed on accessible pages Basic e-signature Sign tool available; signing standard not disclosed on accessible pages
Advertising / monetisation model ✔ No ads, no upsells No advertising, no tracking pixels, no in-tool upgrade prompts, no affiliate links — tool is self-funded Freemium — subscription upsells Free basic use; persistent prompts to upgrade to paid Acrobat plan Freemium — subscription upsells Free tier with task limits; upgrade prompts throughout the tool flow Freemium — subscription upsells Free tier with limits; Premium plan promoted within tools Free with ads Web app displays advertising on the free tier; Premium plan available for ad-free experience Freemium — task-limited Free tier capped at 3 tasks/hour; paid plan promoted in tool UI

ℹ️ All competitor claims verified from official sources as of March 2026: Adobe Terms · iLovePDF Terms §9.3 · iLovePDF FAQ · PDF24 Privacy Policy · Sejda Privacy Policy · Smallpdf retention confirmed from tool pages (privacy policy pages unavailable at time of research). PQ PDF claims are derived from api.php, _tool_head.php, and tool source files on this server.

12 tools for everyday PDF manipulation. All processing is server-side; nothing is rasterised unless explicitly requested.

📎
Merge PDFs

Combine up to 20 PDFs into one (200 MB total). Drag thumbnails to reorder before merging. Real-time upload progress percentage.

Try it →
✂️
Split PDF

Split by every page, a fixed interval, custom page ranges, or interactive cut-point selection. Output is a ZIP of individual PDFs.

Try it →
🗜️
Compress PDF

Five quality presets plus custom DPI slider (50–600). Optional metadata stripping, linearisation, and stream recompression. Live before/after split-canvas preview from page 1. Shows size reduction after download.

Try it →
🔄
Rotate Pages

Rotate all, odd, even, or a custom range. Supports 90°/180°/270° and arbitrary decimal angles. Live canvas preview of page 1.

Try it →
📃
Extract Pages

Click a thumbnail grid to select the pages you want to keep. Selections auto-compress to ranges (e.g. 1–3, 5, 7–9).

Try it →
🗑️
Delete Pages

Click the thumbnail grid to mark pages for removal. Everything else is kept.

Try it →
🔀
Reorder Pages

Drag-and-drop page thumbnails to rearrange, then export the reordered PDF.

Try it →
🔧
Repair PDF

Reconstructs corrupted or malformed PDFs via Ghostscript. On upload, PDF.js diagnoses the file client-side — checking the header, xref table, and content streams — and shows red error badges or a green "readable" confirmation before any server work runs.

Try it →
📋
Flatten PDF

Permanently bakes form fields, annotations, and layers into the page content. Client-side pre-scan shows exactly what will be flattened — field counts, annotation types, layer names — with a green "already flat" badge if nothing is found.

Try it →
🎨
Grayscale / B&W

Convert a colour PDF to grayscale or pure black-and-white. Live before/after split-canvas preview: colour on the left, grayscale simulation on the right.

Try it →
📋
N-up / Imposition

Arrange multiple PDF pages on each output sheet: 2-up, 4-up, 6-up, 8-up, 9-up, or booklet (pages re-ordered for saddle-stitch binding). Uses PyMuPDF show_pdf_page() — vector output, no rasterisation. Page size and orientation selectable.

Try it →
📐
Auto-crop & Deskew

Remove excess white margins and correct page rotation. Three modes: crop only, fix rotation only, or both. Features a per-page interactive crop editor — see the deep dive below.

Try it →

Most deskew tools apply a single global correction. This one gives you a per-page interactive editor before anything is sent to the server.

How auto-detection works

After upload, each page is rendered via PDF.js. Text extraction and PyMuPDF's bounding-box analysis across text blocks, vector paths, and raster images detects the tight content boundary. A 20pt safety margin is added. The result is drawn as a draggable crop box over the rendered page.

The interactive crop editor

8 drag handles 4 corners + 4 edge midpoints. Resize the keep area in any direction.
Pan by dragging inside the box Move the crop region without resizing it.
🔁
Apply to all pages Normalises the current crop as proportional fractions and applies it to every page in the document — useful when all pages have consistent margins.
🔄
Reset page Re-runs auto-detection on the current page, discarding your manual adjustment.

What gets sent to the server

Per-page overrides are sent as a JSON array of {page, x0, y0, x1, y1} in PDF display-space points. Pages without manual overrides continue to use server-side auto-detection. The rotation fix bakes the /Rotate flag into the content stream so output pages have rotation=0 in all viewers — aspect ratio and coordinate mapping are preserved for 90°/180°/270° pages via an offset target-rect approach.

Compress PDF controls image DPI downsampling via Ghostscript and optionally applies stream-level recompression via qpdf. A live split-canvas preview renders page 1 as soon as you upload — original on the left, simulated compression on the right — so you can see the visual impact before committing to a download.

Quality presets

📱
Screen — 72 DPI Optimised for on-screen reading. Minimum file size. Suitable for email attachments where print quality is not required.
📖
eBook — 150 DPI Recommended preset. Balanced quality and file size — images remain sharp on screen and in basic print. Best choice for most documents.
🖨️
Printer — 300 DPI High-quality output suitable for desktop printing. Noticeably larger than eBook but retains photographic detail at print scale.
🎨
Prepress — 300 DPI with colour profiles preserved Maximum colour fidelity. Preserves embedded ICC profiles and applies Ghostscript's /prepress quality settings. Use for documents destined for a print shop or colour-critical workflows.
🎯
Custom — 50–600 DPI (10 DPI steps) Slider-controlled DPI for precise control. Useful when you know the exact output requirement — for example, 120 DPI for a mobile-only document or 600 DPI for archival-quality images.

Advanced options

🗑️
Strip metadata Removes the PDF Info dictionary — title, author, subject, keywords, creator, producer, and creation/modification dates. Useful when distributing documents publicly and you want to remove any authoring footprint.
Web-optimise (linearise) Restructures the PDF so the first page is available before the full file downloads — enabling browsers to render it progressively. Adds minor overhead to file size but significantly improves perceived load time for large documents hosted online.
📦
Recompress streams (qpdf) Runs qpdf's maximum Flate (Deflate) compression across all internal PDF data streams after Ghostscript finishes. Adds 2–10% additional savings on top of image resampling. Most effective on documents with many uncompressed object streams.

What gets compressed — and what doesn't

DPI settings affect raster images embedded in the PDF. Photographs, scanned pages, and screenshots will see the largest reductions — typically 40–90%. Vector content and text are not affected by DPI; they are resolution-independent and remain identical across all presets. A text-only document will see minimal reduction regardless of preset; in that case, enabling stream recompression and metadata stripping provides the most benefit.

Split-canvas preview

As soon as a file is uploaded, page 1 is rendered at full resolution in the browser via PDF.js. The canvas is split vertically — left half shows the original, right half simulates the compressed output at the selected DPI. The preview updates live as you switch presets. This shows the visual quality trade-off before any server processing occurs.

Split PDF supports four distinct split strategies. All modes use Poppler at the binary level — pages are extracted without re-rendering, so fonts, images, and text layers are preserved exactly as they appear in the source.

Split modes

📄
Every page Produces one PDF per page, packaged as a ZIP. A 40-page document becomes 40 individual PDFs. Useful for splitting scanned documents into individual records or for batch processing single-page files.
🔢
Every N pages (interval) Specify a chunk size (e.g. 10) and the document divides into equal-sized pieces — with a smaller final chunk if the total is not divisible. A 45-page document split every 10 produces four 10-page PDFs and one 5-page PDF.
✂️
Interactive cut points After upload, all pages render as thumbnails. Scissors (✂) icons appear between each pair of pages. Click a scissors icon to mark a split point — click again to remove it. Multiple split points produce multiple PDFs. A counter shows how many pieces the document will become before you submit.
📋
Custom page ranges Enter comma-separated ranges (e.g. 1-3, 5, 7-9). Each range becomes a separate PDF. Pages not covered by any range are discarded. Useful for extracting specific named sections from a longer document.

Output

Every Page, Interval, and Interactive modes output a ZIP archive. Custom ranges output a single PDF when one range is specified, or a ZIP for multiple ranges. No re-encoding occurs — output pages are binary-identical to the source pages.

N-up imposition places multiple source pages onto each output sheet. Unlike tools that rasterise pages before imposing them, this tool uses PyMuPDF's show_pdf_page() — pages are placed as live PDF content. Text remains selectable and searchable in the output; images are not re-compressed.

Layouts

📄
2-up (2×1)Two source pages side by side on one landscape sheet.
📋
4-up (2×2)Four source pages in a 2-column, 2-row grid.
📋
6-up (2×3)Six source pages in a 2-column, 3-row grid.
📋
8-up (2×4)Eight source pages in a 2-column, 4-row grid.
📋
9-up (3×3)Nine source pages in a 3-column, 3-row grid.
📖
Booklet (saddle-stitch) Pages are re-ordered and paired for saddle-stitch binding — when folded and stapled, they read in correct sequence. A 16-page document becomes 4 sheets: sheet 1 has pages 16&1 on the outside and 2&15 on the inside, and so on. Output is 2-up on landscape sheets, ready for duplex printing and folding.

Output options

Output page size is selectable (A4, Letter, Legal, A3, or original). Orientation (portrait/landscape) is configurable independently. Pages are auto-scaled to fit each cell while preserving aspect ratio.

13 tools converting between PDF and common document, spreadsheet, presentation, image, and web formats. All conversions run via locally-installed open-source engines — LibreOffice, Ghostscript, PyMuPDF, ImageMagick.

PDF to other formats

📝
PDF → Word

Export to .docx, .odt, .rtf, or .txt via LibreOffice. A format fidelity indicator shows star ratings (out of 4) for each output format before you convert.

Try it →
📊
PDF → Excel

Extracts tables to .xlsx via LibreOffice. Best suited to PDFs where table structure is preserved in the source document.

Try it →
🖼️
PDF → Images

Renders pages to PNG or JPEG at 72–600 DPI. Select all pages or a custom range. JPEG quality slider available. Live DPI preview: page 1 is rendered in a canvas at the selected DPI immediately on upload, showing actual pixel dimensions before processing. Download as ZIP.

Try it →
🗂️
PDF → PDF/A

Convert to PDF/A-1b, PDF/A-2b, or PDF/A-3b for long-term archival (ISO 19005). Fonts are embedded, transparency is flattened, colour profiles are attached.

Try it →
🖨️
PDF → PDF/X

Convert to print-industry PDF/X (X-1a, X-3, or X-4) via Ghostscript with CMYK colour conversion, /prepress quality, and configurable render intent. All fonts embedded, colour data print-shop compliant.

Try it →
💽
PDF → PowerPoint

Each page is rendered at 150 DPI via PyMuPDF and placed as a full-bleed image on its own slide using python-pptx. Slide dimensions match the original page aspect ratio.

Try it →
🌐
PDF → HTML

Converts pages to a styled HTML document using PyMuPDF page.get_text("html"), preserving font, size, and positioned text spans. Produces a single self-contained .html file with print-friendly styling.

Try it →
📄
PDF → Markdown

Uses pymupdf4llm — the AI/LLM-optimised layout analysis engine built on PyMuPDF 1.27 + ONNX. Detects headings, paragraphs, tables, code blocks, and list structures. Produces clean .md ideal for RAG pipelines and LLM ingestion.

Try it →

Other formats to PDF

📝
Word → PDF

Convert .doc / .docx / .odt / .rtf / .txt via LibreOffice. A fidelity indicator shows expected quality for the file type you upload.

Try it →
📊
Excel → PDF

Convert .xls / .xlsx / .ods / .csv via LibreOffice. A sheet selector fetches sheet names from the uploaded file so you can choose which sheets to convert.

Try it →
💽
PowerPoint → PDF

Convert .ppt / .pptx / .odp via LibreOffice. A slide selector fetches slide titles from the uploaded file so you can choose which slides to include.

Try it →
🖼️
Images → PDF

Pack JPEG / PNG / WebP / BMP / TIFF / GIF images into a single PDF via ImageMagick. Drag thumbnails to reorder before generating.

Try it →
🌐
HTML → PDF

Upload a .html / .htm file or enter any public URL. Converted via Playwright/Chromium — full Chromium rendering engine captures modern CSS, web fonts, lazy-loaded images, and JavaScript-rendered content. Page size, orientation, and margins are configurable.

Try it →

Most PDF-to-text converters flatten the document into a single stream of characters, destroying the layout information that makes it useful — headings become plain lines, tables become scrambled text, multi-column layouts interleave content from adjacent columns. pymupdf4llm analyses the document layout before extracting text.

Engine: pymupdf4llm + ONNX

pymupdf4llm is the AI/LLM-optimised extraction layer built on PyMuPDF 1.27 with an ONNX inference backend. It analyses bounding-box positions, font sizes, column boundaries, and text flow to infer document structure before generating Markdown — the same technique used by state-of-the-art document AI pipelines.

Structural elements detected

# Headings — H1–H4 inferred from font size hierarchy 📝 Paragraphs — reflowed, not line-broken 📊 Tables — GitHub-flavored pipe syntax 💻 Code blocks — monospace font detection, triple-backtick fencing • Bullet and numbered lists 📰 Multi-column layout handling

LLM and RAG use cases

🧠
RAG pipelinesChunk structured Markdown by heading for higher-quality retrieval than chunking raw PDF text. LangChain and LlamaIndex consume Markdown natively with heading-aware splitters.
💬
Direct LLM ingestionStructured Markdown preserves the relationships between headings, sub-sections, and tables that a flat text dump destroys — reducing hallucination rates when models reason over document content.
📚
Documentation and knowledge basesConvert internal PDF documentation to Markdown for version control, wiki import, or static site generation.
When it works best: PDFs with a native text layer (not scanned). For scanned documents, run OCR first to generate a searchable PDF, then convert to Markdown.

Convert PDF pages to individual image files using Poppler's pdftoppm renderer. A live preview shows exactly what the output will look like at the selected DPI — including the actual output pixel dimensions — before any server processing starts.

DPI options

📱
72 DPI — ScreenSmallest files. Suitable for thumbnails, web previews, or email where exact reproduction is not needed.
🖥️
96 DPI — Standard screenCommon web resolution. Matches the historical CSS reference pixel for 1:1 display on most monitors.
📸
150 DPI — Good quality (default)Recommended for most uses. Sharp enough for document review, form scanning, and presentation slides.
🖨️
300 DPI — Print qualityFull print resolution. Use when images will be printed or when fine detail — small text, thin rules — must be preserved. ZIP files will be proportionally larger.

Output formats

PNG — Lossless. All detail preserved. Recommended for technical documents, forms, and anything with sharp edges or small text. JPEG — Lossy with a configurable quality slider (50–100%, default 85%). Considerably smaller at quality 80+. Recommended for photo-heavy PDFs or when file size is critical.

Live DPI preview with file size estimate

As soon as a file is uploaded, page 1 is rendered in the browser at the selected DPI. The size estimate card shows the actual output pixel dimensions and an estimated file size per page — for example, "1240 × 1754 px per page — ~1.5 MB per page". PNG estimates use ~0.7 bytes/pixel (lossless document content). JPEG estimates scale with the quality slider: at quality 85 the multiplier is ~0.14 bytes/pixel — considerably smaller than PNG. Changing the DPI, format, or quality slider re-runs the estimate immediately so you know exactly how large the ZIP will be before any server processing starts.

Page selection and output

All pages or a custom range (e.g. 1-3, 5, 7-10). Output is a ZIP archive containing one image per page, named sequentially — page-001.png, page-002.png, etc.

PDF/X is the ISO standard for print exchange (ISO 15930). It constrains the PDF feature set to what is reliably reproducible by commercial presses — no RGB images, all fonts embedded, transparency flattened in most variants. This tool converts via Ghostscript's /prepress quality settings with configurable render intent.

PDF/X standards

🖨️
PDF/X-3 (recommended) ICC colour management allowed. RGB content is converted to CMYK DeviceCMYK using the selected render intent. Widest acceptance among print shops — the default for most commercial print submissions.
🖨️
PDF/X-1a CMYK and spot colours only — no ICC profiles, no RGB content of any kind. The strictest standard. Required by some newspaper and magazine publishers for predictable ink coverage.
🖨️
PDF/X-4 Extends PDF/X-3 to allow live transparency and layers. Modern print workflows that support PDF/X-4 handle transparency natively — preserving edge quality on gradients and drop-shadows without flattening artefacts.

Render intent — controls RGB → CMYK mapping

📋
Relative Colorimetric (default)Standard press intent. Clips out-of-gamut colours to the nearest reproducible value. White-point adapted to the output profile. Best for most business documents.
📸
PerceptualCompresses the entire colour gamut to fit within CMYK, preserving relative colour relationships. Out-of-gamut colours are not clipped — the whole image shifts slightly to maintain harmony. Best for photographs.
📊
SaturationPrioritises vivid, saturated colours over accuracy. Best for business graphics, charts, and presentations where colour impact matters more than fidelity.
🧪
Absolute ColorimetricNo white-point adaptation — reproduces colours exactly as defined in the source profile, including paper-white simulation. Used for proofing and colour matching against a specific reference.

What the conversion does

All RGB images are converted to DeviceCMYK. All fonts are embedded and subsetted. Transparency is flattened (PDF/X-1a and X-3). Ghostscript's /prepress output intent is applied. The resulting file meets the ISO 15930 constraint set for the selected variant and is accepted by commercial RIP workflows.

Six tools covering PDF forensics & analysis, encryption, decryption, permanent content removal, watermarking, and cryptographic signing. All run server-side on local engines — nothing is sent to a third-party service.

🔬
PDF Forensics Scanner

Forensic analysis across 44 independent engines — structural, behavioural, provenance, ML anomaly detection with SHAP, local threat intelligence (URLhaus · MalwareBazaar · ThreatFox — 6.4M+ indicators, no external APIs), AcroForm field forensics, PDF signature forensics, phishing detection, embedded file analysis, and TLSH campaign attribution. MITRE ATT&CK mapping on every indicator. Results across 24 analysis tabs including 🤖 AI Forensic Report — Qwen 2.5 1.5B Instruct synthesises all 44 engine outputs into a structured verdict, with semantic context from live engine data: actual phishing phrases, JavaScript call targets, embedded payload strings, FormCalc code, and SHAP feature explanations fed directly to the model. Verdict is exec-vector-aware (high score with no execution vector caps at LIKELY_CLEAN). MALICIOUS verdict auto-labels the record as 'malicious'; CLEAN/LIKELY_CLEAN as 'benign'; SUSPICIOUS is not labeled (ambiguous). Triggers ML retrain at threshold — no user input needed. 9-mode sanitize: flatten to images, strip active content, remove JavaScript, remove embedded files, remove XFA, remove rich media, normalize structure, flatten forms, or strip metadata. The most technically deep tool on the site — see the deep dive.

Try it →
🛡️
Protect PDF

Two modes: AES-256-CBC server-side with granular permissions, or client-side post-quantum encryption with 31 algorithms. In PQC mode the server never sees your plaintext. See the deep dive.

Try it →
🔓
Unlock PDF

Remove password protection (owner password required). Detects encryption type client-side before upload — shows AES-256 or PQC badge. PQC bundles (.pqcpdf) are auto-detected and routed to the quantum-safe decryption panel.

Try it →
Redact PDF

Two modes: text-pattern redaction (multi-pattern list, case sensitivity, whole-word matching) or mouse-drawn region redaction on a canvas preview. Redaction is permanent — content is erased server-side, not just covered. Includes 🤖 AI Redaction Suggestions — Qwen 2.5 1.5B analyses extracted text and proposes redaction patterns by PII category (names, emails, IDs, financial data, and more) with one-click add to the redaction list.

Try it →
💧
Add Watermark

Stamp text watermarks with 8-position placement, opacity, rotation, font size, font style, and hex colour. Apply to all, odd, even, or custom page ranges. Live canvas preview updates in real time as you adjust settings.

Try it →
✍️
Sign PDF & PAdES

Four signature modes: draw, type, upload image, or invisible PAdES cryptographic signature. All modes support RSA-2048 certificates — auto-generated or your own .p12. See the deep dive.

Try it →

PDF is the most abused document format for delivering malware. This forensics scanner runs 44 independent engines covering every investigative dimension — byte-level signatures, structural integrity, sliding-window entropy, provenance analysis, dynamic behavioural tracing, machine learning anomaly detection with SHAP explanations (IsolationForest + RandomForest + LightGBM), multi-parser differential analysis across six independent parsers, fully offline threat intelligence (URLhaus · MalwareBazaar · ThreatFox · FeodoTracker · OpenPhish — 6.4M+ indicators, zero external API calls), PDF digital signature forensics, phishing detection, AcroForm field forensics (JS triggers on field events, SubmitForm exfiltration targets, hidden fields, password fields, /AA hooks, calc-order chain exploitation), embedded file analysis (magic-byte classification, VBA macro detection, full ZIP archive content listing, nested PDF detection, PowerShell content analysis), and TLSH + pHash + JS-fingerprint campaign attribution. Every indicator is tagged with MITRE ATT&CK technique IDs. Results are presented across 24 analysis tabs: Summary, Threats, Score, a per-engine two-panel browser (click any of the 44 engines for its full findings + structure fields), URLs, Streams, ML/SHAP, Sandbox, Threat Intel, MITRE, Differential Parsing, Polyglot, Phishing, Embedded Files, Signature Forensics, Revision History, Annotations, Metadata, XFA FormCalc, Action Graph, Deep Forensics (engines 34–43), 🤖 AI Forensic Report (Qwen 2.5 1.5B Instruct synthesises all 44 engine outputs into threat verdict, confidence rating, executive summary, key findings, MITRE technique grid, and recommended actions — fully local, structured JSON output, ~15–25 s on CPU), Raw JSON, and a Raw Forensics view showing decoded stream content, JavaScript sources, all indicator contexts, and the complete structure dump. File bytes never leave the server — no hash or data is sent to any external service at any point. Results are forensic-grade: each indicator is documented with engine source, severity, and contextual explanation. File size limit: 10 MB. Threat intelligence research (MalwareBazaar corpus, HP Wolf Security telemetry, Contagio malware archive) consistently shows real-world malicious PDFs are under 5 MB — exploit-kit payloads average 200 KB–1 MB, phishing lures 300 KB–4 MB, dropper PDFs up to 8 MB. The 10 MB cap covers every known threat class with 2× headroom. Scanning larger files requires enterprise deployment.

After a scan, a 9-mode sanitize panel appears. Basic: Flatten to Images (PyMuPDF raster rebuild — maximum safety, destroys all active content) · Strip Active Content (Ghostscript -dSAFER — moderate safety, text usually retained). Advanced — Surgical Cleaning: Remove JavaScript (/JS /AA nullified, layout preserved) · Remove Embedded Files (all /EmbeddedFile attachments) · Remove XFA Forms (/XFA definitions) · Remove Rich Media (/RichMedia /Movie /Sound) · Normalize Structure (qpdf rebuild — collapses incremental updates, disables object streams, decodes filter chains) · Flatten Forms (PyMuPDF bake() renders AcroForm widgets to static content) · Strip Metadata (/Info + XMP stream). All modes produce a new file; the original is never modified.

The 44 engines

Engine 1 Structure Validator Validates fundamental file structure before any content analysis: %PDF- header position (flagged if beyond byte offset 1,024), %%EOF marker count (>2 indicates incremental update stacking or exploit layering), xref table depth (>3 flagged), obfuscation codec count (ASCIIHexDecode / ASCII85Decode / LZWDecode >3 flagged on non-image streams — image XObjects are excluded since they legitimately use these codecs as standard output from PDF generators such as ReportLab and Ghostscript), and excessive filter chains (>120 /Filter entries). Proportional incremental injection: flags if the final revision adds >10 new objects compared to prior revisions — a disproportionately large final update is a strong indicator of post-signing payload injection. Collects: PDF version, linearised flag, binary comment presence.
Engine 2 Raw Pattern Scanner Scans raw file bytes for 45+ known-malicious byte sequences in six categories — JavaScript execution: /JavaScript, /JS, /Launch, /OpenAction, /AAremote & form actions: /GoToR, /SubmitForm, /ImportData, /Rendition, /Hideembedded & rich content: /EmbeddedFile, /RichMedia, /XFA, /AcroFormobfuscation: /ObjStm, /JBIG2Decode, /ASCIIHexDecodedangerous JS APIs: unescape(), eval(), String.fromCharCode, collab.getIcon (CVE-2009-0927), util.printf (CVE-2008-2992), media.newPlayer (CVE-2009-4324), Collab.collectEmailInfo (CVE-2007-5659) — shellcode: %u9090 (Unicode NOP sled), %u4141, %u0c0c%u0c0c heap-fill patterns. Evasion patterns: /Trans with JavaScript (page-transition trigger used to execute JS while evading action-based detection); /OpenAction hidden inside an AcroForm /DR indirect reference (indirect variant bypasses naive dictionary-key scanners). Each match records a context snippet (20 bytes before, 60 bytes after) for the Threats tab.
Engine 3 Stream Decompressor & Content Inspector Opens every object in the xref graph (up to 6,000 objects) via PyMuPDF and decompresses each stream via doc.xref_stream(xref) — catching JavaScript and shellcode hidden inside compressed objects that raw-byte scanners miss entirely. Calculates entropy using 512-byte sliding windows; any window exceeding 7.6 bits/byte on non-image streams flags encrypted, packed, or obfuscated payloads (detects shellcode splices that average out in whole-stream analysis). Decompression bomb detection flags streams with >500:1 compression ratio; image XObjects (/Subtype /Image, DCT, JPX, CCITT, JBIG2) are excluded from both the entropy check and the decompression bomb check — uniform-fill or solid-colour images legitimately achieve extreme compression ratios at near-zero entropy. Scans decompressed content for 14 JS/shellcode signatures. Returns up to 40 streams with xref number, entropy, type, and matched patterns.
Engine 4 Object Graph Traversal Maps parent/child object relationships across the xref graph and flags abnormal nesting depth, circular references, and shadow object trees.
Engine 5 URL Extractor URL extraction from all object streams — flags known malicious domains and suspicious URL patterns. Detects data: URI schemes (data:text/html, data:application/*) that deliver payloads without network requests, bypassing URL-reputation filters. Also detects hex-encoded URLs in JavaScript (\x68\x74\x74\x70 = "http") used to hide C2 addresses from static scanners.
Engines 6–9 Metadata / Font / CVE / Stats Engine 6 (Metadata Analyzer): Extracts and cross-validates all PDF Info dictionary and XMP metadata fields. Creation vs. modification timestamp delta analysis: a gap of 0–5 seconds between CreationDate and ModDate indicates scripted, automated document generation — a common characteristic of malware factory pipelines. Engine 7 (Font Analyzer): Unusual font names, encoding flags, and embedding status. Font objects are a common exploit carrier — malformed font tables trigger heap corruption in viewer rendering engines (e.g. CVE-2010-2883, Type1C font vulnerabilities). JBIG2 exploit detection follows indirect /FontFile* references: some exploits store the JBIG2-filtered stream on a separate object pointed to by the font dict rather than embedding the filter directly, and both forms are caught. Engine 8 (CVE Pattern Matcher): Byte-level CVE signature matching — known exploit patterns for CVE-2009-0658 (JBIG2), CVE-2009-4324 (/OpenAction JS), CVE-2010-2883 (font), and other historically weaponised PDF CVEs. Engine 9 (Structural Statistics): Object-to-page ratio heuristic — >50 objects per page is anomalous and flags potential exploit payload inflation. Zero-page detection: a PDF with 0 pages is a pure exploit payload with no legitimate document content (critical severity).
Engine 10 ExifTool Forensics Deep EXIF/XMP metadata forensics via ExifTool 12. Detects metadata inconsistencies, hidden authoring tool footprints, GPS data, and fields that conflict with PDF structure — useful for provenance analysis.
Engine 11 — qpdf Structural Integrity qpdf binary-level structural analysis — detects object stream corruption, incorrect xref table offsets, overlapping object definitions, and linearisation anomalies that indicate deliberate file manipulation.
Engine 12 — Signatures YARA Rule Matching (YARA 4.5) 24 custom YARA rules targeting PDF-specific exploit patterns, obfuscated JavaScript payloads, embedded binary signatures, CVE-specific byte patterns (CVE-2009-0658, CVE-2008-2992, CVE-2010-1240, CVE-2018-4990, CVE-2021 XFA, CVE-2024-41869 UAF, CVE-2024-45112 type confusion), PowerShell stager patterns, Cobalt Strike beacon signatures, and multi-stage dropper structures. External .yar rule files are loaded from a configured rules directory. Rules cover patterns not caught by byte-string matching alone.
Engine 13 — Deep Object PeePDF + pikepdf Analysis PeePDF (v0.4) deep object analysis — decodes compressed object streams (/ObjStm), reconstructs the internal object graph, and analyses suspicious cross-references, duplicate object definitions, and object version stacking. Supplemented by pikepdf (a modern libqpdf-based Python parser) which independently extracts the JavaScript Names tree, counts embedded file attachments, detects per-page /AA triggers, and provides a second independent indicator set. Crash/timeout behaviour of each parser is tracked separately.
Engine 14 — Sandbox Dynamic Behavioural Sandbox — 6 Renderers The PDF is rendered through six independent engines — Ghostscript, MuPDF, Poppler, LibreOffice Draw, Chromium PDFium, and pdf.js/Node — each inside an isolated Linux namespace via unshare --net --pid --mount with all syscalls captured by strace. The network namespace makes any connect() or sendto() syscall definitively malicious — there is no legitimate reason for a PDF renderer to initiate network contact in an isolated namespace. Detects: outbound C2 beacons, anonymous executable memory mappings (shellcode staging), unauthorised process spawning (code execution), filesystem escape attempts, DNS lookups, and fork-bomb patterns. PDFium (Playwright/Chromium) covers the Chrome browser attack surface — where most users now open PDFs. pdf.js/Node covers the Firefox/Mozilla rendering engine. LibreOffice Draw exposes OLE macro and embedded content paths. When all renderers complete without triggering, a confirmed clean result is explicitly surfaced so analysts know the sandbox ran successfully.
Engine 15 — Signatures ClamAV 1.4+ ClamAV signature scanning against 700,000+ malware signatures via local clamdscan daemon. The clamav user is a member of the www-data group so the daemon reads upload files directly — no --fdpass needed, no fallback to the slow single-process scanner. The only engine that makes external calls — and only for signature database updates via clamav.net, never for file analysis.
Engine 16 — ML ML Intelligence Engine Extracts a 38-dimensional feature vector from all preceding engine outputs. Applies four models: IsolationForest (unsupervised anomaly detection — works from scan 1, no labelled data required), RandomForest classifier (supervised — activates at ≥10 labelled samples; bootstrap pseudo-labeling supplements the set when below threshold but ≥1 malicious label exists), LightGBM (gradient-boosted ensemble with class-imbalance weighting, RF+LightGBM scores are averaged), and Bayesian contextual scoring. SHAP explanations use TreeExplainer for RandomForest/LightGBM and KernelExplainer (nsamples=50) for IsolationForest. Model drift detection warns when models have not been retrained in >30 days. Feature vectors and auto-inferred labels are persisted to PostgreSQL. Models retrain every 30 minutes via cron. No file content, filename, hash, or PII stored.
Engine 17 — Differential Multi-Parser Comparison (6 parsers, 8 dimensions) Runs MuPDF (mutool), Poppler (pdfinfo/pdfdetach), Ghostscript, qpdf, pdfminer, and Node.js pdf.js across 8 structural dimensions simultaneously: page count, object count, JavaScript presence, PDF version, encryption status, AcroForm presence, embedded file count, and OpenAction. Seven distinct discrepancy checks (Critical/High/Medium) flag hidden objects, shadow object trees, or deliberate parser-confusion exploits. Page delta scoring is weighted by magnitude (up to +70/critical for >50 page delta). A hard 30-second SIGALRM wraps the engine; pdfminer runs in a subprocess with timeout 6 for guaranteed hard-kill.
Engine 18 Polyglot / Binary Detector Scans every stream (raw and decompressed) for file magic byte signatures: ZIP, Windows PE, Linux ELF, Mach-O, Java class, OLE/CFBF, RAR, 7-Zip, embedded PostScript, HTML/XHTML, WebAssembly (\x00asm), and Python bytecode. Also performs mid-stream scanning at non-zero offsets to catch payloads prefixed by junk bytes. JAR files are detected via ZIP + META-INF/MANIFEST.MF. Detects polyglot files that embed executable droppers inside a valid PDF container.
Engine 19 — AST JavaScript AST Deobfuscator (Acorn) JavaScript extracted from /JS literals and keyword-bearing compressed streams (with Unicode \uXXXX pre-processing) is parsed into an AST via Acorn (Node.js, ECMAScript 2022) and walked for obfuscation constructs invisible to pattern-matching: eval() chains, String.fromCharCode() arrays (shellcode staging), unescape() decode pipelines, large numeric arrays (heap spray), new Function() dynamic construction, atob()/btoa() base64 decode chains, and property accessor obfuscation — including the split-string concatenation technique (window["ev"+"al"]) used to evade static keyword detection. Performs 6 iterative deobfuscation passes (each pass feeds its output into the next) to unravel multi-layer obfuscation chains. Also detects anti-sandbox patterns (app.platform, screen.width, navigator.*) and executes multi-stage eval chains in a Node.js VM sandbox to decode obfuscated payloads statically hidden from pattern-matching.
Engine 20 — TI Threat Intelligence (URLhaus · MalwareBazaar · ThreatFox — local, no external APIs) Queries four local PostgreSQL databases — no external API calls per scan, no rate limits, sub-millisecond lookups. URLhaus hashes (5M+ SHA-256 malware payload hashes), URLhaus URLs (70K+ malicious URLs, refreshed every 30 min), MalwareBazaar (1M+ confirmed malware samples with family labels), ThreatFox IOCs (176K+ hashes, URLs, and domains with malware families). All four feeds are downloaded in bulk and kept current by cron. SHA-256 hash matches are treated as definitive: they raise a critical indicator, auto-label the scan as malicious, and feed ML retraining. Domain-level matches (URL / C2 host lookups) raise a high indicator but do not auto-label — major trusted hosting platforms (GitHub, Google, Microsoft, Dropbox, etc.) are allowlisted to prevent false positives from PDFs that legitimately link to those domains. Every indicator is mapped to a MITRE ATT&CK technique ID.
Engine 21 — SigForensics PDF Signature Forensics (pyhanko) Deep forensics on PDF digital signatures. Computes ByteRange coverage — if byte ranges declared in the signature don't cover the full file, the gap contains unsigned content (shadow document attack, CVE-2019-14980 class). ByteRange gap size analysis: a gap exceeding 20% of the total file size indicates a substantial block of content hidden outside the signed region, raising the severity to critical. Diffs the object inventory across every incremental update revision after the signature to detect execution vectors (/JavaScript, /Launch, /OpenAction, /EmbeddedFile) added post-signing. A signed-then-modified document with active content is critical.
Engine 22 — Phishing Phishing Detection (urgency · brand impersonation · credential harvesting · QR codes) Multi-vector phishing analysis: 30+ urgency/deception phrases; brand impersonation keywords (Microsoft, Apple, PayPal, DocuSign, Adobe, DHL, IRS, and others); AcroForm credential harvesting — SubmitForm action + password-type field detection; QR code extraction and decoding via zbarimg with suspicious domain scoring. High urgency phrase density combined with brand impersonation scores as high-confidence phishing.
Engine 23 — EmbeddedFiles Embedded File Analysis (pdfdetach · magic bytes · VBA macros · ZIP content listing · nested PDF detection) Uses pdfdetach (Poppler) to extract every embedded file attachment. Inspects each for magic bytes: Windows PE (MZ), Linux ELF (\x7fELF), OLE/CFBF (\xd0\xcf), OOXML archives, script files (.bat, .ps1, .vbs, .sh), RAR, 7-Zip. Detects VBA macros in OOXML Office attachments (vbaProject.bin). Non-OOXML ZIP archives have their full contents listed (up to 50 entries) and are scanned for dangerous files (.exe, .dll, .ps1, .vbs, etc.) — flagged Critical when dropper files are present. Nested PDFs (embedded PDF documents) are detected and flagged — nested PDFs can carry independent malicious payloads processed outside outer-document defences. PowerShell .ps1 content analysis: embedded scripts are scanned for high-risk patterns including Invoke-Expression, DownloadString, and -ExecutionPolicy Bypass — all common stager and downloader primitives. Extracts readable strings from executables to surface suspicious API calls or IP addresses. A PDF carrying a PE executable is a confirmed dropper — scored critical.
Engine 24 — Campaign Campaign Attribution (TLSH fuzzy hash) Three similarity fingerprints are computed and compared against confirmed-malicious history in PostgreSQL: TLSH (full-PDF locality-sensitive hash — score <30 = near-identical, <100 = same campaign), pHash (perceptual hash of each page thumbnail via imagehash — hamming distance ≤8 = visual match, detects rebranded or re-formatted copies), and JS fingerprint (MD5 of sorted, normalised JavaScript fragments — catches code-reuse across campaigns). Self-matches are excluded: TLSH distance=0 (identical content, same file previously labeled malicious) is skipped to prevent a file from being flagged solely because it was scanned before. Campaign name is surfaced from MalwareBazaar family labels when a cluster match is found. Falls back to structural fingerprint for small files or when TLSH is unavailable.
Engine 25 — AcroForm AcroForm Field Forensics Deep analysis of interactive form fields across all pages via PyMuPDF widget enumeration. Detects JavaScript on field objects (/A and /AA dictionaries — JS fires on focus, blur, keystroke, validate, or calculate events, invisible during static review but executing in any Acrobat-compatible viewer); hidden NoExport fields (present in submitted data but not displayed to the user); password-type fields (credential harvesting indicators); SubmitForm exfiltration targets — the URL(s) to which all form field data is POSTed; /AA additional-action JS triggers on field objects (a secondary execution vector independent of /OpenAction); and calculation order (/CO) exploitation — adversaries reorder field calculations to chain JS evaluations across fields, enabling multi-step payload staging hidden entirely within form arithmetic. SubmitForm target URLs are scanned and flagged for external HTTP destinations. Results feed into the Correlation Engine.
Engine 26 — RevHistory Document Revision History Splits the PDF at each %%EOF boundary and extracts per-revision metadata: author, producer, modification date, and new/modified/deleted object counts for each incremental update. Detects author identity changes between revisions, execution vectors injected (/JavaScript, /Launch, /EmbeddedFile, /OpenAction) after the original document was created, and large late-stage object injections in the final revision — the structural signature of automated exploit staging. Injection depth (revision number) is recorded for each vector. Results feed into the Correlation Engine.
Engine 27 — Annotations Annotation Forensics Enumerates every /Annot object across all pages and forensically analyses each action dictionary. Detects dangerous URI schemes (javascript:, data:, file://, vbscript:); JavaScript action triggers on annotation interaction; /Launch actions that spawn arbitrary programs; GoToR remote links that open external files; and SubmitForm actions that exfiltrate form data to external servers. Also inspects the /T (author/title) field of every annotation for XSS payloads — matching the CVE-2025-70401 attack vector in which PDF viewers pass the annotation author string through DOM reconciliation into innerHTML without sanitisation, executing injected scripts on every component re-render. Checks 15 patterns including <script>, onerror=, <svg><foreignObject> (the bypass used in the disclosed PoC), javascript:, and percent/unicode-encoded variants; handles both literal and hex-encoded /T values. Annotation-borne payloads are completely invisible to scanners that only analyse raw bytes or page content streams. Results feed into the Correlation Engine.
Engine 28 — NamedTree Named Tree Analysis Catalogues the full PDF action infrastructure: Named JavaScript Registry (/Names /JavaScript subtree — persistent JS objects callable by name from any action); /AA Additional Actions count (event-driven triggers on page open/close, print, save, field events); /OpenAction type classification (JavaScript, Launch, GoToR, URI, GoTo); DocMDP modification prevention signatures that lock out sanitizers; /Perms cryptographic permission restrictions; and UR3 usage-rights signatures used to exploit extended viewer features. Results feed into the Correlation Engine.
Engine 29 — ContentStream Content Stream Forensics Inspects all decompressed content streams for dangerous PostScript execution operators: exec (dynamic code execution), run (file execution — detected as ) run requiring an explicit filename string argument, avoiding false positives from the English word "run" appearing in page content), token (string-to-code eval), setpagedevice (PostScript-to-system passthrough — bridges to the PostScript interpreter from PDF context), def. Also detects ICC color profile abuse — malformed /ICCBased profiles of anomalous size exploit heap buffer overflows (CVE-2021-21017 class). Flags content bombs: non-image streams exceeding 5 MB that may exhaust parser memory or conceal oversized payloads (image XObjects are excluded — large raster data is expected). Results feed into the Correlation Engine.
Engine 30 — ObjStm Object Stream Analysis PDF 1.5+ allows multiple objects to be compressed together in a single /ObjStm stream. Scanners that only search raw bytes will miss any object inside a compressed container. This engine decompresses every /ObjStm and re-scans the decompressed content for JavaScript, /Launch actions, /EmbeddedFile references, and high-entropy payloads (entropy >7.5 bits) that suggest encrypted content hidden inside compressed object bundles. Complements the Stream Inspector (Engine 3) with object-container-specific forensics. Results feed into the Correlation Engine.
Engine 31 — TokObf PDF Token Obfuscation Detector Decodes all PDF name token hex-escape sequences (/J#61vaScript/JavaScript) and checks decoded names against a dangerous-keyword list: JavaScript, Launch, OpenAction, EmbeddedFile, AA, URI, SubmitForm, ImportData, GoToR, RichMedia, and others. Counts total hex-encoded name tokens, dangerous-keyword obfuscations, and unique obfuscated forms. Also detects whitespace-split keyword injection — byte sequences like /Java\nscript or /Lau\tch in the raw byte stream that evade simple string scanners; detection requires at least one actual whitespace character inside the keyword (a zero-width match would flag every normal /JavaScript token). Scans outside compressed stream bodies for formfeed byte injection (0x0C) and null bytes in the PDF header region — stream bodies are excluded since FlateDecode binary data naturally contains these bytes; both are classic evasion markers when found in the structural token layer. Excessive hex-encoded name tokens are flagged at a threshold of >500 tokens (benign PDF generators such as ReportLab routinely hex-encode colour names and resource keys, so only counts far beyond normal generator output are reported, at low severity). Every obfuscated dangerous keyword triggers a Critical indicator. Results feed into the Correlation Engine.
Engine 32 — XFA XFA FormCalc Parser Extracts and decompresses the XFA (XML Forms Architecture) data stream — an XML-based form description that supports an embedded scripting language called FormCalc. Detects auto-execute initialise/ready events and openURL / submit calls that silently exfiltrate data or fetch remote resources on form load. Flags exec() calls that pass strings to a FormCalc eval-style function and JavaScript snippets embedded within the XFA XML wrapper — a technique that bypasses AcroForm-specific scanners. Results feed into the Correlation Engine.
Engine 33 — ActGraph PDF Action Dependency Graph Constructs a directed graph of the complete PDF action chain: every /Next action pointer is followed to map the full execution sequence. Detects circular action cycles (infinite loops); deep chains exceeding 10 hops (overflows parser stack depth in hardened viewers); high fan-in nodes — single action objects referenced from many triggers simultaneously (covert shared-execution points); and sleeper nodes — actions present in the graph but unreachable from the nominal entry points, planted for deferred detonation via a separate trigger. The graph is serialised and available for raw forensic inspection. Results feed into the Correlation Engine.
Engine 34 — OCG OCG Layer Cloaking Enumerates every Optional Content Group (/OCG) layer defined in the /OCProperties dictionary. Detects layers configured as never-visible (display-state forced off in all circumstances) — a technique for hiding malicious content from visual review; screen/print divergence (content visible on screen but suppressed in print, or vice versa — used in watermarking and DLP-evasion attacks); and hidden clickable links inside invisible layers, which are fully interactive in Acrobat despite being visually absent. Results feed into the Correlation Engine.
Engine 35 — Unicode Unicode & Invisible Text Forensics Scans for Unicode bidirectional control characters (U+202E RLO, U+200F RLM, U+202D LRO, U+200E LRM, U+2066–U+2069 isolate markers) in text streams and document strings — the class of injection used in CVE-2023-36884 and filename-spoofing attacks. Detects rendering mode 3 (invisible text — used by Trojan-Source-style hidden content and some phishing kits to embed machine-readable payload over visible decoy text) and rendering mode 7 (clip mode — advanced invisibility). Invisible text detection works by directly parsing all content streams for the PDF Tr (text rendering mode) operator — PyMuPDF's span flags field encodes font flags (bold/italic/serif) rather than the rendering mode and cannot be used for this purpose. Flags homograph domains using Cyrillic/Greek/Armenian lookalike characters (confusable with ASCII). Results feed into the Correlation Engine.
Engine 36 — Trailer Trailer Chain Forensics Walks the raw trailer chain via /Prev byte-offset pointers without relying on any PDF library's repair logic. For each trailer, records the /ID array pair, the /Root reference, and the /Prev offset, building a chronological chain of all incremental updates. Detects Document ID mutation across updates (both entries of the /ID array should be stable after creation — mutation is a structural anomaly); /Root reference swaps between trailer versions (the Shadow Document Attack — a signed PDF whose signed version and visible version have different catalog roots); and malformed /Prev pointers that would confuse incremental-update-aware parsers. Results feed into the Correlation Engine.
Engine 37 — Codec Codec Exploit Parameter Validation Audits every compressed stream's filter parameters for known exploit patterns. CCITTFaxDecode: validates Columns and Rows against the stream length — out-of-bounds values trigger heap overflows in multiple decoders. JBIG2Decode: checks for a /JBIG2Globals reference (required for CVE-2009-0658 / Pwn2Own 2009 Adobe Reader exploit). DCTDecode: validates that the declared stream length is plausible for the claimed image dimensions. Multi-filter chains: flags streams using 3+ stacked decoders (a classic technique to slow forensic analysis and trigger parser differential vulnerabilities — each decoder in the chain may parse the preceding output differently). Results feed into the Correlation Engine.
Engine 38 — Entropy Physical Entropy Topology Computes per-256-byte sliding-window Shannon entropy across the raw file bytes, producing a high-resolution entropy map with structural awareness. Detects post-EOF high-entropy regions — encrypted payloads appended after the last %%EOF marker (invisible to all structure-respecting parsers); entropy cliffs — sudden sharp transitions between low-entropy and high-entropy regions that indicate injection boundaries; header entropy anomalies — unexpected compression or encryption in the first 256 bytes of the file; and under-entropy in compressed streams — near-zero entropy (<1.5 bits) in a compressed region that should be random (consistent with a decompression bomb). Image XObjects (/Subtype /Image) are excluded from the under-entropy check — solid-colour or uniform-fill images produce near-zero entropy in their compressed stream by design and are not suspicious. Uses the PDF's object offset table to partition the entropy map into structural regions (header, objects, streams, trailer, post-EOF). Results feed into the Correlation Engine.
Engine 39 — Stego Image Steganography & Tracking Beacons Extracts all embedded images (JPEG, PNG, BMP) via PyMuPDF and applies statistical steganalysis. LSB chi-square analysis: computes a chi-square statistic on the least-significant bits of each colour channel — a score above threshold indicates non-random LSB distribution consistent with LSB steganography (SteghideJPEG, OpenStego, etc.). Tracking beacons: flags 1×1 or sub-10px images that are HTTP/HTTPS URIs (invisible tracker pixels that phone home when the PDF is opened in a connected viewer). JPEG EXIF anomalies: parses EXIF metadata from all extracted JPEG images and flags maker notes, GPS tags, and unusual tag combinations that may fingerprint the author or embed covert data in EXIF fields. Results feed into the Correlation Engine.
Engine 40 — PDFA PDF/A Compliance Fraud Detector Checks whether a PDF claims PDF/A conformance (pdfaid:conformance and pdfaid:part XMP metadata) and, if so, validates that the document actually conforms to the declared standard. PDF/A forbids JavaScript, embedded executables, non-embedded fonts, encryption, and external references — all of which are attack vectors. Detecting a PDF that claims PDF/A but contains active content is a reliable indicator of a document engineered to bypass DLP systems and email gateways that whitelist PDF/A. Also checks for conformance level mismatch (e.g. claiming PDF/A-1a but using features only in PDF/A-2). Results feed into the Correlation Engine.
Engine 41 — JSEmul JavaScript Behavioral Emulation Executes extracted JavaScript in a sandboxed Node.js vm context with a full stub of the Acrobat JavaScript API — app, this, event, util, console, Doc, Field, and others. Intercepts and records all calls to dangerous methods: app.launchURL(), this.submitForm(), app.openDoc(), app.execMenuItem(), util.printd(). Detects obfuscated eval() and string-concatenation assembly of dangerous payloads at runtime. Records the full call log: function name, argument list, and execution timestamp. This engine catches JavaScript payloads that static AST analysis (Engine 16) cannot — obfuscated strings that are only assembled and evaluated at runtime. Results feed into the Correlation Engine.
Engine 42 — Font Font CharString Emulator Decrypts and emulates Type 1 font CharString programs using the eexec and charstring decryption algorithms. The Type 1 CharString format is a stack-based bytecode interpreter with dangerous operators: seac (seac/accented-character — calls two other glyphs by name, enabling recursive execution that overflows the call stack in vulnerable renderers, used in exploits targeting Adobe Reader ≤9); excessive stack depth (CharString programs that push ≥200 values onto the stack, triggering stack exhaustion in strict interpreters); and abnormal subroutine depth (recursion deeper than 10 levels in the subr/globalsubr call chain). Flags obfuscated font binaries with unusually high entropy in the eexec-encrypted region. Results feed into the Correlation Engine.
Engine 43 — XRef XRef Integrity Graph Builds a complete cross-reference graph by parsing both traditional XRef tables and compressed XRef streams (/XRef objects, PDF 1.5+). Cross-references every declared object against actual byte positions in the file. Detects phantom objects — entries in the XRef table that point to byte offsets with no valid object header; orphan sleepers — objects present at valid byte offsets but absent from every XRef table (reachable only through raw parsing, not through standard readers); free-entry exploitation — free-list entries (f type) whose generation numbers deviate from standard increments (a technique for hiding objects that become reachable after a use-after-free in the parser); and object length fraud — stream objects whose declared /Length diverges from the actual byte count between stream markers. Reachability BFS starts from doc.pdf_catalog() — the authoritative PDF Catalog xref returned by the parser — rather than assuming OID 1 is always the root (which produces large false-positive orphan lists in non-standard PDFs). Orphaned Action objects are classified by subtype: execution subtypes (JavaScript, Launch, GoToR, ImportData, SubmitForm, GoToE) are flagged as dangerous; navigational subtypes (URI, GoTo, Named, Sound, Movie) are treated as benign and not flagged. Results feed into the Correlation Engine.
Engine 45 — AI Synthesis 🤖 AI Forensic Report After all 44 forensic analysis engines complete, a Qwen 2.5 1.5B Instruct Q4_K_M LLM synthesises the structured scan output into a human-readable forensic report. The model runs on dedicated private hardware (Ryzen 5 3550H · 12 GB RAM · llama.cpp CPU-only) over an encrypted WireGuard tunnel — no OpenAI, no Anthropic, no Google, no third-party AI call of any kind. Your document data never leaves pqpdf.com infrastructure. Input to the model is a compact JSON object (~250–350 tokens) containing: risk score, critical/high indicator signal names, MITRE technique IDs, sandbox hit flag, structural stats, and threat intelligence match status — never raw binary PDF bytes. Output is a structured JSON object with seven fields: threat_verdict (MALICIOUS / SUSPICIOUS / LIKELY_CLEAN / CLEAN), confidence (HIGH / MEDIUM / LOW), executive_summary (one-sentence plain-English verdict), key_findings (array of {signal, severity, mitre_id} objects), observed_techniques (array of MITRE ATT&CK {id, name} pairs drawn only from IDs present in the scan), recommended_actions (array of strings), and false_positive_note (null or string). All enum fields (verdict, confidence, severity) are validated and normalised server-side — a fuzzy-match fallback corrects any model drift. Inference configuration: temperature 0.1 (near-deterministic), max 220 output tokens, json_object response format. Typical latency: ~15–25 s (CPU-only inference at ~13 tokens/s, no GPU required). Results appear in the dedicated 🤖 AI Forensic Report tab and as a compact verdict widget on the Summary tab.
Engine 44 — Correlation Correlation Engine Cross-references all 43 prior engine findings and adds weighted bonus points (35–100) for dangerous combinations. Classic: JavaScript + /OpenAction + high entropy = +100 bonus; JavaScript + /Launch = +75 bonus. Cross-engine: YARA heap-spray + JS, PeePDF vuln + JS, qpdf structural damage + active content, ExifTool exploit-kit fingerprint + execution. Dynamic sandbox: live network beacon + JS, runtime shellcode + heap spray, dynamic shell spawn + trigger. Form patterns: AcroForm JS field + SubmitForm exfiltration target, /AA keystroke trigger + credential field, calc-order chain + JS payload. New-engine patterns: token obfuscation + JS keyword, annotation JS trigger + auto-exec, post-signature revision injection + execution vector, object stream concealment + active content, named JS registry + OpenAction, DocMDP bypass + content modification, XFA exec + auto-fire, action cycle + JS node, OCG hidden link + JS, trailer /Root swap + execution, codec OOB + active content, post-EOF entropy + execution, steganography + exfiltration target, PDF/A claim fraud + active content, JS emulation live call + obfuscated eval, font seac OOB + JS, XRef phantom object + orphan sleeper. TI + sandbox + YARA triple-confirmation. TI domain match + active content: a domain from the PDF's links matching threat intelligence databases combined with JavaScript or auto-execute content raises a high-confidence combined indicator. 60+ compound patterns. Multi-engine JS confirmation bonus: when 3+ independent engines confirm JavaScript presence, score is amplified. Final score capped at 999.

Risk scoring

Each indicator contributes base points multiplied by min(occurrence_count, 3) — capped at 3 occurrences per finding type to prevent artificial inflation from a single pattern appearing many times. The Correlation Engine adds weighted bonus points on top for dangerous combinations.

Risk levelBase points per occurrence
Critical50
High25
Medium10
Low3
Clean
0
Low
1–14
Suspicious
15–54
Dangerous
55–999

Forensic Console

During the scan a live terminal-style event log streams timestamped events to the browser — upload confirmation, per-engine START/DONE lines, and the final risk verdict. Section dividers separate Upload, Engines, and Results phases. The console can be collapsed or cleared without affecting the scan.

Result banner and risk levels

When all 45 engines complete, a full-width banner appears at the top of the Summary tab showing the risk level, a text explanation, and a score meter bar (0–999):

Clean — score 0, green 🟡 Low Risk — score 1–99, yellow 🟠 Suspicious — score 100–299, orange ⚠️ High Risk — score 300–599, red 🔴 Dangerous — score 600–999, dark red

Statistics grid — 15 fields

Below the banner a 15-cell grid shows key structural stats at a glance. Three cells are clickable and jump directly to the relevant tab. Cells turn red when values exceed safe thresholds:

Pages · Objects · File Size · PDF Version · Encrypted Embedded Files (clickable → Embedded tab, red if > 0) · Form Fields · Annotations · Links %%EOF Markers (red if > 2) · XRef Tables (red if > 3) · Total Streams (clickable → Streams tab) High-Entropy Streams (red if > 0) · URLs Found (clickable → URLs tab, red if > 0) · Threats Found (clickable → Threats tab, red if > 0)

Scan report — 24 tabs

Results are rendered across 24 tabs. Each tab is independently navigable. Dynamic badges on several tabs update live (threat count, ML %, MITRE technique count, phishing signal score, embedded file count).

📊 Summary — risk banner + score meter, 15-cell stats grid, engines-completed pill strip (✓ {name} for all 45 that ran), ML probability bar + SHAP feature bars + false-positive/confirm-threat feedback buttons ⚠️ Threats — all indicators grouped Critical → High → Medium → Low, each card shows risk badge, engine label, count pill, key, description, byte-context snippet 📈 Score — score gauge (0–999), per-engine contribution bars, full per-indicator table (engine / indicator / risk / base pts / count / total pts) ⚙️ Engines — two-panel browser: sidebar (45 engines, status dot, findings pill), right panel shows full indicator cards, engine-specific data (stream table ③, URL list ⑤, SHAP bars ⑯, differential table ⑰, certificate chain ㉑, correlation bonuses + Per-Engine Indicator Counts + Final Risk Assessment ㉕) 🌐 URLs — all unique HTTP/HTTPS URLs from raw bytes and decompressed streams, per-URL copy button 📦 Streams — top 40 streams: XRef# · type · decompressed size · Shannon entropy bar (red if > 7.2) · status (OK / High Entropy / Patterns Found) · matched patterns. Suspicious rows amber, high-entropy rows orange. 🧠 ML — malicious probability bar, SHAP bar chart (red=malicious / green=benign per feature), feature importance bars, false-positive / confirm-threat feedback buttons (trains next model update) 🔬 Sandbox — 7-cell metrics grid (Behavioral Score · Network Attempts · Exec Attempts · Process Forks · FS Escape · Anon Exec Memory · Timeout, cells red at critical thresholds), renderer list, threat indicators, matched YARA rules 🌍 Threat Intel — confirmed-malware banner (if SHA-256 matches), per-database results (URLhaus · MalwareBazaar · ThreatFox · FeodoTracker · OpenPhish), domain-level TI matches, campaign attribution (TLSH · pHash · JS fingerprint), similar malicious samples with similarity % 🎯 MITRE — ATT&CK technique IDs mapped from indicators, grouped by tactic, indicator rows per technique 🧬 Parsing — 6 parser cards (MuPDF · Poppler · Ghostscript · qpdf · pdfminer · pdf.js), per-dimension comparison (pages · objects · JS · encryption · AcroForm · embedded files · linearised · OpenAction), mismatch severity badges 🧬 Polyglot — Engine ⑱ magic-byte hits (type + risk badge) + Engine ⑲ JS AST deobfuscation findings (eval · fromCharCode · unescape · large numeric arrays · new Function) 🎣 Phishing — signal score meter, urgency phrase tags, brand keyword tags, credential-harvesting detection, QR code decodes, OCR-extracted text from images 📎 Embedded — per-file cards: magic-byte type · size · VBA macro detection · ZIP content listing (50 entries, dangerous-extension flags) · PE import table · suspicious strings · nested PDF detection ✍️ Signature — signature count, ByteRange coverage gap (shadow-document indicator), post-signing revision diff, unsigned JS/launch actions, per-certificate cards (subject · issuer · dates · algorithm · self-signed · expired) 📜 History — per-revision timeline (engine 26): %%EOF count, per-revision author/producer/date, new/modified/deleted object counts per update, execution vectors injected post-creation, large late-stage injection alerts 📌 Annotations — per-page annotation cards (engine 27): type, action dictionary, dangerous URI scheme flags, JS/Launch/GoToR/SubmitForm action detection, risk badge per annotation 📋 XFA — XFA FormCalc findings (engine 32): auto-execute events, openURL/submit calls, exec() calls, embedded JavaScript in XFA XML 🗺️ Action Graph — PDF action dependency graph (engine 33): full action chain visualisation, cycle detection, deep chain alerts, fan-in maximisation nodes, sleeper/orphan action nodes 🧪 Deep Forensics — findings from engines 34–43: OCG layer cloaking (engine 34) · Unicode/invisible text (35) · trailer chain forensics (36) · codec exploit parameters (37) · physical entropy topology with post-EOF detection (38) · image steganography & tracking beacons (39) · PDF/A compliance fraud (40) · JS behavioral emulation call log (41) · font CharString emulator findings (42) · XRef integrity graph anomalies (43) 🤖 AI Forensic Report — Qwen 2.5 1.5B Instruct (self-hosted, no third-party AI) synthesises all 44 engine outputs into: threat verdict (MALICIOUS / SUSPICIOUS / LIKELY_CLEAN / CLEAN) · confidence rating · executive summary · key findings table with MITRE technique IDs and severity badges · observed MITRE ATT&CK technique grid · recommended actions · false-positive note. Structured JSON output, ~15–25 s CPU inference, near-deterministic (temperature 0.1). Compact AI verdict widget also shown inline on the Summary tab. 🏷️ Metadata — document metadata KV table · structure info KV table · full 44-engine structure dump 📋 Raw JSON — complete scan result JSON with syntax highlighting (strings · keys · booleans · nulls · numbers) and one-click copy 🔍 Raw Forensics — JS source code from streams · JS AST deobfuscation contexts · decoded stream content (3 KB preview) · every indicator context snippet · complete sorted KV dump from all 45 engines

Sanitize panel

After every scan (including clean results) a 9-mode sanitize panel appears below the result. Selecting a method sends the session token to the server, produces a new file, and reveals a Download Sanitized PDF button and a Scan the Sanitized File button to re-run the full 45-engine scan on the cleaned output. The original file is never modified.

ML data policy

The ML engine stores a 38-dimensional feature vector per scan (structural statistics: byte counts, entropy values, object type flags, parser discrepancy counts, sandbox syscall anomalies). No file content, no filename, no hash, no IP address, and no PII is stored. Feature vectors are used to retrain the IsolationForest, RandomForest, and LightGBM models every 30 minutes. Model drift detection reports if models have not been retrained in >30 days. Retained indefinitely — not subject to GDPR Article 17 as no personal data is involved. Full details on the Security page.

Standard mode — AES-256-CBC

Password is transmitted over TLS, used to encrypt via Ghostscript with AES-256-CBC, and never stored. Granular permission flags are configurable: print, copy, modify, annotate, form fill, accessibility, and assembly.

PQC mode — client-side quantum-safe encryption

In PQC mode the encryption happens in your browser before the file is uploaded. Key generation uses @noble/post-quantum — a local JavaScript library. The server receives only the encrypted .pqcpdf bundle. Your plaintext file never crosses the network unencrypted.

Why this matters: AES-256 is vulnerable to Shor's algorithm on a sufficiently powerful quantum computer. NIST standardised post-quantum key encapsulation mechanisms in 2024. PQ PDF is the only free online PDF tool that implements them.

Available algorithms (31 total, 29 quantum-resistant)

Organised by category. NIST = NIST-standardised primitive.

Classical
X25519 / Ed25519 / AES-256-GCM
Core PQ
Hybrid — Classical + Post-Quantum Post-Quantum — NIST Standardised ML-KEM-1024 — Pure KEM
Multi-Layer
Multi-Algorithm — Triple Layer Multi-KEM — Classical + PQ KEM Multi-KEM Triple — 3× KEM Redundancy Quad-Layer — 4-Layer Redundancy Lattice + Code — Mathematical Diversity PQ3-Stack — Forward Secrecy
HQC — Code-Based (NIST 2025)
HQC-128 — 128-bit security HQC-192 — 192-bit security HQC-256 — 256-bit security
FN-DSA (Falcon) — Lattice Signatures
FN-DSA 512 Compact — 666B sigs FN-DSA 1024 High-Security — 1.3KB sigs FN-DSA Floating-Point Hardened FN-DSA Dual Signature Redundancy FN-DSA Transition Stack — Hybrid TLS FN-DSA + ZK Stack — Privacy-First
Max Secure
PQ Lightweight — Embedded / IoT Pure PQ — High Assurance Hybrid Transition — NIST 5 + Classical Stateless — Hash-Based / Firmware Crypto-Agile Stack — Runtime Switching PQC + ZK Stack — Zero-Knowledge
Experimental
Quantum-Inspired Lattice Fusion Post-ZK Homomorphic Stack Quantum-Resistant Consensus Entropy-Orchestrated PQ Stack AI-Synthesized Crypto-Agile
Primitives used: Key encapsulation — ML-KEM-1024 (FIPS 203), HQC (NIST 2025 backup KEM), X25519. Signatures — ML-DSA-87 (FIPS 204), FN-DSA/Falcon (FIPS 206), SLH-DSA/SPHINCS+ (FIPS 205), Ed25519. Symmetric — AES-256-GCM, ChaCha20-Poly1305, Ascon-128a (NIST LWC). All key generation runs in your browser via @noble/post-quantum.

Signature modes

✏️
Draw Freehand on a canvas with full touch support. The drawn signature is composited onto the PDF page at your chosen position and size.
🔤
Type Your name is rendered as a signature image via ImageMagick using DejaVu-Sans-Oblique script font.
📷
Upload Use your own PNG or JPEG as the signature image. Transparency is preserved.
🔐
PAdES / Crypto Only An invisible cryptographic signature — no image drawn on the page. Verifiable in Adobe Reader's Signatures panel. Compliant with PAdES-B (ETSI EN 319 102-1) via pyhanko 0.34. The signature is written as an incremental update — the original content stream is never modified.

Visual placement controls (Draw / Type / Upload modes)

First / last / all / custom page selector. Two placement modes:

  • Snap grid — 3×3 position grid (left/center/right × top/middle/bottom) for one-click alignment.
  • Free placement — drag the signature to any position on the page. Coordinates are transmitted as fractional page offsets (pos_x_pct, pos_y_pct, range 0.0–1.0) and applied with sub-point precision regardless of page dimensions.

A size slider (40–300 pt). Live placement preview composites the signature image onto a rendered page 1 canvas in real time as position and size are adjusted.

Date stamp — an optional date string (up to 30 characters) can be rendered in small text directly below the signature image. Accepts any alphanumeric format, separators, and common date punctuation.

Certificate options

All modes embed a cryptographic digital signature. Certificate source is either an auto-generated ephemeral RSA-2048 self-signed certificate (created per-request, never stored) or a user-supplied .p12 / .pfx file. Signer name (required), email, reason, and location metadata are embedded in the CMS/PKCS#7 signature block.

Note: /tools/pades.php 301-redirects to /tools/sign.php?tab=pades — existing links and bookmarks continue to work.

Workflow

The initiator uploads a PDF, adds up to 10 signers (name + optional email), and chooses a signing order. The server creates an ephemeral workspace (/tmp/esign_{32hex}/, mode 0700) and generates a unique 256-bit secure token per signer. Each token produces a signing URL that can be shared directly — no account is required on either side. The initiator's tracking page polls status every 5 seconds and provides a download link once all signers have completed.

Signing order

Sequential (chain) Each signer's completed PDF becomes the next signer's input. The output accumulates all signatures in order. Signer 2 cannot sign until Signer 1 has completed.
Parallel (all-at-once) All signers receive their link simultaneously and sign independently. The server merges signatures when all parties complete.

Signature placement

Each signer sees a page-1 thumbnail and can place their signature using the same three input modes as the solo Sign PDF tool (draw canvas, typed name, uploaded image). Placement supports the full 3×3 snap grid and free drag-and-drop positioning via fractional page coordinates (pos_x_pct, pos_y_pct). An optional date stamp can be rendered below the signature image.

Cryptographic enforcement — require_crypto

When the document creator enables require_crypto at creation time, signers who attempt to submit without enabling the PAdES-B cryptographic layer receive an error response: "A PAdES-B cryptographic signature is required for this document." This lets initiators mandate that every signature in the workflow is cryptographically verifiable in Adobe Reader's Signatures panel — not just a visual stamp. The certificate source is the signer's own .p12/.pfx or an auto-generated ephemeral RSA-2048 self-signed certificate created per request and never stored.

Workflow management (from the tracking page)

  • Add signer — append a new signer to an in-progress workflow; a fresh token and signing URL are generated immediately.
  • Remove signer — remove a signer who has not yet signed; their token is invalidated.
  • Cancel request — terminate the entire workflow; all tokens are invalidated and the workspace is scheduled for cleanup.
  • Return URL / copy link — the initiator can copy a resume link to return to the tracking page from any device.

Storage & retention

All state is stored in the ephemeral temp directory — no database writes, no cloud storage. The workspace has a 24-hour TTL; it is purged on expiry and at create-time cleanup. The final signed PDF is never stored beyond the TTL window. Zero retention applies to the e-sign workflow exactly as it does to all other tools.

Watermark renders directly to the PDF content stream via PyMuPDF — not as a separate annotation layer. The text is permanently embedded; it cannot be removed by deleting an annotation. A live canvas preview composites your watermark text over page 1 in real time as you adjust any setting.

Placement positions (8)

Diagonal (full page) — defaultThe watermark spans the full page at 45° from bottom-left to top-right. The most common choice — visually unambiguous that the document is marked.
CenterHorizontal text centred on the page. Prominent without the angle.
📍
Top-Left / Top-Right / Bottom-Left / Bottom-RightCorner placements. Useful for company name, document classification, or "DRAFT" in a corner without obscuring the main content area.
Header / FooterFull-width centred text at the top or bottom of the page. Suitable for document titles, classification banners, or page footers.

Style controls

💧
Opacity — 5% to 100% (default 30%)Lower values produce a subtle ghost watermark that does not obscure content. Higher values produce an opaque stamp. The live preview renders the exact opacity as PyMuPDF will apply it.
🔤
Font size — 12 pt to 96 pt (default 44 pt)The preview updates immediately so you can confirm the text fits without truncation at the selected size.
📝
Font style — Bold, Regular, Italic, Bold ItalicBold is the default for legibility at lower opacities. Italic suits signature-style watermarks.
🎨
Colour — hex picker (default #cccccc)Any hex colour. Common choices: #cccccc neutral grey, #ff0000 red for CONFIDENTIAL, #0000ff blue for DRAFT.

Page targeting

Apply to All pages, Odd pages (recto-only in duplex documents), Even pages, or a custom range (comma-separated, e.g. 1-3, 5, 8-10).

Redaction is not the same as drawing a black box over text. A black rectangle drawn on top of text leaves the original text in the PDF file — it can be selected, copied, and searched by anyone who removes or moves the rectangle. Genuine redaction removes the underlying content from the PDF's data structures. This tool uses PyMuPDF's native redaction API, which permanently erases content at the structural level.

How it works: page.add_redact_annot() marks regions, then page.apply_redactions() removes the content from the page's content streams — text, images, and vector graphics within the region are erased, not covered.

Mode 1 — Text pattern redaction

Enter search patterns and the tool finds every matching text occurrence across the document and permanently removes it.

📋
Multi-pattern listAdd multiple patterns in one job — names, ID numbers, phone numbers, email addresses. All occurrences of all patterns are redacted in a single pass.
🔤
Case-sensitive matchingToggle on to distinguish "CONFIDENTIAL" from "confidential". Off by default — matches any case variant.
🔍
Whole-word matchingWhen enabled, "John" will not match "Johnson". Prevents partial-word false positives in names and technical terms.

Mode 2 — Canvas region redaction

Draw rectangular redaction areas directly on a rendered preview of each PDF page.

📷
Click-and-drag to draw regionsEach region is drawn as a rectangle on the canvas. Multiple regions per page. Coordinates are captured in PDF display-space points and sent to the server for precise structural erasure.
📄
Multi-page navigationNavigate through all pages and mark regions on each. A per-page region list shows how many areas are marked.
🗑️
Clear pageRemove all regions from the current page without affecting others.

Fill colour and page targeting

Black fill (standard) produces the visible redaction box. White fill is invisible on white backgrounds — useful when removing content without leaving a visible mark, such as stripping header metadata. Page targeting: All pages, Odd, Even, or a custom range.

11 tools for editing, filling, comparing, reading, and inspecting PDF documents. From a full visual editor to a font-embedding checker to table extraction.

✏️
Edit PDF

Full page-by-page visual editor: 16 annotation tools, an interactive AcroForm builder, and a bookmark editor. All edits are permanently flattened server-side. See the deep dive below.

Try it →
📝
Fill PDF Form

Detect and fill all interactive AcroForm fields — text inputs, checkboxes, radio buttons, dropdowns, and list boxes. Values are written server-side via PyMuPDF. Optional flatten-after-fill bakes values into static content.

Try it →
🔍
Compare PDFs

Visual pixel-level diff of two PDFs. Configurable DPI (72–300) and sensitivity. Side-by-side previews render immediately when files are selected. Output is a highlighted diff PDF with change regions marked. Includes 🤖 AI Change Analysis — Qwen 2.5 1.5B classifies change significance (MAJOR/MODERATE/MINOR/NONE), change type, plain-English change_summary, details array (per-change breakdown), and recommendation.

Try it →
📄
Extract Text

Export all text to .txt with optional layout preservation, text encoding selection, and custom page range. Includes 🤖 AI Document Analysis — Qwen 2.5 1.5B classifies document type (13 categories) with classification_confidence, language, key entities (people, organisations, locations, dates, amounts), topics, and reading level.

Try it →
ℹ️
PDF Info

Full metadata inspection: title, author, subject, keywords, creator, producer, page count, dimensions, PDF version, encryption status, form type, tagged flag, fast web view, permission flags, and creation/modification dates. Shows a canvas preview of page 1 alongside the data.

Try it →
🔎
OCR PDF

Optical character recognition for scanned and image-based PDFs via Tesseract 5 LSTM. Three output formats, DPI control, four page segmentation modes, up to 100 pages per job. Returns OCR confidence score, word count, character count, and a live text preview tab. See the deep dive.

Try it →
🔖
Outline / Bookmarks

Load a PDF's existing table of contents. Add, rename, reorder, delete, and set the level (1–4) of each entry. Each row has a page-number input validated against the actual page count. Reads and writes via PyMuPDF get_toc() / set_toc().

Try it →
Accessibility Checker

WCAG 2.1 / PDF/UA compliance audit via PyMuPDF. 8 checks: document title (2.4.2), language metadata (3.1.1), tagged structure (PDF/UA §7.1), image alt-text (1.1.1), reading order (1.3.2), font embedding (PDF/UA §7.21), bookmark navigation (2.4.5), and page-size consistency. Returns pass/fail with WCAG criterion references and overall A–F grade.

Try it →
🔤
Font Inspector

Lists every font across every page: name, type (Type1, TrueType, CIDFont, etc.), encoding, embedded status, subset flag (presence of + prefix in BaseFont name), and the pages each font appears on. Non-embedded fonts flagged in red — critical for print and PDF/UA compliance.

Try it →
🎨
Colour Inspector

Comprehensive colour audit across all PDF content — raster images, vector paths, shapes, and text. Detects DeviceRGB, DeviceCMYK, DeviceGray, Spot, ICC, Lab, and more. Flags overprint, transparency, and Total Ink Coverage over 300%. Ghostscript inkcov gives structured per-page CMYK percentages.

Try it →
📊
Tables to JSON

Extracts all tables from a PDF using pdfplumber with lines_strict strategy (explicit table borders from PDF path operators), falling back to text-position heuristics. First row becomes column headers. Output: {table_count, page_count, tables:[{id, page, rows, cols, headers, data}]}.

Try it →

16 annotation tools

🔤 Text (bold / italic / alignment / font / size / colour) 🖊️ Freehand Draw 🧹 Eraser — Line ➡️ Arrow ▭ Rectangle (+ fill colour) ○ Ellipse (+ fill colour) 🟡 Highlight ■ Whiteout ̶ Strikethrough 📏 Underline 🖼️ Image Insert ✍️ Signature 📷 QR Code 📌 Stamps (12 built-in + custom) 📌 Sticky Notes

Interactive form builder

Draw AcroForm widgets directly onto the PDF canvas. Supported field types:

Text field CheckBox RadioButton ListBox ComboBox Signature field PushButton

Each field has configurable field name, tooltip, required/read-only flags, font size, and text colour. Fields are written as native AcroForm annotations.

Additional features

🔖
Bookmark editor Build a navigable table of contents written via PyMuPDF set_toc().
🔢
Page numbers & headers/footers Auto-number pages; add custom header and footer text with font/size/position control.
🔄
Per-page rotation & blank page insertion Rotate individual pages or insert blank pages anywhere in the document.
↩️
Undo / redo Full undo/redo stack for all annotation operations within the session.

How edits are committed

All annotation data (positions, colours, text, field definitions) is collected client-side and sent to the server as structured JSON alongside the original PDF. PyMuPDF applies every annotation and permanently flattens them into the page content. The output PDF has no interactive annotation layer — all edits are baked in.

Engine

Tesseract 5 with the LSTM neural network engine (OEM mode 1). The LSTM engine significantly outperforms the older pattern-matching engine on low-quality scans, handwriting, and non-standard fonts.

Output formats

📄
Plain text (.txt) All recognised text extracted as a flat text file.
🔍
Searchable PDF Original page images preserved with an invisible text layer overlaid. The document becomes copyable and searchable in any PDF viewer.
📦
Both (ZIP) A ZIP containing the .txt and the searchable PDF together.

Controls

DPI: 150 / 200 / 300. Higher DPI improves accuracy on dense text but increases processing time. Page segmentation modes (PSM): auto-detect, single column, single block, and sparse text — important for forms and tables where the default auto-detect makes wrong assumptions. Custom page range: up to 100 pages per job.

What comes back

Along with the output file, the response includes an OCR confidence score (per-word Tesseract TSV confidence averaged across all pages), word count, and character count. A live text preview tab in the browser lets you read extracted text without downloading the file.

Compare PDFs performs a page-by-page pixel-level diff between two documents. Because it operates on rendered pixels rather than the text layer, it works equally on text-based PDFs and scanned documents — any visual change is detected, including font substitutions, layout shifts, and image replacements that text-diff tools would miss.

Resolution

Both documents are rendered at the selected DPI before comparison. Higher DPI catches smaller visual differences but increases processing time and output file size.

100 DPI — FastSuitable for detecting large-scale changes: paragraph additions, section moves, image replacements.
⚖️
150 DPI — Balanced (default)Catches most meaningful changes including single-word edits, font changes, and minor layout shifts.
🔍
200 DPI — DetailedDetects subtle rendering differences, anti-aliasing changes, and minor typographic adjustments. Use when documents are visually similar and small changes are critical.

Sensitivity threshold

Controls the minimum per-pixel difference required to flag a change. Lower values catch more (including compression artefacts); higher values ignore minor differences.

📊
Low (threshold: 5)Detects nearly every pixel difference. Use when comparing documents known to be visually identical and you want to confirm that with precision.
⚖️
Medium (threshold: 15, default)Ignores minor rendering differences and JPEG artefacts. Flags meaningful content changes. The right choice for most document review workflows.
🔎
High (threshold: 30)Only flags substantial changes. Useful when comparing a scanned document against a digital version where scanner noise would otherwise produce false positives across the whole page.

Change map colour coding

🟥
Red — only in Document A (original)Content that has been removed or replaced in the revised version.
🟩
Green — only in Document B (revised)Content that has been added or changed in the revised version.
Gray — unchangedIdentical content in both documents, rendered at reduced opacity so changed regions stand out.

Preview and output

Side-by-side canvas previews of both documents render immediately when each file is selected — no upload required for the preview. Output is a single diff PDF with change regions overlaid on every compared page pair.

On upload, the tool reads the PDF's AcroForm dictionary and generates a matching input form in the browser — one input per field, typed to match the field's widget type. Fill the form in the browser, then submit: PyMuPDF writes the values server-side and returns the filled PDF.

Supported AcroForm field types

📝 Text field — single and multi-line ✅ Checkbox — true / false toggle 🔘 Radio button group 📋 Combo box / drop-down 📋 List box — single or multi-select

Flatten-after-fill

When the flatten option is enabled, field values are baked into the page content stream after writing — the output PDF has no interactive form layer. The filled values appear as static text. This is the correct format for archiving, printing, or sharing a completed form — interactive fields in a shared PDF can otherwise be re-edited by any recipient.

No-fields detection

If the uploaded PDF has no AcroForm dictionary, the tool shows a "No interactive fields found" notice immediately rather than presenting an empty form. For PDFs that need form fields added, use the Edit PDF tool's form builder.

The accessibility checker audits a PDF against WCAG 2.1 and PDF/UA-1 requirements. It returns a pass/fail result for each criterion, an impact level (Critical / High / Medium), and an overall letter grade (A–F) based on a weighted score.

The 8 checks

Check 1 — WCAG 2.4.2 Document Title The PDF's Info dictionary must contain a non-empty Title field. Screen readers announce the document title when the file opens.
Check 2 — WCAG 3.1.1 Language Metadata A language identifier (e.g. en-US) must be set in the document's XMP metadata. Screen readers use this to select the correct speech synthesis voice.
Check 3 — PDF/UA §7.1 Tagged PDF Structure The PDF must contain a logical structure tree (/MarkInfo /Marked true). Tagged PDFs expose heading levels, paragraphs, lists, and tables to assistive technology. Untagged PDFs are effectively inaccessible to screen readers.
Check 4 — WCAG 1.1.1 Image Alt Text Every image in the structure tree must have an alternative text description (/Alt) or be marked decorative (/Artifact). Images without alt text are invisible to screen readers.
Check 5 — WCAG 1.3.2 Reading Order The logical reading order in the structure tree must match the visual reading order. Multi-column PDFs and complex layouts are common sources of reading-order failures — screen readers follow the structure tree, not visual position.
Check 6 — PDF/UA §7.21 Font Embedding All fonts must be fully embedded (or be one of the 14 standard PDF fonts). Non-embedded fonts depend on the viewer's substitution, which can change character rendering and disrupt the character-to-glyph mapping required by assistive technology.
Check 7 — WCAG 2.4.5 Bookmark Navigation Multi-page documents (more than 9 pages) should have a bookmark outline. Bookmarks allow screen reader users and keyboard navigators to jump directly to sections without reading through the entire document.
Check 8 — Consistency Page Size Consistency All pages should have consistent dimensions. Mixed-size documents can cause assistive technology to misinterpret layout, and may indicate inadvertent page imports from different source documents.

Grading

Each check carries a weight corresponding to its accessibility impact. The weighted pass rate maps to letter grades: A (all critical + high pass), B (minor failures only), down to F. The report lists each check's pass/fail status, the specific WCAG criterion, and the impact level — giving developers and document authors a clear remediation priority order.

Font Inspector

Enumerates every font used across every page. For each font the report shows:

Font name Type — Type1, TrueType, CIDFont, OpenType, etc. Encoding Embedded — Yes / No (non-embedded flagged red) Subset — + prefix in BaseFont name (e.g. ABCDEF+Helvetica) Pages — list of pages where this font appears

Why non-embedded fonts fail print: When a font is not embedded, the viewer or RIP must substitute it. Substitution changes glyph widths, reflows text, and breaks any layout that depends on exact positioning. PDF/X and PDF/UA compliance both require full font embedding. Non-embedded fonts are flagged in red.

Subset embedding: A + prefix means only the glyphs actually used in the document are included — reducing file size while remaining fully compliant with PDF/X and PDF/UA standards.

Colour Inspector

Audits colour space usage for print-readiness using five detection layers — covering every type of PDF colour content:

🖼️
Raster images (PyMuPDF extract_image()) Checks every embedded image's colour space via component count: 1 = DeviceGray, 3 = DeviceRGB, 4 = DeviceCMYK. RGB images are flagged — commercial presses expect CMYK, and RGB requires conversion during RIP processing, which can produce unexpected colour shifts.
📐
Vector drawings (PyMuPDF get_drawings()) PyMuPDF preserves the original colour space in drawing colour tuples: 1-component = Gray, 3-component = RGB, 4-component = CMYK. Catches all filled and stroked paths, shapes, and borders.
📝
Content-stream operator analysis Tokenises the raw PDF content stream to detect colour operators: rg/RG (DeviceRGB), k/K (DeviceCMYK), g/G (DeviceGray), and cs/CS for named colour spaces. This layer catches text colours and inline images that neither image extraction nor drawing analysis would detect.
🎨
Resource dictionary traversal Follows the page → Resources → ColorSpace/ExtGState reference chain in the raw PDF object tree to detect Separation (spot), DeviceN, ICCBased, Lab, CalRGB, and CalGray colour spaces, plus overprint (/OP true) and transparency (/ca, /CA, /BM) flags.
🖨️
Ghostscript ink coverage + Total Ink Coverage (TIC) Runs Ghostscript's inkcov device to compute per-page C/M/Y/K percentages. Calculates Total Ink Coverage (TIC = C+M+Y+K) per page and flags any page over 300% — a common press limit beyond which wet ink can cause trapping, drying, and registration problems.

The overall verdict — Print-ready (CMYK only, no RGB) or Requires conversion — is shown at the top of the report alongside the per-page breakdown and structured ink coverage table.

PDFs do not store tables as data structures — they store text characters at absolute positions and path objects that may or may not form visible borders. pdfplumber reconstructs table structure from these primitives using two strategies in sequence.

Detection strategies

Strategy 1 — lines_strict (explicit borders) Detects tables by finding horizontal and vertical line segments drawn by PDF path operators (l, re commands in the content stream). If the PDF was generated from software that draws explicit table borders — Word, Excel, LibreOffice, InDesign — this strategy reliably reconstructs cell boundaries. Applied first; if no tables are found, the fallback runs.
📝
Strategy 2 — Text-position heuristics (fallback) For borderless tables (where structure is implied by text alignment rather than drawn lines), pdfplumber infers columns and rows from the statistical distribution of text bounding boxes. Works on tables from PDF export pipelines that omit explicit borders.

JSON output schema

Output is a single .json file. The first row of each detected table is treated as column headers; subsequent rows become an array of objects keyed by those headers. Multiple tables per page are each represented as separate entries.

table_count — total number of tables found page_count — total pages in the document tables[].id — sequential table number tables[].page — page the table appears on tables[].rows / .cols — dimensions tables[].headers — array of column header strings tables[].data — array of row objects keyed by header

Limitations

Scanned PDFs (image-based, no text layer) are not supported — use OCR first, then extract tables. Tables spanning multiple pages are detected as separate tables. Merged cells are flattened to the flat row/column structure.

The Workflow Builder chains multiple PDF operations into a single automated pipeline. Build once, run on any PDF.

How it works

Add steps from the step picker, configure per-step parameters, and drag to reorder. Upload one or more PDFs and run the full pipeline in one click. Each step processes the output of the previous step.

Supported pipeline steps

🔄
RotateRotate all / odd / even / custom range by 90°/180°/270°.
🗜️
CompressAny of the five quality presets.
💧
WatermarkText watermark with all placement, opacity, and style parameters.
🛡️
ProtectAES-256 password protection with permission flags.
🔓
UnlockRemove password protection.
🎨
GrayscaleConvert to grayscale or black-and-white.
📋
FlattenBake all form fields and annotations into page content.
🔧
RepairReconstruct corrupted or malformed PDFs.
📃
Extract PagesKeep only the selected pages from the document.
🗑️
Delete PagesRemove selected pages from the document.
🔀
Reorder PagesRearrange pages into a custom order.
🗂️
Convert to PDF/AArchive-compliant conversion (PDF/A-1b, 2b, or 3b).
✍️
SignThree modes: typed visual only; typed visual + digital certificate; digital-only with auto self-signed RSA-2048 cert.
RedactPermanent text-pattern removal with case-sensitive option and black or white fill.
✂️
Split every N pages Terminal step — outputs a ZIP of equal-sized PDF chunks. Useful for batch-scanned documents or splitting large reports into individual sections.

Saving and composing workflows

💾
Save named workflows to localStorage Saved locally in your browser — no server, no account, no sync.
📥
Load vs. Append Load replaces the current workflow. + Append joins a saved workflow onto the end of the current one — letting you compose complex pipelines from saved building blocks.
📤
Export / import as JSON Share workflows with colleagues or version-control them alongside your documents.
⚙️
Workflow Builder

Chain 15 operations, save named pipelines to localStorage, export/import as JSON, and run on multiple PDFs in a single job.

Try it →

How every request is handled — from upload to download to deletion.

Request lifecycle

1 Upload — File arrives over HTTPS/TLS 1.3+ with HSTS preload. Magic-byte check: first 4 bytes must be %PDF for PDF operations. Size checked: 50 MB per file, 200 MB total.
2 Isolation — A private temp directory is created: sys_get_temp_dir() . '/pdftool_' . bin2hex(random_bytes(12)) with permissions 0700. No other process can access it.
3 Processing — The appropriate engine runs inside the temp directory, wrapped in a four-layer process isolation sandbox (see below). Every shell command receives paths via escapeshellarg(). A 120-second timeout wraps every external process. At most 4 heavy jobs run simultaneously.
4 Streamreadfile($path) begins streaming the output to your browser over the existing HTTP connection.
5 Deletecleanup() is called immediately after readfile() returns. The temp directory and all its contents are deleted while your download is still in flight. There is no retention window.

Security controls

🛡️
Strict Content Security Policy

Every page generates two fresh random nonces per request. script-src allows only 'nonce-{ext}' and 'nonce-{inline}'. No unsafe-inline, no unsafe-eval. style-src 'self' — no inline styles anywhere in any HTML, including this page.

⏱️
Rate Limiting & Concurrency Control

Two independent rate-limiting layers run on every request. Session-based: 10 operations per 5-minute sliding window per browser session. IP-based: 30 operations per 5-minute window per source IP — generous enough for shared NAT networks but still bounds individual abusers. Both are backed by Redis with filesystem fallback so limits are always enforced. Polling and keepalive operations (edit-ping, pdf-scan-poll, esign-status) are exempt from both limits to avoid blocking live progress UIs. Returns HTTP 429 when either limit is exceeded.

A third layer enforces server-wide concurrency: at most 4 heavy operations execute simultaneously. When that limit is reached the server returns HTTP 503 ("Server is busy — please try again shortly"). Lightweight operations and status polls are exempt. All limit breaches are recorded as structured security events.

📁
File Validation

Two-step validation before any processing: (1) magic-byte check — first 4 bytes must be %PDF; (2) secondary structural parse via pdfinfo — the file must be parseable as a valid PDF cross-reference table. Both checks must pass; a file that starts with %PDF but contains no valid PDF structure is rejected. Repeated failures within a session are counted — three consecutive failures trigger a security event log entry. MIME type validated against allowlist. No user-controlled string reaches the shell without escapeshellarg(). Page range inputs are validated against /^\d+$/ before any integer conversion.

🛡️
Four-Layer Process Isolation Sandbox

Every invocation of a heavy external tool — Ghostscript, Python, LibreOffice, Playwright, ImageMagick — passes through a mandatory four-layer sandbox chain. The architecture is sandbox-by-default: new tools are sandboxed automatically; an explicit opt-out is required to exempt a tool (only four read-only helpers are exempt: pdfinfo, qpdf, pdfseparate, pdftotext).

Layer 1 — prlimit: kernel-enforced resource caps applied before any process image loads: 1.5 GB virtual memory (RLIMIT_AS), 512 MB max file write (RLIMIT_FSIZE), 256 processes (RLIMIT_NPROC), 512 open file descriptors (RLIMIT_NOFILE).

Layer 2 — AppArmor aa-exec: transitions the process into the pqpdf-unshare mandatory-access-control profile. Required on Ubuntu 24.04+ where user namespace creation is gated behind the AppArmor userns permission. The profile grants only what unshare and the sandbox script need; all other filesystem writes are denied.

Layer 3 — unshare (Linux namespaces): creates isolated kernel namespaces. --user --map-root-user — the process believes it is root but holds no real capabilities. --net — private network stack with no interfaces; the tool cannot connect to the internet or any internal service; any connect() syscall fails. --pid --fork — isolated PID tree; child processes cannot escape to the host. --ipc — private shared memory and message queues. --mount — private mount namespace so bind-mounts are invisible to the host.

Layer 4 — pqpdf-sandbox script: runs inside the new namespaces, mounts a 512 MB tmpfs as scratch space so all I/O happens in-memory and vanishes when the namespace exits, bind-mounts the job directory into the scratch tmpfs, applies a CPU time limit via ulimit -t (enforced after the PID namespace fork, avoiding a kernel sigprocmask conflict), then execs the real tool binary. No shell remains after exec.

SANDBOX_MIN_LEVEL = 'full' in production — if any layer is unavailable the operation fails rather than running unsandboxed. Degraded execution is always logged as a security event.

🔐
Transport Security

HTTP/3 over QUIC v1 (RFC 9000) — primary protocol. TLS 1.3 only; TLS 1.0, 1.1, and 1.2 disabled — no downgrade possible. Key exchange uses X25519MLKEM768 hybrid post-quantum cryptography (NIST FIPS 203). Cipher suite: TLS_AES_256_GCM_SHA384. Certificate: Let's Encrypt ECDSA + SHA-384, CT-logged. HSTS preload eligible (max-age=31536000; includeSubDomains; preload). Full transport details →

📋
Security Event Logging

Security-relevant events are written as structured NDJSON to /var/log/pqpdf/security.ndjson — one event per line, ingestible by Elasticsearch, Loki, Datadog, or jq. Events logged: invalid HTTP method, unknown operation, session rate limit breach, IP rate limit breach, concurrency limit reached, file size exceeded, total upload size exceeded, repeated PDF validation failures (threshold: 3 consecutive), malformed page range input. Every entry carries a hashed session token (first 12 hex chars of sha256(session_id()) — stable but cannot be used to hijack the session), IP address, operation name, and sanitised user-agent string. Falls back to error_log() if the log file is not writable so no event is silently dropped. A live Security Dashboard at /security-dashboard.php presents aggregated telemetry — event timeline, activity heatmap, top source IPs, and a filterable event log table with CSV/JSON export. The dashboard is token-gated via the PQPDF_DASHBOARD_TOKEN environment variable.

🚫
Spam & Bot Protection

The contact form layers four independent defences: (1) AI behavioural verification — client-side analysis of interaction patterns before the submit button is enabled; (2) honeypot fields — two hidden inputs invisible to humans are sent with every submission; any non-empty value causes the server to reject the request via SpamException; (3) server-side spam pattern matching — pharmaceutical keywords, excessive capitalisation, disposable email domains, and common bot phrases; (4) IP-based rate limit — maximum 5 submissions per hour per IP, enforced in PostgreSQL before any email is dispatched.

Engine stack

All engines run locally. No file data is ever sent to a third-party service.

Engine Used by External calls?
GhostscriptCompress, watermark, rotate, protect, unlock, flatten, grayscale, repair, PDF/XNone
PopplerMerge, split, extract text, to-images, PDF infoNone
qpdfProtect/unlock, structural analysis (scanner)None
LibreOfficeAll Office ↔ PDF conversions (Word, Excel, PowerPoint, ODT, ODS, ODP)None
Playwright / ChromiumHTML → PDF (URL and file modes, JavaScript rendering, lazy-load, web fonts)None (sandboxed)
ImageMagickImages → PDF, typed signature renderingNone
Tesseract 5 LSTMOCR PDFNone
PyMuPDF 1.27Edit, fill, nup, deskew, outline, a11y, font/colour inspect, PDF info, scanner engines 1–9None
pymupdf4llmPDF → MarkdownNone
python-pptxPDF → PowerPointNone
pdfplumberTables to JSONNone
pyhanko 0.34PAdES / Sign PDF (incremental CMS/PKCS#7)None
endesiveVisual + crypto sign modesNone
ExifTool 12Scanner engine 10None
YARA 4.5Scanner engine 12 (24 custom rules + external .yar support)None
ClamAV 1.4+Scanner engine 15 (700k+ signatures)Signature updates only (clamav.net)
PeePDF 0.4Scanner engine 13None
prlimit + AppArmor aa-exec + unshare + pqpdf-sandboxFour-layer process isolation sandbox — wraps every heavy tool invocation; also used explicitly by Scanner engine 14 for the dynamic behavioral sandbox with strace syscall tracingNone (network namespace isolates all tools)
pikepdfScanner engine 13 (supplemental PDF parser — JS Names tree, EmbeddedFiles, per-page AA)None
scikit-learn + LightGBMScanner engine 16 (IsolationForest + RandomForest + LightGBM ensemble, model drift detection)None
Acorn (Node.js)Scanner engine 19 (JS AST deobfuscation, ECMAScript 2022, 6 iterative deobfuscation passes)None
imagehashScanner engines 24, 39 (pHash perceptual similarity for campaign attribution · LSB chi-square steganalysis and tracking beacon detection)None
Node.js vmScanner engine 41 (JS behavioral emulation — sandboxed Acrobat API stub, runtime call interception: app.launchURL, this.submitForm, app.openDoc)None
python-tlshScanner engine 24 (TLSH locality-sensitive hash for campaign clustering)None
@noble/post-quantumProtect PDF — PQC mode (runs in browser)None
Full technical details — temp-dir lifecycle, TLS configuration, CSP nonce implementation, ML data policy, and vulnerability reporting contact: legal/security.php

PQ PDF runs behind PQCrypta Proxy — a Rust-based QUIC proxy (built on quinn) that provides HTTP/3, WebTransport, and post-quantum hybrid TLS at the network layer. Every connection uses TLS 1.3 with X25519MLKEM768 hybrid key exchange — the same algorithm now deployed by Chrome, Firefox, and Cloudflare. TLS 1.0, 1.1, and 1.2 are disabled entirely.

HTTP/3 Primary protocol
QUIC v1 RFC 9000
TLS 1.3 Only — 1.0/1.1/1.2 off
X25519MLKEM768 PQ hybrid key exchange
48 ms TLS handshake
3 ms TTFB
0.00% Packet loss

Post-quantum hybrid key exchange

X25519MLKEM768 combines a classical algorithm with a post-quantum algorithm. Both must be broken simultaneously for the key exchange to be compromised.

🧬
ML-KEM-768 NIST FIPS 203

The post-quantum half. ML-KEM-768 (formerly Kyber-768) is a lattice-based key encapsulation mechanism standardised by NIST in August 2024. It provides 192-bit post-quantum security — the key cannot be recovered by either a classical computer or a cryptographically-relevant quantum computer.

🔑
X25519 Classical ECDH

The classical half. X25519 (Curve25519 Diffie-Hellman) is the fastest and most-audited elliptic-curve key exchange in production use. Constant-time arithmetic eliminates timing side-channels. Secure against all known classical attacks.

🧷
Hybrid binding

The final session key is derived from the output of both components. An adversary must break both X25519 and ML-KEM-768 to recover the key. If either algorithm is later found broken, the connection is still protected by the other — forward security is maintained at two independent levels.

"Harvest now, decrypt later" protection

Nation-state actors are known to archive encrypted traffic today, intending to decrypt it once a sufficiently powerful quantum computer exists. X25519MLKEM768 renders those archives useless — even a future quantum computer cannot reconstruct the session key from captured ciphertext.

🧪
Is your site post-quantum ready?

Check whether your server negotiates X25519MLKEM768 or another PQC hybrid key exchange — and whether TLS 1.2 is still reachable.

Test PQC readiness →

Protocol stack

Clients negotiate the highest protocol they support. All three are advertised via ALPN and Alt-Svc. All enforce TLS 1.3 — there is no downgrade path to TLS 1.2.

Protocol Transport TLS ALPN Multiplexing HOL blocking
HTTP/3 Primary QUIC v1 (UDP) TLS 1.3 (QUIC built-in) h3 ✔ Stream-level ✔ Eliminated
WebTransport QUIC v1 (UDP) TLS 1.3 (QUIC built-in) h3 ✔ Bidirectional streams + datagrams ✔ Eliminated
HTTP/2 Fallback TCP TLS 1.3 h2 ✔ Stream-level ⚠️ TCP-level HOL remains
HTTP/1.1 Legacy TCP TLS 1.3 http/1.1 ✘ None ⚠️ Request + TCP HOL
Does your site support HTTP/3, QUIC, and WebTransport?

The PQCrypta scanner checks HTTP/3 negotiation, QUIC version, WebTransport availability, Alt-Svc headers, and 0-RTT configuration.

Scan HTTP/3 & QUIC →

QUIC v1 — security hardening

RFC 9000 defines a suite of anti-abuse mechanisms. All are enabled.

🚫
0-RTT disabled Secure

Zero-RTT session resumption is intentionally off. While 0-RTT reduces latency for repeat connections, it opens a replay window where an adversary can re-submit captured early data. Disabling it eliminates this class of attack. All connections use full 1-RTT handshakes.

📝
Address validation & Retry tokens

Before allocating connection state, the proxy issues a RETRY packet with a server-generated token. The client must echo this token in its next Initial, proving it controls the claimed source address. Prevents IP spoofing and connection-state exhaustion attacks.

📡
Anti-amplification limit RFC 9000 §8.1

The server sends no more than 3× the bytes received from an unvalidated client address. This prevents the QUIC handshake from being weaponised as a UDP amplification vector — a significant concern for protocols that can send large responses to small initial packets.

🔄
Stateless reset

If the proxy loses connection state (e.g. after a restart), it sends a STATELESS_RESET to the client, cleanly terminating the connection rather than leaving the client retransmitting into a broken session indefinitely.

🗺️
Connection migration

QUIC connections are identified by a Connection ID, not the IP/port 4-tuple. A client that switches from Wi-Fi to mobile data mid-upload continues the same logical connection without starting over. Connection migration is enabled with address re-validation on path change.

🧪
GREASE RFC 8701

The proxy sends Generate Random Extensions And Sustain Extensibility values in TLS extension slots. This prevents middleboxes and TLS stacks from hardcoding assumptions about which extension IDs are valid — keeping the protocol extensible as new standards are adopted.

CUBIC congestion control RFC 9002

QUIC-native loss recovery with CUBIC CC. Initial congestion window: 12,000 bytes. Path MTU Discovery (PMTUD) enabled — the proxy probes for the optimal UDP payload size (measured MTU 1,452 bytes, UDP MTU 1,200 bytes, datagram payload 1,162 bytes).

ECH — not yet supported

Encrypted Client Hello (RFC 9289) would encrypt the SNI field, hiding the target hostname from network observers. It is not yet supported by PQCrypta Proxy. The TLS handshake itself is fully encrypted; only the SNI in the Client Hello remains visible to on-path observers.

WebTransport

Available on port 443 (path /) and port 4433. WebTransport runs over HTTP/3 and exposes QUIC streams and unreliable datagrams directly to browser code.

📦
Multiplexed streams

Unlike WebSockets (which layer over HTTP/1.1 TCP), WebTransport streams are independent QUIC streams with no head-of-line blocking. Multiple large file operations can transfer in parallel — one slow stream does not stall others.

Unreliable datagrams

Beyond streams, WebTransport supports fire-and-forget datagrams (max 1,162 bytes each). Ideal for low-latency signals — live progress events, cancellation, real-time preview requests — where retransmitting stale data would add unnecessary latency.

🔐
Inherits QUIC security

Every WebTransport session shares the underlying QUIC connection's TLS 1.3 encryption, X25519MLKEM768 key exchange, address validation, and anti-amplification hardening. No separate security layer to configure.

TLS 1.3 — cipher & certificate

Parameter Value Notes
Cipher suite TLS_AES_256_GCM_SHA384 AES-256-GCM authenticated encryption; 256-bit key; SHA-384 transcript hash
Key exchange X25519MLKEM768 Hybrid: X25519 classical ECDH + ML-KEM-768 post-quantum (NIST FIPS 203)
Signature ecdsa-with-SHA384 Certificate signed with ECDSA + SHA-384; P-384 curve
Certificate issuer Let's Encrypt E8 Free public CA; certificate transparency logged; 90-day auto-renewal
ALPN h3 HTTP/3 negotiated via TLS ALPN extension
TLS versions TLS 1.3 only TLS 1.0, 1.1, and 1.2 explicitly disabled — no downgrade possible
GREASE Enabled Random extension values injected to prevent middlebox ossification (RFC 8701)
Handshake RTTs 1-RTT only 0-RTT disabled; full handshake on every new connection — no replay window
ECH Not yet supported SNI remains visible to on-path observers; ECH (RFC 9289) planned

Connection & performance metrics

Measured by PQCrypta scanner against pqpdf.com, March 2026.

Metric Value Notes
TLS handshake 48 ms Full 1-RTT QUIC Initial + Handshake packet exchange
TTFB 3 ms Time to first byte from proxy after handshake
RTT 0 ms Sub-millisecond measured round-trip time
Packet loss 0.00% 0 of 16 packets lost during scan
Congestion control CUBIC RFC 9002 QUIC loss recovery + CUBIC CC algorithm
Initial CWND 12,000 bytes ~8 QUIC packets before ACK feedback required
Max stream data 12,000 bytes Initial per-stream flow control window
MTU / UDP MTU 1,452 / 1,200 bytes PMTUD enabled; max datagram payload 1,162 bytes
Idle timeout 20 s Server-initiated close after 20 s of inactivity
Proxy processing 2.64 ms PQCProxy internal duration (Server-Timing: proxy;dur=2.64)

HTTP/3 response headers

Headers sent on every HTTP/3 response that convey transport metadata, observability, and client hint negotiation.

Header Value / Purpose
Alt-Svc h3=":443"; ma=86400, h3=":4434"; ma=86400 — advertises HTTP/3 on ports 443 and 4434; browsers cache for 24 hours
Server-Timing proxy;dur=2.64;desc="PQCProxy Processing", quic;desc="QUIC v1"
Priority u=3 — RFC 9218 Extensible Prioritisation Scheme; urgency 3 (default)
Accept-CH DPR · Viewport-Width · Width · ECT · RTT · Downlink · Sec-CH-UA-Platform · Sec-CH-UA-Mobile — client hints for adaptive responses
NEL Network Error Logging configured — browsers report transport failures to the Report-To endpoint
Report-To Reporting API endpoint for NEL, CSP violation, and COOP violation reports
103 Early Hints Supported — server can push Link: preload hints before the full response is ready

Implementation

PQCrypta Proxy v0.2.1 — purpose-built Rust proxy using the quinn library (the leading Rust QUIC implementation, also used by Cloudflare). Rated A++ / HTTP/3 Ultimate by the PQCrypta scanner with 95% confidence: "Post-Quantum ready, Rust QUIC (quinn), HTTP/3 RFC 9114, Standard port (443)".