What is OCR and when do I need it?

OCR (Optical Character Recognition) converts images of text into machine-readable text. Use it when your PDF is a scanned document or image-based — it contains pictures of text rather than selectable text. If you can already highlight and copy text in your PDF, use Extract Text instead.

What output formats does this OCR tool produce?

Three formats: Plain Text (.txt) with all extracted text, a Searchable PDF where your original scanned images have an invisible text layer added so the document becomes searchable and copyable, or a ZIP archive containing both.

What DPI should I choose for best OCR accuracy?

200 DPI is the recommended balance. 150 DPI is faster for clear, high-contrast scans. 300 DPI gives best accuracy for small text, faded scans, or low-quality originals — but takes significantly longer. Most modern scanned PDFs are already at 300 DPI internally so 200 DPI rendering is usually sufficient.

What is Page Segmentation Mode (PSM)?

PSM tells Tesseract how to interpret the layout of each page. Auto works for most documents with mixed content. Single Column is better for newspapers or multi-column text. Single Block works best for pure text pages. Sparse Text finds text anywhere on the page regardless of layout — ideal for forms, receipts, and mixed-content pages.

OCR PDF — Free Online PDF Text Recognition

🔎

About OCR PDF

This tool runs Tesseract 5 LSTM neural network OCR on your PDF. Each page is rendered to a high-resolution image, then OCR'd to extract recognised text. Best for scanned documents, photographed pages, and image-based PDFs. For PDFs that already have a selectable text layer, use Extract Text instead — it is faster and more accurate for those files.

⏱️

Processing Time

OCR is compute-intensive. Expect 2–8 seconds per page depending on DPI and page complexity. A 10-page document at 200 DPI typically completes in under 60 seconds. Processing runs entirely server-side — your files are deleted immediately after download.

🔎

Drop your scanned PDF here or click to browse

Scanned PDFs, image-based PDFs, photographed documents — up to 50 MB

Scan Resolution (DPI)

150 DPI — Fast Clear, high-contrast scans 200 DPI — Balanced Recommended 300 DPI — Best Quality Small text, faded or poor scans

Page Segmentation Mode

Auto works for most documents. Use Sparse Text for forms, receipts, or mixed-content pages.

Output Format

Plain Text (.txt) — Raw extracted text, previewable in browser Searchable PDF — Original images + invisible text layer (copy, search, select text) Both — ZIP archive containing the .txt file and the searchable .pdf

Pages to OCR

All pages Custom page range

Page Range

Initialising OCR…

🧠

Tesseract 5 LSTM Engine

State-of-the-art neural network OCR — trained on millions of document samples for high character recognition accuracy.

📄

Searchable PDF Output

Adds an invisible text layer to your scanned images — the original appearance is preserved while text becomes copyable and searchable.

🔬

150 / 200 / 300 DPI Control

Match rendering resolution to your scan quality. 200 DPI is the recommended balance; 300 DPI maximises accuracy for small or faded text.

📐

4 Page Segmentation Modes

Auto, single column, single block, and sparse text — choose how Tesseract reads your page layout for better results on forms, receipts, and columns.

👁️

In-Browser Text Preview

Read the extracted text directly in the results panel without downloading — see immediately whether OCR succeeded before saving the file.

📊

Confidence Score & Word Count

Every job returns per-word Tesseract confidence averaged across all pages, plus word count and character count — so you know how well OCR performed.

✂️

Custom Page Ranges

Target specific pages (e.g. 1–3, 5, 8–12) rather than the entire document — saves time on long scanned books where you only need certain pages.

📚

Up to 100 Pages Per Job

Processes entirely server-side — no browser memory limits. Pages are handled one at a time to prevent disk exhaustion on large documents.

🔒

Zero Retention

Your file and all OCR output are deleted from the server immediately after the download begins — nothing is stored, logged, or retained.