Extract Text from a PDF (and Why Scans Fail)

Sometimes copying text out of a PDF works perfectly, sometimes it produces scrambled columns, and sometimes nothing selects at all. All three behaviors have one explanation: what a PDF calls “text” is not text the way a document editor means it, and some PDFs contain no text whatsoever, just photographs of pages. This guide explains what an extractor can honestly pull out, why scans fail, and how to tell which kind of file you have in five seconds. Our free PDF text extractor does the pulling, in your browser, with no upload.

What the “text layer” actually is

A PDF page does not store sentences; it stores drawing instructions: place this glyph at these coordinates, in this font, at this size. The “text layer” is the sum of those instructions plus a mapping table that says which Unicode character each glyph represents. Extraction walks the instructions, applies the mapping, and reassembles characters into lines. When the producing software wrote a clean mapping, extraction is essentially perfect. When it did not, you get the famous symptoms: missing spaces (the page never drew a space character, just left a gap), odd characters where ligatures like fi were drawn as single glyphs, and gibberish from fonts whose mapping is broken or missing. The extractor reports what the file actually contains, which is sometimes worse than what your eyes reconstruct from the picture.

The five-second test

Open the PDF in any viewer and try to select a sentence with the mouse.

  • Text highlights line by line: a real text layer exists; extraction will work.
  • Your selection draws a rectangle over the page image: the page is a picture, almost certainly a scan; there is no text to extract, only pixels.
  • Text selects but search cannot find words you can see: the glyph-to-character mapping is broken; extraction will produce the same garbage that search is choking on.

This single test predicts the extractor’s output better than any file property, and it costs nothing.

Why extracted text comes out scrambled

The instructions on a page are stored in drawing order, not reading order, and nothing in the format requires the two to match. The classic cases:

  • Multi-column layouts. A two-column article may be drawn left column then right, or interleaved by vertical position, so naive line-joining shuffles the columns together. Good extractors reconstruct by coordinates, but ambiguous layouts stay ambiguous.
  • Tables. Cells are positioned text with no “this is a table” marker, so extraction flattens them into word soup; expect to rebuild structure by hand or from the source data.
  • Headers, footers, and page numbers interleave with body text because they are drawn on every page at the same coordinates.
  • Hyphenation. Words broken across lines arrive broken, because the hyphen is genuinely in the file.

None of this is the extractor misbehaving; it is the format telling the truth about itself. PDF was designed to print faithfully, not to be read back, and every extraction tool negotiates with that design.

Scans: when there is no text to extract

A scanned PDF is a stack of page images, and an extractor honestly returns nothing, because nothing is there. Turning those pictures into text is OCR, optical character recognition, a genuinely different and heavier job involving image analysis; it is the main item on the “needs heavier tools” list in our PDF pillar. Two notes keep expectations straight. Many official documents are hybrids: a scanned image with an invisible OCR text layer already added by the scanning software, which is why some scans select and search just fine, with OCR-grade errors in the layer. And if you control the scanner, the cheapest fix is upstream: scan to “searchable PDF” and the text layer arrives with the file.

A clean extraction workflow

  1. Run the five-second test to learn which file you have.
  2. Extract with the extractor rather than copy-paste: it processes every page at once and is not fooled by viewer selection quirks.
  3. For long documents, scope first. Check the size of the job with the page counter, and pull only the needed pages out with a split if the document is huge.
  4. Clean once, top to bottom: fix hyphenation with a find-replace, strip repeated headers, rebuild tables from the source if you have it. Cleaning is usually faster than re-extracting with different settings, because the artifacts come from the file, not the settings.

Frequently asked questions

Why are there no spaces in my extracted text?

The file probably never drew space characters; it positioned words with coordinate gaps instead, which is legal and common. Extractors infer spaces from gap widths, and unusual typography defeats the inference. A find-replace pass usually repairs the worst of it.

Why do quotes and dashes come out as weird symbols?

Typographic characters live higher in Unicode than their keyboard cousins, and a broken glyph mapping mangles them first. If curly quotes arrive as junk while plain letters survive, the mapping is the culprit, and the file, not the extractor, owns the problem.

Can I extract text from just one page?

Split that page into its own file first, then extract; both steps run in the browser. It is also the polite workflow for sharing: extract from, and send, only what is needed.

Does extraction preserve bold, italics, and headings?

No; plain text extraction returns characters, not formatting. Styling in a PDF is font and position information, and “this is a heading” exists only visually. If you need structured output, you are in document-conversion territory, not extraction.

ATV

Written by Nick (ATV Team)

We build and maintain the 600+ free, client-side tools on this site, and every guide is written against the tools themselves: each figure is computed and checked before it is published, and every linked tool is tested in the browser. More about how we work on the about page, and the full library of guides lives on the blog.