How To Copy Text From A Pdf And Keep The Formatting

When you copy text from a PDF, you want to keep the formatting. This means keeping the headings, lists, and other elements in their original order. You can also use the same font and size for all text. To copy text from a PDF, open the PDF file in your favorite word processor or editor. Then paste the text into a new document or page.

PDF, the ubiquitous document format, is great for sharing documents while preserving fonts, images, and the general layout across platforms. Is there an easy way, however, to preserve that very formatting when copying and pasting text out of the document?

Today’s Question & Answer session comes to us courtesy of SuperUser—a subdivision of Stack Exchange, a community-driven grouping of Q&A web sites.

The Question

SuperUser reader Colen is searching for a way to extract text from PDFs while preserving the formatting:

Is there a quick and easy way for Colen (and the rest of us) to get grab text without sacrificing the formatting?

Ideally, I’d like to be able to copy text from a PDF and have formatting converted to HTML codes, “smart quotes” converted to ” and ‘, and line breaks done properly. Is there any way to do this?

The Answer

SuperUser contributor Frabjous offers a solution combined with a heavy dose of caution:

If you are having trouble deciding which tool to start with, Calibre is a veritable document Swiss Army knife. You can also use it to convert PDF files for use on your ebook reader and organize your ebook/document library.

(A few recent PDFs do store some information about this stuff, but that’s a new technology, and you’d be lucky to find PDFs like that. Even if you did, your PDF viewer might not know about it.)

Anyway, it’s up to your software to implement some kind of “artificial intelligence” to extract merely from the locations of individual characters what is a word, what is a paragraph, and so on. Different software is going to do this better than others, and it’s also going to depend on how the PDF was made. In any case, you should never expect perfect results. Having the output PDF is not the same as having the source document. Far better to try to obtain that if you can.

The standard solution to your kind of problem is to use Adobe Acrobat Professional (the expensive one, not the free reader) to convert the PDF to HTML. Even that is not going to get perfect results.

There is free software that can be used to extract text from PDFs with some of formatting intact, but again, don’t expect perfect results. See, e.g., calibre (which can convert to RTF format), pdftohtml/pdfreflow, or the AbiWord word processor (with all import/export plugins enabled). There’s also a PDF import plugin for OpenOffice.

But please don’t expect perfection with any of these results. You’re going against the grain here. PDF just is not meant as an editable input format.

Have something to add to the explanation? Sound off in the the comments. Want to read more answers from other tech-savvy Stack Exchange users? Check out the full discussion thread here.

The Question#

The Answer#

Related Video#

The Question

The Answer

Related Video