Super User is a question and answer site for computer enthusiasts and power users.

Ideally, I’d like to be able to copy text from a PDF and have formatting converted to HTML codes, “smart quotes” converted to ” and ‘, and line breaks done properly. Is there any way to do this? Word 2013 can open PDFs. Use comments to ask for more information or suggest improvements. Avoid answering questions in comments.

Firstly, you have to understand what a PDF is. A few recent PDFs do store some information about this stuff, but that’s a new technology, and you’d be lucky to find PDFs like that. Even if you did, your PDF viewer might not know about it. Anyway, it’s up to your software to implement some kind of “artificial intelligence” to extract merely from the locations of individual characters what is a word, what is a paragraph, and so on. Different software is going to do this better than others, and it’s also going to depend on how the PDF was made. Having the output PDF is not the same as having the source document.

Far better to try to obtain that if you can. Even that is not going to get perfect results. There is free software that can be used to extract text from PDFs with some of formatting intact, but again, don’t expect perfect results. But please don’t expect perfection with any of these results. You’re going against the grain here.

PDF just is not meant as an editable input format. On some pdfs I tried it gave better results than all the above software. Then you can ‘Save As’ and choose . That will preserve all the formatting. Dunno whether you can do the same in Adobe because I stopped using it a while ago when I converted to Foxit.

Save as Text” worked for me with several free pdf viewers. I use Foxit, and just tried it, I wouldn’t say it preserved formatting. And all I wanted was decent line endings and each paragraph as a paragraph. You can use Adobe Acrobat Pro for this.

