pdf 复制文本 乱码_如何在保留格式的同时从PDF复制文本?
pdf 復制文本 亂碼
PDF, the ubiquitous document format, is great for sharing documents while preserving fonts, images, and the general layout across platforms. Is there an easy way, however, to preserve that very formatting when copying and pasting text out of the document?
PDF是無處不在的文檔格式,非常適合共享文檔,同時保留跨平臺的字體,圖像和總體布局。 但是,在從文檔中復制和粘貼文本時,是否有一種簡單的方法來保留這種格式?
Today’s Question & Answer session comes to us courtesy of SuperUser—a subdivision of Stack Exchange, a community-driven grouping of Q&A web sites.
今天的“問答”環節由SuperUser提供,它是Stack Exchange的一個分支,該社區是由社區驅動的Q&A網站分組。
問題 (The Question)
SuperUser reader Colen is searching for a way to extract text from PDFs while preserving the formatting:
超級用戶閱讀器Colen正在尋找一種在保留格式的同時從PDF提取文本的方法:
When I copy text out of a PDF file and into a text editor, it ends up mangled in a variety of ways. Formatting like bold and italics are lost; soft line breaks within a paragraph of text are converted to hard line breaks; dashes to break a word over two lines are preserved even when they shouldn’t be; and single and double quotes are replaced with ? signs.
當我將文本從PDF文件復制到文本編輯器中時,它最終會以各種方式被破壞。 像粗體和斜體這樣的格式會丟失; 文本段落中的軟換行符轉換為硬換行符; 即使在不應該使用破折號的情況下也保留了兩行破折號; 單引號和雙引號替換為? 跡象。
Ideally, I’d like to be able to copy text from a PDF and have formatting converted to HTML codes, “smart quotes” converted to ” and ‘, and line breaks done properly. Is there any way to do this?
理想情況下,我希望能夠從PDF復制文本,并將格式轉換為HTML代碼,將“智能引號”轉換為“和”,并正確完成換行符。 有什么辦法嗎?
Is there a quick and easy way for Colen (and the rest of us) to get grab text without sacrificing the formatting?
Colen(還有我們其他人)是否有一種快速簡便的方法來獲取抓取文本而不犧牲格式?
答案 (The Answer)
SuperUser contributor Frabjous offers a solution combined with a heavy dose of caution:
超級用戶貢獻者Frabjous提供了一種解決方案,并需要特別注意:
Firstly, you have to understand what a PDF is. PDFs are designed to mimic a printed page, and they are designed only as an output format, not an input format. a PDF is basically a map containing the exact location of characters (individual letters or punctuation, etc.) or images. In most cases, a PDF does not even store information about where one word ends and another begins, much less things like soft breaks vs. hard breaks for paragraph endings.
首先,您必須了解什么是PDF。 PDF旨在模仿打印的頁面,并且它們僅被設計為輸出格式,而不是輸入格式。 PDF基本上是一張包含字符(各個字母或標點符號等)或圖像的確切位置的地圖。 在大多數情況下,PDF甚至不存儲有關一個單詞的結尾和另一個單詞的開頭的信息,少了諸如段落結尾的軟中斷與硬中斷之類的信息。
(A few recent PDFs do store some information about this stuff, but that’s a new technology, and you’d be lucky to find PDFs like that. Even if you did, your PDF viewer might not know about it.)
(最近的一些PDF確實存儲了有關此內容的一些信息,但這是一項新技術,您很幸運能夠找到這樣的PDF。即使您這樣做,您的PDF查看器也可能不知道它。)
Anyway, it’s up to your software to implement some kind of “artificial intelligence” to extract merely from the locations of individual characters what is a word, what is a paragraph, and so on. Different software is going to do this better than others, and it’s also going to depend on how the PDF was made. In any case, you should never expect perfect results. Having the output PDF is not the same as having the source document. Far better to try to obtain that if you can.
無論如何,要由軟件來實現某種“人工智能”,以僅從單個字符的位置提取什么是單詞,什么是段落等。 不同的軟件將比其他軟件做得更好,而且還取決于PDF的制作方式。 無論如何,您永遠都不應期望獲得完美的結果。 具有輸出PDF與具有源文檔是不同的。 如果可以的話,嘗試獲得更好的選擇。
The standard solution to your kind of problem is to use Adobe Acrobat Professional (the expensive one, not the free reader) to convert the PDF to HTML. Even that is not going to get perfect results.
解決此類問題的標準方法是使用Adobe Acrobat Professional(價格昂貴,而不是免費的閱讀器)將PDF轉換為HTML。 即使那樣也不會取得完美的結果。
There is free software that can be used to extract text from PDFs with some of formatting intact, but again, don’t expect perfect results. See, e.g., calibre (which can convert to RTF format), pdftohtml/pdfreflow, or the AbiWord word processor (with all import/export plugins enabled). There’s also a PDF import plugin for OpenOffice.
有一些免費軟件可用于從PDF中提取格式完整的文本,但同樣,不要指望完美的結果。 請參見例如口徑(可以轉換為RTF格式) , pdftohtml / pdfreflow或AbiWord文字處理器 (啟用所有導入/導出插件)。 還有一個用于OpenOffice的PDF導入插件。
But please don’t expect perfection with any of these results. You’re going against the grain here. PDF just is not meant as an editable input format.
但是,請不要指望這些結果中的任何一個都是完美的。 你在這里反對谷物。 PDF并不意味著它是可編輯的輸入格式。
If you are having trouble deciding which tool to start with, Calibre is a veritable document Swiss Army knife. You can also use it to convert PDF files for use on your ebook reader and organize your ebook/document library.
如果您在決定使用哪種工具時遇到麻煩,Calibre是名副其實的瑞士軍刀。 您還可以使用它來轉換PDF文件以在電子書閱讀器上使用,以及整理電子書/文檔庫 。
Have something to add to the explanation? Sound off in the the comments. Want to read more answers from other tech-savvy Stack Exchange users? Check out the full discussion thread here.
有什么補充說明嗎? 在評論中聽起來不對。 是否想從其他精通Stack Exchange的用戶那里獲得更多答案? 在此處查看完整的討論線程 。
翻譯自: https://www.howtogeek.com/136698/how-can-i-copy-text-from-a-pdf-while-preserving-the-formatting/
pdf 復制文本 亂碼
總結
以上是生活随笔為你收集整理的pdf 复制文本 乱码_如何在保留格式的同时从PDF复制文本?的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: pdf复制到word有空格间隙和换行问题
- 下一篇: 算法4(一、递归学习)