當前位置：首頁 > 编程语言 > python >内容正文

python

python 读取 pdf 文档

發布時間：2023/12/20 python 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 python 读取 pdf 文档小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

這個圖片是使用的流程說明，看著是有點繞的，分解來看（學自慕課）

首先使用?open?方法或者??urlopen??打開本場文檔或者網絡文檔（一般會這么做因為考慮到文檔太大，對網絡服務器負擔也很大）生成文檔對象，以下的方法之中的網絡鏈接已經存在了

#?獲取文檔對象??

pdf0?=?open('sampleFORtest.pdf','rb')??

#?pdf1?=?urlopen('http://www.tencent.com/zh-cn/content/ir/an/2016/attachments/20160321.pdf')??

接著創建 ? 文檔解析器? 和?PDF文檔對象?并將他們相互關聯

#?創建一個與文檔關聯的解析器??

parser?=?PDFParser(pdf0)??

#?創建一個PDF文檔對象??

doc?=?PDFDocument()??

#?連接兩者??

parser.set_document(doc)??

doc.set_parser(parser)??

對 ? PDF文檔對象?進行初始化，如果文檔本身進行了加密，則需要在加入 ? password?參數

#?文檔初始化??

doc.initialize('')??

先創建?PDF資源管理器?和?參數分析器?

#?創建PDF資源管理器??

resources?=?PDFResourceManager()??

#?創建參數分析器??

laparam?=?LAParams()??

再創建一個?聚合器?，并接收?PDF資源管理器??參數分析器?作為參數

#?創建一個聚合器，并接收資源管理器，參數分析器作為參數??

device?=?PDFPageAggregator(resources,laparams=laparam)??

最后創建一個?頁面解釋器?，將?PDF資源管理器?和?聚合器?作為參數

#?創建一個頁面解釋器??

interpreter?=?PDFPageInterpreter(resources,device)??

這樣?頁面解釋器?就具有對PDF文檔進行編碼，解釋成Python能夠識別的格式

最后呢，使用?PDF文檔對象?的?get_pages()方法?從PDF文檔中讀取出頁面集合，接著使用?頁面解釋器?? ?對頁面集合逐一讀取，再調用?聚合器??的?get_result()方法?將頁面逐一放置到?layout?之中，最后商用?layout?的?get_text()方法?獲取每一頁的?text?

for?page?in?doc.get_pages():??

????#?使用頁面解釋器讀取頁面??

????interpreter.process_page(page)??

????#?使用聚合器讀取頁面頁面內容??

????layout?=?device.get_result()??

????for?out?in?layout:??

????????if?hasattr(out,'get_text'):?????#?因為文檔中不只有text文本??

????????????pprint(out.get_text())??

需要注意的是在PDF文檔中不只有?text? 還可能有圖片等等，為了確保不出錯先判斷對象是否具有 ? get_text()方法?

完整的代碼

#?encoding:utf-8??

'''??

@author:??

@time:??

'''??

from?pdfminer.converter?import?PDFPageAggregator??

from?pdfminer.layout?import?LAParams??

from?pdfminer.pdfparser?import?PDFParser,?PDFDocument??

from?pdfminer.pdfinterp?import?PDFResourceManager,?PDFPageInterpreter??

from?pdfminer.pdfdevice?import?PDFDevice??

from?pprint?import?pprint??

from?urllib.request?import?urlopen??

#?獲取文檔對象??

pdf0?=?open('sampleFORtest.pdf','rb')??

#?pdf1?=?urlopen('http://www.tencent.com/zh-cn/content/ir/an/2016/attachments/20160321.pdf')??

#?創建一個與文檔關聯的解釋器??

parser?=?PDFParser(pdf0)??

#?創建一個PDF文檔對象??

doc?=?PDFDocument()??

#?連接兩者??

parser.set_document(doc)??

doc.set_parser(parser)??

#?文檔初始化??

doc.initialize('')??

#?創建PDF資源管理器??

resources?=?PDFResourceManager()??

#?創建參數分析器??

laparam?=?LAParams()??

#?創建一個聚合器，并接收資源管理器，參數分析器作為參數??

device?=?PDFPageAggregator(resources,laparams=laparam)??

#?創建一個頁面解釋器??

interpreter?=?PDFPageInterpreter(resources,device)??

#?使用文檔對象獲取頁面的集合??

for?page?in?doc.get_pages():??

????#?使用頁面解釋器讀取頁面??

????interpreter.process_page(page)??

????#?使用聚合器讀取頁面頁面內容??

????layout?=?device.get_result()??

????for?out?in?layout:??

????????if?hasattr(out,'get_text'):?????#?因為文檔中不只有text文本??

????????????pprint(out.get_text())??

素材選取是官方提供的

運行的結果：

'Preemptive?Information?Extraction?using?Unrestricted?Relation?Discovery\n'??

'Yusuke?Shinyama\n'??

'Satoshi?Sekine\n'??

'New?York?University\n715,?Broadway,?7th?Floor\nNew?York,?NY,?10003\n'??

'{yusuke,sekine}@cs.nyu.edu\n'??

'Abstract\n'??

('We?are?trying?to?extend?the?boundary?of\n'??

?'Information?Extraction?(IE)?systems.?Ex-\n'??

?'isting?IE?systems?require?a?lot?of?time?and\n'??

?'human?effort?to?tune?for?a?new?scenario.\n'??

?'Preemptive?Information?Extraction?is?an\n'??

?'attempt?to?automatically?create?all?feasible\n'??

?'IE?systems?in?advance?without?human?in-\n'??

?'tervention.?We?propose?a?technique?called\n'??

?'Unrestricted?Relation?Discovery?that?dis-\n'??

?'covers?all?possible?relations?from?texts?and\n'??

?'presents?them?as?tables.?We?present?a?pre-\n'??

?'liminary?system?that?obtains?reasonably\n'??

?'good?results.\n')?

總結

以上是生活随笔為你收集整理的python 读取 pdf 文档的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：网易云计算机系统有限公司,网易云音乐
下一篇： websocket python爬虫_p