當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

docx、ppt、pdf转txt

發(fā)布時(shí)間：2024/1/1 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 docx、ppt、pdf转txt 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文檔格式轉(zhuǎn)換

最近做畢設(shè)用到需要用到文檔格式轉(zhuǎn)換，整理了一些代碼：

Doc、Docx轉(zhuǎn)txt

#-*- coding: utf-8 -*- from win32com import client as wcword = wc.Dispatch('Word.Application') doc = word.Documents.Open('H:\\a.docx') doc.SaveAs('H:\\a.pdf', 17) #17對(duì)應(yīng)于下表中的pdf文件 doc.SaveAs('H:\\a.txt', 2) #2對(duì)應(yīng)于下表中的txt文件 doc.Close() word.Quit()

PDF轉(zhuǎn)TXT

# -*- coding: utf-8 -*- from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfpage import PDFTextExtractionNotAllowed from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfdevice import PDFDevice from pdfminer.layout import * from pdfminer.converter import PDFPageAggregator import os#中文路徑問(wèn)題沒(méi)有解決fp = open('E:\\Final_design\\a.pdf', 'rb') #來(lái)創(chuàng)建一個(gè)pdf文檔分析器 parser = PDFParser(fp) #創(chuàng)建一個(gè)PDF文檔對(duì)象存儲(chǔ)文檔結(jié)構(gòu) document = PDFDocument(parser) # 檢查文件是否允許文本提取 if not document.is_extractable: raise PDFTextExtractionNotAllowed else: # 創(chuàng)建一個(gè)PDF資源管理器對(duì)象來(lái)存儲(chǔ)共賞資源 rsrcmgr=PDFResourceManager() # 設(shè)定參數(shù)進(jìn)行分析 laparams=LAParams() # 創(chuàng)建一個(gè)PDF設(shè)備對(duì)象 # device=PDFDevice(rsrcmgr) device=PDFPageAggregator(rsrcmgr,laparams=laparams) # 創(chuàng)建一個(gè)PDF解釋器對(duì)象 interpreter=PDFPageInterpreter(rsrcmgr,device) # 處理每一頁(yè) for page in PDFPage.create_pages(document): interpreter.process_page(page) # 接受該頁(yè)面的LTPage對(duì)象 layout=device.get_result() for x in layout: if(isinstance(x,LTTextBoxHorizontal)): with open('a.txt','a') as f: f.write(x.get_text().encode('utf-8')+'\n')

PPT轉(zhuǎn)TXT

#-*- coding: utf-8 -*- import win32com import codecs from win32com.client import Dispatch, constantsppt = win32com.client.Dispatch('PowerPoint.Application') ppt.Visible = 1 pptSel = ppt.Presentations.Open("H:\\b.pptx",ReadOnly=1, Untitled=0, WithWindow=0) # win32com.client.gencache.EnsureDispatch('PowerPoint.Application') #get the ppt's pages f = open("H:\\b.txt","w")slide_count = pptSel.Slides.Count for i in range(1,slide_count + 1): shape_count = pptSel.Slides(i).Shapes.Count print shape_count for j in range(1,shape_count + 1): if pptSel.Slides(i).Shapes(j).HasTextFrame: s = pptSel.Slides(i).Shapes(j).TextFrame.TextRange.Text f.write(s.encode('gbk')+ "\n") #gbk對(duì)中文處理比較好 f.close() ppt.Quit()

上面幾種都是利用python的win32com庫(kù)，運(yùn)行時(shí)會(huì)打開(kāi)Office軟件（前面兩個(gè)沒(méi)有明顯打開(kāi)，PPT轉(zhuǎn)txt的明顯能看到軟件打開(kāi)后又關(guān)閉的過(guò)程），由于安全原因，利用文件格式漏洞隱藏在文件中的惡意代碼可能會(huì)在文件被打開(kāi)時(shí)運(yùn)行。

所以更安全的辦法是利用文件格式解析，將其中文件提取出來(lái)，保存成txt。

docx文檔解析，提取文本

利用python的docx庫(kù)，安裝時(shí)建議不要用：

pip install docx

因?yàn)?from docx import Document會(huì)報(bào)錯(cuò)：cannot import name Document
改為：

pip install python-docx

#-*- coding: utf-8 -*- from docx import Document #打開(kāi)文檔 document = Document(u'H:\\a.docx') #讀取每段資料 l = [ paragraph.text.encode('gb2312') for paragraph in document.paragraphs]; #輸出并觀察結(jié)果，也可以通過(guò)其他手段處理文本即可 for i in l:print i #讀取表格材料，并輸出結(jié)果 tables = [table for table in document.tables]; for table in tables:for row in table.rows:for cell in row.cells:print cell.text.encode('gb2312'),'\t',printprint '\n'

總結(jié)

以上是生活随笔為你收集整理的docx、ppt、pdf转txt的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。