當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

使用Python3和BeautifulSoup4处理本地html文件

發(fā)布時(shí)間：2023/12/14 python 24 豆豆

生活随笔收集整理的這篇文章主要介紹了使用Python3和BeautifulSoup4处理本地html文件小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

- 遇到的問題
- 初始需要處理的文本
- 搜索和替換的一些常用正則表達(dá)式
- python3中使用beautifulsoup4
- - beautifulsoup4是什么？
  - 安裝beautifulsoup4
  - 開始使用beautifulsoup4
- 其他的一些小細(xì)節(jié)
- - python3中將list合并轉(zhuǎn)為string
- 最終的代碼（python3）
- 參考資料

我的博客地址：https://hxd.red
原文鏈接：https://hxd.red/2019/08/06/python3-beautifulsoup4-html-190805/
我的微信公眾號(hào)：不淡定的實(shí)驗(yàn)室（hxdred）

遇到的問題

在制作第三個(gè)微信小程序“法語背單詞記憶小助手”時(shí)，我需要處理大量單詞有關(guān)的數(shù)據(jù)，為了一勞永逸解決單詞釋義、單詞例句等種種方面的問題，我打算提取mdx詞典數(shù)據(jù)，將詞典里面所有單詞的數(shù)據(jù)做成數(shù)據(jù)表，并上傳至云開發(fā)。這樣的話，另一個(gè)小程序“法語動(dòng)詞變位記憶小助手”也能共享成果。

作為一個(gè)懶人，肯定不會(huì)手動(dòng)去處理這么多數(shù)據(jù)（提取mdx之后有60萬行數(shù)據(jù)，去除對我來說沒用的動(dòng)詞變位數(shù)據(jù)，還有15萬行，共計(jì)12000余個(gè)單詞）。所以打算使用python和Beautiful Soup（以下可能簡稱BS）進(jìn)行數(shù)據(jù)處理。引用官方文檔的說法：Beautiful Soup 是一個(gè)可以從HTML或XML文件中提取數(shù)據(jù)的Python庫。它能夠通過你喜歡的轉(zhuǎn)換器實(shí)現(xiàn)慣用的文檔導(dǎo)航、查找，修改文檔的方式。Beautiful Soup會(huì)幫你節(jié)省數(shù)小時(shí)甚至數(shù)天的工作時(shí)間。

初始需要處理的文本

初始文本如下，下面僅選取兩個(gè)單詞的詳情頁作為示例：

<zidingyi> abandonner <h1 class="Adresse" >abandonner</h1><br /><span class="CategorieGrammaticale" >verbe transitif </span><br /> <span class="Indicateur">(déserter) </span><br /> <div class="Traductionchinois" >擅離</div> <span class="Locution2" id="48" >abandonner son poste</span> <div class="Traduction2chinois" >擅離職守</div Traduction2> </td></tr></table> <span class="Indicateur">(laisser) </span><br /> <div class="Traductionchinois" >拋棄</div> <span class="Locution2" id="49" >abandonner un animal</span> <div class="Traduction2chinois" >丟棄一只動(dòng)物</div Traduction2><span class="Locution2" id="50" >partir en abandonnant femme et enfants</span> <div class="Traduction2chinois" >拋棄妻子和孩子出走</div Traduction2> </td></tr></table> <span class="Indicateur">(renoncer à) </span><br /> <div class="Traductionchinois" >放棄</div> <span class="Locution2" id="51" >abandonner ses études</span> <div class="Traduction2chinois" >放棄自己的學(xué)業(yè)</div Traduction2> </td></tr></table> <span class="Indicateur">(se retirer de) </span><br /> <div class="Traductionchinois" >棄權(quán)</div> <span class="Locution2" id="52" >il a abandonné la course</span> <span class="Traduction2chinois" >他在這次賽跑中棄權(quán)</span></td></tr></table> <br /><br /> <h1 class="Adresse" >abandonner</h1><br /><span class="CategorieGrammaticale" >verbe intransitif </span><br /> <div class="Traductionchinois" >退出比賽</div> <span class="Locution2" id="53" >après sa chute, le cycliste a abandonné</span> <span class="Traduction2chinois" >這個(gè)自行車運(yùn)動(dòng)員摔倒后就退出了比賽</span> </zidingyi> <zidingyi> abat-jour <h1 class="Adresse" >abat-jour</h1> <br /><span class="CategorieGrammaticale" >nom masculin invariable</span><br /> <div class="Traductionchinois" >燈罩</div> </zidingyi>

搜索和替換的一些常用正則表達(dá)式

在最原始的文檔中，有非常多無用的標(biāo)簽，需要將這些標(biāo)簽刪除。如果這些標(biāo)簽是定值，那么直接就能用普通的搜索替換就行批量替換；但若是標(biāo)簽中有有規(guī)律變動(dòng)的id或者是標(biāo)簽之間的文字有所變動(dòng)時(shí)，就需要使用正則表達(dá)式進(jìn)行查找。在使用過程中，最常用的表達(dá)式總結(jié)一些就是這樣的：

<a[^>]*>(.*?)</a>

舉例如下：<span class=”Traduction_py”>之間有不規(guī)則的文字內(nèi)容，但是我需要將所有<span class=”Traduction_py”></span Traduction_py>和標(biāo)簽之間文字一起替換掉，例如下方的第一行：<span class=”Locution2″ id=”12″>標(biāo)簽中存在id號(hào)，但是我需要將所有的類似標(biāo)簽（不同id）全部替換掉，例如下方的第二行：

python3中使用beautifulsoup4

beautifulsoup4是什么？

引用官方文檔的說法：Beautiful Soup 是一個(gè)可以從HTML或XML文件中提取數(shù)據(jù)的Python庫。它能夠通過你喜歡的轉(zhuǎn)換器實(shí)現(xiàn)慣用的文檔導(dǎo)航、查找，修改文檔的方式。Beautiful Soup會(huì)幫你節(jié)省數(shù)小時(shí)甚至數(shù)天的工作時(shí)間。

安裝beautifulsoup4

從這部分開始就需要使用到python了，至于如何方便快捷地0基礎(chǔ)使用上python？這里可能會(huì)單獨(dú)放一篇文章介紹，先立一個(gè)flag。用簡潔地話來說，需要配備一下幾點(diǎn)：

先下載一個(gè)Anaconda（搜索即可，傻瓜安裝）
裝完之后搜索所安裝的軟件里有：Anaconda Prompt。打開。
輸入下面代碼即可安裝完成beautifulsoup4

$ pip install beautifulsoup4

搜索所安裝的軟件：Anaconda Navigator，選擇Spyder，把本文的代碼修改一下貼上即可運(yùn)行。

開始使用beautifulsoup4

首先我們需要打開html文件，告訴程序你的文件存在什么地方。在path中需要將你的文件路徑修改成自己的。html文件怎么來？參照“初始需要處理的文本”，將代碼保存在Notepad++中另存為html即可開始實(shí)驗(yàn)。接下來兩行就是打開html文件并且讀取其中的內(nèi)容。

path = 'D:/WORKS/larousse_original_test1.html'htmlfile = open(path, 'r', encoding='utf-8')htmlhandle = htmlfile.read()

下一步就是調(diào)用Beautifulsoup解析功能，解析器使用lxml。并且使用python中的panda包來存儲(chǔ)目標(biāo)數(shù)據(jù)。注意此處BeautifulSoup的大小寫，不然會(huì)報(bào)錯(cuò)。

from bs4 import BeautifulSoupsoup = BeautifulSoup(htmlhandle, 'lxml')import pandas as pd

創(chuàng)建一個(gè)計(jì)數(shù)的，然后創(chuàng)建result，之后的所有的數(shù)據(jù)都存在這里面，到時(shí)候打開excel表時(shí)就可以看到‘word’、‘word_cixing’等等的列，而數(shù)據(jù)正是隨著這些列進(jìn)行逐行增加的。

count = 0result = pd.DataFrame({},index=[0])result['word'] = ''result['word_cixing'] = ''result['word_jieshi_fr'] = ''result['word_jieshi_cn'] = ''result['word_liju_fr'] = ''result['word_liju_cn'] = ''new = result

在這里建立一個(gè)循環(huán)。再初始html中我將原來mdx中的</>替換成了<zidingyi></zidingyi>。也就是說每一個(gè)單詞的最外面罩著<zidingyi></zidingyi>，每一個(gè)<zidingyi></zidingyi>里面就是該單詞的所有內(nèi)容。

首先用了find_all()命令，這樣就能得到所有的<zidingyi></zidingyi>標(biāo)簽的內(nèi)容，并用循環(huán)遍歷。每一次讀到的內(nèi)容存儲(chǔ)在item里面，再通過BS的CSS選擇器選擇了標(biāo)簽為h1的內(nèi)容，這是單詞本身。接下來，需要將讀到的list轉(zhuǎn)化為string，這個(gè)在下節(jié)會(huì)講到。

BeautifulSoup 對象表示的是一個(gè)文檔的全部內(nèi)容.。大部分時(shí)候可以把它當(dāng)作 Tag 對象，它支持遍歷文檔樹和搜索文檔樹中描述的大部分的方法。再使用get_text()，將所有標(biāo)簽之內(nèi)的所有內(nèi)容讀出，儲(chǔ)存到new的“word”字段里面，并且拼接到result中，為最后的文檔輸出做好準(zhǔn)備。

這里只舉了“word”一個(gè)例子，不同的字段對應(yīng)著不同的樣式或者是標(biāo)簽，可以從BS的官方中文文檔中尋找詳細(xì)信息。

for item in soup.find_all('zidingyi'):word = item.select("zidingyi > h1")word = ';'.join(str(e) for e in word)word = BeautifulSoup(word).get_text()new['word'] = wordcount += 1result = result.append(new,ignore_index=True)

最后大功告成，將所有的數(shù)據(jù)保存到excel表格中。（具體路徑和excel命名可以根據(jù)自己的實(shí)際需求改寫）

result.to_excel('d:result.xlsx')

其他的一些小細(xì)節(jié)

python3中將list合并轉(zhuǎn)為string

使用 ‘’.join，引號(hào)內(nèi)可以加上相應(yīng)的分隔符

list1 = ['1', '2', '3'] str1 = ''.join(list1)

如果list是數(shù)字類型或者不是string類型，那需要在join之前轉(zhuǎn)換。

list1 = [1, 2, 3] str1 = ''.join(str(e) for e in list1)

最終的代碼（python3）

-*- coding: utf-8 -*- """ Created on Sun Aug 4 14:13:54 2019 @author: https://hxd.red """ path = 'D:/WORKS/larousse_original_test1.html' htmlfile = open(path, 'r', encoding='utf-8') htmlhandle = htmlfile.read() from bs4 import BeautifulSoup soup = BeautifulSoup(htmlhandle, 'lxml') import pandas as pd count = 0 result = pd.DataFrame({},index=[0]) result['word'] = '' result['word_cixing'] = '' result['word_jieshi_fr'] = '' result['word_jieshi_cn'] = '' result['word_liju_fr'] = '' result['word_liju_cn'] = '' new = result for item in soup.find_all('zidingyi'):print(item)word = item.select("zidingyi > h1")word = ';'.join(str(e) for e in word)print(word)word_cixing = item.select(".CategorieGrammaticale")word_cixing = ';'.join(str(e) for e in word_cixing)print(word_cixing)word_jieshi_fr = item.select(".Indicateur")word_jieshi_fr = ';'.join(str(e) for e in word_jieshi_fr)print(word_jieshi_fr)word_jieshi_cn = item.select(".Traductionchinois")word_jieshi_cn = ';'.join(str(e) for e in word_jieshi_cn)print(word_jieshi_cn)word_liju_fr = item.select(".Locution2")word_liju_fr = ';'.join(str(e) for e in word_liju_fr)print(word_liju_fr)word_liju_cn = item.select(".Traduction2chinois")word_liju_cn = ';'.join(str(e) for e in word_liju_cn)print(word_liju_cn)word = BeautifulSoup(word).get_text()word_cixing = BeautifulSoup(word_cixing).get_text()word_jieshi_fr = BeautifulSoup(word_jieshi_fr).get_text()word_jieshi_cn = BeautifulSoup(word_jieshi_cn).get_text()word_liju_fr = BeautifulSoup(word_liju_fr).get_text()word_liju_cn = BeautifulSoup(word_liju_cn).get_text()new['word'] = wordnew['word_cixing'] = word_cixingnew['word_jieshi_fr'] = word_jieshi_frnew['word_jieshi_cn'] = word_jieshi_cnnew['word_liju_fr'] = word_liju_frnew['word_liju_cn'] = word_liju_cncount += 1result = result.append(new,ignore_index=True) result.to_excel('d:result.xlsx')

參考資料

https://stackoverflow.com/questions/5618878/how-to-convert-list-to-string

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#contents-children

https://blog.csdn.net/fwj_ntu/article/details/78843872

總結(jié)

以上是生活随笔為你收集整理的使用Python3和BeautifulSoup4处理本地html文件的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

文件
html

上一篇： Netcat工具的玩法
下一篇：目标检测模型中NMS、soft-NMS、