當前位置：首頁 > 编程语言 > python >内容正文

python

《Python 网络数据采集》正则表达式

發布時間：2025/3/16 python 16 豆豆

生活随笔收集整理的這篇文章主要介紹了《Python 网络数据采集》正则表达式小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

如果你用過 Python 2.x 里的 urllib2 庫，可能會發現 urllib2 與 urllib 有些不同。在 Python 3.x 里，urllib2改名為 urllib，被分成一些子模塊： urllib.request 、urllib.parse 和 urllib.error 。盡管函數名稱大多和原來一樣，但是在用新的 urllib 庫時需要注意哪些函數被移動到子模塊里了。

?????urlopen 用來大家并讀取一個從網絡獲取的遠程對象。

from urllib.request import u rlopen html = urlopen("http://pythonscraping.com/pages/page1.html") print(html.read())

BeautifulSoup通過定位 HTML 標簽來格式化和組織復雜的網絡信息，用簡單易用的 Python 對象為我們展現 XML 結構信息。

from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSoupdef getTitle(url):try:html = urlopen(url)except HTTPError as e:print(e)return Nonetry:bsObj = BeautifulSoup(html.read(), "lxml")title = bsObj.body.h1except AttributeError as e:print(e)return Nonereturn titletitle = getTitle("http://pythonscraping.com/pages/page1.html") if title == None:print("Title could not be found!") else:print(title)

復雜的HTML解析

from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSouptry:html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html") except HTTPError as e:print(e)try:bsObj = BeautifulSoup(html, "lxml") except AttributeError as e:print(e)namelist = bsObj.findAll("span", {"class":"green"}) for name in namelist:print(name.get_text())

findAll()和find()

????? ? findAll(tag, attributes, recursive, text, limit, keywords)

????? ? find(tag, attributes, recursive, text, keywords)

????? ? tag:：傳一個標簽名稱或多個標簽組成的列表

????? ? attributes：傳一個Python字典封裝一個標簽的若干屬性和屬性值。例如：.findAll("span", {"class":{"green", "red"}})

????? ? recursive：是一個遞歸參數，要求傳一個布爾變量，默認值是Ture，所以findAll默認會去查找標簽參數的所有子標簽，以及子標簽的子標簽。改為False，findAll就至查找文檔的以及標簽。

????? ? text：用標簽的文本內容去匹配，而不是標簽的屬性。

????? ? limit：范圍限制參數，顯然只用于findAll，find其實等價與findAll中的limit=1的情況。limit參數設置后，它的返回的前limit項結果是按照網頁上的順序排序的。

????? ? keyword：可以讓你選擇指定屬性的標簽，是BeautifulSoup設置的一個冗余功能，可替代，且偶爾會出現問題。例如bsObj.findAll(class="green")，會產生一個語法錯誤，因為class是Python的保留字。

get_text()

????????.get_text() 會把你正在處理的 HTML 文檔中所有的標簽都清除，然后返回一個只包含文字的字符串。假如你正在處理一個包含許多超鏈接、段落和標簽的大段源代碼，那么 .get_text() 會把這些超鏈接、段落和標簽都清除掉，只剩下一串不帶標簽的文字。

導航樹
子標簽和后代標簽

from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSouptry:html = urlopen("http://www.pythonscraping.com/pages/page3.html") except HTTPError as e:print(e)try:bsObj = BeautifulSoup(html, "lxml") except AttributeError as e:print(e)for child in bsObj.find("table", {"id":"giftList"}).children:print(child)

處理兄弟標簽

from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSouptry:html = urlopen("http://www.pythonscraping.com/pages/page3.html") except HTTPError as e:print(e)try:bsObj = BeautifulSoup(html, "lxml") except AttributeError as e:print(e)for sibling in bsObj.find("table", {"id":"giftList"}).tr.next_siblings:print(sibling)

父標簽處理

from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSouptry:html = urlopen("http://www.pythonscraping.com/pages/page3.html") except HTTPError as e:print(e)try:bsObj = BeautifulSoup(html, "lxml") except AttributeError as e:print(e)print(bsObj.find("img", {"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

正則表達式

郵箱：[A-Za-z0-9\._+]+@[A-Za-z]+\.(com|org|edu|net)

from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSoup import retry:html = urlopen("http://www.pythonscraping.com/pages/page3.html") except HTTPError as e:print(e)try:bsObj = BeautifulSoup(html, "lxml") except AttributeError as e:print(e)images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts\/img.*\.jpg")}) for img in images:print(img["src"])

獲取屬性
對于一個標簽可以用myTag.attrs獲取所有屬性
myTag.attrs["src"]表示myTag的src的屬性

總結

以上是生活随笔為你收集整理的《Python 网络数据采集》正则表达式的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：银行大数据风控平台的建设要点与应用
下一篇：漫画：当程序员有了下一代.....