當前位置：首頁 > 编程语言 > python >内容正文

python

python简单爬虫程序分析_[Python专题学习]-python开发简单爬虫

發布時間：2025/10/17 python 21 豆豆

生活随笔收集整理的這篇文章主要介紹了 python简单爬虫程序分析_[Python专题学习]-python开发简单爬虫小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

掌握開發輕量級爬蟲，這里的案例是不需要登錄的靜態網頁抓取。涉及爬蟲簡介、簡單爬蟲架構、URL管理器、網頁下載器(urllib2)、網頁解析器(BeautifulSoup)

一.爬蟲簡介以及爬蟲的技術價值

1.爬蟲簡介

爬蟲：一段自動抓取互聯網信息的程序。

爬蟲是自動訪問互聯網，并且提取數據的程序。

2.爬蟲價值

互聯網數據，為我所用！

二.簡單爬蟲架構

運行流程：

三.URL管理器和實現方法

1.URL管理器

URL管理器：管理待抓取URL集合和已抓取URL集合，防止重復抓取、防止循環抓取

2.實現方式

四.網頁下載器和urllib2模塊

1.網頁下載器

將互聯網上URL對應的網頁下載到本地的工具。

Python有哪幾種網頁下載器？

2.urllib2下載器網頁的三種方法

a.urllib2下載網頁方法1：最簡法方法

b.urllib2下載網頁方法2：添加data、http header

c.urllib2下載網頁方法3：添加特殊情景的處理器

3.urllib2實例代碼演示

由于我這里用的是python3.x，引用的不是urllib，而是urllib.request。

importurllib.requestimporthttp.cookiejar

url= "http://www.baidu.com"

print('第一種方法')

response1=urllib.request.urlopen(url)print(response1.getcode())print(len(response1.read()))print('第二種方法')

request=urllib.request.Request(url)

request.add_header("user-agent", "Mozilla/5.0")

response2=urllib.request.urlopen(request)print(response2.getcode())print(len(response2.read()))print('第三種方法')

cj=http.cookiejar.CookieJar()

opener=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))

urllib.request.install_opener(opener)

response3=urllib.request.urlopen(url)print(response3.getcode())print(cj)print(response3.read())

運行結果：

五.網頁解析器和BeautifulSoup第三方模塊

1.網頁解析器簡介

從網頁中提取有價值數據的工具。

Python有哪幾中網頁解析器？

結構化解析-DOM(Document?Object?Model)樹

2.BeautifulSoup模塊介紹和安裝

安裝并測試BeautifulSoup4，安裝：pip install beautifulsoup4

但這樣安裝成功后，在PyCharm中還是不能引入，于是再通過從官網上下載安裝包解壓，再安裝，竟然還是不可以，依然報No module named 'bs4'。

沒辦法，最后在PyCharm中通過如下方式安裝后才可以。

進入如下窗口。

點擊“Install?Package”進行安裝，出現如下提示表明安裝成功。

安裝成功后，再次進入可以看到安裝的版本等信息，如下所示。

3.BeautifulSoup的語法

4.BeautifulSoup實例測試

from bs4 importBeautifulSoupimportre

html_doc= """

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

"""

#soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8') python3 缺省的編碼是unicode, 再在from_encoding設置為utf8, 會被忽視掉，去掉【from_encoding="utf-8"】

soup = BeautifulSoup(html_doc, 'html.parser')print("獲取所有的鏈接")

links= soup.find_all('a')for link inlinks:print(link.name, link['href'], link.get_text())print("獲取Lacie的鏈接")

link_node= soup.find('a', href="http://example.com/lacie")print(link_node.name, link_node['href'], link_node.get_text())print("正則匹配")

link_node= soup.find('a', href=re.compile(r"ill"))print(link_node.name, link_node['href'], link_node.get_text())print("獲取p段落文字")

p_node= soup.find('p', class_="title")print(p_node.name, p_node.get_text())

運行結果：

六.實戰演練：爬取百度百科1000個頁面的數據

1.分析目標

目標：百度百科Python詞條相關詞條網頁-標題和簡介

URL格式：詞條頁面URL：/item/計算機程序設計語言/7073760

數據格式：

標題：

***

簡介：

***

總結

以上是生活随笔為你收集整理的python简单爬虫程序分析_[Python专题学习]-python开发简单爬虫的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： node python 速度_Java，
下一篇：斑马线分析_中设设计集团：聚焦智慧交通