當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

requests 获取div_爬虫系列第五篇使用requests与BeautifulSoup爬取豆瓣图书Top250

發(fā)布時間：2024/7/5 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 requests 获取div_爬虫系列第五篇使用requests与BeautifulSoup爬取豆瓣图书Top250 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

上一篇我們學(xué)習(xí)了BeautifulSoup的基本用法，本節(jié)我們使用它來爬取豆瓣圖書Top250。

一、網(wǎng)頁分析

我們爬取的網(wǎng)頁的url是https://book.douban.com/top250?icn=index-book250-all。首頁如圖

與豆瓣電影Top250差不多，將頁面拉到最底部，可以看到分頁列表

并且每一頁的url也是以25遞增，所以爬取思路與豆瓣電影Top250一致。

二、爬取目標(biāo)

我們本篇要爬取的信息包括書名、作者、出版社、價格、評分、推薦語。

三、爬取首頁

網(wǎng)頁獲取源代碼

import requestsdef get_html(url):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}html = requests.get(url,headers=headers)return html.textif __name__ == '__main__':url = 'https://book.douban.com/top250?start=0'html = get_html(url)print(html)

輸出結(jié)果

解析提取所需信息

如圖，查看源代碼我們可以知道頁面中書的信息均包含在一個個<table>標(biāo)簽中，所以我們可以先用CSS選擇器將一個個<table>節(jié)點(diǎn)選出來，然后在使用循環(huán)提取每一本書的信息。提取<table>節(jié)點(diǎn)的代碼如下：

def parse_html(html):soup = BeautifulSoup(html,'lxml')books = soup.select('div.article div.indent table')print(books)

運(yùn)行結(jié)果如下：

可以看到輸出為列表，并且第一個元素包含《追風(fēng)箏的人》的相關(guān)信息。這里我們使用BeautifulSoup中的select()加CSS選擇器提取<table>節(jié)點(diǎn)。傳入的CSS選擇器為：div.article div.indent table。其中div.article的意思為選擇包含屬性class="article"的<div>標(biāo)簽，然后跟空格代表嵌套關(guān)系，表示接著選擇該<div>下的包含class="indent"的<div>標(biāo)簽，再跟空格表示接著嵌套，繼續(xù)選擇第二個<div>標(biāo)簽下的<table>標(biāo)簽。

接下來，對選出來的<table>標(biāo)簽循環(huán)，在每一個<table>標(biāo)簽中去提取圖書信息。這里我們先提出書名信息，先看一種寫法，代碼如下：

def parse_html(html):soup = BeautifulSoup(html,'lxml')books = soup.select('div.article div.indent table')for book in books:title = book.div.a.stringprint(title)

輸出結(jié)果

可以看到確實(shí)獲取到了書名信息，但是有些書的書名沒有得到，返回了None，這就不是很完美了呀。我們先解釋寫這里獲取標(biāo)題的方法，這里我們使用了節(jié)點(diǎn)選擇器的嵌套選擇：首先選擇了<div>標(biāo)簽，然后繼續(xù)選擇其下的<a>標(biāo)簽。為什么不直接選擇<a>標(biāo)簽?zāi)?#xff0c;因為包含書名信息的<a>標(biāo)簽是<table>節(jié)點(diǎn)下的第二個<a>標(biāo)簽，直接選擇<a>只會選擇第一個不包含書名信息的那個<a>標(biāo)簽。下面我們來研究下為什么使用string屬性不能獲取某些書的書名信息，我們先將獲取到的<a>標(biāo)簽打印出來。

def parse_html(html):soup = BeautifulSoup(html,'lxml')books = soup.select('div.article div.indent table')for book in books:title = book.div.aprint(title)

并截取返回None的那本書的位置

可以看到《三體》這本書的<a>標(biāo)簽的內(nèi)部結(jié)構(gòu)不同，所以導(dǎo)致調(diào)用string屬性返回None，但是我們可以注意到每條<a>標(biāo)簽都包含title屬性，我們是否可以通過title屬性獲取書名？

def parse_html(html):soup = BeautifulSoup(html,'lxml')books = soup.select('div.article div.indent table')for book in books:title = book.div.a['title']print(title)

輸出結(jié)果如下：

可以看到不僅獲取到了想要的信息，而數(shù)據(jù)更加干凈。這里獲取title部分說了這么多主要是想告訴大家一個獲取相同的信息有很多的方法，當(dāng)一種獲取方式不理想時可以考慮換一種思路。

接下來我們一次性將所有的信息抓取下來。

def parse_html(html):soup = BeautifulSoup(html,'lxml')tables = soup.select('div.article div.indent table')books = []for table in tables:title = table.div.a['title']'''由于information中包含多個信息，某些書與大多數(shù)書的信息格式不一致在進(jìn)行列表索引的時候非常容易引起IndexError異常，為了保證爬蟲的健壯性我們對該異常進(jìn)行處理'''information = table.p.stringinformations = information.split('/')while(len(informations)>4):del informations[1]try:author = informations[0]press = informations[1]date = informations[2]price = informations[3]except IndexError:continue'''像這樣子進(jìn)行數(shù)據(jù)提取很容易遇到某一個特殊部分的網(wǎng)頁結(jié)構(gòu)與大部分的不一樣，這會導(dǎo)致首頁能抓取到的節(jié)點(diǎn)，在該部分會返回None，從而導(dǎo)致調(diào)用string屬性產(chǎn)生AttributeError異常，我們需要進(jìn)行異常處理'''try:score = table.find(attrs={'class':'rating_nums'}).stringrecommendation = table.find(attrs={'class':'inq'}).stringexcept AttributeError:continuebook = {'書名':title,'作者':author,'出版社':press,'出版日期':date,'價格':price,'評分':score,'推薦語':recommendation}books.append(book)return books

輸出結(jié)果：

這里可以看到輸出結(jié)果雖然為字典，但是不好看。保存為字典格式只是方便存儲與后續(xù)使用，假如我們要將其打印到屏幕上的話，并不好看，所以我們接著寫一個打印函數(shù)，專門用于輸出：

def print_(books):for book in books:print('*'*50)for key,value in zip(book.keys(),book.values()):print(key+':'+value)print('*'*50)

我們將爬取首頁的代碼匯總在一起，看看輸出效果

import requests from bs4 import BeautifulSoupdef get_html(url):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}html = requests.get(url,headers=headers)return html.textdef parse_html(html):soup = BeautifulSoup(html,'lxml')tables = soup.select('div.article div.indent table')books = []for table in tables:title = table.div.a['title']'''由于information中包含多個信息，某些書與大多數(shù)書的信息格式不一致在進(jìn)行列表索引的時候非常容易引起IndexError異常，為了保證爬蟲的健壯性我們對該異常進(jìn)行處理'''information = table.p.stringinformations = information.split('/')while(len(informations)>4):del informations[1]try:author = informations[0]press = informations[1]date = informations[2]price = informations[3]except IndexError:continue'''像這樣子進(jìn)行數(shù)據(jù)提取很容易遇到某一個特殊部分的網(wǎng)頁結(jié)構(gòu)與大部分的不一樣，這會導(dǎo)致首頁能抓取到的節(jié)點(diǎn)，在該部分會返回None，從而導(dǎo)致調(diào)用string屬性產(chǎn)生AttributeError異常，我們需要進(jìn)行異常處理'''try:score = table.find(attrs={'class':'rating_nums'}).stringrecommendation = table.find(attrs={'class':'inq'}).stringexcept AttributeError:continuebook = {'書名':title,'作者':author,'出版社':press,'出版日期':date,'價格':price,'評分':score,'推薦語':recommendation}books.append(book)return booksdef print_(books):for book in books:print('*'*50)for key,value in zip(book.keys(),book.values()):print(key+':'+value)print('*'*50)if __name__ == '__main__':url = 'https://book.douban.com/top250?start=0'html = get_html(url)books = parse_html(html)print_(books)

輸出結(jié)果：

注意，這里我們沒有寫存儲的相關(guān)函數(shù)，因為這里只為演示BeautifulSoup的用法，假如需要存儲數(shù)據(jù)參考爬蟲系列第三篇使用requests與正則表達(dá)式爬取豆瓣電影Top250

四、爬取整個豆瓣圖書Top250

與前面?zhèn)€爬蟲實(shí)例一樣，構(gòu)造url列表，使用循環(huán)即可。全部代碼如下

import requests from bs4 import BeautifulSoupdef get_html(url):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}html = requests.get(url,headers=headers)return html.textdef parse_html(html):soup = BeautifulSoup(html,'lxml')tables = soup.select('div.article div.indent table')books = []for table in tables:title = table.div.a['title']'''由于information中包含多個信息，某些書與大多數(shù)書的信息格式不一致在進(jìn)行列表索引的時候非常容易引起IndexError異常，為了保證爬蟲的健壯性我們對該異常進(jìn)行處理'''information = table.p.stringinformations = information.split('/')while(len(informations)>4):del informations[1]try:author = informations[0]press = informations[1]date = informations[2]price = informations[3]except IndexError:continue'''像這樣子進(jìn)行數(shù)據(jù)提取很容易遇到某一個特殊部分的網(wǎng)頁結(jié)構(gòu)與大部分的不一樣，這會導(dǎo)致首頁能抓取到的節(jié)點(diǎn)，在該部分會返回None，從而導(dǎo)致調(diào)用string屬性產(chǎn)生AttributeError異常，我們需要進(jìn)行異常處理'''try:score = table.find(attrs={'class':'rating_nums'}).stringrecommendation = table.find(attrs={'class':'inq'}).stringexcept AttributeError:continuebook = {'書名':title,'作者':author,'出版社':press,'出版日期':date,'價格':price,'評分':score,'推薦語':recommendation}books.append(book)return booksdef print_(books):for book in books:print('*'*50)for key,value in zip(book.keys(),book.values()):print(key+':'+value)print('*'*50)if __name__ == '__main__':urls = [f'https://book.douban.com/top250?start={i*25}' for i in range(0,10)]for url in urls: html = get_html(url)books = parse_html(html)print_(books)

五、總結(jié)

通過本篇的學(xué)習(xí)，讀者應(yīng)該著重掌握：

BeautifulSoup庫三種節(jié)點(diǎn)選擇方式的靈活運(yùn)用
對可能的異常要進(jìn)行處理（請求部分的異常一般不用處理）
與正則表達(dá)式進(jìn)行優(yōu)劣比較
讀者可以自行將正則表達(dá)式與BeautifulSoup結(jié)合起來靈活使用

這里解釋一下為什么請求部分的異常一般不需要處理，因為請求出現(xiàn)異常一般意味著url錯誤、網(wǎng)絡(luò)連接有問題等，這些異常都需要我們處理好而不是用try...except語句跳過它，否則爬蟲無法繼續(xù)。

如果覺得本篇文章不錯，歡迎關(guān)注我的爬蟲系列教程公眾號【痕風(fēng)雨】，一起學(xué)習(xí)交流。

總結(jié)

以上是生活随笔為你收集整理的requests 获取div_爬虫系列第五篇使用requests与BeautifulSoup爬取豆瓣图书Top250的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。