當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

用python写一个豆瓣通用爬虫并可视化分析

發(fā)布時間：2025/3/20 python 13 豆豆

生活随笔收集整理的這篇文章主要介紹了用python写一个豆瓣通用爬虫并可视化分析小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

原創(chuàng)技術(shù)公眾號：bigsai,本文在1024發(fā)布，回復(fù)bigsai送架構(gòu)師進(jìn)階pdf資源，祝大家節(jié)日快樂，心想事成。收到祝福后點個一鍵三連回饋一下蟹蟹！

文章結(jié)構(gòu)

- 前言
- 登錄
- 爬取
- 儲存
- 可視化分析

前言

在本人上的一門課中，老師對每個小組有個任務(wù)要求，介紹和完成一個小模塊、工具知識的使用。然而我所在的組剛好遇到的是python爬蟲的小課題。

心想這不是很簡單嘛，搞啥呢？想著去搞新的時間精力可能不太夠，索性自己就把豆瓣電影的評論(短評)搞一搞吧。

之前有寫過哪吒那篇類似的，但今天這篇要寫的像姨母般詳細(xì)。本篇主要實現(xiàn)的是對任意一部電影短評(熱門)的抓取以及可視化分析。 也就是你只要提供鏈接和一些基本信息，他就可以

分析

對于豆瓣爬蟲，what shold we 考慮？怎么分析呢？豆瓣電影首頁

這個首先的話嘗試就可以啦，打開任意一部電影，這里以姜子牙為例。打開姜子牙你就會發(fā)現(xiàn)它是非動態(tài)渲染的頁面，也就是傳統(tǒng)的渲染方式，直接請求這個url即可獲取數(shù)據(jù)。但是翻著翻著頁面你就會發(fā)現(xiàn)：未登錄用戶只能訪問優(yōu)先的界面，登錄的用戶才能有權(quán)限去訪問后面的頁面。

所以這個流程應(yīng)該是 登錄——> 爬蟲——>存儲——>可視化分析。

這里提一下環(huán)境和所需要的安裝裝，環(huán)境為python3，代碼在win和linux可成功跑，如果mac和linux不能跑友字體亂碼問題還請私我。其中pip用到包如下,直接用清華鏡像下載不然很慢很慢(夠貼心不)。

pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple pip install matplotlib -i https://pypi.tuna.tsinghua.edu.cn/simple pip install numpy -i https://pypi.tuna.tsinghua.edu.cn/simple pip install xlrd -i https://pypi.tuna.tsinghua.edu.cn/simple pip install xlwt -i https://pypi.tuna.tsinghua.edu.cn/simple pip install bs4 -i https://pypi.tuna.tsinghua.edu.cn/simple pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple pip install wordcloud -i https://pypi.tuna.tsinghua.edu.cn/simple pip install jieba -i https://pypi.tuna.tsinghua.edu.cn/simple

登錄

豆瓣的登錄地址

進(jìn)去后有個密碼登錄欄，我們要分析在登錄的途中發(fā)生了啥，打開F12控制臺是不夠的，我們還要使用Fidder抓包。

打開F12控制臺然后點擊登錄，多次試探之后發(fā)現(xiàn)登錄接口也很簡單：

查看請求的參數(shù)發(fā)現(xiàn)就是普通請求，無加密，當(dāng)然這里可以用fidder進(jìn)行抓包，這里我簡單測試了一下用錯誤密碼進(jìn)行測試。如果失敗的小伙伴可以嘗試手動登陸再退出這樣再跑程序。

這樣編寫登錄模塊的代碼：

url='https://accounts.douban.com/j/mobile/login/basic' header={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36', 'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony','Origin': 'https://accounts.douban.com','content-Type':'application/x-www-form-urlencoded','x-requested-with':'XMLHttpRequest','accept':'application/json','accept-encoding':'gzip, deflate, br','accept-language':'zh-CN,zh;q=0.9','connection': 'keep-alive','Host': 'accounts.douban.com'} data={'ck':'','name':'','password':'','remember':'false','ticket':'' } def login(username,password):global datadata['name']=usernamedata['password']=passworddata=urllib.parse.urlencode(data)print(data)req=requests.post(url,headers=header,data=data,verify=False)cookies = requests.utils.dict_from_cookiejar(req.cookies)print(cookies)return cookies

這塊高清之后，整個執(zhí)行流程大概為：

爬取

成功登錄之后，我們就可以攜帶登錄的信息訪問網(wǎng)站為所欲為的爬取信息了。雖然它是傳統(tǒng)交互方式，但是每當(dāng)你切換頁面時候會發(fā)現(xiàn)有個ajax請求。

這部分接口我們可以直接拿到評論部分的數(shù)據(jù)，就不需要請求整個頁面然后提取這部分的內(nèi)容了。而這部分的url規(guī)律和之前分析的也是一樣，只有一個start表示當(dāng)前的條數(shù)在變化，所以直接拼湊url就行。

也就是用邏輯拼湊url一直到不能正確操作為止。

https://movie.douban.com/subject/25907124/comments?percent_type=&start=0&其他參數(shù)省略 https://movie.douban.com/subject/25907124/comments?percent_type=&start=20&其他參數(shù)省略 https://movie.douban.com/subject/25907124/comments?percent_type=&start=40&其他參數(shù)省略

對于每個url訪問之后如何提取信息呢？
我們根據(jù)css選擇器進(jìn)行篩選數(shù)據(jù)，因為每個評論他們的樣式相同，在html中就很像一個列表中的元素一樣。

再觀察我們剛剛那個ajax接口返回的數(shù)據(jù)剛好是下面紅色區(qū)域塊，所以我們直接根據(jù)class搜素分成若干小組進(jìn)行曹祖就可以。

在具體的實現(xiàn)上，我們使用requests發(fā)送請求獲取結(jié)果，使用BeautifulSoup去解析html格式文件。
而我們所需要的數(shù)據(jù)也很容易分析對應(yīng)部分。

實現(xiàn)的代碼為：

import requests from bs4 import BeautifulSoup url='https://movie.douban.com/subject/25907124/comments?percent_type=&start=0&limit=20&status=P&sort=new_score&comments_only=1&ck=C7di'header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36', } req = requests.get(url,headers=header,verify=False) res = req.json() # 返回的結(jié)果是一個json res = res['html'] soup = BeautifulSoup(res, 'lxml') node = soup.select('.comment-item') for va in node:name = va.a.get('title')star = va.select_one('.comment-info').select('span')[1].get('class')[0][-2]comment = va.select_one('.short').textvotes=va.select_one('.votes').textprint(name, star,votes, comment)

這個測試的執(zhí)行結(jié)果為：

儲存

數(shù)據(jù)爬取完就要考慮存儲，我們將數(shù)據(jù)儲存到cvs中。

使用xlwt將數(shù)據(jù)寫入excel文件中，xlwt基本應(yīng)用實例：

import xlwt#創(chuàng)建可寫的workbook對象 workbook = xlwt.Workbook(encoding='utf-8') #創(chuàng)建工作表sheet worksheet = workbook.add_sheet('sheet1') #往表中寫內(nèi)容,第一個參數(shù) 行,第二個參數(shù)列,第三個參數(shù)內(nèi)容 worksheet.write(0, 0, 'bigsai') #保存表為test.xlsx workbook.save('test.xlsx')

使用xlrd讀取excel文件中，本案例xlrd基本應(yīng)用實例：

import xlrd #讀取名稱為test.xls文件 workbook = xlrd.open_workbook('test.xls') # 獲取第一張表 table = workbook.sheets()[0] # 打開第1張表 # 每一行是個元組 nrows = table.nrows for i in range(nrows):print(table.row_values(i))#輸出每一行

到這里，我們對登錄模塊+爬取模塊+存儲模塊就可把數(shù)據(jù)存到本地了，具體整合的代碼為：

import requests from bs4 import BeautifulSoup import urllib.parseimport xlwt import xlrd# 賬號密碼 def login(username, password):url = 'https://accounts.douban.com/j/mobile/login/basic'header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36','Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony','Origin': 'https://accounts.douban.com','content-Type': 'application/x-www-form-urlencoded','x-requested-with': 'XMLHttpRequest','accept': 'application/json','accept-encoding': 'gzip, deflate, br','accept-language': 'zh-CN,zh;q=0.9','connection': 'keep-alive', 'Host': 'accounts.douban.com'}# 登陸需要攜帶的參數(shù)data = {'ck' : '','name': '','password': '','remember': 'false','ticket': ''}data['name'] = usernamedata['password'] = passworddata = urllib.parse.urlencode(data)print(data)req = requests.post(url, headers=header, data=data, verify=False)cookies = requests.utils.dict_from_cookiejar(req.cookies)print(cookies)return cookiesdef getcomment(cookies, mvid): # 參數(shù)為登錄成功的cookies(后臺可通過cookies識別用戶，電影的id)start = 0w = xlwt.Workbook(encoding='ascii') # #創(chuàng)建可寫的workbook對象ws = w.add_sheet('sheet1') # 創(chuàng)建工作表sheetindex = 1 # 表示行的意思，在xls文件中寫入對應(yīng)的行數(shù)while True:# 模擬瀏覽器頭發(fā)送請求header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',}# try catch 嘗試，一旦有錯誤說明執(zhí)行完成，沒錯誤繼續(xù)進(jìn)行try:# 拼湊url 每次star加20url = 'https://movie.douban.com/subject/' + str(mvid) + '/comments?start=' + str(start) + '&limit=20&sort=new_score&status=P&comments_only=1'start += 20# 發(fā)送請求req = requests.get(url, cookies=cookies, headers=header)# 返回的結(jié)果是個json字符串通過req.json()方法獲取數(shù)據(jù)res = req.json()res = res['html'] # 需要的數(shù)據(jù)在`html`鍵下soup = BeautifulSoup(res, 'lxml') # 把這個結(jié)構(gòu)化html創(chuàng)建一個BeautifulSoup對象用來提取信息node = soup.select('.comment-item') # 每組class 均為comment-item 這樣分成20條記錄(每個url有20個評論)for va in node: # 遍歷評論name = va.a.get('title') # 獲取評論者名稱star = va.select_one('.comment-info').select('span')[1].get('class')[0][-2] # 星數(shù)好評votes = va.select_one('.votes').text # 投票數(shù)comment = va.select_one('.short').text # 評論文本print(name, star, votes, comment)ws.write(index, 0, index) # 第index行，第0列寫入 indexws.write(index, 1, name) # 第index行，第1列寫入評論者ws.write(index, 2, star) # 第index行，第2列寫入評星ws.write(index, 3, votes) # 第index行，第3列寫入投票數(shù)ws.write(index, 4, comment) # 第index行，第4列寫入評論內(nèi)容index += 1except Exception as e: # 有異常退出print(e)breakw.save('test.xls') # 保存為test.xls文件if __name__ == '__main__':username = input('輸入賬號：')password = input('輸入密碼：')cookies = login(username, password)mvid = input('電影的id為：')getcomment(cookies, mvid)

執(zhí)行之后成功存儲數(shù)據(jù)：

可視化分析

我們要對評分進(jìn)行統(tǒng)計、詞頻統(tǒng)計。還有就是生成詞云展示。而對應(yīng)的就是matplotlib、WordCloud庫。

實現(xiàn)的邏輯思路：讀取xls的文件，將評論使用分詞處理統(tǒng)計詞頻，統(tǒng)計出現(xiàn)最多的詞語制作成直方圖和詞語。將評星🌟數(shù)量做成餅圖展示一下，主要代碼均有注釋，具體的代碼為：

其中代碼為：

import matplotlib.pyplot as plt import matplotlib import jieba import jieba.analyse import xlwt import xlrd from wordcloud import WordCloud import numpy as np from collections import Counter # 設(shè)置字體有的linux字體有問題 matplotlib.rcParams['font.sans-serif'] = ['SimHei'] matplotlib.rcParams['axes.unicode_minus'] = False# 類似comment 為評論的一些數(shù)據(jù) [ ['1','名稱'，'star星','贊同數(shù)','評論內(nèi)容'] ,['2','名稱'，'star星','贊同數(shù)','評論內(nèi)容'] ]元組 def anylasescore(comment):score = [0, 0, 0, 0, 0, 0] # 分別對應(yīng)0 1 2 3 4 5分出現(xiàn)的次數(shù)count = 0 # 評分總次數(shù)for va in comment: # 遍歷每條評論的數(shù)據(jù) ['1','名稱'，'star星','贊同數(shù)','評論內(nèi)容']try:score[int(va[2])] += 1 # 第3列為star星要強制轉(zhuǎn)換成int格式count += 1except Exception as e:continueprint(score)label = '1分', '2分', '3分', '4分', '5分'color = 'blue', 'orange', 'yellow', 'green', 'red' # 各類別顏色size = [0, 0, 0, 0, 0] # 一個百分比數(shù)字合起來為100explode = [0, 0, 0, 0, 0] # explode :(每一塊)離開中心距離；for i in range(1, 5): # 計算size[i] = score[i] * 100 / countexplode[i] = score[i] / count / 10pie = plt.pie(size, colors=color, explode=explode, labels=label, shadow=True, autopct='%1.1f%%')for font in pie[1]:font.set_size(8)for digit in pie[2]:digit.set_size(8)plt.axis('equal') # 該行代碼使餅圖長寬相等plt.title(u'各個評分占比', fontsize=12) # 標(biāo)題plt.legend(loc=0, bbox_to_anchor=(0.82, 1)) # 圖例# 設(shè)置legend的字體大小leg = plt.gca().get_legend()ltext = leg.get_texts()plt.setp(ltext, fontsize=6)plt.savefig("score.png")# 顯示圖plt.show()def getzhifang(map): # 直方圖二維，需要x和y兩個坐標(biāo)x = []y = []for k, v in map.most_common(15): # 獲取前15個最大數(shù)值x.append(k)y.append(v)Xi = np.array(x) # 轉(zhuǎn)成numpy的坐標(biāo)Yi = np.array(y)width = 0.6plt.rcParams['font.sans-serif'] = ['SimHei'] # 用來正常顯示中文標(biāo)簽plt.figure(figsize=(8, 6)) # 指定圖像比例： 8：6plt.bar(Xi, Yi, width, color='blue', label='熱門詞頻統(tǒng)計', alpha=0.8, )plt.xlabel("詞頻")plt.ylabel("次數(shù)")plt.savefig('zhifang.png')plt.show()returndef getciyun_most(map): # 獲取詞云# 一個存對應(yīng)中文單詞，一個存對應(yīng)次數(shù)x = []y = []for k, v in map.most_common(300): # 在前300個常用詞語中x.append(k)y.append(v)xi = x[0:150] # 截取前150個xi = ' '.join(xi) # 以空格 ` `將其分割為固定格式(詞云需要)print(xi)# backgroud_Image = plt.imread('') # 如果需要個性化詞云# 詞云大小，字體等基本設(shè)置wc = WordCloud(background_color="white",width=1500, height=1200,# min_font_size=40,# mask=backgroud_Image,font_path="simhei.ttf",max_font_size=150, # 設(shè)置字體最大值random_state=50, # 設(shè)置有多少種隨機生成狀態(tài)，即有多少種配色方案) # 字體這里有個坑，一定要設(shè)這個參數(shù)。否則會顯示一堆小方框wc.font_path="simhei.ttf" # 黑體# wc.font_path="simhei.ttf"my_wordcloud = wc.generate(xi) #需要放入詞云的單詞，這里前150個單詞plt.imshow(my_wordcloud) # 展示my_wordcloud.to_file("img.jpg") # 保存xi = ' '.join(x[150:300]) # 再次獲取后150個單詞再保存一張詞云my_wordcloud = wc.generate(xi)my_wordcloud.to_file("img2.jpg")plt.axis("off")def anylaseword(comment):# 這個過濾詞，有些詞語沒意義需要過濾掉list = ['這個', '一個', '不少', '起來', '沒有', '就是', '不是', '那個', '還是', '劇情', '這樣', '那樣', '這種', '那種', '故事', '人物', '什么']print(list)commnetstr = '' # 評論的字符串c = Counter() # python一種數(shù)據(jù)集合，用來存儲字典index = 0for va in comment:seg_list = jieba.cut(va[4], cut_all=False) ## jieba分詞index += 1for x in seg_list:if len(x) > 1 and x != '\r\n': # 不是單個字并且不是特殊符號try:c[x] += 1 # 這個單詞的次數(shù)加一except:continuecommnetstr += va[4]for (k, v) in c.most_common(): # 過濾掉次數(shù)小于5的單詞if v < 5 or k in list:c.pop(k)continue# print(k,v)print(len(c), c)getzhifang(c) # 用這個數(shù)據(jù)進(jìn)行畫直方圖getciyun_most(c) # 詞云# print(commnetstr)def anylase():data = xlrd.open_workbook('test.xls') # 打開xls文件table = data.sheets()[0] # 打開第i張表nrows = table.nrows # 若干列的一個集合comment = []for i in range(nrows):comment.append(table.row_values(i)) # 將該列數(shù)據(jù)添加到元組中# print(comment)anylasescore(comment)anylaseword(comment)if __name__ == '__main__':anylase()

我們再來查看一下執(zhí)行的效果：

這里我選了姜子牙和千與千尋電影的一些數(shù)據(jù)，兩個電影評分比例對比為：

從評分可以看出明顯千與千尋好評度更高，大部分人愿意給他五分?；舅闶亲詈每吹膭勇涣?#xff0c;再來看看直方圖的詞譜：

很明顯千與千尋的作者更出名，并且有很大的影響力，以至于大家紛紛提起他。再看看兩者詞云圖：

宮崎駿、白龍、婆婆，真的是滿滿的回憶，好了不說了，有啥想說的歡迎討論！

如果感覺不錯，點贊、一鍵三連 原創(chuàng)公眾號：bigsai，分享知識和干貨！

總結(jié)

以上是生活随笔為你收集整理的用python写一个豆瓣通用爬虫并可视化分析的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：硬核！手写一个优先队列
下一篇： LeetCode 43字符串相乘44通配