當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

手把手教你用Scrapy爬取知乎大V粉丝列表

發布時間：2025/3/15 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了手把手教你用Scrapy爬取知乎大V粉丝列表小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

導讀：通過獲取知乎某個大V的關注列表和被關注列表，查看該大V以及其關注用戶和被關注用戶的詳細信息，然后通過層層遞歸調用，實現獲取關注用戶和被關注用戶的關注列表和被關注列表，最終實現獲取大量用戶信息。

作者：趙國生王健

來源：大數據DT（ID：hzdashuju）

新建一個Scrapy項目scrapy startproject zhihuuser，移動到新建目錄cdzhihuuser下。新建Spider項目：scrapy genspider zhihu zhihu.com。

01 定義spider.py文件

定義爬取網址、爬取規則等。

#?-*-?coding:?utf-8?-*- import?json from?scrapy?import?Spider,?Request from?zhihuuser.items?import?UserItem class?ZhihuSpider(Spider):name?=?'zhihu'allowed_domains?=?['zhihu.com']start_urls?=?['http://zhihu.com/'] #?自定義爬取網址start_user?=?'excited-vczh'user_url?=?'https://www.zhihu.com/api/v4/members/{user}?include={include}'user_query?=?'allow_message,is_followed,is_following,is_org,is_blocking,employments,answer_count,follower_count,articles_count,gender,badge[?(type=best_answerer)].topics'follows_url?=?'https://www.zhihu.com/api/v4/members/{user}/followees?include=?{include}&offset={offset}&limit={limit}'follows_query?=?'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'followers_url?=?'https://www.zhihu.com/api/v4/members/{user}/followees?include=?{include}&offset={offset}&limit={limit}'followers_query?=?'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics' #?定義請求爬取用戶信息、關注用戶和被關注用戶的函數def?start_requests(self):yield?Request(self.user_url.format(user=self.start_user,?include=self.user_query),?callback=self.parseUser)yield?Request(self.follows_url.format(user=self.start_user,?include=self.follows_query,?offset=0,?limit=20),?callback=self.parseFollows)yield?Request(self.followers_url.format(user=self.start_user,?include=self.followers_query,?offset=0,?limit=20),?callback=self.parseFollowers) #?請求爬取用戶詳細信息def?parseUser(self,?response):result?=?json.loads(response.text)item?=?UserItem()for?field?in?item.fields:if?field?in?result.keys():item[field]?=?result.get(field)yield?item #?定義回調函數，爬取關注用戶與被關注用戶的詳細信息，實現層層迭代yield?Request(self.follows_url.format(user=result.get('url_token'),?include=self.follows_query,?offset=0,?limit=20),?callback=self.parseFollows)yield?Request(self.followers_url.format(user=result.get('url_token'),?include=self.followers_query,?offset=0,?limit=20),?callback=self.parseFollowers) #?爬取關注者列表def?parseFollows(self,?response):results?=?json.loads(response.text)if?'data'?in?results.keys():for?result?in?results.get('data'):yield?Request(self.user_url.format(user=result.get('url_token'),?include=self.user_query),?callback=self.parseUser)if?'paging'?in?results.keys()?and?results.get('paging').get('is_end')?==?False:next_page?=?results.get('paging').get('next')yield?Request(next_page,?callback=self.parseFollows) #?爬取被關注者列表def?parseFollowers(self,?response):results?=?json.loads(response.text)if?'data'?in?results.keys():for?result?in?results.get('data'):yield?Request(self.user_url.format(user=result.get('url_token'),?include=self.user_query),?callback=self.parseUser)if?'paging'?in?results.keys()?and?results.get('paging').get('is_end')?==?False:next_page?=?results.get('paging').get('next')yield?Request(next_page,?callback=self.parseFollowers)

02 定義items.py文件

定義爬取數據的信息、使其規整等。

#?-*-?coding:?utf-8?-*- #?Define?here?the?models?for?your?scraped?items #?See?documentation?in: #?https://doc.scrapy.org/en/latest/topics/items.html from?scrapy?import?Field,?Item class?UserItem(Item):#?define?the?fields?for?your?item?here?like:#?name?=?scrapy.Field()allow_message?=?Field()answer_count?=?Field()articles_count?=?Field()avatar_url?=?Field()avatar_url_template?=?Field()badge?=?Field()employments?=?Field()follower_count?=?Field()gender?=?Field()headline?=?Field()id?=?Field()name?=?Field()type?=?Field()url?=?Field()url_token?=?Field()user_type?=?Field()

03 定義pipelines.py文件

存儲數據到MongoDB。

#?-*-?coding:?utf-8?-*- #?Define?your?item?pipelines?here #?Don't?forget?to?add?your?pipeline?to?the?ITEM_PIPELINES?setting #?See:?https://doc.scrapy.org/en/latest/topics/item-pipeline.html import?pymongo #?存儲到MongoDB class?MongoPipeline(object):collection_name?=?'users'def?__init__(self,?mongo_uri,?mongo_db):self.mongo_uri?=?mongo_uriself.mongo_db?=?mongo_db@classmethoddef?from_crawler(cls,?crawler):return?cls(mongo_uri=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DATABASE'))def?open_spider(self,?spider):self.client?=?pymongo.MongoClient(self.mongo_uri)self.db?=?self.client[self.mongo_db]def?close_spider(self,?spider):self.client.close()def?process_item(self,?item,?spider):self.db[self.collection_name].update({'url_token':item['url_token']},?dict(item),?True) #?執行去重操作return?item

04 定義settings.py文件

開啟MongoDB、定義請求頭、不遵循robotstxt規則。

#?-*-?coding:?utf-8?-*- BOT_NAME?=?'zhihuuser' SPIDER_MODULES?=?['zhihuuser.spiders'] #?Obey?robots.txt?rules ROBOTSTXT_OBEY?=?False??#?是否遵守robotstxt規則，限制爬取內容 #?Override?the?default?request?headers（加載請求頭）: DEFAULT_REQUEST_HEADERS?=?{'Accept':?'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language':?'en','User-agent':?'Mozilla/5.0?(Macintosh;?Intel?Mac?OS?X?10_11_6)?AppleWebKit/?537.36?(KHTML,?like?Gecko)?Chrome/64.0.3282.140?Safari/537.36','authorization':?'oauth?c3cef7c66a1843f8b3a9e6a1e3160e20' } #?Configure?item?pipelines #?See?https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES?=?{'zhihuuser.pipelines.MongoPipeline':?300, } MONGO_URI?=?'localhost' MONGO_DATABASE?=?'zhihu'

開啟爬取：scrapycrawlzhihu。部分爬取過程中的信息如圖8-4所示。

▲圖8-4 部分爬取過程中的信息

存儲到MongoDB的部分信息如圖8-5所示。

▲圖8-5 MongoDB的部分信息

關于作者：趙國生，哈爾濱師范大學教授，工學博士，碩士生導師，黑龍江省網絡安全技術領域特殊人才。主要從事可信網絡、入侵容忍、認知計算、物聯網安全等方向的教學與科研工作。

本文摘編自《Python網絡爬蟲技術與實戰》，經出版方授權發布。

延伸閱讀《Python網絡爬蟲技術與實戰》

點擊上圖了解及購買

轉載請聯系微信：DoctorData

推薦語：本書是一本系統、全面地介紹Python網絡爬蟲的實戰寶典。作者融合自己豐富的工程實踐經驗，緊密結合演示應用案例，內容覆蓋了幾乎所有網絡爬蟲涉及的核心技術。在內容編排上，一步步地剖析算法背后的概念與原理，提供大量簡潔的代碼實現，助你從零基礎開始編程實現深度學習算法。

劃重點????

干貨直達????

如何寫出清晰又優雅的Python代碼？我們給你這26條建議
5個步驟帶你入門FPGA設計流程
終于有人把A/B測試講明白了
多圖詳解數據中臺建設框架（建議收藏）

更多精彩????

在公眾號對話框輸入以下關鍵詞

查看更多優質內容！

PPT?|?讀書?|?書單?|?硬核?|?干貨?|?講明白?|?神操作

大數據?|?云計算?|?數據庫?|?Python?|?爬蟲?|?可視化

AI?|?人工智能?|?機器學習?|?深度學習?|?NLP

5G?|?中臺?|?用戶畫像?|?1024?|?數學?|?算法?|?數字孿生

據統計，99%的大咖都關注了這個公眾號

????

總結

以上是生活随笔為你收集整理的手把手教你用Scrapy爬取知乎大V粉丝列表的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 6个实例，8段代码，详解Python中的
下一篇：学Python半年，56岁的潘叔叔晒出9