當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python3爬虫（8）爬虫框架scrapy安装和使用

發(fā)布時(shí)間：2024/4/11 python 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 python3爬虫（8）爬虫框架scrapy安装和使用小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

一：scrapy的windows下安裝

① 安裝Python3.6，瀏覽器打開官網(wǎng)，找到適合自己操作系統(tǒng)的版本下載即可，注意Customize installation 為自定義安裝路徑，不要忘記勾選pip 進(jìn)行安裝。

②安裝pywin32.網(wǎng)址：https://sourceforge.net/projects/pywin32/files/pywin32/?
下載相應(yīng)版本的.exe 文件，下載完成后安裝即可。

③安裝lxml，命令pip3 install lxml

④安裝pyOpenSSL，命令pip3 install pyOpenSSL

⑤安裝Twisted，網(wǎng)址：https://www.lfd.uci.edu/~gohlke/pythonlibs/?
找到操作系統(tǒng)對(duì)應(yīng)的版本下載?
之后進(jìn)入DOS窗口，進(jìn)入Twisted所在的目錄執(zhí)行命令pip3 install Twisted-17.9.0-cp36-cp36m-win_amd64.whl 這里17.9.0為版本號(hào)，36為對(duì)應(yīng)的python版本號(hào)

⑥安裝scrapy，進(jìn)入Python所在目錄，命令pip3 install scrapy?
成功安裝后，重啟DOS，輸入scrapy顯示如下即為安裝成功！?

二：為什么要使用scrapy

1.scrapy 底層是異步框架 twisted ，高并發(fā)和性能是最大優(yōu)勢(shì)

2.scrapy方便擴(kuò)展，提供了很多內(nèi)置的功能

3.scrapy內(nèi)置的css和xpath非常方便，效率比beautifulsoup好很多

4.URL去重采用布隆過濾器方案，避免同個(gè)網(wǎng)頁多次趴取

當(dāng)然也有缺點(diǎn)

1.不支持分布式部署

2.原生不支持爬去JavaScript的頁面，需要手動(dòng)分JS請(qǐng)求

三：scrapy簡單的使用

創(chuàng)建項(xiàng)目：scrapy startproject quote

創(chuàng)建spider文件：scrapy genspider quotes quotes.toscrape.com

運(yùn)行爬蟲：scrapy crawl quotes

也可以用pycharm打開運(yùn)行和調(diào)試scrapy項(xiàng)目，需要個(gè)調(diào)用文件，如main.py，想在pycharm中運(yùn)行或調(diào)試整個(gè)項(xiàng)目運(yùn)行或調(diào)試main.py就可以了

from scrapy.cmdline import execute import os import sys#添加當(dāng)前項(xiàng)目的絕對(duì)地址 sys.path.append(os.path.dirname(os.path.abspath(__file__))) #執(zhí)行 scrapy 內(nèi)置的函數(shù)方法execute，使用 crawl 爬取并調(diào)試，最后一個(gè)參數(shù)jobbole 是我的爬蟲文件名 execute(['scrapy', 'crawl', 'zhihu', '--nolog'])

(上面的代碼里的zhihu就是創(chuàng)建的爬蟲，也就是scrapy genspider quotes quotes.toscrtapy.com中紅色標(biāo)注的地方，偷個(gè)懶，不做更改了，知道就好)

可以把爬去到的內(nèi)容存儲(chǔ)JSON,XML文件，命令為：

scrapy crawl quotes -o quotes.json

一個(gè)簡單的爬蟲示例(爬去的網(wǎng)站是quotes.toscrape.com)：

https://pan.baidu.com/s/1N3b5NXRJWZZuVV7mMD2j7A

提取碼：bhhf

代碼就不貼出來了，有點(diǎn)多

四：scrapy抓取知乎用戶信息

大家都知道知名網(wǎng)站“知乎”，這個(gè)站究竟有多少用戶呢，又有多少活躍用戶能，能不能爬去到所有用戶信息？用scrapy可以試一下。首先如何下手爬去用戶，我真知道知乎有個(gè)粉絲機(jī)制，部分用戶有自己的粉絲，也有自己關(guān)注的用戶，如下圖：

越活躍的用戶粉絲越多，我們隨便找個(gè)比較活躍的用戶，爬去他的所有粉絲，假設(shè)有2000個(gè)，這2000個(gè)用戶每個(gè)人都自己的粉絲，假設(shè)每個(gè)人有20個(gè)粉絲，這樣不就爬去到4W用戶嗎，這樣無限遞歸下去，理論上可以爬去所有用戶。這個(gè)時(shí)候聰明的你可能會(huì)跟我抬杠，如果兩個(gè)人互粉，或者A粉B,B粉C,C粉D,D粉A，這樣不就出現(xiàn)死循環(huán)了嗎，不用擔(dān)心，scrapy有個(gè)參數(shù)可以設(shè)置去重，避免相同網(wǎng)頁第二次爬去。這個(gè)時(shí)候你腦子一轉(zhuǎn)，又提出了一種情況，A是B和C的粉絲，我們搜集B和C的粉絲的時(shí)候都會(huì)有A,這個(gè)時(shí)候是不同頁面，還是會(huì)出現(xiàn)同一個(gè)用戶采集多次情況，沒關(guān)系，存數(shù)據(jù)庫的時(shí)候我們?nèi)コ貜?fù)就好了。

看下粉絲列表是如何獲取的：

F12打開開發(fā)者工具，點(diǎn)擊粉絲列表第二頁，可以看到獲取粉絲列表是一個(gè)XHR請(qǐng)求，亦即AJAX加載的，是一個(gè)get請(qǐng)求，地址是：

https://www.zhihu.com/api/v4/members/excited-vczh/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=60&limit=20

經(jīng)過簡單分析，我們只需關(guān)注

excited-vczh? 當(dāng)前用戶

offset=60&limit=20? 開始位置和獲取個(gè)數(shù)

爬去到的數(shù)據(jù)大概是這樣：

代碼如下：

zhihu.py

# -*- coding: utf-8 -*- import jsonfrom scrapy import Spider,Requestfrom zhihuuser.items import UserItemclass ZhihuSpider(Spider):name = 'zhihu'allowed_domains = ['www.zhihu.com']start_urls = ['http://www.zhihu.com/']start_user = 'excited-vczh'user_url = ''user_query = ''followers_url = 'https://www.zhihu.com/api/v4/members/{user}/followers?{include}&offset={offset}&limit={limit}'followers_query = 'include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'followings_url = ''followings_query = ''def start_requests(self):yield Request(self.followers_url.format(user=self.start_user,include=self.followers_query,offset=0,limit=20), callback=self.parse_followers, dont_filter=False)#關(guān)注他的人def parse_followers(self, response):results = json.loads(response.text)if 'data' in results.keys():for data in results.get('data'):item = UserItem()for field in item.fields:if field in data.keys():item[field] = data.get(field)yield itemprint('爬去到用戶：',item['name'])yield Request(self.followers_url.format(user=data.get('url_token'),include=self.followers_query,offset=0,limit=20),callback=self.parse_followers, dont_filter=False)if 'paging' in results.keys() and results.get('paging').get('is_end') == False:next_page = results.get('paging').get('next')pos = next_page.find('www.zhihu.com/') + len('www.zhihu.com/')next_page = next_page[:pos] + 'api/v4/' + next_page[pos:]yield Request(next_page, self.parse_followers, dont_filter=False)

items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.htmlfrom scrapy import Item,Fieldclass UserItem(Item):# define the fields for your item here like:# name = scrapy.Field()id = Field()name = Field()avatar_url = Field()url_token = Field()headline = Field()is_vip = Field()answer_count = Field()articles_count = Field()follower_count = Field()

pipeline.py

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongoclass MongoPipeline(object):def __init__(self, mongo_uri, mongo_db):self.mongo_uri = mongo_uriself.mongo_db = mongo_db@classmethoddef from_crawler(cls, crawler):return cls(mongo_uri=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DATABASE'))def open_spider(self, spider):self.client = pymongo.MongoClient(self.mongo_uri)self.db = self.client[self.mongo_db]def close_spider(self, spider):self.client.close()def process_item(self, item, spider):# self.db['user3'].insert_one(dict(item))self.db['user'].update({'url_token':item['url_token']},{'$set':item}, True)return item

項(xiàng)目整個(gè)下載：

https://pan.baidu.com/s/1dWaIJhdK-nSooxdKpj1-NQ

提取碼：ns5d

我運(yùn)行了5個(gè)小時(shí)左右，抓取到了16W個(gè)用戶，

放進(jìn)了mongodb里面，這16W數(shù)據(jù)加載用了56秒，占用內(nèi)存1.6G。毫無疑問，知乎用戶至少是千萬級(jí)別，本次只是個(gè)簡單測(cè)試和學(xué)習(xí)，后面考慮分布式爬去，

總結(jié)

以上是生活随笔為你收集整理的python3爬虫（8）爬虫框架scrapy安装和使用的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： python3爬虫（7）反反爬虫解决方案
下一篇： python3爬虫（9）分布式爬虫与对等