當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

这年头学爬虫还就得会点 scrapy 框架

發(fā)布時(shí)間：2024/5/7 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了这年头学爬虫还就得会点 scrapy 框架小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Hello，我是 Alex 007，為啥是007呢？因?yàn)榻?Alex 的人太多了，再加上每天007的生活，Alex 007就誕生了。

這幾天一直在練車，只能在中間休息的時(shí)候?qū)懸粚懖┛?#xff0c;可憐去年報(bào)的名到現(xiàn)在還沒有拿到小本本，當(dāng)然練車只是副技能，主技能還是coding，不斷學(xué)習(xí)才能不被淘汰。

最近在學(xué)爬蟲的 scrapy 框架，以前雖然拿 GoLang 玩過爬蟲，可惜沒有太深入，這次拿 Python 好好學(xué)一學(xué)。
學(xué)習(xí)爬蟲過程中的代碼都放在了GitHub上：https://github.com/koking0/Spider
小生才疏學(xué)淺，如有謬誤，恭請指正。

文章目錄

一、初探 Scrapy
- 1.Scrapy 的安裝
- 2.第一個(gè) scrapy 項(xiàng)目
二、基本操作
- 1.持久化存儲
- - （1）基于終端指令的持久化存儲
  - （2）基于管道的持久化存儲
- 2.全站數(shù)據(jù)爬取
- - 請求傳參
- 3.圖片下載

一、初探 Scrapy

先來看一下官網(wǎng)的定義：

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
Scrapy是一個(gè)快速的高級web抓取框架，用于抓取網(wǎng)站和從網(wǎng)頁中提取結(jié)構(gòu)化數(shù)據(jù)。

It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
它可以用于廣泛的用途，從數(shù)據(jù)挖掘到監(jiān)控和自動(dòng)化測試。

沒學(xué)習(xí)過爬蟲和框架的人可能就懵逼了，不知道爬蟲是什么的可以先花兩分鐘看一下網(wǎng)絡(luò)機(jī)器人之爬蟲這篇文章，不知道框架是什么的，我簡單說一下，大佬可以屏幕向下滾動(dòng)200px。

1.什么是框架？所謂框架，顧名思義，就是一個(gè)具有很強(qiáng)通用性并且集成了很多功能的項(xiàng)目模板，可以應(yīng)用在不同的項(xiàng)目需求中。也就是說，框架是別人造好的輪子，一個(gè)項(xiàng)目的半成品，我們只需要拿過來編寫自己的業(yè)務(wù)邏輯填空即可。2.怎么學(xué)習(xí)框架？對于剛接觸編程或者小白來講，一個(gè)新的框架只需要掌握該框架的作用及其各個(gè)功能的使用即可。說白了就是會(huì)用就行，對于框架的底層實(shí)現(xiàn)和原理，在逐步進(jìn)階中慢慢深入即可。

Scrapy 可以說在爬蟲界是非常出名也非常強(qiáng)悍的，為爬取網(wǎng)站結(jié)構(gòu)性數(shù)據(jù)而生，其內(nèi)部集成了諸如高性能異步下載、隊(duì)列、分布式、持久化等功能，可以說是爬蟲利器。

1.Scrapy 的安裝

Windows 操作系統(tǒng)

三行代碼，復(fù)制粘貼，簡單粗暴。

pip install twistedpip install pywin32pip install scrapy

如果安裝太慢的話可以用阿里云鏡像。

pip install -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com twistedpip install -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com pywin32pip install -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com scrapy

簡單解釋一下除了 scrapy 之外的兩個(gè)東西是啥：

twisted
Twisted 是用 Python 實(shí)現(xiàn)的基于事件驅(qū)動(dòng)的網(wǎng)絡(luò)引擎框架，提供了允許阻塞行為但不會(huì)阻塞代碼執(zhí)行的方法，比較適合異步的程序。

關(guān)于 Twisted 異步與多線程的比較可以參考：Scrapy與Twisted

pywin32
pywin32 主要的作用是方便 Python 開發(fā)者快速調(diào)用 Windows API的一個(gè)模塊庫。

也就是說，twisted 和 pywin32 可以配合起來讓 scrapy 的異步爬取更加絲滑順暢，哈哈，開玩笑，沒這兩個(gè)庫 scrapy 根本安裝不上，三個(gè)步驟一步都不能報(bào)錯(cuò)。

Linux 操作系統(tǒng)

pip install scrapy

Mac 操作系統(tǒng)

pip install scrapy

安裝完成后可以測試一下安裝結(jié)果，在終端輸入 scrapy，執(zhí)行后沒有報(bào)錯(cuò)即安裝成功：

(venv) G:\Python\Spider>scrapy Scrapy 2.0.1 - no active projectUsage:scrapy <command> [options] [args]Available commands:bench Run quick benchmark testfetch Fetch a URL using the Scrapy downloadergenspider Generate new spider using pre-defined templatesrunspider Run a self-contained spider (without creating a project)settings Get settings valuesshell Interactive scraping consolestartproject Create new projectversion Print Scrapy versionview Open URL in browser, as seen by Scrapy[ more ] More commands available when run from project directoryUse "scrapy <command> -h" to see more info about a command

2.第一個(gè) scrapy 項(xiàng)目

使用 scrapy 大體上可以分為5個(gè)步驟，這里說的可不是代碼的編寫，而是從項(xiàng)目的創(chuàng)建到執(zhí)行需要5步：

創(chuàng)建項(xiàng)目

scrapy startproject firstScrapy

進(jìn)入項(xiàng)目目錄

cd firstScrapy

創(chuàng)建爬蟲文件

scrapy genspider baiDuwww.baidu.com

編寫代碼
這里簡單做一個(gè)爬取百度首頁頂部菜單的爬蟲。

# baiDu.py # -*- coding: utf-8 -*- import scrapyclass BaiduSpider(scrapy.Spider):# 爬蟲應(yīng)用名稱name = 'baiDu'# 允許爬取的域名，如果不是該域名下的 url 則不會(huì)爬取allowed_domains = ['www.baidu.com']# 起始爬取 urlstart_urls = ['http://www.baidu.com/']# 將爬取起始 url 的結(jié)果作為 response 參數(shù)傳入該函數(shù)，函數(shù)的返回值必須是可迭代對象或 nulldef parse(self, response):# 字符串類型響應(yīng)對象內(nèi)容# print(response.text)# 字節(jié)類型響應(yīng)對象內(nèi)容# print(response.body)# xpath 為 response 的方法，可以直接寫 xpath 表達(dá)式aList = response.xpath('//*[@id="u1"]/a')for item in aList:name = item.xpath('.//text()')[0].extract()url = item.xpath('./@href')[0].extract()print(name, url) # settings.py from fake_useragent import UserAgent# ......# Crawl responsibly by identifying yourself (and your website) on the user-agent # 設(shè)置全局 UA 偽裝 user_agent = UserAgent() USER_AGENT = user_agent.random# Obey robots.txt rules # 忽略 robots 協(xié)議 ROBOTSTXT_OBEY = False

執(zhí)行項(xiàng)目

scrapy crawl baiDu

如果你的代碼邏輯沒有出錯(cuò)的話，可以看到如下結(jié)果：

(venv) G:\Python\Spider\6.scrapy框架\firstScrapy>scrapy crawl baiDu 2020-04-09 21:48:46 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: firstScrapy) 2020-04-09 21:48:46 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Py thon 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, PlatformWindows-10-10.0.18362-SP0 2020-04-09 21:48:46 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2020-04-09 21:48:46 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'firstScrapy','EDITOR': '~/AppData/Roaming/GitPad/GitPad.exe','NEWSPIDER_MODULE': 'firstScrapy.spiders','SPIDER_MODULES': ['firstScrapy.spiders'],'USER_AGENT': 'Mozilla/5.0 (X11; CrOS i686 3912.101.0) AppleWebKit/537.36 ''(KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36'} 2020-04-09 21:48:46 [scrapy.extensions.telnet] INFO: Telnet Password: 2ca333daee7184fb 2020-04-09 21:48:46 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.logstats.LogStats'] 2020-04-09 21:48:47 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-04-09 21:48:47 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-04-09 21:48:47 [scrapy.middleware] INFO: Enabled item pipelines: [] 2020-04-09 21:48:47 [scrapy.core.engine] INFO: Spider opened 2020-04-09 21:48:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-04-09 21:48:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2020-04-09 21:48:47 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.baidu.com/> from <GET http://www.baidu.com /> 2020-04-09 21:48:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/> (referer: None) 抗擊肺炎 https://voice.baidu.com/act/newpneumonia/newpneumonia/?from=osari_pc_1 新聞 http://news.baidu.com hao123 https://www.hao123.com 地圖 http://map.baidu.com 視頻 http://v.baidu.com 貼吧 http://tieba.baidu.com 學(xué)術(shù) http://xueshu.baidu.com 登錄 https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5 設(shè)置 http://www.baidu.com/gaoji/preferences.html 更多產(chǎn)品 http://www.baidu.com/more/ 2020-04-09 21:48:47 [scrapy.core.engine] INFO: Closing spider (finished) 2020-04-09 21:48:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 732,'downloader/request_count': 2,'downloader/request_method_count/GET': 2,'downloader/response_bytes': 53325,'downloader/response_count': 2,'downloader/response_status_count/200': 1,'downloader/response_status_count/302': 1,'elapsed_time_seconds': 0.491685,'finish_reason': 'finished','finish_time': datetime.datetime(2020, 4, 9, 13, 48, 47, 901362),'log_count/DEBUG': 2,'log_count/INFO': 10,'response_received_count': 1,'scheduler/dequeued': 2,'scheduler/dequeued/memory': 2,'scheduler/enqueued': 2,'scheduler/enqueued/memory': 2,'start_time': datetime.datetime(2020, 4, 9, 13, 48, 47, 409677)} 2020-04-09 21:48:47 [scrapy.core.engine] INFO: Spider closed (finished)

scrapy 給我們輸出了很多很多東西，我們的打印結(jié)果被放在了中間，其它的內(nèi)容其實(shí)是日志信息，scrapy 幫我們自動(dòng)生成了日志，如果你覺得礙眼的話，可以通過 settings.py 文件中的設(shè)置只保留錯(cuò)誤信息：

LOG_LEVEL = 'ERROR'

二、基本操作

接下來了解一下 scrapy 框架的一些基本操作，比如爬取數(shù)據(jù)的持久化存儲啦，對網(wǎng)站的全站爬取啦還有圖片下載等功能。

1.持久化存儲

爬取到的數(shù)據(jù)只有保存到本地的電腦上才是自己的，不然只在內(nèi)存里，用完就沒了。

（1）基于終端指令的持久化存儲

在前邊的小試牛刀中我們可以看到控制臺的輸出，其實(shí)基于終端指令的持久化存儲就是將終端的輸出結(jié)果重定向到一個(gè)本地文件中。

使用基于終端指令的持久化存儲必須保證爬蟲文件中的 parse 方法中有可迭代對象返回，通常是列表或者字典。

我們把爬取百度頂部菜單欄的爬蟲 parse 方法升級一下：

def parse(self, response):# xpath 為 response 的方法，可以直接寫 xpath 表達(dá)式aList = response.xpath('//*[@id="u1"]/a')data = {}for item in aList:name = item.xpath('.//text()')[0].extract()url = item.xpath('./@href')[0].extract()data[name] = urlreturn data

然后在 settings.py 文件中寫一下文件編碼的配置，保證使用的是 utf-8 編碼方式：

FEED_EXPORT_ENCODING = 'UTF8'

接下來，在啟動(dòng)項(xiàng)目的時(shí)候可以用如下指令：

scrapy crawl baiDu -o baidu.json

這樣就可以將爬取的結(jié)果持久化存儲到 baidu.json 文件中：

類似的方法還有：

scrapy crawl spiderName-o xxxx.txt scrapy crawl spiderName-o xxxx.xml scrapy crawl spiderName-o xxxx.csv

（2）基于管道的持久化存儲

使用終端保存文件的方式在 Windows 操作系統(tǒng)貌似不是很常見，Linux 下倒是正常操作。

scrapy 框架中集成了高效、便捷的持久化存儲功能，并且在創(chuàng)建項(xiàng)目的時(shí)候也幫我們自動(dòng)創(chuàng)建好了文件：

1.items.py：數(shù)據(jù)結(jié)構(gòu)模板，定義存儲數(shù)據(jù)的字段 2.pipelines.py：管道文件，接收數(shù)據(jù)(item)進(jìn)行持久化存儲

基于管道的持久化存儲流程：

將爬蟲文件爬取到的數(shù)據(jù)封裝到 items 對象中

items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass FirstscrapyItem(scrapy.Item):# define the fields for your item here like:name = scrapy.Field() # 存儲菜單名url = scrapy.Field() # 存儲菜單 urlpass

使用 yield 將 items 對象提交給 pipelines 管道持久化存儲

baiDu.py

def parse(self, response):# xpath 為 response 的方法，可以直接寫 xpath 表達(dá)式aList = response.xpath('//*[@id="u1"]/a')for data in aList:# 將解析到的數(shù)據(jù)封裝到 items 對象中item = FirstscrapyItem()item["name"] = data.xpath('.//text()')[0].extract()item["url"] = data.xpath('./@href')[0].extract()yield item

管道文件中的 process_item 方法接收并處理爬蟲文件提交過來的 item 對象

pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlclass FirstscrapyPipeline(object):def __init__(self):self.fp = Nonedef open_spider(self, spider):"""開啟爬蟲時(shí)執(zhí)行一次"""print("爬蟲啟動(dòng)！")self.fp = open("data.txt", "w")def process_item(self, item, spider):self.fp.write(f'{item["name"]}:{item["url"]}\n')return item # 注意：一定要有 return item 這一步def close_spider(self, spider):"""結(jié)束爬蟲時(shí)執(zhí)行一次"""self.fp.close()print("爬蟲結(jié)束！")

配置文件 settings.py 中開啟管道

取消兩行注釋即可，后邊的300表示優(yōu)先級，值越小優(yōu)先級越高：

# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'firstScrapy.pipelines.FirstscrapyPipeline': 300, }

如此這般，當(dāng)我們再次執(zhí)行 scrapy 項(xiàng)目的時(shí)候：

(venv) G:\Python\Spider\6.scrapy框架\firstScrapy>scrapy crawl baiDu 爬蟲啟動(dòng)！爬蟲結(jié)束！

就會(huì)生成一個(gè) data.txt 文件：

如果你想將爬取到的數(shù)據(jù)一式兩份，一份存儲到磁盤文件中，一份存儲到數(shù)據(jù)庫中，那么就需要在 pipelines.py 文件中再定制一個(gè)存儲到數(shù)據(jù)庫的管道類：

class DataBasePipeline(object):def __init__(self):self.connect, self.cursor = None, Nonedef open_spider(self, spider):self.connect = pymysql.Connect(host="127.0.0.1", port=3306, user="root", password="20001001", db="test", charset="utf8")def process_item(self, item, spider):self.cursor = self.connect.cursor()try:sql = 'INSERT INTO scrapy1 VALUES ("%s", "%s");' % (item["name"], item["url"])self.cursor.execute(sql)self.connect.commit()except Exception as e:print(e)self.connect.rollback()return item # 注意：一定要有 return item 這一步def close_spider(self, spider):self.connect.close()self.cursor.close()

然后在 settings.py 文件中注冊該類：

ITEM_PIPELINES = {'firstScrapy.pipelines.FirstscrapyPipeline': 300,'firstScrapy.pipelines.DataBasePipeline': 301, }

這樣就可以將爬取到的數(shù)據(jù)存儲到數(shù)據(jù)庫中了：

2.全站數(shù)據(jù)爬取

現(xiàn)在大部分的網(wǎng)站展示的數(shù)據(jù)都進(jìn)行了分頁操作，因此將所有頁碼對應(yīng)的頁面進(jìn)行爬取變成了普遍的要求，scrapy 也幫我們定制好了全站數(shù)據(jù)爬取的功能。

以我自己的 CSDN 博客為例，我現(xiàn)在想把所有我寫博客的標(biāo)題和摘要爬取下來：

Elements 分析

每一篇博客的盒子 xpath 表達(dá)式：//*[@id="mainBox"]/main/div[2]/div

博客名稱 xpath 表達(dá)式：./h4/a/text()

博客摘要 xpath 表達(dá)式：./p/a/text()

Page 分析

第一頁：https://alex007.blog.csdn.net/article/list/1

第二頁：https://alex007.blog.csdn.net/article/list/2

第三頁：https://alex007.blog.csdn.net/article/list/3

……

第 n 頁：https://alex007.blog.csdn.net/article/list/n

在 Scrapy 中可以使用 Request 方法手動(dòng)對每一個(gè)頁面發(fā)起請求。

import time import scrapy from myBlog.items import MyblogItemclass CsdnSpider(scrapy.Spider):name = 'csdn'start_urls = ['https://alex007.blog.csdn.net/']pageNumber = 1pageUrl = 'https://alex007.blog.csdn.net/article/list/%d'def parse(self, response):print(f"正在爬取第{self.pageNumber}頁，url={self.pageUrl % self.pageNumber}。")divList = response.xpath('//*[@id="mainBox"]/main/div[2]/div')for div in divList:item = MyblogItem()item["name"] = ("".join(div.xpath('.//h4/a/text()').extract())).strip("\n").strip()item["content"] = ("".join(div.xpath('.//p/a/text()').extract())).strip("\n").strip()yield itemif self.pageNumber < 21:self.pageNumber += 1time.sleep(1)url = format(self.pageUrl % self.pageNumber)# 遞歸爬起數(shù)據(jù)，callback 參數(shù)為回調(diào)函數(shù)yield scrapy.Request(url=url, callback=self.parse)

如此這般，就可以將我的所有博客文章題目和摘要都爬取下來了：

請求傳參

如果我現(xiàn)在的需求升級一下，對于每一篇博客，摘要不要了，替換為博客文章全部內(nèi)容，這樣的話，我們就得通過在一級頁面拿到 url 訪問二級頁面，這時(shí)候就需要用到請求傳參。

也就是說，當(dāng)我們使用爬蟲想要爬取的數(shù)據(jù)沒有存在于同一張頁面的時(shí)候，則必須使用請求傳參。

import time import scrapy from myBlog.items import MyblogItemclass CsdnSpider(scrapy.Spider):name = 'csdn'start_urls = ['https://alex007.blog.csdn.net/']pageNumber = 1pageUrl = 'https://alex007.blog.csdn.net/article/list/%d'def parse(self, response):print(f"正在爬取第{self.pageNumber}頁，url={self.pageUrl % self.pageNumber}。")divList = response.xpath('//*[@id="mainBox"]/main/div[2]/div')for div in divList:item = MyblogItem()item["name"] = ("".join(div.xpath('.//h4/a/text()').extract())).strip("\n").strip()contentUrl = div.xpath('.//h4/a/@href').extract_first()print(f"正在爬取第文章{item['name']}，url={contentUrl}。")time.sleep(2)yield scrapy.Request(url=contentUrl, callback=self.parseContent, meta={'item': item})if self.pageNumber < 2:self.pageNumber += 1url = format(self.pageUrl % self.pageNumber)# 遞歸爬起數(shù)據(jù)，callback 參數(shù)為回調(diào)函數(shù)yield scrapy.Request(url=url, callback=self.parse)def parseContent(self, response):item = response.meta["item"]item["content"] = "".join(response.xpath('//*[@id="content_views"]//text()').extract())yield item

我們通過meta={'item': item}將item傳遞給處理二級頁面函數(shù)，然后直接在其中yield item就可以將結(jié)果傳遞給管道函數(shù)，爬取結(jié)果如下：

3.圖片下載

圖片下載也是爬蟲的基本需求，那么 Scrapy 當(dāng)然也幫我們封裝好了一個(gè)專門基于圖片請求和持久化存儲的管道類 ImagesPipeline。

在爬蟲文件中解析出圖片的地址

# -*- coding: utf-8 -*- import scrapy from beauty.items import BeautyItemclass ImagesSpider(scrapy.Spider):name = 'images'start_urls = ['http://wuming3175.lofter.com//']pageNumber = 1pageUrl = "http://wuming3175.lofter.com/?page=%d"def parse(self, response):divList = response.xpath('/html/body/div[3]/div')for div in divList:item = BeautyItem()imageSrc = div.xpath('.//div[2]/div[1]/div[1]/a/img/@src').extract_first()if imageSrc:item["image_urls"] = imageSrc.split("?")[0]yield itemif self.pageNumber < 22:self.pageNumber += 1url = format(self.pageUrl % self.pageNumber)yield scrapy.Request(url=url, callback=self.parse)

2.使用 ImagesPipeline 類

class BeautyImagesPipeline(ImagesPipeline):def get_media_requests(self, item, info):"""用于請求方法"""print(f'開始下載{item["image_urls"]}')yield scrapy.Request(url=item["image_urls"])def file_path(self, request, response=None, info=None):"""指定文件存儲路徑"""return request.url.split('/')[-1]def item_completed(self, results, item, info):return item # 該返回值會(huì)傳遞給下一個(gè)即將被執(zhí)行的管道類

在 settings.py 文件中配置管道和圖片存儲路徑

ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR' # ...... IMAGES_STORE = "./images" ITEM_PIPELINES = {'beauty.pipelines.BeautyImagesPipeline': 300, }

好了，關(guān)于 Scrapy 的入門就先講到這里，再寫多了看著都累，如果還想看可以點(diǎn)贊、收藏+關(guān)注哦。

寫在最后：
小生才疏學(xué)淺，如有謬誤，恭請指正。

總結(jié)

以上是生活随笔為你收集整理的这年头学爬虫还就得会点 scrapy 框架的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：如何让你瞬间拥有百万粉丝前端F12的那
下一篇： [Win10]鼠标没用，插入USB口电脑