當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫scrapy

發布時間：2023/12/31 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫scrapy 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Q2Day81

性能相關

在編寫爬蟲時，性能的消耗主要在IO請求中，當單進程單線程模式下請求URL時必然會引起等待，從而使得請求整體變慢。

import requestsdef fetch_async(url):response = requests.get(url)return responseurl_list = ['http://www.github.com', 'http://www.bing.com']for url in url_list:fetch_async(url) ?2.多線程執行 ?2.多線程+回調函數執行 ?3.多進程執行 ?3.多進程+回調函數執行

通過上述代碼均可以完成對請求性能的提高，對于多線程和多進行的缺點是在IO阻塞時會造成了線程和進程的浪費，所以異步IO回事首選：

?1.asyncio示例1 ?1.asyncio示例2 ?2.asyncio + aiohttp ?3.asyncio + requests ?4.gevent + requests ?5.grequests ?6.Twisted示例 ?7.Tornado from twisted.internet import reactor from twisted.web.client import getPage import urllib.parsedef one_done(arg):print(arg)reactor.stop()post_data = urllib.parse.urlencode({'check_data': 'adf'}) post_data = bytes(post_data, encoding='utf8') headers = {b'Content-Type': b'application/x-www-form-urlencoded'} response = getPage(bytes('http://dig.chouti.com/login', encoding='utf8'),method=bytes('POST', encoding='utf8'),postdata=post_data,cookies={},headers=headers) response.addBoth(one_done)reactor.run()

以上均是Python內置以及第三方模塊提供異步IO請求模塊，使用簡便大大提高效率，而對于異步IO請求的本質則是【非阻塞Socket】+【IO多路復用】：

import select import socket import timeclass AsyncTimeoutException(TimeoutError):"""請求超時異常類"""def __init__(self, msg):self.msg = msgsuper(AsyncTimeoutException, self).__init__(msg)class HttpContext(object):"""封裝請求和相應的基本數據"""def __init__(self, sock, host, port, method, url, data, callback, timeout=5):"""sock: 請求的客戶端socket對象host: 請求的主機名port: 請求的端口port: 請求的端口method: 請求方式url: 請求的URLdata: 請求時請求體中的數據callback: 請求完成后的回調函數timeout: 請求的超時時間"""self.sock = sockself.callback = callbackself.host = hostself.port = portself.method = methodself.url = urlself.data = dataself.timeout = timeoutself.__start_time = time.time()self.__buffer = []def is_timeout(self):"""當前請求是否已經超時"""current_time = time.time()if (self.__start_time + self.timeout) < current_time:return Truedef fileno(self):"""請求sockect對象的文件描述符，用于select監聽"""return self.sock.fileno()def write(self, data):"""在buffer中寫入響應內容"""self.__buffer.append(data)def finish(self, exc=None):"""在buffer中寫入響應內容完成，執行請求的回調函數"""if not exc:response = b''.join(self.__buffer)self.callback(self, response, exc)else:self.callback(self, None, exc)def send_request_data(self):content = """%s %s HTTP/1.0\r\nHost: %s\r\n\r\n%s""" % (self.method.upper(), self.url, self.host, self.data,)return content.encode(encoding='utf8')class AsyncRequest(object):def __init__(self):self.fds = []self.connections = []def add_request(self, host, port, method, url, data, callback, timeout):"""創建一個要請求"""client = socket.socket()client.setblocking(False)try:client.connect((host, port))except BlockingIOError as e:pass# print('已經向遠程發送連接的請求')req = HttpContext(client, host, port, method, url, data, callback, timeout)self.connections.append(req)self.fds.append(req)def check_conn_timeout(self):"""檢查所有的請求，是否有已經連接超時，如果有則終止"""timeout_list = []for context in self.connections:if context.is_timeout():timeout_list.append(context)for context in timeout_list:context.finish(AsyncTimeoutException('請求超時'))self.fds.remove(context)self.connections.remove(context)def running(self):"""事件循環，用于檢測請求的socket是否已經就緒，從而執行相關操作"""while True:r, w, e = select.select(self.fds, self.connections, self.fds, 0.05)if not self.fds:returnfor context in r:sock = context.sockwhile True:try:data = sock.recv(8096)if not data:self.fds.remove(context)context.finish()breakelse:context.write(data)except BlockingIOError as e:breakexcept TimeoutError as e:self.fds.remove(context)self.connections.remove(context)context.finish(e)breakfor context in w:# 已經連接成功遠程服務器，開始向遠程發送請求數據if context in self.fds:data = context.send_request_data()context.sock.sendall(data)self.connections.remove(context)self.check_conn_timeout()if __name__ == '__main__':def callback_func(context, response, ex):""":param context: HttpContext對象，內部封裝了請求相關信息:param response: 請求響應內容:param ex: 是否出現異常（如果有異常則值為異常對象；否則值為None）:return:"""print(context, response, ex)obj = AsyncRequest()url_list = [{'host': 'www.google.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},{'host': 'www.baidu.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},{'host': 'www.bing.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},]for item in url_list:print(item)obj.add_request(**item)obj.running()

Scrapy

Scrapy是一個為了爬取網站數據，提取結構性數據而編寫的應用框架。其可以應用在數據挖掘，信息處理或存儲歷史數據等一系列的程序中。
其最初是為了頁面抓取 (更確切來說, 網絡抓取 )所設計的，也可以應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。Scrapy用途廣泛，可以用于數據挖掘、監測和自動化測試。

Scrapy 使用了 Twisted異步網絡庫來處理網絡通訊。整體架構大致如下

Scrapy主要包括了以下組件：

引擎(Scrapy)
用來處理整個系統的數據流處理, 觸發事務(框架核心)
調度器(Scheduler)
用來接受引擎發過來的請求, 壓入隊列中, 并在引擎再次請求的時候返回. 可以想像成一個URL（抓取網頁的網址或者說是鏈接）的優先隊列, 由它來決定下一個要抓取的網址是什么, 同時去除重復的網址
下載器(Downloader)
用于下載網頁內容, 并將網頁內容返回給蜘蛛(Scrapy下載器是建立在twisted這個高效的異步模型上的)
爬蟲(Spiders)
爬蟲是主要干活的, 用于從特定的網頁中提取自己需要的信息, 即所謂的實體(Item)。用戶也可以從中提取出鏈接,讓Scrapy繼續抓取下一個頁面
項目管道(Pipeline)
負責處理爬蟲從網頁中抽取的實體，主要的功能是持久化實體、驗證實體的有效性、清除不需要的信息。當頁面被爬蟲解析后，將被發送到項目管道，并經過幾個特定的次序處理數據。
下載器中間件(Downloader Middlewares)
位于Scrapy引擎和下載器之間的框架，主要是處理Scrapy引擎與下載器之間的請求及響應。
爬蟲中間件(Spider Middlewares)
介于Scrapy引擎和爬蟲之間的框架，主要工作是處理蜘蛛的響應輸入和請求輸出。
調度中間件(Scheduler Middewares)
介于Scrapy引擎和調度之間的中間件，從Scrapy引擎發送到調度的請求和響應。

Scrapy運行流程大概如下：

引擎從調度器中取出一個鏈接(URL)用于接下來的抓取

引擎把URL封裝成一個請求(Request)傳給下載器

下載器把資源下載下來，并封裝成應答包(Response)

爬蟲解析Response

解析出實體（Item）,則交給實體管道進行進一步的處理

解析出的是鏈接（URL）,則把URL交給調度器等待抓取

一、安裝

1 2 3 4 5 6 7 8 9 10

Linux ??????pip3 install scrapy Windows ??????a. pip3 install wheel ??????b. 下載twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted ??????c. 進入下載目錄，執行 pip3 install Twisted?17.1.0?cp35?cp35m?win_amd64.whl ??????d. pip3 install scrapy ??????e. 下載并安裝pywin32：https://sourceforge.net/projects/pywin32/files/

二、基本使用

1. 基本命令

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

1.?scrapy startproject 項目名稱 ???-?在當前目錄中創建中創建一個項目文件（類似于Django） 2.?scrapy genspider [-t template] <name> <domain> ???-?創建爬蟲應用 ???如： ??????scrapy gensipider?-t basic oldboy oldboy.com ??????scrapy gensipider?-t xmlfeed autohome autohome.com.cn ???PS: ??????查看所有命令：scrapy gensipider?-l ??????查看模板命令：scrapy gensipider?-d 模板名稱 3.?scrapy?list ???-?展示爬蟲應用列表 4.?scrapy crawl 爬蟲應用名稱 ???-?運行單獨爬蟲應用

2.項目結構以及爬蟲應用簡介

1 2 3 4 5 6 7 8 9 10 11 12

project_name/ ???scrapy.cfg ???project_name/ ???????__init__.py ???????items.py ???????pipelines.py ???????settings.py ???????spiders/ ???????????__init__.py ???????????爬蟲1.py ???????????爬蟲2.py ???????????爬蟲3.py

文件說明：

scrapy.cfg ?項目的主配置信息。（真正爬蟲相關的配置信息在settings.py文件中）
items.py ? ?設置數據存儲模板，用于結構化數據，如：Django的Model
pipelines ? ?數據處理行為，如：一般結構化的數據持久化
settings.py 配置文件，如：遞歸的層數、并發數，延遲下載等
spiders ? ? ?爬蟲目錄，如：創建文件，編寫爬蟲規則

注意：一般創建爬蟲文件時，以網站域名命名

?爬蟲1.py ?關于windows編碼

3.?小試牛刀

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

import?scrapy from?scrapy.selector?import?HtmlXPathSelector from?scrapy.http.request?import?Request class?DigSpider(scrapy.Spider): ????# 爬蟲應用的名稱，通過此名稱啟動爬蟲命令 ????name?=?"dig" ????# 允許的域名 ????allowed_domains?=?["chouti.com"] ????# 起始URL ????start_urls?=?[ ????????'http://dig.chouti.com/', ????] ????has_request_set?=?{} ????def?parse(self, response): ????????print(response.url) ????????hxs?=?HtmlXPathSelector(response) ????????page_list?=?hxs.select('//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract() ????????for?page?in?page_list: ????????????page_url?=?'http://dig.chouti.com%s'?%?page ????????????key?=?self.md5(page_url) ????????????if?key?in?self.has_request_set: ????????????????pass ????????????else: ????????????????self.has_request_set[key]?=?page_url ????????????????obj?=?Request(url=page_url, method='GET', callback=self.parse) ????????????????yield?obj ????@staticmethod ????def?md5(val): ????????import?hashlib ????????ha?=?hashlib.md5() ????????ha.update(bytes(val, encoding='utf-8')) ????????key?=?ha.hexdigest() ????????return?key

執行此爬蟲文件，則在終端進入項目目錄執行如下命令：

1	scrapy crawl dig?--nolog

對于上述代碼重要之處在于：

Request是一個封裝用戶請求的類，在回調函數中yield該對象表示繼續訪問
HtmlXpathSelector用于結構化HTML代碼并提供選擇器功能

4. 選擇器

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

#!/usr/bin/env python # -*- coding:utf-8 -*- from?scrapy.selector?import?Selector, HtmlXPathSelector from?scrapy.http?import?HtmlResponse html?=?"""<!DOCTYPE html> <html> ????<head lang="en"> ????????<meta charset="UTF-8"> ????????<title></title> ????</head> ????<body> ????????<ul> ????????????<li class="item-"><a id='i1' href="link.html">first item</a></li> ????????????<li class="item-0"><a id='i2' href="llink.html">first item</a></li> ????????????<li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li> ????????</ul> ????????<div><a href="llink2.html">second item</a></div> ????</body> </html> """ response?=?HtmlResponse(url='http://example.com', body=html,encoding='utf-8') # hxs = HtmlXPathSelector(response) # print(hxs) # hxs = Selector(response=response).xpath('//a') # print(hxs) # hxs = Selector(response=response).xpath('//a[2]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@id]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@id="i1"]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]') # print(hxs) # hxs = Selector(response=response).xpath('//a[contains(@href, "link")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract() # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract() # print(hxs) # hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract() # print(hxs) # hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first() # print(hxs) # ul_list = Selector(response=response).xpath('//body/ul/li') # for item in ul_list: #???? v = item.xpath('./a/span') #???? # 或 #???? # v = item.xpath('a/span') #???? # 或 #???? # v = item.xpath('*/a/span') #???? print(v)

?示例：自動登陸抽屜并點贊

注意：settings.py中設置DEPTH_LIMIT = 1來指定“遞歸”的層數。

5. 格式化處理

上述實例只是簡單的處理，所以在parse方法中直接處理。如果對于想要獲取更多的數據處理，則可以利用Scrapy的items將數據格式化，然后統一交由pipelines來處理。

?spiders/xiahuar.py ?items ?pipelines ?settings

對于pipeline可以做更多，如下：

?自定義pipeline

6.中間件

?爬蟲中間件 ?下載器中間件

7. 自定制命令

在spiders同級創建任意目錄，如：commands
在其中創建 crawlall.py 文件（此處文件名就是自定義的命令）
?crawlall.py
在settings.py 中添加配置 COMMANDS_MODULE = '項目名稱.目錄名稱'
在項目目錄執行命令：scrapy crawlall?

8. 自定義擴展

自定義擴展時，利用信號在指定位置注冊制定操作

?View Code

9. 避免重復訪問

scrapy默認使用 scrapy.dupefilter.RFPDupeFilter 進行去重，相關配置有：

1 2 3

DUPEFILTER_CLASS?=?'scrapy.dupefilter.RFPDupeFilter' DUPEFILTER_DEBUG?=?False JOBDIR?=?"保存范文記錄的日志路徑，如：/root/"??# 最終路徑為 /root/requests.seen

?自定義URL去重操作

10.其他

?settings?

11.TinyScrapy

?twisted示例一 ?twisted示例二 ?twisted示例三 ?模擬scrapy框架 ?參考版

點擊下載

?更多文檔參見：http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

轉載于:https://www.cnblogs.com/xc1234/p/8645901.html

總結

以上是生活随笔為你收集整理的爬虫scrapy的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： SpringMVC （六）注解式开发
下一篇：基于密钥的认证机制（ssh）