當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

9.2-Scrapy框架爬虫【进阶】-spiders用法

發布時間：2023/12/16 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 9.2-Scrapy框架爬虫【进阶】-spiders用法小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1、爬蟲（Spiders）

????Spider類定義了如何爬取某個（或某些）網站。包括爬取的動作（例：是否跟進鏈接）以及如何從網頁的內容中提取結構化數據（爬取item）。

????===>則Spider就是定義的爬取動作及分析某個網頁。

????對于Spider，爬取循環做以下的事：

首先生成抓取第一個URL的初始request，request下載完成后生成response，然后指定對response要使用的回調函數。通過調用start_requests()方法（默認情況下）為start_urls中指定的URL生成初始的Request以及將parse方法作為回調函數。? ? 【注：可以自己重寫start_requests()方法】
在回調函數中，解析Response（網頁）并返回帶有提取的數據的dict，Item對象，Request對象或這些對象的可迭代容器。這些請求還將包含回調（可能是相同的），然后由Scrapy下載，然后由指定的回調函數處理它們的響應。
在回調函數中，通常使用選擇器來解析頁面內容（但也可以使用Beautiful Soup或者lxml），并使用解析的數據生成Item。
最后，從爬蟲返回的Item通常將持久存儲到數據庫（在某些Item Pipeline中）或使用Feed導出文件。

這個循環適用于任何種類的Spider，Scrapy實現了不同種類的默認spider用于不同需求。

2、Class scrapy.spiders.Spider

最簡單的爬蟲，每個其他爬蟲必須繼承該類（包括 Scrapy 自帶的一些爬蟲，以及你自己寫的爬蟲）。
它不提供任何特殊功能。
它只是提供了一個默認的start_requests()實現，它讀取并請求爬蟲start_urls屬性，并為每個結果響應調用爬蟲的parse方法。

其中一些常用的參數為：

（1）name：爬蟲名稱，重要，因為最后需要運行程序。scrapy crawl name

（2）allowed_domains：可選。包含了此爬蟲允許抓取的域的列表。

（3）start_urls：URL列表。當沒有指定特定 URL 時，爬蟲將從從該列表中開始抓取。

????? ???因此，爬取的第一個頁面將是這里列出的某個 URL。后續的 URL 將根據包含在起始 URL 中的數據連續生成。

（4）start_requests：此方法必須返回一個可迭代對象（iterable），該對象包含了 spider 用于爬取的第一個Request。

當 spider 啟動時且未指定特定的 URL 時，Scrapy 會調用該方法（start_requests）。
如果指定了特定的 URL，則使用?make_requests_from_url()?來創建 Request 對象。此方法僅被調用一次，因此將其作為生成器實現是安全的。該方法的默認實現是使用?start_urls?的 url 生成 Request。
若想要修改最初爬取某個網站的Request對象，可重寫該方法start_requests。
- 如下面的例子，在啟動時通過使用post請求登錄：

class MySpider(scrapy.Spider):name = 'myspider'def start_requests(self):return [scrapy.FormRequest("http://www.example.com/login",formdata={'user': 'john', 'pass': 'secret'}, callback=self.logged_in)]def logged_in(self, response):??#該處可以提取鏈接來跟蹤和返回每個鏈接的請求并帶有一個回調函數pass

（5）make_requests_from_url(url)：該方法接受一個URL并返回用于爬取的?Request?對象。該方法在初始化request時被?start_requests()?調用，也被用于轉化url為request。默認未被復寫(overridden)的情況下，該方法返回的Request對象中，?parse()?作為回調函數，dont_filter參數也被設置為開啟。

（6）parse(response)：當response沒有指定回調函數時，這是Scrapy用來處理下載的response的默認方法。parse?方法負責處理response并返回所抓取的數據以及跟進的URL。Spider?對其他的Request的回調函數也有相同的要求。

?????????參數：response（Response） - 用于分析的response

（7）log(message[, level, component])：通過 Spider 的?logger?發送日志消息，保留向后兼容性。

（8）closed(reason)：當spider關閉時，該函數被調用。該方法提供了一個替代調用signals.connect()來監聽 [spider_closed]?信號的快捷方式。

補充：class scrapy.spiders.Spider的三個例子

例1：最基礎

import scrapy class MySpider(scrapy.Spider):name='example.com'allowed_domains=['example.com']start_urls=['http://www.example.com/1.html','http://www.example.com/2.html','http://www.example.com/3.html',]def parse(self,response):self.logger.info('爬取網址：{}'.format(response.url))

例2：在單個回調函數中返回多個Request以及Item

import scrapy class MySpider(scrapy.Spider):name='example.com'allowed_domains=['example.com']start_urls=['http://www.example.com/1.html','http://www.example.com/2.html','http://www.example.com/3.html',]def parse(self,response):for h3 in response.xpath('//h3').extract():yield {'title':h3}for url in response.xpath('//a/@href').extract():yield scrapy.Request(url,callback=self.parse)

例3：除了start_urls，可直接使用start_requests()，也可以使用items來給予數據更多的結構性

import scrapy from myproject.items import MyItem class MySpider(scrapy.Spider):name='example.com'allowed_domains=['example.com']def start_requests(self):yield scrapy.Request(url='http://www.example.com/1.html',callback=self.parse)yield scrapy.Request(url='http://www.example.com/2.html', callback=self.parse)yield scrapy.Request(url='http://www.example.com/3.html', callback=self.parse)def parse(self,response):for h3 in response.xpath('//h3').extract():item=MyItem()item['title']=h3yield itemfor url in response.xpath('//a/@href').extract():yield scrapy.Request(url,callback=self.parse)

3、class scrapy.spiders.CrawlSpider

最常用的爬行常規網站的spider，因為它通過定義一組規則為跟進鏈接提供了一個方便的機制。

（1）除了從Spider繼承的屬性（您必須指定），這個類支持一個新的屬性：rule

它是一個（或多個）Rule?對象的列表。每個?Rule?定義用于爬取網址的特定行為。
Rule 對象如下所述。如果多個規則匹配相同的鏈接，則將根據它們在此屬性中定義的順序使用第一個。

（2）該spider也提供了一個可復寫(overrideable)的方法:parse_start_url(response)

對于 start_urls 中url所對應的 response 調用此方法。它允許解析初始響應，并且必須返回?Item?對象，Request?對象或包含其中任何對象的iterable。

（3）爬取規則（Crawling rules）

class?scrapy.spiders.Rule(link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)

（3-1）link_extractor?是一個?鏈接提取器（Link Extractor）對象，它定義如何從要爬取的頁面提取鏈接。

（3-2）callback?是一個 callable 或 string（在這種情況下，該spider中同名的函數將會被調用），使用 link_extractor 從 Response 對象中提取的每個鏈接將會調用該函數。該回調接函數收一個 response 作為它的第一個參數，并且必須返回一個包含?Item?及（或）[Request]?對象（或它們的任何子類）的列表。

警告：當編寫爬蟲規則時，避免使用 parse 作為回調，因為?CrawlSpider?使用parse方法本身來實現其邏輯。如果覆蓋 parse 方法，crawl spider 將會運行失敗。

補充：class scrapy.spiders.CrawlSpider的例子

import re import scrapy from scrapy.spider import CrawlSpider,Rule from scrapy.linkextractors import LinkExtractorclass MySpider(CrawlSpider):name='example.com'allowed_domains=['example.com']start_urls=['http://www.example.com']rules = (Rule(LinkExtractor(allow=('category\.php'),deny=('subsection\.php'))),Rule(LinkExtractor(allow=('item\.php')),callback='parse_item'),)def parse_item(self,response):self.logger.info('當前爬取網址為：{}'.format(response.url))item=scrapy.Item()id=response.xpath('//td[@id="team_id"]/text()').extract()[0]item['id']=re.findall('(\d+)',id)[0]item['name']=response.xpath('//td[@id="team_name"]/text()').extract()[0]item['description']=response.xpath('//td[@id="item_description"]/text()').extract()[0]return item

說明：該spider將從example.com的首頁開始爬取，獲取category以及item的鏈接并對后者使用parse_item方法。

????????? 當item獲得返回（response）時，將使用xpath處理HTML并生成一些數據填入Item中。

轉載于:https://my.oschina.net/pansy0425/blog/3092968

總結

以上是生活随笔為你收集整理的9.2-Scrapy框架爬虫【进阶】-spiders用法的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：《高等数学B（一）》笔记
下一篇：我的知识星球 -【达叔与他的朋友们】程序