當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Scrapy教程

發布時間：2024/8/1 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 Scrapy教程小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

基礎

1、創建一個項目

scrapy startproject mySpider

2、新建一個爬蟲

2、新建一個爬蟲scrapy genspiders spiders import scrapy class SpidersSpider(scrapy.Spider):name = 'spiders' # 爬蟲名allowed_domains = ['itcast.cn'] # 允許爬蟲的范圍start_urls = ['http://itcast.cn/'] # 最開始請求的url的地址def parse(self, response):# 處理start_urls 地址對應的響應li_list = response.xpath('//div[@class="tea_con"]')for li in li_list:item = {}item['name'] = li.xpath(".//h3/text()").extract_first()item['title'] = li.xpath(".//h4/text()").extract_first()# res = response.xpath('//div[@class="tea_con"]//h3/text()').extract_first()# print(res)yield item # 把列表傳到piplines中注：xpath寫錯會默認給提供None值

3、啟動爬蟲

3、啟動爬蟲scrapy crwal spiders在/settings下的設置里面LOG_LEVEL = "WARNING" 控制臺只顯示warning以上水平的信息

4、pipline[管道]處理

4、pipline[管道]處理/parse（）下： yield item # 把列表傳到piplines中首先：settings中把注釋的piplines取消注釋/settings下 ITEM_PIPELINES = {'myspider.pipelines.MyspiderPipeline': 300, # 數據越小，越先執行'myspider.pipelines.MyspiderPipeline1': 301, # 數據越小，越先執行 } /piplines下定義兩個pipline類 class MyspiderPipeline:def process_item(self, item, spider):print(item)item["hello"] = 'word'return itemclass MyspiderPipeline1:def process_item(self, item, spider):print(item)return item執行結果： {'name': '黃老師', 'title': '高級講師'} {'name': '黃老師', 'title': '高級講師', 'hello': 'word'}

5、如何區別多個爬蟲的pipline

5、如何區別多個爬蟲的pipline方式一：def parse(self, response):item = {}item["come_from"] = 'itcast'class MyspiderPipeline:def process_item(self, item, spider):if item['come_from'] == 'itcast':print(item)item["hello"] = 'word'return item方式二【推薦】： class MyspiderPipeline:def process_item(self, item, spider):if spider.name == 'itcast':

6、logging輸出

6、logging輸出import logging logger = logging.getLogger(__name__) # 能輸出當前日志的輸出/setting下 LOG_LEVEL = "WARNING" LOG_FILE = "./log.log" # 將日志保存到本地

7、logging 非scrapy 輸出

7、logging 非scrapy 輸出import logging logging.basicConfig(level=logging.DEBUG,format='%(levelname)s %(filename)s ''%(message)s'' - %(asctime)s', datefmt='[%d/%b/%Y %H:%M:%S]',filename='./loggmsg.log', filemode="a") logger = logging.getLogger(__name__)

8、實現翻頁

8、實現翻頁 next_page_url = response.xpath("//a[text()='下一頁']/@href).extract()while len(next_page_url)>0:yield scrapy.Request(next_page_url, callback=self.parse)

9、yield scrapy.Request()使用介紹

#9、yield scrapy.Request()使用介紹yield scrapy.Request(url, callback, # 指定傳入的url交給哪個解析函數去處理 method='POST', headers, body, cookies, meta, # 實現在不同的解析函數中傳遞數據，meta默認會攜帶部分信息 dont_filter=False, ) 上注： meta：實現在不同的解析函數中傳遞數據，meta默認會攜帶部分信息例： yield scrapy.Request(next_page_url, callback=self.parse1 meta = {'item': item} ) def paras1(self, response)response.meta['item'] #在此取到上面出來的itemdont_filter=False：讓scrapy的去重不會過濾當前url,scrapy默認有url去重功能，對需要重復請求的url有重要用途

10、數據處理（去空白，/s /t 空字符串）

列表表達式： item = ['http://url'+i for i in item['url']piplines.py 下 def process_conent(self, content):content = [re.sub(r"\xa0|\s|\t","", i) for i in content]content = [i for i in content if len(i)>0]# 去除列表中的字符串

11、scrapy sell

scrapy shell https://www.baidu.com 可以進一步查看請求和響應信息 response.url: 當前響應的網址 response.request.url: 當前響應對應請求的url地址 response.headers: 請求頭 response.body: 響應體 response.request.headers:當前響應的請求頭

12、settings深入

# Configure maximum concurrent requests performed by Scrapy (default: 16) 最大并發請求 #CONCURRENT_REQUESTS = 32#下載延遲，每次下載等三秒 #DOWNLOAD_DELAY = 3 # Disable cookies (enabled by default) # 是否開啟cookie，默認情況是開啟的。 # COOKIES_ENABLED = False# Override the default request headers:默認請求頭 # DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}#自動限速 # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html # AUTOTHROTTLE_ENABLED = Truehttp緩存 # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings # HTTPCACHE_ENABLED = True # HTTPCACHE_EXPIRATION_SECS = 0 # HTTPCACHE_DIR = 'httpcache' # HTTPCACHE_IGNORE_HTTP_CODES = [] # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'主程序中使用settings中的內容： self.settings.get('HTTPCACHE_ENABLED ')

13、pipline使用

import json class JsonWriterPipline(object): #在爬蟲開啟的時候，僅執行一次def open_spider(self, spider):self.file = open(spider.settings.get("SAVE_FILE","./item.json"),"w")#在爬蟲結束的時候，僅執行一次def close_spider(self, spider):self.file.close()def process_item(self, item, spider):line = json.dumps(dict(item))+ "\n"self.file.wirte(line)return item#不return的情況下，另一個權重較低的pipline就不會獲取到該item

爬去糗事百科示例

1.settings下：

# Obey robots.txt rules 不遵循網站的機器規則 ROBOTSTXT_OBEY = False# Override the default request headers: 設置默認請求頭 DEFAULT_REQUEST_HEADERS = {'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','User-Agent' : 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1', }

2、將數據保存到duanzi.json文件中

class QsbkPipeline:def __init__(self):self.fp = open('duanzi.json', 'w', encoding='utf-8')# 爬蟲開始前調用open_spiderdef open_sipder(self, spider):print('這是爬蟲開始。。')def process_item(self, item, spider):item_json = json.dumps(item, ensure_ascii=False)self.fp.write(item_json + '\n')return item# 爬蟲結束前調用close_spiderdef close_spider(self, spider):print('這是爬蟲結束。。')注意：settings中打開 ITEM_PIPELINES = {'qsbk.pipelines.QsbkPipeline': 300, }

3、保存數據寫法優化

items下：定義兩個字段 import scrapy class QsbkItem(scrapy.Item):author = scrapy.Field()content = scrapy.Field()spider下： from items import QsbkItemtext_dict = QsbkItem(author=author, content=content)yield text_dictpiplins下：item_json = json.dumps(dict(item), ensure_ascii=False)self.fp.write(item_json + '\n')return item

筆記：

1. response是一個'scrapy,http.response.html.HtmlResponse'對象，可直接執行'xapth', 'css'語法來提取數據。 2. 提取出來的數據，是一個Selector或者是一個SelectorLIst對象。想要獲取其中的字符串，那么應該執行"get()"或者'getall()'方法。 3. getall()方法：提出'Selcetor'中的所有文本，返回的是一個列表。 4. get()放啊：提出'Selcetor'中第一個文本,返回的是一個str字符串。 5. 如果數據解析回來，要傳給pipeline處理，那么可以使用yield來返回，或者是提前定義一個空列表，最后return 列表。 6. item:“建議在‘items.py’中定義好模型。 7.pipeline：這是一個專門用來保存數據的，其中有三個方法會被經常用到。'open_spider(self, spider)':當爬蟲打開時運行'process_spider(self, spider)':當爬蟲有item傳過來的時候被調用。'close_spider(self, spider)':當爬蟲結束時被調用。要激活pipeline：在'settings.py'中打開中間件。

4、保存數據寫法優化2

from scrapy.exporters import JsonLinesItemExporter class QsbkPipeline:def __init__(self):self.fp = open('duanzi.json', 'wb')self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')# 爬蟲開始前調用open_spiderdef open_sipder(self, spider):print('這是爬蟲開始。。')def process_item(self, item, spider):self.exporter.export_item(item)return item# 爬蟲結束前調用close_spiderdef close_spider(self, spider):self.fp.close()print('這是爬蟲結束。。') 筆記： 'Jsonitemexporter' 和'jsonLinesitemexporter' 保存json數據的時候，可以使用這兩個類，讓操作變得簡單 1.'Jsonitemexporter'：每次把數據添加到內存當中，最后統一寫入到磁盤中，好處是：存儲的數據是一個滿足就送規則的數據，壞處是數據量比較大，比較耗內存。 2、'jsonLinesitemexporter'：這個是每次調用'export_item'的時候就把這個item存儲到硬盤中，壞處是每一個字典是一行，整個問價你不是一個滿足json格式的文件，好處是每次處理的數據的時候就直接存儲到了硬盤當中，這樣不會消耗內存，數據比較安全。

5、糗事百科實現翻頁

next_url = response.xpath('//ul[@class="pagination"]/li[last()]/a/@href').get()if not next_url:returnelse:print('https://www.qiushibaike.com/'+next_url)yield scrapy.Request('https://www.qiushibaike.com/'+next_url, callback=self.parse)

Request對象

request對象在我們寫爬蟲時發送一個請求的時候調用，常用參數有： 1. 'url': url地址 2. 'callback': 在下載器下載完成相應數據后執行的回調函數 3. 'method':請求的方法，默認為get 4. 'headers':請求頭，一些固定的設置，放在；'setting.py'中制定就可以了，對于非固定的，在發送請求時指定。 5. 'meta': 比較常用用于在不同的請求之間傳遞數據用。 6. 'encoding'：編碼，默認為utf-8 7. 'dot_filter':表示不由調度器過濾，在執行多次重復的請求時用的較多。如過之前發送過這個鏈接請求，調度器默認會拒絕再次發送該請求，所以要設置成Flase 8. 'errback':發成錯誤時候執行的函數'Post'請求：Requset子類-> FromRequest來實現

Reponse對象

1.'meta':從其他請求傳過來的meta屬性，可以用來保持多個請求之間的數據連接。 2.encoding:返回當前字符串編碼和屆滿的格式 3.'text': 將返回的數據作為Unicode字符串返回。 4.'body':將返回來額數據作為bytes字符串返回。 5.'xpath':xpath選擇器。 6.'css'：css選擇器。

發送Post請求：

'Post'請求：Requset子類-> FromRequest來實現，如果在爬蟲一開始時候就發送post請求，那么需要再爬蟲類中重寫 start_requset(self)方法，并且不再調用start_requset(self)中的url。

人人網登陸示例：

class RenrenSpider(scrapy.Spider):name = 'renren'allowed_domains = ['renren.com']start_urls = ['http://renren.com']def start_requests(self):url = 'http://www.renren.com/PLogin.do'data = {'email': '', 'password': ''}requset = scrapy.FormRequest(url,formdata=data,callback=self.parse_page)# 將執行requsetyield requsetdef parse_page(self, response):# 訪問一下界面【只有登錄成功】resquset = scrapy.Request(url='http://www.renren.com/123456/profile',callback=self.parse_profile,)yield resqusetdef parse_profile(self, response):with open('dp.html', 'wb', encoding='utf-8') as f:f.write(response.text) import scrapy from PIL import Image from urllib import request class DoubanSpider(scrapy.Spider):name = 'douban'allowed_domains = ['douban.com']start_urls = ['https://accounts.douban.com/login']profile_url = 'https//www.douban.com/people/123456'def parse(self, response):formdata = {'source': 'None','redir': 'https://www.douban.com','from_email': '','from_pasword': '','login': '登錄',}captcha_url = response.css('img#captcha_image::attr(src)').get()# 如果沒有驗證碼if captcha_url:captcha = self.regonize_captcha(captcha_url)formdata['captcha-solution'] = captchacaptcha_id = response.xpath('//input[@name="captcha_id"]/@value')formdata['captcha_id'] = captcha_idyield scrapy.FormRequest(url='',formdata=formdata,callback=self.parase_after_login)def regonize_captcha(self, image_url):# 下載驗證碼的圖片request.urlretrieve(image_url, 'capatcha.png')image = Image.open('capatcha.png')image.show()captcha = input('請輸入驗證碼:')return captchadef parase_after_login(self, response):# 判斷是否登錄成功if response.url == 'https//www.douban.com':yield scrapy.Request(url=self.profile_url,callback=self.parase_profile)print('登陸成功')else:print('登錄失敗')def parase_profile(self, response):print(response.url)if response.url == self.profile_url:ck = response.xpath('//input[@name="ck"]/@value')fromdata ={'ck': ck,'signatures': '則是修改的個人簽名'}print('成功進入該界面')else:print('沒有進入到個人中心') 注意：如果最后一個方法沒有給callback，則會自動去執行parse（）方法，造成多余回調

下載圖片

為什么要選擇使用scrapy內置的下載文件的方法:

1.避免重新下線最近已經下載過的數據。 2.可以方便的指定文件存儲的路徑。 3.可以將下裁的圖片轉換成通用的格式。比如png或pg。 4.可以方便的生成睿備圖。 5.可以方便的檢測圖片的寬和高，確保他們滿足最小限制。 6.異步下載，效率非常高。

下載文件的 Files Pipeline :

當使用Files Pipeline下戲文件的時候，按照以下步強來完成: 1.定義好一個Iten ，然后在這個iten中定義兩個屬性，分別為 File url以及filcs 。file_urls是用來存儲需要下裁的文件的 ur鏈接，需要給一個列表。 2.當文件下或完成后，會把文件下載的相關信息存儲到iten的 files 屬性中。比如下或路徑、下覿的uri和文件的校獫 3.在配用文件 settings.py 中配囂FILEs_STORE ，這個配用是用來設用文件下賴下來的路徑。 4.啟動pipeline:在ITEN_PIPELINES 中設置scrapy.pipelines.files. FilesPipcline:1 .

下載圖片的Images Pipeline :

當使用Images Pipeline下銳文件的時候，按照以下步駿來完成: 1.定義好一個 Item ,然后在這個iten 中定義兩個屬性，分別為 inago _un1s 以及 imsges . inage_ur1s 是用來存儲需要下賴的圖片的url鏈接，需要給一個列表。 2.當文件下戟完成后，會把文件下載的相關信息存儲到item的imsges屬性中。比扣下載路徑、下覿的url和圖片的核金碼等。 3.在配置文件 settings.py中配置1HAGEs_STORE ,這個配置是用來設置圖片下載下來的路徑。 4.啟動pipeline :在 ITEN_PIPELINES 中設置scrapy.pioelines.images.ImagesPipeline

下載器中間件

下載器中間件我們可以設置代理，更換請求頭來達成反反爬蟲的目的。要寫下載器中間件，可以在下載其中實現兩個方法，一個是process_request(self,requset,spider),這個方法會在請求發送之前執行。還有一個是'process_response(selfm response, spider)'在數據下載到引擎之前執行。

隨機請求頭

/middlerwares.py下 class UserAgentDOwnloadMiddleware(object):user_agent = ["Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50","Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0","Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)","Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)","Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1","Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1","Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11","Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)","Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5","Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5","Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5","Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10","Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13","Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+","Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0","Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124","Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)","UCWEB7.0.2.37/28/999","NOKIA5700/ UCWEB7.0.2.37/28/999","Openwave/ UCWEB7.0.2.37/28/999","Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",# iPhone 6："Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",def process_request(self, requeset, spider):user_agent = random.choice(user_agents)requset.heasers['User_Agent'] = user-agent

ip代理池中間件

1.通過連接提取ip

/middlewares.py class IPProxyDownloadMiddleware(object):def ___init(self):proxys = ['127.0.0.1:800']def process_request(self, request, spider):proxy = random.choice(self.proxys )request.meta['proxy'] = proxy

2. 快代理之獨享代理

/middlewares.py class IPProxyDownloadMiddleware(object):def process_request(self, request, spider):proxy = '121.1999.6.124:16816'user_password = 'username:password'request.meta['proxy'] = proxyb64_user_pwd = base64.b64encode(user_password.encode('urf-8'))request.headers['Proxy-Authorization'] = 'Basic '+ b64_user_pwd

總結

以上是生活随笔為你收集整理的Scrapy教程的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：信息安全技术（黑客攻防）入门
下一篇：全球及中国基因组学软件行业发展动态及前景