當前位置：首頁 > 编程语言 > python >内容正文

python

【Python】Scrapy的安装与使用

發布時間：2023/12/3 python 22 豆豆

生活随笔收集整理的這篇文章主要介紹了【Python】Scrapy的安装与使用小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

scrapy的安裝

不得姐網站
直接用命令

pip install scrapy

網絡可能較慢，等等就好
另外windows環境下你還需要安裝win32
我之前就是因為沒有安裝這個庫，白忙活了一天，一直都是報錯狀態

pip install pypiwin32

scrapy的使用

cd到卓面或者其他看的到的文件夾
一行一行輸入命令即可

1、scrapy startproject qsbk2、scrapy genspider bdj_spider budejie.com3、scrapy crawl bdj_spider解釋 1、scrapy startproject 項目名2、scrapy genspider 爬蟲名網址（不加www）3、爬蟲名修改爬蟲里面的parse函數，改為自己想要的效果即可

以上為例
bdj_spider.py是最重要的文件，在里面修改parse即可獲得自己想要的

這是獲取百思不得姐第一頁用戶名的代碼

# -*- coding: utf-8 -*- import scrapy from scrapy.http.response.html import HtmlResponse from scrapy.selector.unified import SelectorListclass BdjSpiderSpider(scrapy.Spider):name = 'bdj_spider'allowed_domains = ['budejie.com']start_urls = ['http://budejie.com/']def parse(self, response):print('='*100)print('='*100)print('='*100)words = response.xpath("//div[@class='j-list-user']/div")for word in words:author=word.xpath(".//a/text()").get()print(author)print('='*100)print('='*100)print('='*100)print('='*100)

加等號的目的是更容易看出篩選的內容

進階

將爬取的文件存儲在文件夾里
我遇到個巨大的坑。浪費我好長時間，都怪自己當時教程沒看明白，反反復復的找錯

需要在上文修改的文件

1、
bdj_spider.py

# -*- coding: utf-8 -*- import scrapy from scrapy.http.response.html import HtmlResponse from scrapy.selector.unified import SelectorListclass BdjSpiderSpider(scrapy.Spider):name = 'bdj_spider'allowed_domains = ['budejie.com']start_urls = ['http://budejie.com/']def parse(self, response):print('='*100)print('='*100)print('='*100)words = response.xpath("//div[@class='j-r-list-c']/div")for word in words:author=word.xpath(".//a/text()").get()author="".join(author).strip()duanzi = {"author":author}print(duanzi)yield duanzi

切記這個yield后面必須有返回的值還必須有縮進一定要是for下面的，否則下面的process_item函數根本不會調用
setting里面改了也沒有用

2、
pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import jsonclass BdjPipeline(object):def __init__(self):self.fp = open("duanzi.json","w",encoding='utf-8')self.fp.write("Hello")def open_spider(self,spider):print('爬蟲開始了哈哈哈哈。。。。。')def process_item(self,item,spider):item_json = json.dumps(item,ensure_ascii=False)print("###"*100)print(item_json)self.fp.write(item_json+'\n')return itemdef close_spider(self,spider):self.fp.close()print('爬蟲結束了哈哈哈哈。。。。。')

3、
settings.py
把這個解開注釋即可，或者復制以下代碼覆蓋文件

# -*- coding: utf-8 -*-# Scrapy settings for bdj project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'bdj'SPIDER_MODULES = ['bdj.spiders'] NEWSPIDER_MODULE = 'bdj.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'bdj (+http://www.yourdomain.com)'# Obey robots.txt rules ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'bdj.middlewares.BdjSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'bdj.middlewares.BdjDownloaderMiddleware': 543, #}# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'bdj.pipelines.BdjPipeline': 300 }# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

那么看看爬到的數據吧

成功

總結

以上是生活随笔為你收集整理的【Python】Scrapy的安装与使用的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。