日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Crawlscrapy获取果壳问答信息

發(fā)布時(shí)間:2024/3/12 编程问答 35 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Crawlscrapy获取果壳问答信息 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

思路
1、果殼網(wǎng)精彩問(wèn)答界面url=https://www.guokr.com/ask/highlight/
2、自動(dòng)獲取多頁(yè)url
3、自動(dòng)獲取每頁(yè)問(wèn)答界面url
4、使用css解析數(shù)據(jù),獲取訪問(wèn)界面問(wèn)題標(biāo)題、排位第一的答案文字和圖片信息

一、準(zhǔn)備工作

創(chuàng)建一個(gè)scrapy project:

scrapy startproject GUOKE

創(chuàng)建crawspider file

scrapy genspider -t crawl guoke guoke.com

二、構(gòu)建框架

(1) 聲明items

import scrapyclass GuokeItem(scrapy.Item):question = scrapy.Field()answer = scrapy.Field()img = scrapy.Field()

(2) spider.py

from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from GUOKE.items import GuokeItemclass GuokeSpider(CrawlSpider):name = 'guoke'allowed_domains = ['guokr.com']start_urls = ['https://www.guokr.com/ask/highlight/']rules = (# 指定規(guī)則, 獲取所有頁(yè)面的url,無(wú)需解析,當(dāng)callback為None,follow默認(rèn)為TrueRule(LinkExtractor(allow='page=\d+')),#獲取每頁(yè)對(duì)應(yīng)的問(wèn)答詳情頁(yè)url,解析數(shù)據(jù),不需要進(jìn)一步提取#檢查時(shí)發(fā)現(xiàn)提取出一個(gè)無(wú)關(guān)url,使用deny去掉Rule(LinkExtractor(allow='question',deny=('new')),follow = False,callback='parse_item' ))#解析函數(shù)名不可設(shè)為parse,會(huì)與內(nèi)置函數(shù)名重復(fù)而將其覆蓋,影響抓取運(yùn)行def parse_item(self, response):#print(response.url)item = GuokeItem()#使用CSS選擇器獲取問(wèn)答詳情頁(yè)問(wèn)題標(biāo)題item['question'] = response.css('#articleTitle::text').extract_first().strip()#獲取問(wèn)答詳情頁(yè)排名第一的答案文字及圖片item['answer'] ='\n'.join(response.css('.answer-txt p::text').extract())item['img'] = '\n'.join(response.css('.answer-txt img::attr(src)').extract())yield item

(3) middlewares.py
通過(guò)下載中間件設(shè)置User-Agent

import random class Ugdownloadmiddleware(object):def __init__(self):#定義user-agent池self.user_agent_list = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3"]def process_request(self,request,spider):#隨機(jī)選取User-Agentug = random.choice(self.user_agent_list)#給請(qǐng)求頭傳遞ug參數(shù)request.headers['User-Agent'] = ugreturn Nonedef process_response(self,request,response,spider):#確定是否添加成功print('使用了ug:',request.headers['User-Agent'])return response

(4)pipelines.py
設(shè)置管道,連接數(shù)據(jù)庫(kù),保存數(shù)據(jù)

import pymongoclass GuokePipeline(object):def open_spider(self,spider):#連接數(shù)據(jù)庫(kù),host port 為默認(rèn)值,不另外指定client = pymongo.MongoClient()#選擇/創(chuàng)建數(shù)據(jù)庫(kù)testmydb = client['test']#在數(shù)據(jù)庫(kù)中 選擇/創(chuàng)建表self.collection = mydb['guoke']def process_item(self, item, spider):#將item數(shù)據(jù)轉(zhuǎn)為鍵值對(duì)格式data = dict(item)#插入數(shù)據(jù)self.collection.insert(data)return item

(5)setting.py
一般寫好一部分代碼就開啟相應(yīng)的設(shè)置版塊,以防忘記

#設(shè)置日志文件,級(jí)別為只看warning LOG_FILE = 'guoke.log' LOG_LEVEL = 'WARNING' #不遵守robot協(xié)議 ROBOTSTXT_OBEY = False #下載延時(shí),太快會(huì)被反爬 DOWNLOAD_DELAY = 3 #打開下載中間件,名稱要與你設(shè)置的一致 DOWNLOADER_MIDDLEWARES = {'ITEM.middlewares.GuokePipeline: 543, } #打開管道 ITEM_PIPELINES = {'ITEM.pipelines.GuokePipeline': 300, }

三、運(yùn)行spider
(1)打開MongoDB服務(wù)器,啟動(dòng)客戶端

sudo mongod mongo

(2)運(yùn)行spider

scrapy crawl guoke

四、進(jìn)入數(shù)據(jù)庫(kù)查看

ok啦

總結(jié)

以上是生活随笔為你收集整理的Crawlscrapy获取果壳问答信息的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。