日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程语言 > python >内容正文

python

python爬b站评论_Python爬虫入门教程 32-100 B站博人传评论数据抓取 scrapy

發(fā)布時(shí)間:2023/12/18 python 27 豆豆
生活随笔 收集整理的這篇文章主要介紹了 python爬b站评论_Python爬虫入门教程 32-100 B站博人传评论数据抓取 scrapy 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

1. B站博人傳評(píng)論數(shù)據(jù)爬取簡(jiǎn)介

今天想了半天不知道抓啥,去B站看跳舞的小姐姐,忽然看到了評(píng)論,那就抓取一下B站的評(píng)論數(shù)據(jù),視頻動(dòng)畫(huà)那么多,也不知道抓取哪個(gè),選了一個(gè)博人傳跟火影相關(guān)的,抓取看看。網(wǎng)址: https://www.bilibili.com/bangumi/media/md5978/?from=search&seid=16013388136765436883#short 在這個(gè)網(wǎng)頁(yè)看到了18560條短評(píng),數(shù)據(jù)量也不大,抓取看看,使用的還是scrapy。

2. B站博人傳評(píng)論數(shù)據(jù)案例---獲取鏈接

從開(kāi)發(fā)者工具中你能輕易的得到如下鏈接,有鏈接之后就好辦了,如何創(chuàng)建項(xiàng)目就不在啰嗦了,我們直接進(jìn)入主題。

我在代碼中的parse函數(shù)中,設(shè)定了兩個(gè)yield一個(gè)用來(lái)返回items 一個(gè)用來(lái)返回requests。 然后實(shí)現(xiàn)一個(gè)新的功能,每次訪問(wèn)切換UA,這個(gè)點(diǎn)我們需要使用到中間件技術(shù)。

class BorenSpider(scrapy.Spider):

BASE_URL = "https://bangumi.bilibili.com/review/web_api/short/list?media_id=5978&folded=0&page_size=20&sort=0&cursor={}"

name = 'Boren'

allowed_domains = ['bangumi.bilibili.com']

start_urls = [BASE_URL.format("76742479839522")]

def parse(self, response):

print(response.url)

resdata = json.loads(response.body_as_unicode())

if resdata["code"] == 0:

# 獲取最后一個(gè)數(shù)據(jù)

if len(resdata["result"]["list"]) > 0:

data = resdata["result"]["list"]

cursor = data[-1]["cursor"]

for one in data:

item = BorenzhuanItem()

item["author"] = one["author"]["uname"]

item["content"] = one["content"]

item["ctime"] = one["ctime"]

item["disliked"] = one["disliked"]

item["liked"] = one["liked"]

item["likes"] = one["likes"]

item["user_season"] = one["user_season"]["last_ep_index"] if "user_season" in one else ""

item["score"] = one["user_rating"]["score"]

yield item

yield scrapy.Request(self.BASE_URL.format(cursor),callback=self.parse)

3. B站博人傳評(píng)論數(shù)據(jù)案例---實(shí)現(xiàn)隨機(jī)UA

第一步, 在settings文件中添加一些UserAgent,我從互聯(lián)網(wǎng)找了一些

USER_AGENT_LIST=[

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",

"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",

"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",

"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",

"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",

"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",

"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",

"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",

"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",

"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",

"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",

"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

]

第二步,在settings文件中設(shè)置 “DOWNLOADER_MIDDLEWARES”

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

#'borenzhuan.middlewares.BorenzhuanDownloaderMiddleware': 543,

'borenzhuan.middlewares.RandomUserAgentMiddleware': 400,

}

第三步,在 middlewares.py 文件中導(dǎo)入 settings模塊中的 USER_AGENT_LIST 方法

from borenzhuan.settings import USER_AGENT_LIST # 導(dǎo)入中間件

import random

class RandomUserAgentMiddleware(object):

def process_request(self, request, spider):

rand_use = random.choice(USER_AGENT_LIST)

if rand_use:

request.headers.setdefault('User-Agent', rand_use)

好了,隨機(jī)的UA已經(jīng)實(shí)現(xiàn),你可以在parse函數(shù)中編寫(xiě)如下代碼進(jìn)行測(cè)試

print(response.request.headers)

4. B站博人傳評(píng)論數(shù)據(jù)----完善item

這個(gè)操作相對(duì)簡(jiǎn)單,這些數(shù)據(jù)就是我們要保存的數(shù)據(jù)了。!

author = scrapy.Field()

content = scrapy.Field()

ctime = scrapy.Field()

disliked = scrapy.Field()

liked = scrapy.Field()

likes = scrapy.Field()

score = scrapy.Field()

user_season = scrapy.Field()

5. B站博人傳評(píng)論數(shù)據(jù)案例---提高爬取速度

在settings.py中設(shè)置如下參數(shù):

# Configure maximum concurrent requests performed by Scrapy (default: 16)

CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

DOWNLOAD_DELAY = 1

# The download delay setting will honor only one of:

CONCURRENT_REQUESTS_PER_DOMAIN = 16

CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

COOKIES_ENABLED = False

解釋說(shuō)明

一、降低下載延遲

DOWNLOAD_DELAY = 0

將下載延遲設(shè)為0,這時(shí)需要相應(yīng)的防ban措施,一般使用user agent輪轉(zhuǎn),構(gòu)建user agent池,輪流選擇其中之一來(lái)作為user agent。

二、多線程

CONCURRENT_REQUESTS = 32 CONCURRENT_REQUESTS_PER_DOMAIN = 16 CONCURRENT_REQUESTS_PER_IP = 16

scrapy網(wǎng)絡(luò)請(qǐng)求是基于Twisted,而Twisted默認(rèn)支持多線程,而且scrapy默認(rèn)也是通過(guò)多線程請(qǐng)求的,并且支持多核CPU的并發(fā),我們通過(guò)一些設(shè)置提高scrapy的并發(fā)數(shù)可以提高爬取速度。

三、禁用cookies

COOKIES_ENABLED = False

6. B站博人傳評(píng)論數(shù)據(jù)案例---保存數(shù)據(jù)

最后在pipelines.py 文件中,編寫(xiě)保存代碼即可

import os

import csv

class BorenzhuanPipeline(object):

def __init__(self):

store_file = os.path.dirname(__file__)+'/spiders/bore.csv'

self.file = open(store_file,"a+",newline="",encoding="utf-8")

self.writer = csv.writer(self.file)

def process_item(self, item, spider):

try:

self.writer.writerow((

item["author"],

item["content"],

item["ctime"],

item["disliked"],

item["liked"],

item["likes"],

item["score"],

item["user_season"]

))

except Exception as e:

print(e.args)

def close_spider(self, spider):

self.file.close()

運(yùn)行代碼之后,發(fā)現(xiàn)過(guò)了一會(huì)報(bào)錯(cuò)了

去看了一眼,原來(lái)是數(shù)據(jù)爬取完畢~!!!

總結(jié)

以上是生活随笔為你收集整理的python爬b站评论_Python爬虫入门教程 32-100 B站博人传评论数据抓取 scrapy的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。