當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

scrapy 爬虫之爬取CSDN博客（一）

發(fā)布時(shí)間：2023/12/10 编程问答 19 豆豆

生活随笔收集整理的這篇文章主要介紹了 scrapy 爬虫之爬取CSDN博客（一）小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

上一篇中簡(jiǎn)單的介紹了scrapy的使用，scrapy爬蟲(chóng)的基本結(jié)構(gòu)，scrapy爬蟲(chóng)的強(qiáng)大遠(yuǎn)不止如此。本篇進(jìn)一步探索scrapy爬蟲(chóng)的進(jìn)階使用，以爬取CSDN博客內(nèi)容為例，介紹數(shù)據(jù)結(jié)構(gòu)化、持久化、xpath定位DOM元素、自定義UA中間件等功能的使用。

首先修改上一篇爬取百度網(wǎng)頁(yè)的spider文件的代碼內(nèi)容，修改后的代碼如下：

import scrapyclass Spider_CSDN(scrapy.Spider):"""docstring for SpiderJSscrapy.Spider""""# spider 的name，每個(gè)新建的spider都必須有一個(gè)name,此變量是唯一需要定位的實(shí)例屬性,name = "spider_CSDN"# 允許爬取的域名,若為空則表示爬取所有網(wǎng)址，無(wú)限制allowed_domains = ["blog.csdn.net"]# 起始url,爬取的起始點(diǎn),爬蟲(chóng)開(kāi)始爬取的入口start_urls = ["https://blog.csdn.net/csd_ct/article/details/109305242"]# 構(gòu)造函數(shù)def __init__(self, *arg, **args):super(Spider_CSDN, self).__init__()# 爬取方法def parse(self, response):self.logger.info(f"response is:{response.text}")

1、結(jié)構(gòu)化數(shù)據(jù)，items.py文件的使用

? ? ?items.py文件是創(chuàng)建爬蟲(chóng)模板時(shí)，一般會(huì)在項(xiàng)目中自動(dòng)生成，也可以改為其他名字。其作用相當(dāng)于一個(gè)模板文件，主要作用是將爬取到的網(wǎng)頁(yè)內(nèi)容變?yōu)閜ython的一個(gè)類(lèi)，已到達(dá)數(shù)據(jù)結(jié)構(gòu)化的目的。假設(shè)我們需要CSDN博客中的['author',?'authorLevel',?'authorRank',?'category',?'collectionsCount',?'commentsCount',?'content',?'fansCount',?'keyWords',?'likesCount',?'publishTime',?'requestUrl',?'title',?'viewsCount',?'webPortal']等內(nèi)容，items.py文件如下：

class CsdnBlogItem(scrapy.Item):# define the fields for your item here like:# blog 標(biāo)題title = scrapy.Field()# 來(lái)源的門(mén)戶網(wǎng)站webPortal = scrapy.Field()# URLwebUrl = scrapy.Field()# 作者author = scrapy.Field()# 關(guān)鍵詞keyWords = scrapy.Field()# blog 內(nèi)容content = scrapy.Field()# 粉絲fansCount = scrapy.Field()# 獲贊likesCount = scrapy.Field()# 評(píng)論commentsCount = scrapy.Field()# 分類(lèi)category = scrapy.Field()# 收藏?cái)?shù)collectionsCount = scrapy.Field()# 訪問(wèn)量viewsCount = scrapy.Field()# 發(fā)表時(shí)間publishTime = scrapy.Field()# 作者排名authorRank = scrapy.Field()# 作者等級(jí)authorLevel = scrapy.Field()# 任務(wù)idcrawlTask_id = scrapy.Field()# 爬取時(shí)間crawlTime = scrapy.Field()

2、XPATH定位DOM元素

在爬取到網(wǎng)頁(yè)內(nèi)容后，我們使用XPATH定位網(wǎng)頁(yè)元素，并依次按照items.py中定義的結(jié)構(gòu)獲取元素值。XPATH主要是為了在XML文檔中快速的查找信息，?最初設(shè)計(jì)是用來(lái)搜尋XML文檔的，但是它同樣適用于 HTML 文檔的搜索。詳細(xì)介紹及使用可參考學(xué)【爬蟲(chóng)利器XPath,看這一篇就夠了】。這里只需要留意2個(gè)簡(jiǎn)單的語(yǔ)法即可，即元素的定位和值的獲取。我們可以利用前面提到的scrapy shell 命令調(diào)試xpath的語(yǔ)法。以https://blog.csdn.net/csd_ct/article/details/109305242為例。

在命令行中輸入：

scrapy shell?https://blog.csdn.net/csd_ct/article/details/109305242

結(jié)果如下圖：

首先測(cè)試下，利用xpath獲取博客的標(biāo)題?？梢岳脼g覽器的dev-tool工具，按下F12鍵，檢查頁(yè)面元素，找到博客標(biāo)題如圖：

可以看到標(biāo)題的的dom節(jié)點(diǎn)元素是h1,class為title-article,id為articleContentId,id 屬性選取節(jié)點(diǎn)最為方便。在打開(kāi)的shell命令行中，輸入

response.xpath("//h1[@id='articleContentId']").extract()

返回結(jié)果是：

已經(jīng)選取到了h1標(biāo)題節(jié)點(diǎn)?？梢钥吹椒祷氐氖且粋€(gè)數(shù)組，且只有一個(gè)元素，因?yàn)槭褂胕d屬性來(lái)選取的，因?yàn)閕d是唯一的，所以選取了一個(gè)元素。要想獲得元素的文本值，在xpath路徑中添加text()函數(shù)即可，如圖

如果以class或者其他的屬性來(lái)選取，則可能會(huì)有多個(gè)值，如選取class為text-center的元素，

輸入：response.xpath("//dl[@class='text-center']").extract()，則有：

可以看到選中的這些內(nèi)容是左上角的作者信息等級(jí)、排名、評(píng)論、收藏?cái)?shù)等內(nèi)容。這里第一個(gè)dl標(biāo)簽中含有a標(biāo)簽，獲取a標(biāo)簽的href鏈接url，是一個(gè)較常用的操作，在xpath選取a標(biāo)簽叫@href可以獲取url值。如圖：

3、數(shù)據(jù)持久化操作

結(jié)合items中定義的數(shù)據(jù)模板和xpath的定位操作，在爬蟲(chóng)的parse的函數(shù)中，依次選取各個(gè)元素的值，復(fù)制給item中的項(xiàng)。代碼如下：

def csdnParseBlog(self, response):'''解析CSDN博客頁(yè)面中的內(nèi)容'''contentItem = CsdnBlogItem()# 請(qǐng)求URLcontentItem['webUrl'] = response.url.strip().split('?')[0]# 文章標(biāo)題try:contentItem['title'] = response.xpath("//h1[@id='articleContentId'][1]/text()").extract()[0].strip()except Exception as identifier:contentItem['title'] = ''# 作者信息pInfo = response.xpath("//div[@class='data-info d-flex item-tiling']//dl[@class='text-center']/attribute::title").extract()# 門(mén)戶網(wǎng)站名稱contentItem['webPortal'] = "CSDN"# blog 作者try:contentItem['author'] = response.xpath("//div[@class='bar-content']//a[@class='follow-nickName']/text()").extract()[0].strip()[0:45]except Exception as identifier:contentItem['author'] = ''self.log(traceback.format_exc(), logging.ERROR)# 發(fā)表時(shí)間try:contentItem['publishTime'] = response.xpath("//div[@class='bar-content']//span[@class='time']/text()").extract()[0].strip()except Exception as identifier:contentItem['publishTime'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")self.log(traceback.format_exc(), logging.ERROR)# 瀏覽量try:tmp = response.xpath("//div[@class='bar-content']//span[@class='read-count']/text()").extract()[0].strip()contentItem['viewsCount'] = 0 if len(tmp) <= 0 else tmpexcept Exception as identifier:contentItem['viewsCount'] = 0self.log(traceback.format_exc(), logging.ERROR)# 收藏?cái)?shù)try:tmp = response.xpath("//div[@class='bar-content']//span[@class='get-collection']/text()").extract()[0].replace(" ", '').strip()contentItem["collectionsCount"] = 0 if len(tmp) <= 0 else tmpexcept Exception as identifier:contentItem['collectionsCount'] = 0self.log(traceback.format_exc(), logging.ERROR)# 點(diǎn)贊數(shù)try:tmp = response.xpath("//span[@id='spanCount']/text()").extract()[0].strip()contentItem['likesCount'] = 0 if len(tmp) <= 0 else tmpexcept Exception as identifier:contentItem['likesCount'] = 0self.log(traceback.format_exc(), logging.ERROR)# 評(píng)論數(shù)try:tmp = response.xpath("//li[@class='tool-item tool-active tool-item-comment']//span[@class='count']/text()").extract()[0].strip()contentItem['commentsCount'] = 0 if len(tmp) <= 0 else tmpexcept Exception as identifier:contentItem['commentsCount'] = 0self.log(traceback.format_exc(), logging.ERROR)# 粉絲數(shù)目try:contentItem['fansCount'] = pInfo[6].strip()except Exception as identifier:contentItem['fansCount'] = 0self.log(traceback.format_exc(), logging.ERROR)# 總排名try:contentItem['authorRank'] = pInfo[2].strip()except Exception as identifier:contentItem['authorRank'] = 0self.log(traceback.format_exc(), logging.ERROR)# 等級(jí)try:contentItem['authorLevel'] = re.findall(r"\d+", pInfo[4])[0].strip()except Exception as identifier:contentItem['authorLevel'] = 0self.log(traceback.format_exc(), logging.ERROR)# 文章內(nèi)容try:contentItem['content'] = response.xpath("//div[@id='content_views']").extract()[0].strip()except Exception as identifier:contentItem['content'] = ""self.log(traceback.format_exc(), logging.ERROR)# blog 類(lèi)別try:contentItem['category'] = response.xpath('//div[@class="tags-box artic-tag-box"]//a/text()').extract()[0].replace(" ", '').strip()[0:18]except Exception as identifier:contentItem['category'] = ''self.log(traceback.format_exc(), logging.ERROR)# 關(guān)鍵詞try:contentItem['keyWords'] = response.xpath('//div[@class="tags-box artic-tag-box"]//a/text()').extract()[1].strip()[0:18]except Exception as identifier:contentItem['keyWords'] = ''self.log(traceback.format_exc(), logging.ERROR)if len(contentItem['keyWords']) == 0:contentItem['keyWords'] = json.loads(self.customArgs.get('filterArg', '{}')).get('keyWords', '')# 任務(wù)idtaskId = self.args.get('_job')contentItem['crawlTask_id'] = taskId# 爬取時(shí)間contentItem['crawlTime'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")self.logger.info(f"item data is:{contentItem}")yield contentItem

yield contentItem 這一句不要忘記了，?這一句相當(dāng)于把contentItem添加到了管道中。對(duì)后面數(shù)據(jù)的處理極為重要，否則，在pipline中數(shù)據(jù)的處理操作不會(huì)被執(zhí)行。輸出結(jié)果如下：

總結(jié)

以上是生活随笔為你收集整理的scrapy 爬虫之爬取CSDN博客（一）的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： node --- 创建一个Socket
下一篇： css --- 圣杯布局