scrapy 爬虫之爬取CSDN博客(一)
上一篇中簡(jiǎn)單的介紹了scrapy的使用,scrapy爬蟲(chóng)的基本結(jié)構(gòu),scrapy爬蟲(chóng)的強(qiáng)大遠(yuǎn)不止如此。本篇進(jìn)一步探索scrapy爬蟲(chóng)的進(jìn)階使用,以爬取CSDN博客內(nèi)容為例,介紹數(shù)據(jù)結(jié)構(gòu)化、持久化、xpath定位DOM元素、自定義UA中間件等功能的使用。
首先修改上一篇爬取百度網(wǎng)頁(yè)的spider文件的代碼內(nèi)容,修改后的代碼如下:
import scrapyclass Spider_CSDN(scrapy.Spider):"""docstring for SpiderJSscrapy.Spider""""# spider 的name,每個(gè)新建的spider都必須有一個(gè)name,此變量是唯一需要定位的實(shí)例屬性,name = "spider_CSDN"# 允許爬取的域名,若為空則表示爬取所有網(wǎng)址,無(wú)限制allowed_domains = ["blog.csdn.net"]# 起始url,爬取的起始點(diǎn),爬蟲(chóng)開(kāi)始爬取的入口start_urls = ["https://blog.csdn.net/csd_ct/article/details/109305242"]# 構(gòu)造函數(shù)def __init__(self, *arg, **args):super(Spider_CSDN, self).__init__()# 爬取方法def parse(self, response):self.logger.info(f"response is:{response.text}")1、結(jié)構(gòu)化數(shù)據(jù),items.py文件的使用
? ? ?items.py文件是創(chuàng)建爬蟲(chóng)模板時(shí),一般會(huì)在項(xiàng)目中自動(dòng)生成,也可以改為其他名字。其作用相當(dāng)于一個(gè)模板文件,主要作用是將爬取到的網(wǎng)頁(yè)內(nèi)容變?yōu)閜ython的一個(gè)類(lèi),已到達(dá)數(shù)據(jù)結(jié)構(gòu)化的目的。假設(shè)我們需要CSDN博客中的['author',?'authorLevel',?'authorRank',?'category',?'collectionsCount',?'commentsCount',?'content',?'fansCount',?'keyWords',?'likesCount',?'publishTime',?'requestUrl',?'title',?'viewsCount',?'webPortal']等內(nèi)容,items.py文件如下:
class CsdnBlogItem(scrapy.Item):# define the fields for your item here like:# blog 標(biāo)題title = scrapy.Field()# 來(lái)源的門(mén)戶網(wǎng)站webPortal = scrapy.Field()# URLwebUrl = scrapy.Field()# 作者author = scrapy.Field()# 關(guān)鍵詞keyWords = scrapy.Field()# blog 內(nèi)容content = scrapy.Field()# 粉絲fansCount = scrapy.Field()# 獲贊likesCount = scrapy.Field()# 評(píng)論commentsCount = scrapy.Field()# 分類(lèi)category = scrapy.Field()# 收藏?cái)?shù)collectionsCount = scrapy.Field()# 訪問(wèn)量viewsCount = scrapy.Field()# 發(fā)表時(shí)間publishTime = scrapy.Field()# 作者排名authorRank = scrapy.Field()# 作者等級(jí)authorLevel = scrapy.Field()# 任務(wù)idcrawlTask_id = scrapy.Field()# 爬取時(shí)間crawlTime = scrapy.Field()2、XPATH定位DOM元素
在爬取到網(wǎng)頁(yè)內(nèi)容后,我們使用XPATH定位網(wǎng)頁(yè)元素,并依次按照items.py中定義的結(jié)構(gòu)獲取元素值。XPATH主要是為了在XML文檔中快速的查找信息,?最初設(shè)計(jì)是用來(lái)搜尋XML文檔的,但是它同樣適用于 HTML 文檔的搜索。詳細(xì)介紹及使用可參考學(xué)【爬蟲(chóng)利器XPath,看這一篇就夠了】。這里只需要留意2個(gè)簡(jiǎn)單的語(yǔ)法即可,即元素的定位和值的獲取。我們可以利用前面提到的scrapy shell 命令調(diào)試xpath的語(yǔ)法。以https://blog.csdn.net/csd_ct/article/details/109305242為例。
在命令行中輸入:
scrapy shell?https://blog.csdn.net/csd_ct/article/details/109305242結(jié)果如下圖:
首先測(cè)試下,利用xpath獲取博客的標(biāo)題??梢岳脼g覽器的dev-tool工具,按下F12鍵,檢查頁(yè)面元素,找到博客標(biāo)題如圖:
可以看到標(biāo)題的的dom節(jié)點(diǎn)元素是h1,class為title-article,id為articleContentId,id 屬性選取節(jié)點(diǎn)最為方便。在打開(kāi)的shell命令行中,輸入
response.xpath("//h1[@id='articleContentId']").extract()返回結(jié)果是:
已經(jīng)選取到了h1標(biāo)題節(jié)點(diǎn)??梢钥吹椒祷氐氖且粋€(gè)數(shù)組,且只有一個(gè)元素,因?yàn)槭褂胕d屬性來(lái)選取的,因?yàn)閕d是唯一的,所以選取了一個(gè)元素。要想獲得元素的文本值,在xpath路徑中添加text()函數(shù)即可,如圖
如果以class或者其他的屬性來(lái)選取,則可能會(huì)有多個(gè)值,如選取class為text-center的元素,
輸入:response.xpath("//dl[@class='text-center']").extract(),則有:
可以看到選中的這些內(nèi)容是左上角的作者信息等級(jí)、排名、評(píng)論、收藏?cái)?shù)等內(nèi)容。這里第一個(gè)dl標(biāo)簽中含有a標(biāo)簽,獲取a標(biāo)簽的href鏈接url,是一個(gè)較常用的操作,在xpath選取a標(biāo)簽叫@href可以獲取url值。如圖:
3、數(shù)據(jù)持久化操作
結(jié)合items中定義的數(shù)據(jù)模板和xpath的定位操作,在爬蟲(chóng)的parse的函數(shù)中,依次選取各個(gè)元素的值,復(fù)制給item中的項(xiàng)。代碼如下:
def csdnParseBlog(self, response):'''解析CSDN博客頁(yè)面中的內(nèi)容'''contentItem = CsdnBlogItem()# 請(qǐng)求URLcontentItem['webUrl'] = response.url.strip().split('?')[0]# 文章標(biāo)題try:contentItem['title'] = response.xpath("//h1[@id='articleContentId'][1]/text()").extract()[0].strip()except Exception as identifier:contentItem['title'] = ''# 作者信息pInfo = response.xpath("//div[@class='data-info d-flex item-tiling']//dl[@class='text-center']/attribute::title").extract()# 門(mén)戶網(wǎng)站 名稱contentItem['webPortal'] = "CSDN"# blog 作者try:contentItem['author'] = response.xpath("//div[@class='bar-content']//a[@class='follow-nickName']/text()").extract()[0].strip()[0:45]except Exception as identifier:contentItem['author'] = ''self.log(traceback.format_exc(), logging.ERROR)# 發(fā)表時(shí)間try:contentItem['publishTime'] = response.xpath("//div[@class='bar-content']//span[@class='time']/text()").extract()[0].strip()except Exception as identifier:contentItem['publishTime'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")self.log(traceback.format_exc(), logging.ERROR)# 瀏覽量try:tmp = response.xpath("//div[@class='bar-content']//span[@class='read-count']/text()").extract()[0].strip()contentItem['viewsCount'] = 0 if len(tmp) <= 0 else tmpexcept Exception as identifier:contentItem['viewsCount'] = 0self.log(traceback.format_exc(), logging.ERROR)# 收藏?cái)?shù)try:tmp = response.xpath("//div[@class='bar-content']//span[@class='get-collection']/text()").extract()[0].replace(" ", '').strip()contentItem["collectionsCount"] = 0 if len(tmp) <= 0 else tmpexcept Exception as identifier:contentItem['collectionsCount'] = 0self.log(traceback.format_exc(), logging.ERROR)# 點(diǎn)贊數(shù)try:tmp = response.xpath("//span[@id='spanCount']/text()").extract()[0].strip()contentItem['likesCount'] = 0 if len(tmp) <= 0 else tmpexcept Exception as identifier:contentItem['likesCount'] = 0self.log(traceback.format_exc(), logging.ERROR)# 評(píng)論數(shù)try:tmp = response.xpath("//li[@class='tool-item tool-active tool-item-comment']//span[@class='count']/text()").extract()[0].strip()contentItem['commentsCount'] = 0 if len(tmp) <= 0 else tmpexcept Exception as identifier:contentItem['commentsCount'] = 0self.log(traceback.format_exc(), logging.ERROR)# 粉絲數(shù)目try:contentItem['fansCount'] = pInfo[6].strip()except Exception as identifier:contentItem['fansCount'] = 0self.log(traceback.format_exc(), logging.ERROR)# 總排名try:contentItem['authorRank'] = pInfo[2].strip()except Exception as identifier:contentItem['authorRank'] = 0self.log(traceback.format_exc(), logging.ERROR)# 等級(jí)try:contentItem['authorLevel'] = re.findall(r"\d+", pInfo[4])[0].strip()except Exception as identifier:contentItem['authorLevel'] = 0self.log(traceback.format_exc(), logging.ERROR)# 文章內(nèi)容try:contentItem['content'] = response.xpath("//div[@id='content_views']").extract()[0].strip()except Exception as identifier:contentItem['content'] = ""self.log(traceback.format_exc(), logging.ERROR)# blog 類(lèi)別try:contentItem['category'] = response.xpath('//div[@class="tags-box artic-tag-box"]//a/text()').extract()[0].replace(" ", '').strip()[0:18]except Exception as identifier:contentItem['category'] = ''self.log(traceback.format_exc(), logging.ERROR)# 關(guān)鍵詞try:contentItem['keyWords'] = response.xpath('//div[@class="tags-box artic-tag-box"]//a/text()').extract()[1].strip()[0:18]except Exception as identifier:contentItem['keyWords'] = ''self.log(traceback.format_exc(), logging.ERROR)if len(contentItem['keyWords']) == 0:contentItem['keyWords'] = json.loads(self.customArgs.get('filterArg', '{}')).get('keyWords', '')# 任務(wù)idtaskId = self.args.get('_job')contentItem['crawlTask_id'] = taskId# 爬取時(shí)間contentItem['crawlTime'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")self.logger.info(f"item data is:{contentItem}")yield contentItemyield contentItem 這一句不要忘記了,?這一句相當(dāng)于把contentItem添加到了管道中。對(duì)后面數(shù)據(jù)的處理極為重要,否則,在pipline中數(shù)據(jù)的處理操作不會(huì)被執(zhí)行。輸出結(jié)果如下:
?
總結(jié)
以上是生活随笔為你收集整理的scrapy 爬虫之爬取CSDN博客(一)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: node --- 创建一个Socket
- 下一篇: css --- 圣杯布局