11_简书业务分析
文章目錄
- 簡書結構分析
- 創建簡書爬蟲項目
- 創建crawl解析器
- 配置簡書下載格式
博文配套視頻課程:24小時實現從零到AI人工智能
簡書結構分析
創建簡書爬蟲項目
C:\Users\Administrator\Desktop>scrapy startproject jianshu New Scrapy project 'jianshu', using template directory 'd:\anaconda3\lib\site-packages\scrapy\templates\project', created in:C:\Users\Administrator\Desktop\jianshuYou can start your first spider with:cd jianshuscrapy genspider example example.com創建crawl解析器
之前創建的spider解析器采用都是basic模板,這次爬蟲是要下載簡書文章,需要支持正則表達式匹配,因此建議采用crawl模板來創建spider解析器
C:\Users\Administrator\Desktop>cd jianshuC:\Users\Administrator\Desktop\jianshu>scrapy genspider -t crawl jianshu_spider jianshu.com Created spider 'jianshu_spider' using template 'crawl' in module:jianshu.spiders.jianshu_spider配置簡書下載格式
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Ruleclass JianshuSpiderSpider(CrawlSpider):name = 'jianshu_spider'allowed_domains = ['jianshu.com']start_urls = ['https://www.jianshu.com/']# 可以指定爬蟲抓取的規則,支持正則表達式# https://www.jianshu.com/p/df7cad4eb8d8# https://www.jianshu.com/p/07b0456cbadb?*****# https://www.jianshu.com/p/.*rules = (Rule(LinkExtractor(allow=r'https://www.jianshu.com/p/[0-9a-z]{12}.*'), callback='parse_item', follow=True),)# name = title = url = collection = scrapy.Field()def parse_item(self, response):print(response.text)總結
- 上一篇: 这些单晶XRD测试问题你了解吗?(二)
- 下一篇: 电话机器人的技术分析