Python爬虫基础入门
生活随笔
收集整理的這篇文章主要介紹了
Python爬虫基础入门
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
爬取的網(wǎng)址:https://www.23hh.com/book/0/189/
需求:獲取小說的章節(jié)目錄及其對(duì)應(yīng)的章節(jié)內(nèi)容
需要的庫(kù):requests、BeautifulSoup和re。利用requests庫(kù)發(fā)送瀏覽器請(qǐng)求,BeautifulSoup和re庫(kù)對(duì)獲取到的數(shù)據(jù)進(jìn)行分析、提取。
分別使用pip install requests和pip install BeautifulSoup4安裝
對(duì)網(wǎng)頁(yè)源碼進(jìn)行分析:
1、創(chuàng)建testcraw包
2、創(chuàng)建craw_site.py文件用于獲取章節(jié)目錄及其鏈接
3、創(chuàng)建mysql_helper.py文件用于保存數(shù)據(jù)
import pymysqlclass MysqlTool(object):def getConn(self):conn = Nonetry:conn = pymysql.connect(host='localhost',user='root',password='5180',port=3306,db='fictions')except Exception as e:print('\033[31m{}\033[0m'.format(e))return conndef closeConn(self, conn):try:if conn is not None:conn.commit()conn.close()except Exception as e:print('\033[31m{}\033[0m'.format(e))def getCursor(self, conn):cur = Nonetry:if conn is not None:cur = conn.cursor()except Exception as e:print('\033[31m{}\033[0m'.format(e))return curdef closeCursor(self, cur):try:if cur is not None:cur.close()except Exception as e:print('\033[31m{}\033[0m'.format(e))def insert(self, cur, chapter='', content=''):sql = 'insert into perfect_world(chapter, content) values(%s, %s);'count = cur.execute(sql, (chapter, content))if count > 0:print('{} 抓取成功'.format(chapter))創(chuàng)建fictions數(shù)據(jù)庫(kù)和如下所示的表:
4、創(chuàng)建subpage.py文件用于獲取子頁(yè)的正文內(nèi)容
import requests from bs4 import BeautifulSoup from testcraw.craw_site import result from testcraw.mysql_helper import MysqlTool import redef test(website):for i in result(website):chapter, content = i[0], ''site = i[1]res = requests.get(url=site, timeout=60)res.raise_for_status()res.encoding = res.apparent_encodingdemo = res.textsoup = BeautifulSoup(demo, 'html.parser')for i in soup.find_all(attrs={'id': 'content'}):for j in i.stripped_strings:content += (j + '\n')content = re.sub(pattern='純文字在線閱讀本站域名手機(jī)同步閱讀請(qǐng)?jiān)L問', repl='', string=content, count=1)mt = MysqlTool()conn = mt.getConn()cur = mt.getCursor(conn)mt.insert(cur, chapter, content)mt.closeCursor(cur)mt.closeConn(conn)test('https://www.23hh.com/book/0/189/')爬取部分內(nèi)容如下:
總結(jié)
以上是生活随笔為你收集整理的Python爬虫基础入门的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: .h .c .hh .cc文件
- 下一篇: Python 实现大文本文件切割