當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

Python爬虫基础入门

發(fā)布時(shí)間：2023/12/29 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫基础入门小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

爬取的網(wǎng)址：https://www.23hh.com/book/0/189/

需求：獲取小說的章節(jié)目錄及其對(duì)應(yīng)的章節(jié)內(nèi)容
需要的庫(kù)：requests、BeautifulSoup和re。利用requests庫(kù)發(fā)送瀏覽器請(qǐng)求，BeautifulSoup和re庫(kù)對(duì)獲取到的數(shù)據(jù)進(jìn)行分析、提取。
分別使用pip install requests和pip install BeautifulSoup4安裝
對(duì)網(wǎng)頁(yè)源碼進(jìn)行分析：

1、創(chuàng)建testcraw包
2、創(chuàng)建craw_site.py文件用于獲取章節(jié)目錄及其鏈接

import requests from bs4 import BeautifulSoup import redef getSoup(website):try:res = requests.get(url=website) # 發(fā)送請(qǐng)求res.raise_for_status() # 檢測(cè)返回狀態(tài)碼是否正常res.encoding = res.apparent_encoding # 避免中文亂碼content = res.textsoup = BeautifulSoup(content, 'html.parser')return soup # 返回BeautifulSoup對(duì)象except requests.HTTPError as e:return edef result(website):chapter, siteLst = [], []try:soup = getSoup(website)except requests.HTTPError as e:return eelse:for i in soup.find_all('dd'): # 利用BeautifulSoup類的find_all方法對(duì)數(shù)據(jù)進(jìn)行篩選for j in i.find_all('a'):k = j.stringisExisted = re.match('[\u4e00-\u9fa5]+章', k) # 利用正則表達(dá)式篩選if isExisted is not None:chapter.append(j.string)siteLst.append(website + j.attrs['href'][12:]) # 提取小說各個(gè)章節(jié)得我連接lst = list(zip(chapter, siteLst))for i in range(12):del lst[0]return lst # 返回章節(jié)目錄及其連接

3、創(chuàng)建mysql_helper.py文件用于保存數(shù)據(jù)

import pymysqlclass MysqlTool(object):def getConn(self):conn = Nonetry:conn = pymysql.connect(host='localhost',user='root',password='5180',port=3306,db='fictions')except Exception as e:print('\033[31m{}\033[0m'.format(e))return conndef closeConn(self, conn):try:if conn is not None:conn.commit()conn.close()except Exception as e:print('\033[31m{}\033[0m'.format(e))def getCursor(self, conn):cur = Nonetry:if conn is not None:cur = conn.cursor()except Exception as e:print('\033[31m{}\033[0m'.format(e))return curdef closeCursor(self, cur):try:if cur is not None:cur.close()except Exception as e:print('\033[31m{}\033[0m'.format(e))def insert(self, cur, chapter='', content=''):sql = 'insert into perfect_world(chapter, content) values(%s, %s);'count = cur.execute(sql, (chapter, content))if count > 0:print('{} 抓取成功'.format(chapter))

創(chuàng)建fictions數(shù)據(jù)庫(kù)和如下所示的表：

4、創(chuàng)建subpage.py文件用于獲取子頁(yè)的正文內(nèi)容

import requests from bs4 import BeautifulSoup from testcraw.craw_site import result from testcraw.mysql_helper import MysqlTool import redef test(website):for i in result(website):chapter, content = i[0], ''site = i[1]res = requests.get(url=site, timeout=60)res.raise_for_status()res.encoding = res.apparent_encodingdemo = res.textsoup = BeautifulSoup(demo, 'html.parser')for i in soup.find_all(attrs={'id': 'content'}):for j in i.stripped_strings:content += (j + '\n')content = re.sub(pattern='純文字在線閱讀本站域名手機(jī)同步閱讀請(qǐng)?jiān)L問', repl='', string=content, count=1)mt = MysqlTool()conn = mt.getConn()cur = mt.getCursor(conn)mt.insert(cur, chapter, content)mt.closeCursor(cur)mt.closeConn(conn)test('https://www.23hh.com/book/0/189/')

爬取部分內(nèi)容如下：

總結(jié)

以上是生活随笔為你收集整理的Python爬虫基础入门的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： .h .c .hh .cc文件
下一篇： Python 实现大文本文件切割