當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python爬取起点小说_Python简单爬取起点中文网小说（仅学习）

發(fā)布時間：2023/12/14 python 135 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬取起点小说_Python简单爬取起点中文网小说（仅学习）小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

前言

實習(xí)期間自學(xué)了vba，現(xiàn)在開始撿回以前上課學(xué)過的python，在此記錄學(xué)習(xí)進(jìn)程

本文內(nèi)容僅用于學(xué)習(xí)，請勿商用

一、爬蟲思路

無需登錄的頁面只需要用到簡單爬蟲，獲取小說目錄、通過目錄獲取小說正文即可。

二、使用步驟

1.引入庫

代碼如下(示例)：

import requests,sys

from bs4 import BeautifulSoup

2.讀取頁面

代碼如下(示例)：

target = 'https://book.qidian.com/info/1024995653#Catalog'

req = requests.get(url=target)

為防止頁面出錯、頁面亂碼問題，分別加入：

req.raise_for_status()

req.encoding = req.apparent_encoding

此時即可看到網(wǎng)頁HTML：

html = req.text

3.分析HTML

在HTML代碼中，我們要找到對應(yīng)目錄的文字和鏈接，以及承載這兩個信息的標(biāo)簽：

在小說目錄頁面按下F12，觀察頁面的HTML，可以發(fā)現(xiàn)目錄是在一個class=‘catalog-content-wrap’、id=‘j-catalogWrap’的div標(biāo)簽下的。繼續(xù)分析，發(fā)現(xiàn)還有volume-wrap，volume等子標(biāo)簽作為目錄的容器：

一直向下延伸到帶有鏈接的a標(biāo)簽，定位到目標(biāo)，分析完畢。

bf = BeautifulSoup(html,"html.parser")

catalogDiv = bf.find('div',class_='catalog-content-wrap',id='j-catalogWrap')

volumeWrapDiv = catalogDiv.find('div',class_='volume-wrap')

volumeDivs = volumeWrapDiv.find_all('div',class_='volume')

3.從標(biāo)簽中取出信息

仍然是利用BS直接取出volume中所有的a標(biāo)簽，并且把其中的文本和對應(yīng)的href存起來。

aList = volumeDiv.find_all('a')

for a in aList:

chapterName = a.string

chapterHref = a.get('href')

這樣整個目錄就檢索完成了，開始利用Href爬取正文。

4.爬取正文

先隨便選擇一個鏈接打開，觀察正文的HTML：

發(fā)現(xiàn)格式會有兩種情況，一種直接用p標(biāo)簽裝起來，一種是p中帶有span，用class=content-wrap的span裝起來。

但是首先他們都一定是在class=‘read-content j_readContent’的div下，因此直接定位：

req = requests.get(url=chapterHref)

req.raise_for_status()

req.encoding = req.apparent_encoding

html = req.text

bf = BeautifulSoup(html,"html.parser")

mainTextWrapDiv = bf.find('div',class_='main-text-wrap')

readContentDiv = mainTextWrapDiv.find('div',class_='read-content j_readContent')

readContent = readContentDiv.find_all('span',class_='content-wrap')

這時已經(jīng)可以拿到帶有標(biāo)簽的正文部分了，由于鏈接不同，會導(dǎo)致標(biāo)簽格式不同，因此用判斷區(qū)分：

if readContent == []:

textContent = readContentDiv.text.replace('

','\r\n')

textContent = textContent.replace('

','')

else:

for content in readContent:

if content.string == '':

print('error format')

else:

textContent += content.string + '\r\n'

正文內(nèi)容獲取完畢。

現(xiàn)在只需遍歷就能獲取整部小說啦！

總結(jié)

以下為完整代碼：

#!/usr/bin/env python3

# coding=utf-8

# author:sakuyo

#----------------------------------

import requests,sys

from bs4 import BeautifulSoup

class downloader(object):

def __init__(self,target):#初始化

self.target = target

self.chapterNames = []

self.chapterHrefs = []

self.chapterNum = 0

self.session = requests.Session()

def GetChapterInfo(self):#獲取章節(jié)名稱和鏈接

req = self.session.get(url=self.target)

req.raise_for_status()

req.encoding = req.apparent_encoding

html = req.text

bf = BeautifulSoup(html,"html.parser")

catalogDiv = bf.find('div',class_='catalog-content-wrap',id='j-catalogWrap')

volumeWrapDiv = catalogDiv.find('div',class_='volume-wrap')

volumeDivs = volumeWrapDiv.find_all('div',class_='volume')

for volumeDiv in volumeDivs:

aList = volumeDiv.find_all('a')

for a in aList:

chapterName = a.string

chapterHref = a.get('href')

self.chapterNames.append(chapterName)

self.chapterHrefs.append('https:'+chapterHref)

self.chapterNum += len(aList)

def GetChapterContent(self,chapterHref):#獲取章節(jié)內(nèi)容

req = self.session.get(url=chapterHref)

req.raise_for_status()

req.encoding = req.apparent_encoding

html = req.text

bf = BeautifulSoup(html,"html.parser")

mainTextWrapDiv = bf.find('div',class_='main-text-wrap')

readContentDiv = mainTextWrapDiv.find('div',class_='read-content j_readContent')

readContent = readContentDiv.find_all('span',class_='content-wrap')

if readContent == []:

textContent = readContentDiv.text.replace('

','\r\n')

textContent = textContent.replace('

','')

else:

for content in readContent:

if content.string == '':

print('error format')

else:

textContent += content.string + '\r\n'

return textContent

def writer(self, path, name='', content=''):

write_flag = True

with open(path, 'a', encoding='utf-8') as f: #a模式意為向同名文件尾增加文本

if name == None:

name=''

if content == None:

content = ''

f.write(name + '\r\n')

f.writelines(content)

f.write('\r\n')

if __name__ == '__main__':#執(zhí)行層

target = 'https://book.qidian.com/info/1024995653#Catalog'

dlObj = downloader(target)

dlObj.GetChapterInfo()

print('開始下載：')

for i in range(dlObj.chapterNum):

try:

dlObj.writer( 'test.txt',dlObj.chapterNames[i], dlObj.GetChapterContent(dlObj.chapterHrefs[i]))

except Exception:

print('下載出錯，已跳過')

pass

sys.stdout.write(" 已下載:%.3f%%" % float(i/dlObj.chapterNum) + '\r')

sys.stdout.flush()

print('下載完成')

原文鏈接:https://blog.csdn.net/weixin_47190827/article/details/113087316

總結(jié)

以上是生活随笔為你收集整理的python爬取起点小说_Python简单爬取起点中文网小说（仅学习）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： JSP设置表格边框为单实线
下一篇： websocket python爬虫_p