當前位置：首頁 > 编程语言 > python >内容正文

python

Python+bs4实现爬取小说并下载到本地

發布時間：2023/12/16 python 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python+bs4实现爬取小说并下载到本地小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Python+bs4實現爬取小說并下載到本地

前言

在公司閑的無聊之際，想研究研究python的bs模塊，試著寫一寫爬蟲。但是公司有限制，娛樂網址一律不能訪問，最后發現小說網站還能進，那就你了。開整~

以前覺得這東西挺low的，從頁面上抓取數據什么的我一直都覺得沒啥意思，不過今天我居然開始感覺到了一些成就感。

一、引包

本次爬蟲主要用到了兩個庫：

import requests from bs4 import BeautifulSoup

requests模塊用于模擬請求，獲取響應頁面；bs4模塊用于解析響應的頁面，方便獲取頁面標簽。

二、代理問題

本來想先試試水，用簡單的代碼試試能不能訪問頁面，結果一試就出現了下面的問題：

Traceback (most recent call last):File "D:/PycharmProjects/NovelCrawling/novel_crawling.py", line 109, in pre_opbook_info = search_by_kewords(keword)File "D:/PycharmProjects/NovelCrawling/novel_crawling.py", line 89, in search_by_kewordssoup = BeautifulSoup(result_html, 'lxml')File "D:\python\lib\site-packages\bs4\__init__.py", line 310, in __init__elif len(markup) <= 256 and ( TypeError: object of type 'NoneType' has no len() HTTPSConnectionPool(host='www.13800100.com', port=443): Max retries exceeded with url: /index.php?s=index/search&name=%E6%96%97%E7%BD%97%E5%A4%A7%E9%99%86&page=1 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002029189FD30>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

好奇怪的嘞，瀏覽器能正常訪問，而且在公司但也設置了代理，怎么就請求不到呢。帶著這個疑問就是一番百度，最多的結果是像這樣：

response = requests.get(url, verify=False)

帶上verify=False參數，說什么關閉openSSL驗證啥的，咱也不懂啊，只能驗證了一下，很遺憾，未果。
還有人說很多情況都是地址拼錯了的問題，但我仔細檢查了也沒有。

難受，百度了半個小時也沒能找到自己想要的答案，都想著算了，無聊就無聊吧，玩兒手機也挺香的。但是轉念一想，是不是requests模塊要單獨設代理呢？

帶著這個疑問再次進行了百度，requests模塊還真能設置代理，感覺有了希望，最后成功解決！代碼如下：

proxies = {'http': 'http://xxx.xxx.xxx.xxx:xx', # http代理'https': 'http://xxx.xxx.xxx.xxx:xx' # https代理 } response = requests.get(url, verify = False, proxies = proxies)

這里還get到一個點，上面的配置意思是說，當請求是http的時候走http的代理，是https請求的時候走https的代理，但是并不是意味著https代理一定要是https的地址。 就像我這里兩個代理都是設置的http代理地址。有一點點繞，其實簡單來說，就是當請求是https的時候也走我設置的http代理。

當然你自己在家使用就沒那么多蛋疼的問題了。

三、爬取過程

本次爬取的小說網站是138看書網https://www.13800100.com，分析階段就不在這記錄了，主要就是分析網頁定位需要信息的元素問題。這里主要記錄一下大概的思路：

其實感覺還是挺簡單的，總的來說就三步：

1、獲取章節列表，分析每一章節的下載網頁
2、分析下載網頁，獲取每一章節的小說內容
3、將小說內容存儲到文本文件中

1、獲取章節列表
分析小說目錄章節網頁，大概長這樣子：

通過F12開發工具分析出章節的元素位置，我們的目的是要分析出每一章節的閱讀地址。這里可以根據css定位然后進行分析：

# 從網頁內容中提取章節鏈接 def get_download_page_urls(htmlContent):# 實例化soup對象，便于處理soup = BeautifulSoup(htmlContent, 'lxml')# 獲取所有章節的a標簽li_as = soup.select('.bd>ul>.cont-li>a')# 小說名稱text_name = soup.select('.cate-tit>h2')[0].text# 下載地址dowload_urls = []for a in li_as:dowload_urls.append(f"{base_url}{a['href']}")return [text_name, dowload_urls]

我這里獲取了小說的名稱以及各章節的鏈接。

2、獲取各章節的小說內容
各個章節的鏈接已經拿到了，接下來只需要分析閱讀網頁的內容并存儲到文件中就可以了。網頁長這樣：

代碼處理：

# 分章節下載 def download_by_chapter(article_name, url ,index):'''article:小說名稱url:章節閱讀鏈接index:章節序號'''content = get_content(url)soup = BeautifulSoup(content, 'lxml')# 章節標題title = soup.select('.chapter-page h1')[0].text# 作者部分處理author = soup.select('.chapter-page .author')[0].text.replace('\n', '')# 小說內容部分處理txt = soup.select('.chapter-page .note')[0].prettify().replace('<br/>', '')\.replace('\n', '')txt = txt[txt.find('>') + 1:].rstrip('</div>').replace(' ', '\n').strip()txt_file = open(fr"{article_name}\{'%04d' % index}_{title}.txt", mode='w',encoding='utf-8')txt_file.write(f'{title}\n\n{author}\n\n{txt}'.replace(' ', ' '))txt_file.flush()txt_file.close()

關于小說內容部分的處理，本來可以采用.text只獲取文本內容，然后進行處理就可以了，但是操作后發現不太好分段落，還會有很多奇奇怪怪的符號。幾經折騰最后還是選擇了使用prettify()方法，將元素格式化成html字符串，然后進行相應的處理。
這里學到一點，關于數字的格式化：

'%04d' % index # 表示將index格式化成四位

有時候我們不想分文件下載，于是我加了一個下載到同一個文件中的方法：

# 下載到一個文件中 def download_one_book(txt_file, url, index):'''txt_file:存儲文本對象url:章節閱讀鏈接index:章節序號'''content = get_content(url)soup = BeautifulSoup(content, 'lxml')# 章節標題title = soup.select('.chapter-page h1')[0].text# 作者部分處理author = soup.select('.chapter-page .author')[0].text.replace('\n', '')# 小說內容部分處理txt = soup.select('.chapter-page .note')[0].prettify().replace('<br/>', '') \.replace('\n', '')txt = txt[txt.find('>') + 1:].rstrip('</div>').replace(' ', '\n').strip()txt_file.write(f'{title}\n\n'.replace(' ', ' '))# 只在第一章節寫入作者if index == 0:txt_file.write(f'{author}\n\n'.replace(' ', ' '))txt_file.write(f'{txt}\n\n'.replace(' ', ' '))txt_file.flush()

四、代碼擴展

很明顯，程序有些僵硬，使用者必須要手動去138讀書網找到小說的目錄地址，才能通過本程序進行下載。所以我分析網頁后，寫出一個網站搜索小說的方法：

# 關鍵字查找書籍 def search_by_kewords(keyword, page=1):'''keyword:搜索關鍵字page:分頁頁號'''book_info = {}while True:search_url = f'{base_url}/index.php?s=index/search&name={keyword}&page={page}'result_html = get_content(search_url)soup = BeautifulSoup(result_html, 'lxml')books_a = soup.select('.s-b-list .secd-rank-list dd>a')for index, book_a in enumerate(books_a):book_info[f'{index + 1}'] = [book_a.text.replace('\n', '').replace('\t', ''), f"{base_url}{book_a['href']}".replace('book', 'list')]if len(books_a) == 0:breakpage += 1print(f'共查找到{page}頁，{len(book_info.keys())}本書籍。')print('--------------------------')return book_info

然后做了一下細節的處理，得出了完整的程序：

import os import sys import time import traceback import warnings import requests from bs4 import BeautifulSoup# 忽略警告 warnings.filterwarnings('ignore') # 代理配置 proxies = {'http': 'http://xxx.xxx.xxx.xxx:xx', # http代理'https': 'http://xxx.xxx.xxx.xxx:xx' # https代理 } # 138看書網地址 base_url = 'https://www.13800100.com'# 獲取目錄網址內容 def get_content(url,):response = ''try:# user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36"response = requests.get(url, verify = False, proxies = proxies) # 需要使用代理的情況# response = requests.get(url, verify=False) # 不用代理的情況response.raise_for_status() # 如果返回的狀態碼不是200，則拋出異常;return response.content.decode(encoding=response.encoding) # 解碼后的網頁內容except requests.exceptions.ConnectionError as ex:print(ex)# 從網頁內容中提取章節鏈接 def get_download_page_urls(htmlContent):# 實例化soup對象，便于處理soup = BeautifulSoup(htmlContent, 'lxml')li_as = soup.select('.bd>ul>.cont-li>a')text_name = soup.select('.cate-tit>h2')[0].textdowload_urls = []for a in li_as:dowload_urls.append(f"{base_url}{a['href']}")return [text_name, dowload_urls]# 分章節下載 def download_by_chapter(article_name, url ,index):content = get_content(url)soup = BeautifulSoup(content, 'lxml')# 章節標題title = soup.select('.chapter-page h1')[0].text# 作者部分處理author = soup.select('.chapter-page .author')[0].text.replace('\n', '')# 小說內容部分處理txt = soup.select('.chapter-page .note')[0].prettify().replace('<br/>', '')\.replace('\n', '')txt = txt[txt.find('>') + 1:].rstrip('</div>').replace(' ', '\n').strip()txt_file = open(fr"{article_name}\{'%04d' % index}_{title}.txt", mode='w',encoding='utf-8')txt_file.write(f'{title}\n\n{author}\n\n{txt}'.replace(' ', ' '))txt_file.flush()txt_file.close()# 下載到一個文件中 def download_one_book(txt_file, url, index):content = get_content(url)soup = BeautifulSoup(content, 'lxml')# 章節標題title = soup.select('.chapter-page h1')[0].text# 作者部分處理author = soup.select('.chapter-page .author')[0].text.replace('\n', '')# 小說內容部分處理txt = soup.select('.chapter-page .note')[0].prettify().replace('<br/>', '') \.replace('\n', '')txt = txt[txt.find('>') + 1:].rstrip('</div>').replace(' ', '\n').strip()txt_file.write(f'{title}\n\n'.replace(' ', ' '))# 只在第一章節寫入作者if index == 0:txt_file.write(f'{author}\n\n'.replace(' ', ' '))txt_file.write(f'{txt}\n\n'.replace(' ', ' '))txt_file.flush()# 關鍵字查找書籍 def search_by_kewords(keyword, page=1):book_info = {}while True:search_url = f'{base_url}/index.php?s=index/search&name={keyword}&page={page}'result_html = get_content(search_url)soup = BeautifulSoup(result_html, 'lxml')books_a = soup.select('.s-b-list .secd-rank-list dd>a')for index, book_a in enumerate(books_a):book_info[f'{index + 1}'] = [book_a.text.replace('\n', '').replace('\t', ''), f"{base_url}{book_a['href']}".replace('book', 'list')]if len(books_a) == 0:breakpage += 1print(f'共查找到{page}頁，{len(book_info.keys())}本書籍。')print('--------------------------')return book_info# 主程序處理 def pre_op():start = time.perf_counter()try:print(f'本程序適用于下載138看書網小說，138看書網: {base_url}')print('請輸入關鍵字查找書籍：')keword = input()print('正在查找...')book_info = search_by_kewords(keword)print('請選擇對應序號進行下載:')print('**********************')for index in book_info.keys():print(f"{index}: {book_info[index][0]}")print('**********************')c = input('請選擇：')result_html = get_content(book_info[c][1])dowload_urls = get_download_page_urls(result_html)print('下載鏈接獲取完畢！')print('----------------------')print('請選擇下載模式：')print('1.分章節下載')print('2.整本下載')c = input()txt_file = ''if not os.path.exists(fr'./{dowload_urls[0]}'):os.mkdir(fr'./{dowload_urls[0]}')if c == '2':txt_file = open(fr"{dowload_urls[0]}\{dowload_urls[0]}_book.txt", mode='a+', encoding='utf-8')for index, dowload_url in enumerate(dowload_urls[1]):if c == '1':download_by_chapter(dowload_urls[0], dowload_url, index + 1)else:txt_file.write(f'{dowload_urls[0]}\n\n')download_one_book(txt_file, dowload_url, index)sys.stdout.write('%\r')percent = str(round(float(index + 1) * 100 / float(len(dowload_urls[1])), 2))sys.stdout.write(f'正在下載...{percent} %')sys.stdout.flush()txt_file.close()print(f'\n下載完畢！共計{len(dowload_urls[1])}章節.耗時：{str(round(time.perf_counter() - start, 2))}s.')print('=======================================================')except:traceback.print_exc()if __name__ == '__main__':pre_op()

看一下最終的效果：

本程序適用于下載138看書網小說，138看書網: https://www.13800100.com 請輸入關鍵字查找書籍：斗羅大陸正在查找... 共查找到4頁，20本書籍。 -------------------------- 請選擇對應序號進行下載: ********************** 1: 斗羅大陸之冰凰斗羅 2: 斗羅大陸之我本藍顏 3: 斗羅大陸之劍決天下 4: 斗羅大陸之極限 5: 斗羅大陸III龍王傳說（龍王傳說） 6: 斗羅大陸之青蓮劍帝姬 7: 斗羅大陸3龍王傳說 8: 斗羅大陸之神圣龍斗羅 9: 斗羅大陸之昊天傳說 10: 斗羅大陸之白鳳傳奇 11: 斗羅大陸之紅顏系統 12: 斗羅大陸之仙神紀 13: 斗羅大陸之時崎狂三 14: 斗羅大陸lll龍之御塵 15: 斗羅大陸之焰門傳奇 16: 斗羅大陸國服達摩玉小剛 17: 斗羅大陸的魔法師 18: 斗羅大陸一劍傾世 19: 斗羅大陸之靈魂手筆 20: 斗羅大陸之傾盡天下 ********************** 請選擇：11 下載鏈接獲取完畢！ ---------------------- 請選擇下載模式： 1.分章節下載 2.整本下載 2 正在下載...100.0 % 下載完畢！共計153章節.耗時：220.17s. =======================================================

至此，就是今日學習的成果了，本來想克服一下使用多線程提高一下下載速度，但是本人對多線程實在是很薄弱，折騰了也沒弄出來。如果有誰能幫我改成多線程，感激不盡。

最后，歡迎大家留言一起探討！有什么不合適的地方請指正。

總結

以上是生活随笔為你收集整理的Python+bs4实现爬取小说并下载到本地的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： SpringAOP基础以及四种实现方式
下一篇：障碍期权定价 python_Python