日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程语言 > python >内容正文

python

python爬虫案例——爬取豆瓣图书信息并保存

發(fā)布時間:2023/12/31 python 28 豆豆
生活随笔 收集整理的這篇文章主要介紹了 python爬虫案例——爬取豆瓣图书信息并保存 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

python爬蟲案例——爬取豆瓣圖書信息并保存

  • 所需基礎
    • requests庫的使用
    • BeautifulSoup庫的使用
    • re庫的使用和簡單的正則表達式
    • tqdm(進度條)庫的使用
    • pandas庫創(chuàng)建DataFrame和保存Csv操作

直接上代碼,注釋寫的比較詳細

from bs4 import BeautifulSoup import requests import re #import threading #import want2url import pandas as pd from tqdm import tqdmurl = "https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=0&type=T"\class douban_crawler():send_headers = {"Host": "book.douban.com","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8","Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2","Connection": "close"}def __init__(self, url, pages):""":param url: 爬蟲的最初界面,決定了要爬的書籍的類別信息:param pages: 要爬取的頁數,豆瓣每頁20本書的信息,決定了要爬取的數據量"""self.url = urlself.pages = [20*i for i in range(pages)]self.book_class = ""self.book_names = []self.book_nations = []self.book_writers = []self.book_scores = []self.book_comments = []self.book_sites = []self.book_pages = []def generate_urls(self):idx_urls = []#正則表達式page_key = re.compile(r"(\S*\?)")#利用正則表達式匹配出url的必須部分,后面和控制頁數的變量進行拼接成索要檢索的所有url列表#注意利用正則匹配到的返回結果為一個列表,一般需要取出列表中的值進行下面的操作page_main = page_key.findall(self.url)[0]#“合成”所有url列表,因為豆瓣的規(guī)則是每20本書放在一頁中,并且用url中的start關鍵字控制顯示的頁數for i in self.pages:g_url = page_main+"start="+str(i)+"&type=T"idx_urls.append(g_url)return idx_urlsdef open_url(self, url=None):#如果不給需要打開的url則自動打開最初始界面(對象初始化給的界面)if url == None:#對網站發(fā)起get請求resp = requests.get(self.url,headers=self.send_headers)#獲取返回信息的文本部分resp_text = resp.text#利用BS庫對文本部分進行html解析,并返回解析后的界面soup = BeautifulSoup(resp_text, "html.parser")return soupelse:resp = requests.get(url, headers=self.send_headers)resp_text = resp.textsoup = BeautifulSoup(resp_text, "html.parser")return soupdef get_main_info(self, url):"""獲取url列表頁面能獲取主要信息,不打開各個書的獨立頁面,主要信息包括:書的所屬類別,作者國家,書名,每本書的索引url,書的作者,書的評分,書的簡介,書的頁數:return: 各個主要信息的存儲列表"""#分別為,書類別,國家,作者和簡介的正則表示式book_class_key = re.compile(": (\D*)")book_nation_key = re.compile("\[(\D*?)\]")book_writer_key1 = re.compile("^(\D*?)/")book_writer_key2 = re.compile("](\D*)$")book_comment_key = re.compile(r"<p>(\S*)</p>")#創(chuàng)建存儲主要信息的列表:因為書名是固定的,一個大頁面是一個類別,所以只需要返回一次,不需要列表存儲book_names = []book_pages = []book_nations = []book_writers = []book_comments = []book_scores = []#對url列表進行遍歷并操作#urls = self.generate_urls()#為了防止耦合,最好一個函數只操作一個頁面,在主函數進行對這個函數的遍歷操作resp = requests.get(url, headers=self.send_headers) #和上面一樣的操作,向url發(fā)送get請求resp_text = resp.text #獲取返回的文本信息soup = BeautifulSoup(resp_text, "html.parser") #利用BS庫對html格式的文本信息進行解析# 獲取圖書類別book_class = soup.find("h1").get_text(strip=True)book_class = book_class_key.findall(book_class)# 獲取書名for a in soup.find_all("a"):try:# 獲取書名res = a.get("title")# 獲取對應的內層網站res_url = a.get("href")# 獲取每本書對應的獨立頁面urlif res != None:book_names.append(res)book_pages.append(res_url)except:pass"""獲取書的作者和作者國籍,因為非中國籍的形式為[國家]作者,而中國籍作者在作者名前沒有[]所以我們用兩個正則表達式分別檢索,但是少數作者即使不為中國籍,也沒有加[],此類我把這類數據當作臟數據為了盡可能的修正這種數據帶來的影響,設置判定條件為,沒有[]且作者名小于五個字,為中國作者"""for nation in soup.find_all("div", attrs={"class": "pub"}):nn = nation.get_text().strip()# print(nn)book_writer = book_writer_key1.findall(nn)[0]if ']' in book_writer:book_writers.append(book_writer_key2.findall(book_writer)[0].strip())else:book_writers.append(book_writer)try:bn = book_nation_key.findall(nn)if bn == [] and len(book_writer) < 5: #中國籍作者的判定條件book_nations.append("中")elif bn != []:# print(bn)book_nations.append(bn[0])else:book_nations.append("日")except:book_nations.append("中")#獲取書籍簡介for comment in soup.find_all("div", attrs={"class": "info"}):if comment.find_all("p") == []:book_comments.append("無簡介")else:book_comments.append(comment.find_all("p")[0].get_text())#獲取書籍評分for score in soup.find_all("span", attrs={"class": "rating_nums"}):book_scores.append(score.get_text())return book_names, book_pages, book_class*20, book_writers, book_nations, book_comments, book_scoresdef get_page_numbers(self, urls):"""從每個圖書的獨立頁面中獲取數據,目前只獲取了頁數數據:param urls: 從get_main_info中生成的圖書獨立頁面url列表:return: 對應圖書的頁數數據"""book_pagesnumber = []print("****開始獲取頁數信息****")for url in tqdm(urls):rrr = requests.get(url, headers=self.send_headers)rtext = rrr.textin_soup = BeautifulSoup(rtext, 'html.parser')# print(in_soup.text)page_num = re.compile(r"頁數: (\d*)").findall(in_soup.text)#有可能有的書缺失頁數信息,遇上此類情況全部將頁數設置為0if page_num == []:book_pagesnumber.append(0)else:book_pagesnumber.extend(page_num)return book_pagesnumberdef begin_crawl(self):"""類的“主函數”只需要執(zhí)行這個函數就可以完成爬蟲功能:return: 所有的信息列表"""sum_book_names = []sum_book_urls = []sum_book_class = []sum_book_writers = []sum_book_nations = []sum_book_comments = []sum_book_scores = []sum_book_pages = []urls = self.generate_urls() #生成要爬取的所有頁面的url地址print("****開始爬取****")for url in tqdm(urls):book_names, book_urls, book_class, book_writers, book_nations, book_comments, book_scores = self.get_main_info(url)book_pages = self.get_page_numbers(book_urls)sum_book_names.extend(book_names)sum_book_urls.extend(book_urls)sum_book_class.extend(book_class)sum_book_writers.extend(book_writers)sum_book_nations.extend(book_nations)sum_book_comments.extend(book_comments)sum_book_scores.extend(book_scores)sum_book_pages.extend(book_pages)return sum_book_names, sum_book_urls, sum_book_class, sum_book_writers, sum_book_nations, sum_book_comments, sum_book_scores, sum_book_pagesdef write2csv(self):"""將爬取結果寫入csv文件中:return: 無返回值"""name, url, book_class, writer, nation, comment, score, pages = self.begin_crawl()info_df = pd.DataFrame(columns=["name", "url", "class", "writer", "nation", "comment", "score", "pages"])info_df["name"] = nameinfo_df["url"] = urlinfo_df["class"] = book_classinfo_df["writer"] = writerinfo_df["nation"] = nationinfo_df["comment"] = commentinfo_df["score"] = scoreinfo_df["pages"] = pagesinfo_df.to_csv(f"{book_class[0]}.csv", header=None, encoding="utf_8_sig")if __name__ == '__main__':db_crawler = douban_crawler(url, 5)db_crawler.write2csv()

推薦資料:
request庫的用法
BeautifulSoup庫的用法
tqdm庫相關
pandas入門操作

總結

以上是生活随笔為你收集整理的python爬虫案例——爬取豆瓣图书信息并保存的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。