當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python爬虫爬取当当网的商品信息

發(fā)布時(shí)間：2024/3/7 python 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫爬取当当网的商品信息小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

python爬蟲(chóng)爬取當(dāng)當(dāng)網(wǎng)的商品信息

一、環(huán)境搭建
二、簡(jiǎn)介
三、當(dāng)當(dāng)網(wǎng)網(wǎng)頁(yè)分析
- 1、分析網(wǎng)頁(yè)的url規(guī)律
- 2、解析網(wǎng)頁(yè)html頁(yè)面
- - - 書(shū)籍商品html頁(yè)面解析
    - 其他商品html頁(yè)面解析
四、代碼實(shí)現(xiàn)

一、環(huán)境搭建

使用到的環(huán)境：

python3.8.0
requests庫(kù)
re庫(kù)
bs4庫(kù)
pycharm

二、簡(jiǎn)介

代碼實(shí)現(xiàn)了根據(jù)設(shè)定的關(guān)鍵字keyword獲取相關(guān)商品的資源定位符(url)，然后批量爬取相關(guān)頁(yè)面的商品信息，另外之所以選擇當(dāng)當(dāng)網(wǎng)是因?yàn)楫?dāng)當(dāng)網(wǎng)的網(wǎng)頁(yè)商品信息不是動(dòng)態(tài)加載的，因此可以直接爬取獲得，例如京東、拼多多的網(wǎng)頁(yè)就是動(dòng)態(tài)加載的，博主暫時(shí)還不會(huì)解析動(dòng)態(tài)加載的頁(yè)面😅😅😅

三、當(dāng)當(dāng)網(wǎng)網(wǎng)頁(yè)分析

1、分析網(wǎng)頁(yè)的url規(guī)律

首先是分析出當(dāng)當(dāng)網(wǎng)的搜索商品的url，瀏覽器進(jìn)入當(dāng)當(dāng)網(wǎng)主頁(yè)，在檢索欄輸入任意的商品關(guān)鍵詞，可以看到打開(kāi)的頁(yè)面的url鏈接形式為如下形式：

url = 'http://search.dangdang.com/?key={}&act=input'.format(keyword)

然后獲取頁(yè)面翻頁(yè)可以發(fā)現(xiàn)這類(lèi)商品的每一頁(yè)的url形式為：

url = "http://search.dangdang.com/?key={}&input&page_index={}".format(keyword,page_count)

分析出了這些規(guī)律之后，就可以根據(jù)輸入的關(guān)鍵詞自動(dòng)生成相應(yīng)頁(yè)面的url，然后像服務(wù)器發(fā)送請(qǐng)求，得到每頁(yè)的html信息。

2、解析網(wǎng)頁(yè)html頁(yè)面

瀏覽器打開(kāi)任一網(wǎng)頁(yè)，打開(kāi)網(wǎng)頁(yè)調(diào)試工具可以發(fā)現(xiàn)，當(dāng)當(dāng)網(wǎng)的商品網(wǎng)頁(yè)分為兩類(lèi)（很奇葩），第一類(lèi)是書(shū)籍類(lèi)商品的網(wǎng)頁(yè)，第二類(lèi)是其他商品。

書(shū)籍商品html頁(yè)面解析

經(jīng)過(guò)調(diào)試發(fā)現(xiàn)，頁(yè)面的商品信息都在ul標(biāo)簽下的中li標(biāo)簽中，每個(gè)li標(biāo)簽塊存放一個(gè)商品的所有信息，商品的一些信息例如name、price、author等信息又存放在li標(biāo)簽下對(duì)應(yīng)的p標(biāo)簽中，因此解析每個(gè)p標(biāo)簽的相關(guān)屬性信息，就能獲得對(duì)應(yīng)的信息。值得一提的是書(shū)籍商品頁(yè)面的ul標(biāo)簽的class屬性是"bigmig"。

其他商品html頁(yè)面解析

其他商品和書(shū)籍類(lèi)商品的商品信息存放的標(biāo)簽塊基本相同，唯一不同的是其ul標(biāo)簽的class屬性是"bigimg cloth_shoplist"，解析類(lèi)似，只不過(guò)要區(qū)別得到的ul標(biāo)簽的class屬性。

四、代碼實(shí)現(xiàn)

代碼主要包括四個(gè)函數(shù)：

def getMaxPageCount(keyword):# 爬取商品的最大頁(yè)面數(shù)try:user_agent = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'}max_page_count = 0url = 'http://search.dangdang.com/?key={}&act=input'.format(keyword)rspon = requests.get(url, headers = user_agent)rspon.encoding = rspon.apparent_encodingrspon.raise_for_status()except:print("爬取頁(yè)數(shù)失敗：", rspon.status_code)return max_page_counthtml = BeautifulSoup(rspon.text, "html.parser")ul_tag = html.find('ul', {"name":"Fy"})for child in ul_tag.children:if type(child) == type(ul_tag):match = re.match(r"\d+",str(child.string))if match:temp_num = int(match.group(0))if temp_num > max_page_count:max_page_count = temp_numreturn max_page_count

getMaxPageCount(keyword) 函數(shù)主要是根據(jù)輸入的商品種類(lèi)關(guān)鍵詞爬取相應(yīng)的網(wǎng)頁(yè)，獲取當(dāng)前種類(lèi)商品的最大頁(yè)數(shù)。

def getOnePageMsg(product_list, url):# 爬取一個(gè)頁(yè)面的商品數(shù)據(jù)try:user_agent = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'}rspon = requests.get(url, headers = user_agent)rspon.encoding = rspon.apparent_encodingrspon.raise_for_status()except:print("爬取頁(yè)數(shù)失敗：", rspon.status_code)html = BeautifulSoup(rspon.text, "html.parser")ul_tag = html.find('ul', {"class": "bigimg cloth_shoplist"})if ul_tag:search_type = 2else:ul_tag = html.find('ul', {"class": "bigimg"})if ul_tag:search_type = 1else:returnif search_type ==1:for child in ul_tag.children:if type(child) == type(ul_tag):temp_list = []# 保存書(shū)名tag_name = (child.find('p',{"class":"name"})).find('a')temp_list.append(str(tag_name.attrs["title"]))# 保存價(jià)格tag_price = (child.find('p',{"class":"price"})).find("span", {"class":"search_now_price"})temp_list.append(str(tag_price.string))# 保存作者tag_author = child.find('p', {"class":"search_book_author"}).find('a',{"name":"itemlist-author"})if tag_author:temp_list.append(str(tag_author.string))else:temp_list.append(str("NULL"))# 保存出版社tag_pub = child.find('p', {"class": "search_book_author"}).find('a', {"name": "P_cbs"})if tag_pub:temp_list.append(str(tag_pub.string))else:temp_list.append(str("NULL"))#保存評(píng)價(jià)tag_comment = child.find('p',{"class":"search_star_line"}).find('a')temp_list.append(str(tag_comment.string))product_list.append(temp_list)else:for child in ul_tag.children:if type(child) == type(ul_tag):temp_list = []# 保存商品名tag_name = (child.find('p',{"class":"name"})).find('a')temp_list.append(str(tag_name.attrs["title"]))# 保存價(jià)格tag_price = (child.find('p',{"class":"price"})).find("span")temp_list.append(str(tag_price.string))#保存評(píng)價(jià)tag_comment = child.find('p',{"class":"star"}).find('a')temp_list.append(str(tag_comment.string))product_list.append(temp_list)

getOnePageMsg(product_list, url) 函數(shù)實(shí)現(xiàn)的是根據(jù)一個(gè)url爬取網(wǎng)頁(yè)的html信息，然后解析信息得到需要的商品信息，主要是使用bs4庫(kù)的BeautifulSoup解析網(wǎng)頁(yè)結(jié)構(gòu)，并且將數(shù)據(jù)存入product_list列表中。

def getAllPageMsg(keyword, product_list):# 爬取商品的所有頁(yè)面的數(shù)據(jù)page_count = getMaxPageCount(keyword)print("find ",page_count," pages...")for i in range(1,page_count+1):url = "http://search.dangdang.com/?key={}&input&page_index={}".format(keyword,i)getOnePageMsg(product_list,url)print(">> page",i," import successfully...")

getAllPageMsg(keyword, product_list) 函數(shù)實(shí)現(xiàn)的是根據(jù)關(guān)鍵詞首先獲取這一商品的最大頁(yè)面，然后自動(dòng)化生成每一個(gè)頁(yè)面的url，在調(diào)用getOnePageMsg函數(shù)解析每個(gè)url頁(yè)面，得到數(shù)據(jù)列表product_list。

def writeMsgToFile(path,keyword,product_list):# 將爬取的數(shù)據(jù)保存到文件file_name = "{}.txt".format(keyword)file = open(path+file_name, "w", encoding='utf-8')for i in product_list:for j in i:file.write(j)file.write(' ')file.write("\n")file.close()

writeMsgToFile(path,keyword,product_list) 函數(shù)主要實(shí)現(xiàn)將列表的數(shù)據(jù)寫(xiě)入文件，當(dāng)然還有根據(jù)傳入的路徑和關(guān)鍵字生成對(duì)應(yīng)的文件名。

def main():# 函數(shù)入口keyword = 'python'path="E://"product_list = []getAllPageMsg(keyword,product_list)writeMsgToFile(path,keyword,product_list)

最后main()函數(shù)是實(shí)現(xiàn)方法的入口，包括一些參數(shù)的初始化設(shè)置。
執(zhí)行main()函數(shù)就能實(shí)現(xiàn)網(wǎng)頁(yè)數(shù)據(jù)的爬取，以下是以python為關(guān)鍵詞，爬取頁(yè)面得到的結(jié)果，抓取解析完所有頁(yè)面花了3分鐘左右。

可以看到最終生成的txt文件一共有6000行，每一行保存的就是商品的信息。

總結(jié)

以上是生活随笔為你收集整理的python爬虫爬取当当网的商品信息的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：弟弟的作业c语言,用C语言解决弟弟的作业
下一篇： websocket python爬虫_p