當前位置：首頁 > 编程语言 > python >内容正文

python

python猫眼top数据解析画图

發布時間：2023/12/14 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 python猫眼top数据解析画图小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

貓眼top100數據解析

這是繼上篇爬取數據后的數據解析，且嘗試使用更多種方法去抓取、存儲數據。上篇鏈接為link

抓取數據方法介紹

1.利用正則表達式解析

def parse_one_page(html):pattern = '<dd>.*?board-index.*?">(\d+).*?data-src="(.*?)".*?/>.*?movie-item-info.*?title="(.*?)".*?star">' + \'(.*?).*?releasetime">(.*?).*?integer">(.*?).*?fraction">(\d+).*?</dd>'# re.S匹配任意字符，多行regex = re.compile(pattern, re.S)items = regex.findall(html)for item in items:yield {'index': item[0],'thumb': get_large_thumb(item[1]),'title': item[2],'actors': item[3].strip()[3:],'release_time': get_release_time(item[4].strip()[5:]),'area': get_release_area(item[4].strip()[5:]),'score': item[5] + item[6]}passpass

2.使用lxml中Xpath路徑解析

def parse_one_page2(html):parse = etree.HTML(html)items = parse.xpath("//*[@id='app']//div//dd")for item in items:yield{'index':item.xpath("./i/text()")[0],'thumb':get_large_thumb(str(item.xpath("./a/img[2]/@data-src")[0].strip())),'name':item.xpath("./a/@title")[0],'star':item.xpath(".//p[@class='star']/text()")[0].strip(),'time':get_release_time(item.xpath(".//p[@class='releasetime']/text()")[0].strip()[5:]),'area':get_release_area(item.xpath(".//p[@class='releasearea']/text()")[0].strip()[5:]),'score':item.xpath(".//p[@class='score']/i[1]/text（）")[0]+\item.xpath(".//p[@class='score']/i[2]/text()")[0]}passpass

此方法一般用于對規則性的信息的解析，是解析利器，也是爬蟲信息抽取利器。

3.bs4的soup.select方法

def parse_one_page3(html):soup = BeautifulSoup(html,'lxml')items =range(10)for item in items:yield{'index':soup.select("dd i.board-index")[item].string,'thumb':get_large_thumb(soup.select("a > img.board-img")[item]['data-src']),'name':soup.select(".name a")[item].string,'star':soup.select(".star")[item].string.strip()[3:],'time':get_release_time(soup.select(".releasetime")[item].string.strip()[5:]),'area':get_release_area(soup.select(".releasearea")[item].string.strip()[5:]),'score':soup.select(".integer")[item].string+soup.select(".fraction")[item].string,}passpass

用beautifulsoup + css選擇器提取。

4.API接口函數 - find函數

def parse_one_page4(html):soup = BeautifulSoup(html, 'lxml')items = range(10)for item in items:yield {'index':soup.find_all(class_="board-index")[item].string,'thumb':get_large_thumb(soup.find_all(class_="board-img")[item].attrs['data-src']),'name':soup.find_all(name='p',attrs={'class':"name"})[item].string,'star':soup.find_all(name='p',attrs={'class':"star"})[item].string.strip()[3:],'time':get_release_time(soup.find_all(class_='releasetime')[item].string.strip()[5:]),'area':get_release_area(soup.find_all(class_='releasetime')[item].string.strip()[5:]),'score':soup.find_all(name='i',attrs={'class':"integer"})[item].string.strip() +soup.find_all(name='i',attrs={'class':"fraction"})[item].string.strip()}passpass

Beautifulsoup除了和css選擇器搭配，還可以直接用它自帶的find_all函數進行提取，如上所示。

2.存儲方法介紹

1.字典格式存儲，JSON串

def write_to_file(items):# a為追加的意思，utf_8_sig是使簡體中文不亂碼with open('save.csv','a',encoding='utf_8_sig')as f:f.write(json.dumps(items,ensure_ascii=False) + '\n')print('第%s部電影爬取完畢'% items["index"])pass pass

2.格式存儲

def write_to_file2(items):with open('save2.csv','a',encoding='utf_8_sig',newline='')as f:fieldnames = ['index','thumb','name','star','time','area','score']w = csv.DictWriter(f,fieldnames=fieldnames)w.writerow(items)passpass

3.值存儲

def write_to_file3(items):with open('save.csv', 'a', encoding='utf_8_sig', newline='')as f:w = csv.writer(f)w.writerow(items.values())passpass

3.數據解析：可視化解析

以畫出電影評分前十的柱狀圖為例。

1.前置工作導入所需庫、所需數據及設置主題

import matplotlib.pyplot as plt import pylab as pl import pandas as pdplt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['font.family']='sans-serif' #解決符號'-'亂碼問題 plt.rcParams['axes.unicode_minus'] = False#，設置主題 plt.style.use('ggplot') # 設置柱形圖大小 fig = plt.figure(figsize=(8,5)) colors1 = '#6D6D6D' #導入原始數據 cloumns = ['index','thumb','name','star','time','area','score'] df=pd.read_csv('save2.csv',encoding='utf-8',header=None,names=cloumns,index_col='index')

2.繪圖

def annsis1():df_score= df.sort_values('score',ascending=False)# asc False降序,True升序: descname1 = df_score.name[:10] #X軸坐標score1 = df_score.score[:10]#Y軸坐標plt.bar(range(10),score1,tick_label=name1) #繪制條形圖，用range()能保持X軸順序一致plt.ylim(9,10)plt.title("電影評分最高Top10",color=colors1)plt.xlabel('電影名稱')plt.ylabel('評分')#標記數值for x,y in enumerate(list(score1)):plt.text(x,y+0.01,'%s' %round(y,1),ha='center',color=colors1)passpl.xticks(rotation=270)#旋轉270°plt.tight_layout() #去除空白vplt.show()pass

旋轉270°是為了防止某些電影名稱過長導致與其他電影名稱重疊。

3.結果

總結

以上是生活随笔為你收集整理的python猫眼top数据解析画图的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：如何对PDF文件的文字图片编辑修改
下一篇： websocket python爬虫_p