當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

Python数据分析实战：TMDB电影数据可视化

發(fā)布時(shí)間：2023/12/31 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python数据分析实战：TMDB电影数据可视化小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

出品：Python數(shù)據(jù)之道 (ID:PyDataLab)?

作者：葉庭云

編輯：Lemon

一、數(shù)據(jù)預(yù)處理

本文將以項(xiàng)目實(shí)戰(zhàn)的形式，對(duì) TMDB電影數(shù)據(jù)進(jìn)行數(shù)據(jù)分析與可視化實(shí)戰(zhàn)，所使用的數(shù)據(jù)來源于 Kaggle，文末提供數(shù)據(jù)的下載方式。

import?json import?pandas?as?pd import?numpy?as?np from?datetime?import?datetime import?warningswarnings.filterwarnings('ignore')????#?不顯示告警信息 #?讀取電影數(shù)據(jù)??指定引擎??不然會(huì)報(bào)錯(cuò)誤 df?=?pd.read_csv('tmdb_5000_movies.csv',?engine='python') df.head() #?由于數(shù)據(jù)集中包含的信息過多，其中部分?jǐn)?shù)據(jù)并不是我們研究的重點(diǎn)，所以從中抽取分析要用的數(shù)據(jù)： #?關(guān)鍵詞??電影名稱??電影類型??首次上映日期??電影時(shí)長(zhǎng)??預(yù)算??收入 df1?=?df[['keywords',?'original_title',?'genres',?'release_date',?'runtime',?'budget',?'revenue',?'vote_count',?'vote_average']] df1.info() #?方法二??查閱資料?填充缺失數(shù)據(jù) #?IMDb官網(wǎng)??https://www.imdb.com/title/tt3856124/df1.loc[2656,?'runtime']?=?98.0 df1.loc[4140,?'runtime']?=?81.0 df1.loc[4553,?'release_date']?=?'2014-06-01' df1.info() #?genres列數(shù)據(jù)處理 df1['genres'].head() #?將str轉(zhuǎn)換為json df1['genres']?=?df1['genres'].apply(json.loads)def?decode(col):genre?=?[]for?item?in?col:genre.append(item['name'])return?'|'.join(genre)df1['genres']?=?df1['genres'].apply(decode) df1.head() #?提取release_date的年份 df1['release_date']?=?pd.to_datetime(df1['release_date']).dt.year #?改列的名稱 col?=?{'release_date':?'year'} df1.rename(columns=col,?inplace=True) df1['year'].apply(int).head()????#?轉(zhuǎn)為整數(shù) #?保存為已清洗數(shù)據(jù) df1.to_excel('已清洗數(shù)據(jù).xlsx')

二、數(shù)據(jù)分析

1. 建立包含年份與電影類型數(shù)量的關(guān)系數(shù)據(jù)框

各類型電影的數(shù)量如何隨著時(shí)間的推移發(fā)生變化的？建立包含年份與電影類型數(shù)量的關(guān)系數(shù)據(jù)框提取取2000-2017年的各電影類型數(shù)量 ?熱力圖可視化

""" @Author ?：葉庭云 @Date ???：2020/10/2 11:40 """ import?pandas?as?pd import?matplotlib.pyplot?as?plt import?matplotlib?as?mpl import?seaborn?as?sns#?讀取Excel數(shù)據(jù) df?=?pd.read_excel('已清洗數(shù)據(jù).xlsx') #?有個(gè)別行數(shù)據(jù)清洗時(shí)不是nan?但為空列表?提取后為nan df.dropna(inplace=True) #?建立genres列表，提取電影的類型 genres_set?=?set() for?genre?in?df['genres'].str.split('|'):for?item?in?genre:genres_set.add(item)genres_list?=?list(genres_set)for?genre?in?genres_list:#?判斷每行??有這個(gè)類型??對(duì)應(yīng)類型的列下添個(gè)1df[genre]?=?df['genres'].str.contains(genre).apply(lambda?x:?1?if?x?else?0)genre_year?=?df.loc[:,?genres_list] #?將年份作為索引標(biāo)簽 genre_year.index?=?df['year'] #?將數(shù)據(jù)集按年份分組并求和，得出每個(gè)年份，各電影類型的電影總數(shù) genresdf?=?genre_year.groupby('year').sum() #?包含年份與電影類型數(shù)量的DataFrame print(genresdf) #?取2000-2016年的電影類型數(shù)量??熱力圖可視化?17年數(shù)據(jù)就沒幾部 datas?=?genresdf.iloc[-18:-1:1,?::] mpl.rcParams['font.family']?=?'Kaiti' fig,?ax?=?plt.subplots(figsize=(15,?9)) print(datas) #?繪制熱力圖??? cmap：從數(shù)字到色彩空間的映射 sns.heatmap(data=datas.T,?linewidths=0.25,linecolor='white',?ax=ax,?annot=True,fmt='d',?cmap='Accent',?robust=True,)#?添加描述信息???x?y軸??title ax.set_xlabel('年份',?fontdict={'size':?18,?'weight':?'bold'}) ax.set_ylabel('電影類型',?fontdict={'size':?18,?'weight':?'bold'}) ax.set_title(r'2000-2016年各電影類型數(shù)量',?fontsize=25,?x=0.5,?y=1.02)#?隱藏邊框 ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) ax.spines['left'].set_visible(False) ax.spines['bottom'].set_visible(False)#?保存?展示圖片 plt.savefig('heat_map.png') plt.show()

從熱力圖可以直觀分析出，Drama 和 Comedy 每年的電影數(shù)量都比較多，Thriller每年的電影數(shù)量也比較可觀。

2. 數(shù)量最多的電影類型Top10

""" @Author ?：葉庭云 @Date ???：2020/10/2 11:40 """ import?pandas?as?pd import?matplotlib.pyplot?as?plt import?matplotlib?as?mpl#?讀取數(shù)據(jù) df?=?pd.read_excel('已清洗數(shù)據(jù).xlsx') #?有個(gè)別行提取類型時(shí)不是nan?但為空列表?再提取后為nan df.dropna(inplace=True) #?建立genres列表，提取電影的類型 genres_set?=?set() for?genre?in?df['genres'].str.split('|'):for?item?in?genre:genres_set.add(item)genres_list?=?list(genres_set)for?genre?in?genres_list:#?判斷每行??有這個(gè)類型??對(duì)應(yīng)類型的列下添個(gè)1df[genre]?=?df['genres'].str.contains(genre).apply(lambda?x:?1?if?x?else?0)genre_year?=?df.loc[:,?genres_list] #?將年份作為索引標(biāo)簽 genre_year.index?=?df['year'] #?將數(shù)據(jù)集按年份分組并求和，得出每個(gè)年份，各電影類型的電影總數(shù) genresdf?=?genre_year.groupby('year').sum() genres_count?=?genresdf.sum(axis=0).sort_values(ascending=False)????#?升序 #?print(genres_count.index) #?print(genres_count.values) colors?=?['#FF0000',?'#FF1493',?'#00BFFF',?'#9932CC',?'#0000CD',?'#FFD700',?'#FF4500',?'#00FA9A',?'#191970','#006400'] #?設(shè)置大小???像素 plt.figure(figsize=(12,?8),?dpi=100) #?設(shè)置中文顯示 mpl.rcParams['font.family']?=?'SimHei' plt.style.use('ggplot') #?繪制柱形圖??設(shè)置柱條的寬度和顏色 plt.barh(genres_count.index[9::-1],?genres_count.values[9::-1],?height=0.6,?color=colors[::-1]) plt.xlabel('電影數(shù)量',?fontsize=12) plt.ylabel('電影類型',?fontsize=12,?color='red') plt.title('數(shù)量最多的電影類型Top10',?fontsize=18,?x=0.5,?y=1.05) plt.savefig('test_001.png') plt.show()

電影數(shù)量最多的電影類型前五為：Drama(戲劇)、Comedy(喜劇)、Thriller(驚悚)、Action（動(dòng)作）、Romance（浪漫）

3. 各種電影類型所占比例

""" @Author ?：葉庭云 @Date ???：2020/10/2 11:40 """ import?pandas?as?pd import?matplotlib.pyplot?as?plt import?matplotlib?as?mpldf?=?pd.read_excel('已清洗數(shù)據(jù).xlsx') #?有個(gè)別行提取類型時(shí)不是nan?但為空列表?提取后為nan df.dropna(inplace=True) #?建立genres列表，提取電影的類型 genres_set?=?set() for?genre?in?df['genres'].str.split('|'):for?item?in?genre:genres_set.add(item)genres_list?=?list(genres_set)for?genre?in?genres_list:#?判斷每行??有這個(gè)類型??對(duì)應(yīng)類型的列下添個(gè)1df[genre]?=?df['genres'].str.contains(genre).apply(lambda?x:?1?if?x?else?0)genre_year?=?df.loc[:,?genres_list] #?將年份作為索引標(biāo)簽 genre_year.index?=?df['year'] #?將數(shù)據(jù)集按年份分組并求和，得出每個(gè)年份，各電影類型的電影總數(shù) genresdf?=?genre_year.groupby('year').sum() genres_count?=?genresdf.sum(axis=0).sort_values(ascending=False)????#?升序 #?print(genres_count.index) #?print(genres_count.values) #?print(len(genres_count.values))#?設(shè)置中文顯示 mpl.rcParams['font.family']?=?'SimHei' #?設(shè)置大小??像素 plt.figure(figsize=(12,?8),?dpi=100) plt.axes(aspect='equal')???#?保證餅圖是個(gè)正圓 explodes?=?[0,?0,?0,?0,?0,?0,?0,?0,?0,?0,?0,?0,?0,?0,?0.1,?0.25,?0.4,?0.55,?0.7,?0.85] plt.pie(genres_count.values,?labels=genres_count.index,autopct='%.2f%%',?shadow=True,?explode=explodes,startangle=15,?labeldistance=1.1,) plt.title('各種電影類型所占比例',?fontsize=18) plt.savefig('test_002.png') plt.show()

在所有的電影類型中，Drama(戲劇)類型電影最多，占比高達(dá)18.89%，其次為Comedy(喜劇)，占比14.16%。

4. 電影關(guān)鍵詞分析

""" @Author ?：葉庭云 @Date ???：2020/10/2 11:40 """ import?pandas?as?pd import?collections from?wordcloud?import?WordCloud import?matplotlib.pyplot?as?pltdf?=?pd.read_csv('tmdb_5000_movies.csv')['keywords']key_words_list?=?[] for?item?in?df:item?=?eval(item)if?item:?????#?為空列表??濾掉#?['aftercreditsstinger',?'duringcreditsstinger']??這個(gè)詞語頻率比較高?但好像沒啥意義??濾掉key_words_list.extend([x['name']?for?x?in?item?if?x['name']?not?in?['aftercreditsstinger',?'duringcreditsstinger']])words_count?=?collections.Counter(key_words_list) print(words_count)wc?=?WordCloud(background_color='white',max_words=2000,max_font_size=100,random_state=8, )wc.generate_from_frequencies(words_count) plt.imshow(wc) plt.axis('off') plt.savefig('test_003.png') plt.show()

通過對(duì)電影關(guān)鍵字的詞云圖分析，可以發(fā)現(xiàn)電影中經(jīng)常被提及的關(guān)鍵詞是女性(woman)、獨(dú)立(independent)，其次是謀殺(murder)、暴力(violence)、復(fù)仇(revenge)、基于小說(based on novel)，可見觀眾對(duì)女性和獨(dú)立方面題材的電影最感興趣，其次是犯罪類和基于小說改編的電影。

5. 各類型電影數(shù)量隨時(shí)間變化趨勢(shì)

""" @Author ?：葉庭云 @Date ???：2020/10/2 11:40 """ import?pandas?as?pd import?matplotlib.pyplot?as?plt import?matplotlib?as?mpl#?讀取數(shù)據(jù) df?=?pd.read_excel('已清洗數(shù)據(jù).xlsx') #?有個(gè)別行提取類型時(shí)不是nan?但為空列表?提取后為nan df.dropna(inplace=True) #?建立genres列表，提取電影的類型 genres_set?=?set() for?genre?in?df['genres'].str.split('|'):for?item?in?genre:genres_set.add(item)genres_list?=?list(genres_set)for?genre?in?genres_list:#?判斷每行??有這個(gè)類型??對(duì)應(yīng)類型的列下添個(gè)1df[genre]?=?df['genres'].str.contains(genre).apply(lambda?x:?1?if?x?else?0)genre_year?=?df.loc[:,?genres_list] #?將年份作為索引標(biāo)簽 genre_year.index?=?df['year'] #?將數(shù)據(jù)集按年份分組并求和，得出每個(gè)年份，各電影類型的電影總數(shù) genresdf?=?genre_year.groupby('year').sum() print(genresdf) #?設(shè)置中文顯示 mpl.rcParams['font.family']?=?'SimHei' #?設(shè)置大小??像素 plt.figure(figsize=(10,?6),?dpi=100) #?設(shè)置圖形顯示風(fēng)格 plt.style.use('ggplot') #?DataFrame?繪制折線圖 plt.plot(genresdf,?label=genresdf.columns) #?添加描述信息 plt.xticks(range(1915,?2018,?5)) plt.xlabel('年份',?fontsize=12) plt.ylabel('電影數(shù)量',?fontsize=12) plt.title('各電影類型的數(shù)量隨時(shí)間變化趨勢(shì)',?fontsize=18,?x=0.5,?y=1.02) #?顯示圖例 plt.legend(genresdf) #?保存圖片 plt.savefig('test_004.png') #?展示圖片 plt.show()

從折線圖圖中容易發(fā)現(xiàn)，隨著時(shí)間的推移，所有電影類型都呈現(xiàn)出增長(zhǎng)趨勢(shì)，大概在1992年以后各類型的電影均增長(zhǎng)迅速，可能原因?yàn)槿藗兾镔|(zhì)生活水平提高，對(duì)觀影有了更多需求，其中Drama(戲劇)和Comedy(喜劇)增長(zhǎng)最快，目前仍是最熱門的電影類型。

6. 電影票房與電影時(shí)長(zhǎng)關(guān)系

""" @Author ?：葉庭云 @Date ???：2020/10/2 11:40 """ import?pandas?as?pd import?matplotlib?as?mpl import?matplotlib.pyplot?as?plt#?讀取數(shù)據(jù) df?=?pd.read_excel('已清洗數(shù)據(jù).xlsx') #?電影時(shí)長(zhǎng)???票房 run_time,?revenue?=?df['runtime'],?df['revenue']#?設(shè)置中文顯示 mpl.rcParams['font.family']?=?'SimHei' #?設(shè)置圖形顯示風(fēng)格 plt.style.use('ggplot') #?設(shè)置大小??像素 plt.figure(figsize=(9,?6),?dpi=100) #?繪制散點(diǎn)圖 plt.scatter(run_time,?revenue) #?添加描述信息 plt.title('電影票房與電影時(shí)長(zhǎng)的關(guān)系',?fontsize=18,?x=0.5,?y=1.02) plt.xlabel('電影時(shí)長(zhǎng)(分鐘)') plt.ylabel('電影票房(億美元)') #?保存圖片 plt.savefig('test_005.png') #?顯示圖片 plt.show()

7. 電影平均評(píng)分與電影時(shí)長(zhǎng)關(guān)系

""" @Author ?：葉庭云 @Date ???：2020/10/2 11:40 @CSDN ???：https://blog.csdn.net/fyfugoyfa """ import?pandas?as?pd import?matplotlib?as?mpl import?matplotlib.pyplot?as?plt import?numpy?as?np#?讀取數(shù)據(jù) df?=?pd.read_excel('已清洗數(shù)據(jù).xlsx') #?提取電影時(shí)長(zhǎng)??平均評(píng)分 run_time,?rating_score?=?df['runtime'],?df['vote_average']#?設(shè)置中文顯示 mpl.rcParams['font.family']?=?'SimHei' #?設(shè)置圖形顯示風(fēng)格 plt.style.use('ggplot') #?設(shè)置大小??像素 plt.figure(figsize=(9,?6),?dpi=100) #?繪制散點(diǎn)圖 plt.scatter(run_time,?rating_score,?c='purple') plt.yticks(np.arange(0,?10.5,?1)) #?添加描述信息 plt.title('電影平均評(píng)分與電影時(shí)長(zhǎng)的關(guān)系',?fontsize=18,?x=0.5,?y=1.02) plt.xlabel('電影時(shí)長(zhǎng)(分鐘)') plt.ylabel('平均評(píng)分') #?保存圖片 plt.savefig('test_006.png') #?顯示圖片 plt.show()

從兩幅散點(diǎn)圖可以看出，電影要想獲得較高的票房及良好的口碑，電影的時(shí)長(zhǎng)應(yīng)保持在 100-150 分鐘內(nèi)較好。

作者簡(jiǎn)介：

葉庭云

個(gè)人格言: 熱愛可抵歲月漫長(zhǎng)

CSDN 博客: https://blog.csdn.net/fyfugoyfa/

本文數(shù)據(jù)獲取方式

本文數(shù)據(jù)可以通過下述步驟來獲取：

??1. 掃描下方二維碼

? 2. 回復(fù)關(guān)鍵詞：電影

（建議復(fù)制關(guān)鍵詞）

????長(zhǎng)按上方二維碼?2 秒回復(fù)「電影」即可獲取資料

總結(jié)

以上是生活随笔為你收集整理的Python数据分析实战：TMDB电影数据可视化的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：台安变频器n2按键说明_台安变频器N2
下一篇： eclipse32位python版下载_