當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据分析 - Kaggle TMDB 票房预测

發布時間：2023/12/31 编程问答 38 豆豆

生活随笔收集整理的這篇文章主要介紹了数据分析 - Kaggle TMDB 票房预测小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

數據分析 - Kaggle TMDB 票房預測

- 環境準備
- 數據集
- 正文
- 數據預處理
- 數據探索性分析
- 建模

環境準備

使用了的環境：

Windows 10
python 3.7.2
Jupyter Notebook（代碼均在此測試成功）

數據集

https://www.kaggle.com/c/tmdb-box-office-prediction/data

正文

開工前準備，導入第三方庫：

import pandas as pd pd.set_option('max_columns',None) import matplotlib.pyplot as plt import seaborn as sns import plotly.graph_objs as go import plotly.offline as py from wordcloud import WordCloud plt.style.use('ggplot') import ast from collections import Counter import numpy as np from sklearn.preprocessing import LabelEncoder # 文本挖掘 from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer from sklearn.linear_model import LinearRegression # 模型 from sklearn.model_selection import train_test_split import lightgbm as lgb

加載數據：

train=pd.read_csv('dataset/train.csv') test=pd.read_csv('dataset/test.csv')

簡單了解數據：

train.head() idbelongs_to_collectionbudgetgenreshomepageimdb_idoriginal_languageoriginal_titleoverviewpopularityposter_pathproduction_companiesproduction_countriesrelease_dateruntimespoken_languagesstatustaglinetitleKeywordscastcrewrevenue

[{‘id’: 313576, ‘name’: 'Hot Tub Time Machine …

14000000

[{‘id’: 35, ‘name’: ‘Comedy’}]

NaN

tt2637294

Hot Tub Time Machine 2

When Lou, who has become the "father of the In…

6.575393

/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg

[{‘name’: ‘Paramount Pictures’, ‘id’: 4}, {'na…

[{‘iso_3166_1’: ‘US’, ‘name’: 'United States o…

2/20/15

93.0

[{‘iso_639_1’: ‘en’, ‘name’: ‘English’}]

Released

The Laws of Space and Time are About to be Vio…

Hot Tub Time Machine 2

[{‘id’: 4379, ‘name’: ‘time travel’}, {‘id’: 9…

[{‘cast_id’: 4, ‘character’: ‘Lou’, 'credit_id…

[{‘credit_id’: ‘59ac067c92514107af02c8c8’, 'de…

12314651

[{‘id’: 107674, ‘name’: 'The Princess Diaries …

40000000

[{‘id’: 35, ‘name’: ‘Comedy’}, {‘id’: 18, 'nam…

NaN

tt0368933

The Princess Diaries 2: Royal Engagement

Mia Thermopolis is now a college graduate and …

8.248895

/w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg

[{‘name’: ‘Walt Disney Pictures’, ‘id’: 2}]

[{‘iso_3166_1’: ‘US’, ‘name’: 'United States o…

8/6/04

113.0

[{‘iso_639_1’: ‘en’, ‘name’: ‘English’}]

Released

It can take a lifetime to find true love; she’…

The Princess Diaries 2: Royal Engagement

[{‘id’: 2505, ‘name’: ‘coronation’}, {‘id’: 42…

[{‘cast_id’: 1, ‘character’: ‘Mia Thermopolis’…

[{‘credit_id’: ‘52fe43fe9251416c7502563d’, 'de…

95149435

NaN

3300000

[{‘id’: 18, ‘name’: ‘Drama’}]

http://sonyclassics.com/whiplash/

tt2582802

Whiplash

Under the direction of a ruthless instructor, …

64.299990

/lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg

[{‘name’: ‘Bold Films’, ‘id’: 2266}, {‘name’: …

[{‘iso_3166_1’: ‘US’, ‘name’: 'United States o…

10/10/14

105.0

[{‘iso_639_1’: ‘en’, ‘name’: ‘English’}]

Released

The road to greatness can take you to the edge.

Whiplash

[{‘id’: 1416, ‘name’: ‘jazz’}, {‘id’: 1523, 'n…

[{‘cast_id’: 5, ‘character’: ‘Andrew Neimann’,…

[{‘credit_id’: ‘54d5356ec3a3683ba0000039’, 'de…

13092000

NaN

1200000

[{‘id’: 53, ‘name’: ‘Thriller’}, {‘id’: 18, 'n…

http://kahaanithefilm.com/

tt1821480

Kahaani

Vidya Bagchi (Vidya Balan) arrives in Kolkata …

3.174936

/aTXRaPrWSinhcmCrcfJK17urp3F.jpg

NaN

[{‘iso_3166_1’: ‘IN’, ‘name’: ‘India’}]

3/9/12

122.0

[{‘iso_639_1’: ‘en’, ‘name’: ‘English’}, {'iso…

Released

NaN

Kahaani

[{‘id’: 10092, ‘name’: ‘mystery’}, {‘id’: 1054…

[{‘cast_id’: 1, ‘character’: ‘Vidya Bagchi’, '…

[{‘credit_id’: ‘52fe48779251416c9108d6eb’, 'de…

16000000

NaN

[{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 53, 'nam…

NaN

tt1380152

????

Marine Boy is the story of a former national s…

1.148070

/m22s7zvkVFDU9ir56PiiqIEWFdT.jpg

NaN

[{‘iso_3166_1’: ‘KR’, ‘name’: ‘South Korea’}]

2/5/09

118.0

[{‘iso_639_1’: ‘ko’, ‘name’: ‘???/???’}]

Released

NaN

Marine Boy

NaN

[{‘cast_id’: 3, ‘character’: ‘Chun-soo’, 'cred…

[{‘credit_id’: ‘52fe464b9251416c75073b43’, 'de…

3923970

數據集大小：數據量挺小的

print(train.shape) print(test.shape)

(3000, 23)
(4398, 22)

數據預處理

從上面的數據預覽，發現有幾列是json形式的數據，必須轉化成可處理的格式。json數據在python中可以pyquery處理，pyquery的語法類似于jquery，也可以用ast.literal_eval將字符串型的json數據轉化成字典列表，這里我用第二種方法：

dict_columns = ['belongs_to_collection', 'genres', 'production_companies','production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew']def json_to_dict(df):for column in dict_columns:df[column] = df[column].apply(lambda x: {} if pd.isna(x) else ast.literal_eval(x) )return dftrain = json_to_dict(train) test = json_to_dict(test)

再將這些不規則數據轉化成特征，分為標簽提取與編碼，如關鍵演員、題材、分類、系列、發行方等，以及標簽數量統計，如分類數量、演員數量、主題長度等。這里需要注意，因為數據集不多，為避免模型過擬合，應僅對TOP的標簽進行編碼：

# collections train['collection_name'] = train['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0) train['has_collection'] = train['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0)test['collection_name'] = test['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0) test['has_collection'] = test['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0)train = train.drop(['belongs_to_collection'], axis=1) test = test.drop(['belongs_to_collection'], axis=1)# genres list_of_genres = list(train['genres'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) train['num_genres'] = train['genres'].apply(lambda x: len(x) if x != {} else 0) train['all_genres'] = train['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_genres = [m[0] for m in Counter([i for j in list_of_genres for i in j]).most_common(15)] for g in top_genres:train['genre_' + g] = train['all_genres'].apply(lambda x: 1 if g in x else 0)test['num_genres'] = test['genres'].apply(lambda x: len(x) if x != {} else 0) test['all_genres'] = test['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_genres:test['genre_' + g] = test['all_genres'].apply(lambda x: 1 if g in x else 0)train = train.drop(['genres'], axis=1) test = test.drop(['genres'], axis=1)# production companies list_of_companies = list(train['production_companies'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)train['num_companies'] = train['production_companies'].apply(lambda x: len(x) if x != {} else 0) train['all_production_companies'] = train['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_companies = [m[0] for m in Counter([i for j in list_of_companies for i in j]).most_common(30)] for g in top_companies:train['production_company_' + g] = train['all_production_companies'].apply(lambda x: 1 if g in x else 0)test['num_companies'] = test['production_companies'].apply(lambda x: len(x) if x != {} else 0) test['all_production_companies'] = test['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_companies:test['production_company_' + g] = test['all_production_companies'].apply(lambda x: 1 if g in x else 0)train = train.drop(['production_companies', 'all_production_companies'], axis=1) test = test.drop(['production_companies', 'all_production_companies'], axis=1)# production countries list_of_countries = list(train['production_countries'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) train['num_countries'] = train['production_countries'].apply(lambda x: len(x) if x != {} else 0) train['all_countries'] = train['production_countries'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_countries = [m[0] for m in Counter([i for j in list_of_countries for i in j]).most_common(25)] for g in top_countries:train['production_country_' + g] = train['all_countries'].apply(lambda x: 1 if g in x else 0)test['num_countries'] = test['production_countries'].apply(lambda x: len(x) if x != {} else 0) test['all_countries'] = test['production_countries'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_countries:test['production_country_' + g] = test['all_countries'].apply(lambda x: 1 if g in x else 0)train = train.drop(['production_countries', 'all_countries'], axis=1) test = test.drop(['production_countries', 'all_countries'], axis=1)# spoken languages list_of_languages = list(train['spoken_languages'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) train['num_languages'] = train['spoken_languages'].apply(lambda x: len(x) if x != {} else 0) train['all_languages'] = train['spoken_languages'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_languages = [m[0] for m in Counter([i for j in list_of_languages for i in j]).most_common(30)] for g in top_languages:train['language_' + g] = train['all_languages'].apply(lambda x: 1 if g in x else 0)test['num_languages'] = test['spoken_languages'].apply(lambda x: len(x) if x != {} else 0) test['all_languages'] = test['spoken_languages'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_languages:test['language_' + g] = test['all_languages'].apply(lambda x: 1 if g in x else 0)train = train.drop(['spoken_languages', 'all_languages'], axis=1) test = test.drop(['spoken_languages', 'all_languages'], axis=1)# keywords list_of_keywords = list(train['Keywords'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) train['num_Keywords'] = train['Keywords'].apply(lambda x: len(x) if x != {} else 0) train['all_Keywords'] = train['Keywords'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_keywords = [m[0] for m in Counter([i for j in list_of_keywords for i in j]).most_common(30)] for g in top_keywords:train['keyword_' + g] = train['all_Keywords'].apply(lambda x: 1 if g in x else 0)test['num_Keywords'] = test['Keywords'].apply(lambda x: len(x) if x != {} else 0) test['all_Keywords'] = test['Keywords'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_keywords:test['keyword_' + g] = test['all_Keywords'].apply(lambda x: 1 if g in x else 0)train = train.drop(['Keywords', 'all_Keywords'], axis=1) test = test.drop(['Keywords', 'all_Keywords'], axis=1)# cast list_of_cast_names = list(train['cast'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) list_of_cast_genders = list(train['cast'].apply(lambda x: [i['gender'] for i in x] if x != {} else []).values) list_of_cast_characters = list(train['cast'].apply(lambda x: [i['character'] for i in x] if x != {} else []).values) train['num_cast'] = train['cast'].apply(lambda x: len(x) if x != {} else 0) top_cast_names = [m[0] for m in Counter([i for j in list_of_cast_names for i in j]).most_common(15)] for g in top_cast_names:train['cast_name_' + g] = train['cast'].apply(lambda x: 1 if g in str(x) else 0) train['genders_0_cast'] = train['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 0])) train['genders_1_cast'] = train['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 1])) train['genders_2_cast'] = train['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 2])) top_cast_characters = [m[0] for m in Counter([i for j in list_of_cast_characters for i in j]).most_common(15)] for g in top_cast_characters:train['cast_character_' + g] = train['cast'].apply(lambda x: 1 if g in str(x) else 0)test['num_cast'] = test['cast'].apply(lambda x: len(x) if x != {} else 0) for g in top_cast_names:test['cast_name_' + g] = test['cast'].apply(lambda x: 1 if g in str(x) else 0) test['genders_0_cast'] = test['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 0])) test['genders_1_cast'] = test['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 1])) test['genders_2_cast'] = test['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 2])) for g in top_cast_characters:test['cast_character_' + g] = test['cast'].apply(lambda x: 1 if g in str(x) else 0)train = train.drop(['cast'], axis=1) test = test.drop(['cast'], axis=1)# crew list_of_crew_names = list(train['crew'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) list_of_crew_jobs = list(train['crew'].apply(lambda x: [i['job'] for i in x] if x != {} else []).values) list_of_crew_genders = list(train['crew'].apply(lambda x: [i['gender'] for i in x] if x != {} else []).values) list_of_crew_departments = list(train['crew'].apply(lambda x: [i['department'] for i in x] if x != {} else []).values) list_of_crew_names = train['crew'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values train['num_crew'] = train['crew'].apply(lambda x: len(x) if x != {} else 0) top_crew_names = [m[0] for m in Counter([i for j in list_of_crew_names for i in j]).most_common(15)] for g in top_crew_names:train['crew_name_' + g] = train['crew'].apply(lambda x: 1 if g in str(x) else 0) train['genders_0_crew'] = train['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0])) train['genders_1_crew'] = train['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1])) train['genders_2_crew'] = train['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2])) top_crew_jobs = [m[0] for m in Counter([i for j in list_of_crew_jobs for i in j]).most_common(15)] for j in top_crew_jobs:train['jobs_' + j] = train['crew'].apply(lambda x: sum([1 for i in x if i['job'] == j])) top_crew_departments = [m[0] for m in Counter([i for j in list_of_crew_departments for i in j]).most_common(15)] for j in top_crew_departments:train['departments_' + j] = train['crew'].apply(lambda x: sum([1 for i in x if i['department'] == j])) test['num_crew'] = test['crew'].apply(lambda x: len(x) if x != {} else 0) for g in top_crew_names:test['crew_name_' + g] = test['crew'].apply(lambda x: 1 if g in str(x) else 0) test['genders_0_crew'] = test['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0])) test['genders_1_crew'] = test['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1])) test['genders_2_crew'] = test['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2])) for j in top_crew_jobs:test['jobs_' + j] = test['crew'].apply(lambda x: sum([1 for i in x if i['job'] == j])) for j in top_crew_departments:test['departments_' + j] = test['crew'].apply(lambda x: sum([1 for i in x if i['department'] == j])) train = train.drop(['crew'], axis=1) test = test.drop(['crew'], axis=1)

預覽一下數據處理完成后的效果：

train.head() idbudgethomepageimdb_idoriginal_languageoriginal_titleoverviewpopularityposter_pathrelease_dateruntimestatustaglinetitlerevenuecollection_namehas_collectionnum_genresall_genresgenre_Dramagenre_Comedygenre_Thrillergenre_Actiongenre_Romancegenre_Crimegenre_Adventuregenre_Horrorgenre_Science Fictiongenre_Familygenre_Fantasygenre_Mysterygenre_Animationgenre_Historygenre_Musicnum_companiesproduction_company_Warner Bros.production_company_Universal Picturesproduction_company_Paramount Picturesproduction_company_Twentieth Century Fox Film Corporationproduction_company_Columbia Picturesproduction_company_Metro-Goldwyn-Mayer (MGM)production_company_New Line Cinemaproduction_company_Touchstone Picturesproduction_company_Walt Disney Picturesproduction_company_Columbia Pictures Corporationproduction_company_TriStar Picturesproduction_company_Relativity Mediaproduction_company_Canal+production_company_United Artistsproduction_company_Miramax Filmsproduction_company_Village Roadshow Picturesproduction_company_Regency Enterprisesproduction_company_BBC Filmsproduction_company_Dune Entertainmentproduction_company_Working Title Filmsproduction_company_Fox Searchlight Picturesproduction_company_StudioCanalproduction_company_Lionsgateproduction_company_DreamWorks SKGproduction_company_Fox 2000 Picturesproduction_company_Summit Entertainmentproduction_company_Hollywood Picturesproduction_company_Orion Picturesproduction_company_Amblin Entertainmentproduction_company_Dimension Filmsnum_countriesproduction_country_United States of Americaproduction_country_United Kingdomproduction_country_Franceproduction_country_Germanyproduction_country_Canadaproduction_country_Indiaproduction_country_Italyproduction_country_Japanproduction_country_Australiaproduction_country_Russiaproduction_country_Spainproduction_country_Chinaproduction_country_Hong Kongproduction_country_Irelandproduction_country_Belgiumproduction_country_South Koreaproduction_country_Mexicoproduction_country_Swedenproduction_country_New Zealandproduction_country_Netherlandsproduction_country_Czech Republicproduction_country_Denmarkproduction_country_Brazilproduction_country_Luxembourgproduction_country_South Africanum_languageslanguage_Englishlanguage_Fran?aislanguage_Espa?ollanguage_Deutschlanguage_Pусскийlanguage_Italianolanguage_日本語language_普通話language_??????language_language_Portuguêslanguage_???????language_???/???language_廣州話 / 廣州話language_?????language_Polskilanguage_Magyarlanguage_Latinlanguage_svenskalanguage_???????language_?eskylanguage_????????language_ελληνικ?language_Türk?elanguage_Dansklanguage_Nederlandslanguage_?????language_Ti?ng Vi?tlanguage_????language_Roman?num_Keywordskeyword_woman directorkeyword_independent filmkeyword_duringcreditsstingerkeyword_murderkeyword_based on novelkeyword_violencekeyword_sportkeyword_biographykeyword_aftercreditsstingerkeyword_dystopiakeyword_revengekeyword_friendshipkeyword_sexkeyword_suspensekeyword_sequelkeyword_lovekeyword_policekeyword_teenagerkeyword_nuditykeyword_female nuditykeyword_drugkeyword_prisonkeyword_musicalkeyword_high schoolkeyword_los angeleskeyword_new yorkkeyword_familykeyword_father son relationshipkeyword_kidnappingkeyword_investigationnum_castcast_name_Samuel L. Jacksoncast_name_Robert De Nirocast_name_Morgan Freemancast_name_J.K. Simmonscast_name_Bruce Williscast_name_Liam Neesoncast_name_Susan Sarandoncast_name_Bruce McGillcast_name_John Turturrocast_name_Forest Whitakercast_name_Willem Dafoecast_name_Bill Murraycast_name_Owen Wilsoncast_name_Nicolas Cagecast_name_Sylvester Stallonegenders_0_castgenders_1_castgenders_2_castcast_character_cast_character_Himselfcast_character_Herselfcast_character_Dancercast_character_Additional Voices (voice)cast_character_Doctorcast_character_Reportercast_character_Waitresscast_character_Nursecast_character_Bartendercast_character_Jackcast_character_Debutantecast_character_Security Guardcast_character_Paulcast_character_Franknum_crewcrew_name_Avy Kaufmancrew_name_Robert Rodriguezcrew_name_Deborah Aquilacrew_name_James Newton Howardcrew_name_Mary Vernieucrew_name_Steven Spielbergcrew_name_Luc Bessoncrew_name_Jerry Goldsmithcrew_name_Francine Maislercrew_name_Tricia Woodcrew_name_James Hornercrew_name_Kerry Bardencrew_name_Bob Weinsteincrew_name_Harvey Weinsteincrew_name_Janet Hirshensongenders_0_crewgenders_1_crewgenders_2_crewjobs_Producerjobs_Executive Producerjobs_Directorjobs_Screenplayjobs_Editorjobs_Castingjobs_Director of Photographyjobs_Original Music Composerjobs_Art Directionjobs_Production Designjobs_Costume Designjobs_Writerjobs_Set Decorationjobs_Makeup Artistjobs_Sound Re-Recording Mixerdepartments_Productiondepartments_Sounddepartments_Artdepartments_Crewdepartments_Writingdepartments_Costume & Make-Updepartments_Cameradepartments_Directingdepartments_Editingdepartments_Visual Effectsdepartments_Lightingdepartments_Actors

0	1	14000000	NaN	tt2637294	en	Hot Tub Time Machine 2	When Lou, who has become the "father of the In…	6.575393	/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg	2/20/15	93.0	Released	The Laws of Space and Time are About to be Vio…	Hot Tub Time Machine 2	12314651	Hot Tub Time Machine Collection	1	1	Comedy	0	1	0	0	0	0	3	1	1	0	1	1	1	0	0	1	1	0	1	0	4	1	0	1	0	0	0	24	0	6	8	10	1	1	1	0	0	0	72	59	0	13	1	3	1	0	1	1	1	1	1	1	1	1	1	4	2	9	10	12	4	2	13	8	4	2	4	4
1	2	40000000	NaN	tt0368933	en	The Princess Diaries 2: Royal Engagement	Mia Thermopolis is now a college graduate and …	8.248895	/w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg	8/6/04	113.0	Released	It can take a lifetime to find true love; she’…	The Princess Diaries 2: Royal Engagement	95149435	The Princess Diaries Collection	1	4	Comedy Drama Family Romance	1	1	0	0	1	1	1	0	0	1	0	1	1	0	0	1	1	0	1	0	4	0	0	0	1	0	0	20	0	0	10	10	1	0	0	0	1	0	9	1	4	4	3	1	1	1	1	0	1	1	0	0	0	0	0	0	0	4	1	0	0	1	0	1	1	1	0	0
2	3	3300000	http://sonyclassics.com/whiplash/	tt2582802	en	Whiplash	Under the direction of a ruthless instructor, …	64.299990	/lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg	10/10/14	105.0	Released	The road to greatness can take you to the edge.	Whiplash	13092000	0	0	1	Drama	1	0	0	0	0	0	3	0	0	0	0	1	1	0	0	1	1	0	1	0	12	0	1	0	0	0	1	51	1	31	7	13	1	0	0	1	1	1	64	49	4	11	4	4	1	1	1	2	1	1	1	1	1	0	1	1	2	18	9	5	9	1	5	4	3	6	3	1
3	4	1200000	http://kahaanithefilm.com/	tt1821480	hi	Kahaani	Vidya Bagchi (Vidya Balan) arrives in Kolkata …	3.174936	/aTXRaPrWSinhcmCrcfJK17urp3F.jpg	3/9/12	122.0	Released	NaN	Kahaani	16000000	0	0	2	Drama Thriller	1	0	1	0	0	0	0	0	0	0	0	1	0	1	0	2	1	1	1	0	7	0	0	0	0	1	0	7	0	4	1	2	1	0	0	0	0	0	3	3	0	0	1	0	1	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	1	0	0	1	0	0	0
4	5	0	NaN	tt1380152	ko	????	Marine Boy is the story of a former national s…	1.148070	/m22s7zvkVFDU9ir56PiiqIEWFdT.jpg	2/5/09	118.0	Released	NaN	Marine Boy	3923970	0	0	2	Action Thriller	0	0	1	1	0	0	0	0	0	0	0	1	0	0	1	1	0	0	1	1	0	0	0	0	0	0	0	4	0	0	0	4	1	0	0	0	0	0	2	2	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	1	0	0	0

達到預期的效果，但是日期還是字符串格式，再轉化一下日期為標準格式：

def fix_date(x):"""Fixes dates which are in 20xx"""year = x.split('/')[2]if int(year) <= 19:return x[:-2] + '20' + yearelse:return x[:-2] + '19' + year test.loc[test['release_date'].isnull() == True, 'release_date'] = '01/01/98' train['release_date'] = train['release_date'].apply(lambda x: fix_date(x)) test['release_date'] = test['release_date'].apply(lambda x: fix_date(x)) train['release_date'] = pd.to_datetime(train['release_date']) test['release_date'] = pd.to_datetime(test['release_date'])

數據探索性分析

首先看一下預算的分布情況，發現大部分值比較小，數據不平衡，應做log處理，增加數值較小時的區分度：

plt.hist(train['budget']) plt.title('budget distribution')

plt.hist(np.log1p(train['budget'])) plt.title('log1p budget distribution')

顯然收入revenue也一樣處理：

# log_budget, Normalization train['log_budget'] = np.log1p(train['budget']) test['log_budget'] = np.log1p(test['budget']) # log_revenue, Normalization train['log_revenue'] = np.log1p(train['revenue'])

再看下homepage，這里把homepage轉換成布爾值，有homepage的也是有實力的象征：

train['has_homepage'] = 0 train.loc[train['homepage'].isnull() == False, 'has_homepage'] = 1 test['has_homepage'] = 0 test.loc[test['homepage'].isnull() == False, 'has_homepage'] = 1

是否有主頁的分布情況，有主頁的票房更高：

sns.catplot(x='has_homepage', y='revenue', data=train); plt.title('Revenue for film with and without homepage');

各個語言的票房收入情況：

sns.boxplot(x='original_language',y='log_revenue',data=train.loc[train['original_language'].isin(train['original_language'].value_counts().head(10).index)]) plt.title("Log_Revenue VS Original_language")

英語好片很多，爛片也很多，其他語言也有好的電影，總體差別不大。

overview列，涉及到文本信息挖掘，這里簡單結合常用的Tfidf和線性回歸進行建模，如下：（）

vectorizer=TfidfVectorizer(sublinear_tf=True,analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1,2),min_df=5) overview_text=vectorizer.fit_transform(train['overview'].fillna('')) linreg=LinearRegression() linreg.fit(overview_text,train['log_revenue'])

使用eli5可視化各關鍵字對log_revenue的影響：

import eli5 print('Target value:', train['log_revenue'][5]) eli5.show_prediction(linreg,doc=train['overview'][5],vec=vectorizer)

日期特征比較粗糙，增加星期幾、月份、季度、年份等特征：

def process_date(df):date_parts=['year','weekday','month','weekofyear','day','quarter']for part in date_parts:df["release_date_"+part]=getattr(df["release_date"].dt,part).astype(int)return df train=process_date(train) test=process_date(test)

先看下每年電影的發行量：這里用可交互式的可視化工具plotly

d1=train['release_date_year'].value_counts().sort_index() d2=test['release_date_year'].value_counts().sort_index() py.init_notebook_mode(connected=True) data=[go.Scatter(x=d1.index,y=d1.values,name='train'),go.Scatter(x=d2.index,y=d2.values,name='test')] layout=go.Layout(dict(title='Number of films per year',xaxis=dict(title='year'),yaxis=dict(title='Count')),legend=dict(orientation='v')) py.iplot(dict(data=data,layout=layout))

總發行量與總票房的趨勢：

d1=train['release_date_year'].value_counts().sort_index() d2=train.groupby(['release_date_year'])['revenue'].sum() data=[go.Scatter(x=d1.index,y=d1.values,name='Count'),go.Scatter(x=d2.index,y=d2.values,name='overall_revenue',yaxis='y2')] layout=go.Layout(dict(title= "Number of films and total revenue per year",xaxis=dict(title='year'),yaxis=dict(title='Count'),yaxis2=dict(title='Total revenue', overlaying='y', side='right')),legend=dict(orientation='v')) py.iplot(dict(data=data,layout=layout))

總發行量與平均票房的趨勢：（似乎平均票房在2000之后趨于穩定）

d1=train['release_date_year'].value_counts().sort_index() d2=train.groupby(['release_date_year'])['revenue'].mean() data=[go.Scatter(x=d1.index,y=d1.values,name='Count'),go.Scatter(x=d2.index,y=d2.values,name='Average revenue',yaxis='y2')] layout=go.Layout(dict(title="Number of films and average revenue per year",xaxis=dict(title='year'),yaxis=dict(title='Count'),yaxis2=dict(title='Average revenue',overlaying='y',side='right')),legend=dict(orientation='v')) py.iplot(dict(data=data,layout=layout))

周幾發行是否與票房有關：

sns.catplot(x='release_date_weekday',y='revenue',data=train) plt.title('Revenue on different days of week of release')

再看下箱線圖：

sns.boxplot(x='release_date_weekday',y='log_revenue',data=train) plt.title('Revenue on different days of week of release')

發現：周一到周三發布的電影很多是高票房的，周六發行的電影票房就很低了。

電影介紹tagline分析，分析出現頻率最高的詞匯：

plt.figure(figsize = (12, 12)) text_tagline=' '.join(train['tagline'].fillna('')) wordcloud_tagline=WordCloud(max_font_size=None,background_color='white',width=1200,height=1000).generate_from_text(text_tagline) plt.imshow(wordcloud_tagline) plt.title('Top words in tagline') plt.axis("off") plt.show()

是否有系列has_collection對票房的影響：

sns.boxplot(x='has_collection',y='log_revenue',data=train)

發現系列電影的平均票房更高。

分析電影題材數量與票房的關系：

train['num_genres'].value_counts() sns.catplot(x='num_genres',y='revenue',data=train)

題材數量3個往往有更高的票房，數量多了反而不好。

最后看下電影發行方與票房的關系，分別繪制分布圖：

f,axes=plt.subplots(6,5,figsize=(24,32)) plt.suptitle('Violin of revenue vs production company') for i,e in enumerate([i for i in train.columns if 'production_company_' in i]):sns.violinplot(x=e,y='revenue',data=train,ax=axes[i//5][i%5])

建模

先刪除一些無關的特征：

train = train.drop(['homepage', 'imdb_id', 'poster_path', 'release_date', 'status', 'log_revenue'], axis=1) test = test.drop(['homepage', 'imdb_id', 'poster_path', 'release_date', 'status'], axis=1)

再刪除特征值唯一的特征：

for col in train.columns:if train[col].nunique()==1:print(col)train.drop([col],axis=1)train.drop([col],axis=1)

對分類標簽進行編碼：

for col in ['original_language','collection_name','all_genres']:le=LabelEncoder()le.fit(list(train[col].fillna(''))+list(test[col].fillna('')))train[col]=le.transform(train[col].fillna('').astype(str))test[col]=le.transform(test[col].fillna('').astype(str))

將文本轉化成特征：

train_texts = train[['title', 'tagline', 'overview', 'original_title']] test_texts = test[['title', 'tagline', 'overview', 'original_title']] for col in ['title','tagline','overview','original_title']:train['len_'+col]=train[col].fillna('').apply(lambda x: len(str(x)))train['words_'+col]=train[col].fillna('').apply(lambda x: len(str(x).split(' ')))test['len_'+col]=test[col].fillna('').apply(lambda x: len(str(x)))test['words_'+col]=test[col].fillna('').apply(lambda x: len(str(x).split(' ')))train=train.drop(col,axis=1)test=test.drop(col,axis=1)

訓練數據和測試數據：

X=train.drop(['id','revenue'],axis=1) y=np.log1p(train['revenue']) X_test=test.drop(['id'],axis=1)

模型訓練：

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.1) # rmse: root mean square error, (sum(d^2)/N)^0.5 params = {'num_leaves': 30,'min_data_in_leaf': 20,'objective': 'regression','max_depth': 5,'learning_rate': 0.01,"boosting": "gbdt","feature_fraction": 0.9,"bagging_freq": 1,"bagging_fraction": 0.9,"bagging_seed": 11,"metric": 'rmse',"lambda_l1": 0.2,"verbosity": -1} model1=lgb.LGBMRegressor(**params,n_estimators=20000,nthread=4,jobs=-1) model1.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric='rmse',verbose=1000, early_stopping_rounds=200)

Training until validation scores don’t improve for 200 rounds.
[1000] training’s rmse: 1.42756 valid_1’s rmse: 2.07259
Early stopping, best iteration is:
[1118] training’s rmse: 1.38621 valid_1’s rmse: 2.06726

訓練后，各特征權重：

總結

以上是生活随笔為你收集整理的数据分析 - Kaggle TMDB 票房预测的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 2060显卡驱动最新版本_教程：怎么安装
下一篇： C语言队列的理解