日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据分析 - Kaggle TMDB 票房预测

發布時間:2023/12/31 编程问答 38 豆豆
生活随笔 收集整理的這篇文章主要介紹了 数据分析 - Kaggle TMDB 票房预测 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

數據分析 - Kaggle TMDB 票房預測

    • 環境準備
    • 數據集
    • 正文
    • 數據預處理
    • 數據探索性分析
    • 建模

環境準備

使用了的環境:

  • Windows 10
  • python 3.7.2
  • Jupyter Notebook(代碼均在此測試成功)

數據集

https://www.kaggle.com/c/tmdb-box-office-prediction/data

正文

開工前準備,導入第三方庫:

import pandas as pd pd.set_option('max_columns',None) import matplotlib.pyplot as plt import seaborn as sns import plotly.graph_objs as go import plotly.offline as py from wordcloud import WordCloud plt.style.use('ggplot') import ast from collections import Counter import numpy as np from sklearn.preprocessing import LabelEncoder # 文本挖掘 from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer from sklearn.linear_model import LinearRegression # 模型 from sklearn.model_selection import train_test_split import lightgbm as lgb

加載數據:

train=pd.read_csv('dataset/train.csv') test=pd.read_csv('dataset/test.csv')

簡單了解數據:

train.head() idbelongs_to_collectionbudgetgenreshomepageimdb_idoriginal_languageoriginal_titleoverviewpopularityposter_pathproduction_companiesproduction_countriesrelease_dateruntimespoken_languagesstatustaglinetitleKeywordscastcrewrevenue
01[{‘id’: 313576, ‘name’: 'Hot Tub Time Machine …14000000[{‘id’: 35, ‘name’: ‘Comedy’}]NaNtt2637294enHot Tub Time Machine 2When Lou, who has become the "father of the In…6.575393/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg[{‘name’: ‘Paramount Pictures’, ‘id’: 4}, {'na…[{‘iso_3166_1’: ‘US’, ‘name’: 'United States o…2/20/1593.0[{‘iso_639_1’: ‘en’, ‘name’: ‘English’}]ReleasedThe Laws of Space and Time are About to be Vio…Hot Tub Time Machine 2[{‘id’: 4379, ‘name’: ‘time travel’}, {‘id’: 9…[{‘cast_id’: 4, ‘character’: ‘Lou’, 'credit_id…[{‘credit_id’: ‘59ac067c92514107af02c8c8’, 'de…12314651
12[{‘id’: 107674, ‘name’: 'The Princess Diaries …40000000[{‘id’: 35, ‘name’: ‘Comedy’}, {‘id’: 18, 'nam…NaNtt0368933enThe Princess Diaries 2: Royal EngagementMia Thermopolis is now a college graduate and …8.248895/w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg[{‘name’: ‘Walt Disney Pictures’, ‘id’: 2}][{‘iso_3166_1’: ‘US’, ‘name’: 'United States o…8/6/04113.0[{‘iso_639_1’: ‘en’, ‘name’: ‘English’}]ReleasedIt can take a lifetime to find true love; she’…The Princess Diaries 2: Royal Engagement[{‘id’: 2505, ‘name’: ‘coronation’}, {‘id’: 42…[{‘cast_id’: 1, ‘character’: ‘Mia Thermopolis’…[{‘credit_id’: ‘52fe43fe9251416c7502563d’, 'de…95149435
23NaN3300000[{‘id’: 18, ‘name’: ‘Drama’}]http://sonyclassics.com/whiplash/tt2582802enWhiplashUnder the direction of a ruthless instructor, …64.299990/lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg[{‘name’: ‘Bold Films’, ‘id’: 2266}, {‘name’: …[{‘iso_3166_1’: ‘US’, ‘name’: 'United States o…10/10/14105.0[{‘iso_639_1’: ‘en’, ‘name’: ‘English’}]ReleasedThe road to greatness can take you to the edge.Whiplash[{‘id’: 1416, ‘name’: ‘jazz’}, {‘id’: 1523, 'n…[{‘cast_id’: 5, ‘character’: ‘Andrew Neimann’,…[{‘credit_id’: ‘54d5356ec3a3683ba0000039’, 'de…13092000
34NaN1200000[{‘id’: 53, ‘name’: ‘Thriller’}, {‘id’: 18, 'n…http://kahaanithefilm.com/tt1821480hiKahaaniVidya Bagchi (Vidya Balan) arrives in Kolkata …3.174936/aTXRaPrWSinhcmCrcfJK17urp3F.jpgNaN[{‘iso_3166_1’: ‘IN’, ‘name’: ‘India’}]3/9/12122.0[{‘iso_639_1’: ‘en’, ‘name’: ‘English’}, {'iso…ReleasedNaNKahaani[{‘id’: 10092, ‘name’: ‘mystery’}, {‘id’: 1054…[{‘cast_id’: 1, ‘character’: ‘Vidya Bagchi’, '…[{‘credit_id’: ‘52fe48779251416c9108d6eb’, 'de…16000000
45NaN0[{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 53, 'nam…NaNtt1380152ko????Marine Boy is the story of a former national s…1.148070/m22s7zvkVFDU9ir56PiiqIEWFdT.jpgNaN[{‘iso_3166_1’: ‘KR’, ‘name’: ‘South Korea’}]2/5/09118.0[{‘iso_639_1’: ‘ko’, ‘name’: ‘???/???’}]ReleasedNaNMarine BoyNaN[{‘cast_id’: 3, ‘character’: ‘Chun-soo’, 'cred…[{‘credit_id’: ‘52fe464b9251416c75073b43’, 'de…3923970

數據集大小:數據量挺小的

print(train.shape) print(test.shape)

(3000, 23)
(4398, 22)

數據預處理

從上面的數據預覽,發現有幾列是json形式的數據,必須轉化成可處理的格式。json數據在python中可以pyquery處理,pyquery的語法類似于jquery,也可以用ast.literal_eval將字符串型的json數據轉化成字典列表,這里我用第二種方法:

dict_columns = ['belongs_to_collection', 'genres', 'production_companies','production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew']def json_to_dict(df):for column in dict_columns:df[column] = df[column].apply(lambda x: {} if pd.isna(x) else ast.literal_eval(x) )return dftrain = json_to_dict(train) test = json_to_dict(test)

再將這些不規則數據轉化成特征,分為標簽提取與編碼,如關鍵演員、題材、分類、系列、發行方等,以及標簽數量統計,如分類數量、演員數量、主題長度等。這里需要注意,因為數據集不多,為避免模型過擬合,應僅對TOP的標簽進行編碼:

# collections train['collection_name'] = train['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0) train['has_collection'] = train['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0)test['collection_name'] = test['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0) test['has_collection'] = test['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0)train = train.drop(['belongs_to_collection'], axis=1) test = test.drop(['belongs_to_collection'], axis=1)# genres list_of_genres = list(train['genres'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) train['num_genres'] = train['genres'].apply(lambda x: len(x) if x != {} else 0) train['all_genres'] = train['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_genres = [m[0] for m in Counter([i for j in list_of_genres for i in j]).most_common(15)] for g in top_genres:train['genre_' + g] = train['all_genres'].apply(lambda x: 1 if g in x else 0)test['num_genres'] = test['genres'].apply(lambda x: len(x) if x != {} else 0) test['all_genres'] = test['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_genres:test['genre_' + g] = test['all_genres'].apply(lambda x: 1 if g in x else 0)train = train.drop(['genres'], axis=1) test = test.drop(['genres'], axis=1)# production companies list_of_companies = list(train['production_companies'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)train['num_companies'] = train['production_companies'].apply(lambda x: len(x) if x != {} else 0) train['all_production_companies'] = train['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_companies = [m[0] for m in Counter([i for j in list_of_companies for i in j]).most_common(30)] for g in top_companies:train['production_company_' + g] = train['all_production_companies'].apply(lambda x: 1 if g in x else 0)test['num_companies'] = test['production_companies'].apply(lambda x: len(x) if x != {} else 0) test['all_production_companies'] = test['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_companies:test['production_company_' + g] = test['all_production_companies'].apply(lambda x: 1 if g in x else 0)train = train.drop(['production_companies', 'all_production_companies'], axis=1) test = test.drop(['production_companies', 'all_production_companies'], axis=1)# production countries list_of_countries = list(train['production_countries'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) train['num_countries'] = train['production_countries'].apply(lambda x: len(x) if x != {} else 0) train['all_countries'] = train['production_countries'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_countries = [m[0] for m in Counter([i for j in list_of_countries for i in j]).most_common(25)] for g in top_countries:train['production_country_' + g] = train['all_countries'].apply(lambda x: 1 if g in x else 0)test['num_countries'] = test['production_countries'].apply(lambda x: len(x) if x != {} else 0) test['all_countries'] = test['production_countries'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_countries:test['production_country_' + g] = test['all_countries'].apply(lambda x: 1 if g in x else 0)train = train.drop(['production_countries', 'all_countries'], axis=1) test = test.drop(['production_countries', 'all_countries'], axis=1)# spoken languages list_of_languages = list(train['spoken_languages'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) train['num_languages'] = train['spoken_languages'].apply(lambda x: len(x) if x != {} else 0) train['all_languages'] = train['spoken_languages'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_languages = [m[0] for m in Counter([i for j in list_of_languages for i in j]).most_common(30)] for g in top_languages:train['language_' + g] = train['all_languages'].apply(lambda x: 1 if g in x else 0)test['num_languages'] = test['spoken_languages'].apply(lambda x: len(x) if x != {} else 0) test['all_languages'] = test['spoken_languages'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_languages:test['language_' + g] = test['all_languages'].apply(lambda x: 1 if g in x else 0)train = train.drop(['spoken_languages', 'all_languages'], axis=1) test = test.drop(['spoken_languages', 'all_languages'], axis=1)# keywords list_of_keywords = list(train['Keywords'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) train['num_Keywords'] = train['Keywords'].apply(lambda x: len(x) if x != {} else 0) train['all_Keywords'] = train['Keywords'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_keywords = [m[0] for m in Counter([i for j in list_of_keywords for i in j]).most_common(30)] for g in top_keywords:train['keyword_' + g] = train['all_Keywords'].apply(lambda x: 1 if g in x else 0)test['num_Keywords'] = test['Keywords'].apply(lambda x: len(x) if x != {} else 0) test['all_Keywords'] = test['Keywords'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_keywords:test['keyword_' + g] = test['all_Keywords'].apply(lambda x: 1 if g in x else 0)train = train.drop(['Keywords', 'all_Keywords'], axis=1) test = test.drop(['Keywords', 'all_Keywords'], axis=1)# cast list_of_cast_names = list(train['cast'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) list_of_cast_genders = list(train['cast'].apply(lambda x: [i['gender'] for i in x] if x != {} else []).values) list_of_cast_characters = list(train['cast'].apply(lambda x: [i['character'] for i in x] if x != {} else []).values) train['num_cast'] = train['cast'].apply(lambda x: len(x) if x != {} else 0) top_cast_names = [m[0] for m in Counter([i for j in list_of_cast_names for i in j]).most_common(15)] for g in top_cast_names:train['cast_name_' + g] = train['cast'].apply(lambda x: 1 if g in str(x) else 0) train['genders_0_cast'] = train['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 0])) train['genders_1_cast'] = train['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 1])) train['genders_2_cast'] = train['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 2])) top_cast_characters = [m[0] for m in Counter([i for j in list_of_cast_characters for i in j]).most_common(15)] for g in top_cast_characters:train['cast_character_' + g] = train['cast'].apply(lambda x: 1 if g in str(x) else 0)test['num_cast'] = test['cast'].apply(lambda x: len(x) if x != {} else 0) for g in top_cast_names:test['cast_name_' + g] = test['cast'].apply(lambda x: 1 if g in str(x) else 0) test['genders_0_cast'] = test['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 0])) test['genders_1_cast'] = test['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 1])) test['genders_2_cast'] = test['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 2])) for g in top_cast_characters:test['cast_character_' + g] = test['cast'].apply(lambda x: 1 if g in str(x) else 0)train = train.drop(['cast'], axis=1) test = test.drop(['cast'], axis=1)# crew list_of_crew_names = list(train['crew'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) list_of_crew_jobs = list(train['crew'].apply(lambda x: [i['job'] for i in x] if x != {} else []).values) list_of_crew_genders = list(train['crew'].apply(lambda x: [i['gender'] for i in x] if x != {} else []).values) list_of_crew_departments = list(train['crew'].apply(lambda x: [i['department'] for i in x] if x != {} else []).values) list_of_crew_names = train['crew'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values train['num_crew'] = train['crew'].apply(lambda x: len(x) if x != {} else 0) top_crew_names = [m[0] for m in Counter([i for j in list_of_crew_names for i in j]).most_common(15)] for g in top_crew_names:train['crew_name_' + g] = train['crew'].apply(lambda x: 1 if g in str(x) else 0) train['genders_0_crew'] = train['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0])) train['genders_1_crew'] = train['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1])) train['genders_2_crew'] = train['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2])) top_crew_jobs = [m[0] for m in Counter([i for j in list_of_crew_jobs for i in j]).most_common(15)] for j in top_crew_jobs:train['jobs_' + j] = train['crew'].apply(lambda x: sum([1 for i in x if i['job'] == j])) top_crew_departments = [m[0] for m in Counter([i for j in list_of_crew_departments for i in j]).most_common(15)] for j in top_crew_departments:train['departments_' + j] = train['crew'].apply(lambda x: sum([1 for i in x if i['department'] == j])) test['num_crew'] = test['crew'].apply(lambda x: len(x) if x != {} else 0) for g in top_crew_names:test['crew_name_' + g] = test['crew'].apply(lambda x: 1 if g in str(x) else 0) test['genders_0_crew'] = test['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0])) test['genders_1_crew'] = test['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1])) test['genders_2_crew'] = test['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2])) for j in top_crew_jobs:test['jobs_' + j] = test['crew'].apply(lambda x: sum([1 for i in x if i['job'] == j])) for j in top_crew_departments:test['departments_' + j] = test['crew'].apply(lambda x: sum([1 for i in x if i['department'] == j])) train = train.drop(['crew'], axis=1) test = test.drop(['crew'], axis=1)

預覽一下數據處理完成后的效果:

train.head() idbudgethomepageimdb_idoriginal_languageoriginal_titleoverviewpopularityposter_pathrelease_dateruntimestatustaglinetitlerevenuecollection_namehas_collectionnum_genresall_genresgenre_Dramagenre_Comedygenre_Thrillergenre_Actiongenre_Romancegenre_Crimegenre_Adventuregenre_Horrorgenre_Science Fictiongenre_Familygenre_Fantasygenre_Mysterygenre_Animationgenre_Historygenre_Musicnum_companiesproduction_company_Warner Bros.production_company_Universal Picturesproduction_company_Paramount Picturesproduction_company_Twentieth Century Fox Film Corporationproduction_company_Columbia Picturesproduction_company_Metro-Goldwyn-Mayer (MGM)production_company_New Line Cinemaproduction_company_Touchstone Picturesproduction_company_Walt Disney Picturesproduction_company_Columbia Pictures Corporationproduction_company_TriStar Picturesproduction_company_Relativity Mediaproduction_company_Canal+production_company_United Artistsproduction_company_Miramax Filmsproduction_company_Village Roadshow Picturesproduction_company_Regency Enterprisesproduction_company_BBC Filmsproduction_company_Dune Entertainmentproduction_company_Working Title Filmsproduction_company_Fox Searchlight Picturesproduction_company_StudioCanalproduction_company_Lionsgateproduction_company_DreamWorks SKGproduction_company_Fox 2000 Picturesproduction_company_Summit Entertainmentproduction_company_Hollywood Picturesproduction_company_Orion Picturesproduction_company_Amblin Entertainmentproduction_company_Dimension Filmsnum_countriesproduction_country_United States of Americaproduction_country_United Kingdomproduction_country_Franceproduction_country_Germanyproduction_country_Canadaproduction_country_Indiaproduction_country_Italyproduction_country_Japanproduction_country_Australiaproduction_country_Russiaproduction_country_Spainproduction_country_Chinaproduction_country_Hong Kongproduction_country_Irelandproduction_country_Belgiumproduction_country_South Koreaproduction_country_Mexicoproduction_country_Swedenproduction_country_New Zealandproduction_country_Netherlandsproduction_country_Czech Republicproduction_country_Denmarkproduction_country_Brazilproduction_country_Luxembourgproduction_country_South Africanum_languageslanguage_Englishlanguage_Fran?aislanguage_Espa?ollanguage_Deutschlanguage_Pусскийlanguage_Italianolanguage_日本語language_普通話language_??????language_language_Portuguêslanguage_???????language_???/???language_廣州話 / 廣州話language_?????language_Polskilanguage_Magyarlanguage_Latinlanguage_svenskalanguage_???????language_?eskylanguage_????????language_ελληνικ?language_Türk?elanguage_Dansklanguage_Nederlandslanguage_?????language_Ti?ng Vi?tlanguage_????language_Roman?num_Keywordskeyword_woman directorkeyword_independent filmkeyword_duringcreditsstingerkeyword_murderkeyword_based on novelkeyword_violencekeyword_sportkeyword_biographykeyword_aftercreditsstingerkeyword_dystopiakeyword_revengekeyword_friendshipkeyword_sexkeyword_suspensekeyword_sequelkeyword_lovekeyword_policekeyword_teenagerkeyword_nuditykeyword_female nuditykeyword_drugkeyword_prisonkeyword_musicalkeyword_high schoolkeyword_los angeleskeyword_new yorkkeyword_familykeyword_father son relationshipkeyword_kidnappingkeyword_investigationnum_castcast_name_Samuel L. Jacksoncast_name_Robert De Nirocast_name_Morgan Freemancast_name_J.K. Simmonscast_name_Bruce Williscast_name_Liam Neesoncast_name_Susan Sarandoncast_name_Bruce McGillcast_name_John Turturrocast_name_Forest Whitakercast_name_Willem Dafoecast_name_Bill Murraycast_name_Owen Wilsoncast_name_Nicolas Cagecast_name_Sylvester Stallonegenders_0_castgenders_1_castgenders_2_castcast_character_cast_character_Himselfcast_character_Herselfcast_character_Dancercast_character_Additional Voices (voice)cast_character_Doctorcast_character_Reportercast_character_Waitresscast_character_Nursecast_character_Bartendercast_character_Jackcast_character_Debutantecast_character_Security Guardcast_character_Paulcast_character_Franknum_crewcrew_name_Avy Kaufmancrew_name_Robert Rodriguezcrew_name_Deborah Aquilacrew_name_James Newton Howardcrew_name_Mary Vernieucrew_name_Steven Spielbergcrew_name_Luc Bessoncrew_name_Jerry Goldsmithcrew_name_Francine Maislercrew_name_Tricia Woodcrew_name_James Hornercrew_name_Kerry Bardencrew_name_Bob Weinsteincrew_name_Harvey Weinsteincrew_name_Janet Hirshensongenders_0_crewgenders_1_crewgenders_2_crewjobs_Producerjobs_Executive Producerjobs_Directorjobs_Screenplayjobs_Editorjobs_Castingjobs_Director of Photographyjobs_Original Music Composerjobs_Art Directionjobs_Production Designjobs_Costume Designjobs_Writerjobs_Set Decorationjobs_Makeup Artistjobs_Sound Re-Recording Mixerdepartments_Productiondepartments_Sounddepartments_Artdepartments_Crewdepartments_Writingdepartments_Costume & Make-Updepartments_Cameradepartments_Directingdepartments_Editingdepartments_Visual Effectsdepartments_Lightingdepartments_Actors
0114000000NaNtt2637294enHot Tub Time Machine 2When Lou, who has become the "father of the In…6.575393/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg2/20/1593.0ReleasedThe Laws of Space and Time are About to be Vio…Hot Tub Time Machine 212314651Hot Tub Time Machine Collection11Comedy010000000000000300100100000001000000000000000011000000000000000000000000110000000010000000000000000000040010000000000010000000000000002400000000000000068101110000000000007200000000000000059013131011111111142910124213842440
1240000000NaNtt0368933enThe Princess Diaries 2: Royal EngagementMia Thermopolis is now a college graduate and …8.248895/w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg8/6/04113.0ReleasedIt can take a lifetime to find true love; she’…The Princess Diaries 2: Royal Engagement95149435The Princess Diaries Collection14Comedy Drama Family Romance1100100001000001000000001000000000000000000000110000000000000000000000001100000000100000000000000000000400000000000000010000000000000020000000000000000010101000000000000109000000000000000144311110110000000410010111000
233300000http://sonyclassics.com/whiplash/tt2582802enWhiplashUnder the direction of a ruthless instructor, …64.299990/lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg10/10/14105.0ReleasedThe road to greatness can take you to the edge.Whiplash13092000001Drama100000000000000300000000000000000000000000000011000000000000000000000000110000000010000000000000000000012000001000000000000000000010000510001000000000003171310000000010001164000000000000000494114411121111101121895915436310
341200000http://kahaanithefilm.com/tt1821480hiKahaaniVidya Bagchi (Vidya Balan) arrives in Kolkata …3.174936/aTXRaPrWSinhcmCrcfJK17urp3F.jpg3/9/12122.0ReleasedNaNKahaani16000000002Drama Thriller1010000000000000000000000000000000000000000000100000100000000000000000002100000001100000000000000000000700000000000000001000000000000070000000000000004121000000000000003000000000000000300101000000001000100010010000
450NaNtt1380152ko????Marine Boy is the story of a former national s…1.148070/m22s7zvkVFDU9ir56PiiqIEWFdT.jpg2/5/09118.0ReleasedNaNMarine Boy3923970002Action Thriller0011000000000000000000000000000000000000000000100000000000000010000000001000000000100100000000000000000000000000000000000000000000000040000000000000000041000000000000002000000000000000200001000000001000000010010000

達到預期的效果,但是日期還是字符串格式,再轉化一下日期為標準格式:

def fix_date(x):"""Fixes dates which are in 20xx"""year = x.split('/')[2]if int(year) <= 19:return x[:-2] + '20' + yearelse:return x[:-2] + '19' + year test.loc[test['release_date'].isnull() == True, 'release_date'] = '01/01/98' train['release_date'] = train['release_date'].apply(lambda x: fix_date(x)) test['release_date'] = test['release_date'].apply(lambda x: fix_date(x)) train['release_date'] = pd.to_datetime(train['release_date']) test['release_date'] = pd.to_datetime(test['release_date'])

數據探索性分析

首先看一下預算的分布情況,發現大部分值比較小,數據不平衡,應做log處理,增加數值較小時的區分度:

plt.hist(train['budget']) plt.title('budget distribution')

plt.hist(np.log1p(train['budget'])) plt.title('log1p budget distribution')


顯然收入revenue也一樣處理:

# log_budget, Normalization train['log_budget'] = np.log1p(train['budget']) test['log_budget'] = np.log1p(test['budget']) # log_revenue, Normalization train['log_revenue'] = np.log1p(train['revenue'])

再看下homepage,這里把homepage轉換成布爾值,有homepage的也是有實力的象征:

train['has_homepage'] = 0 train.loc[train['homepage'].isnull() == False, 'has_homepage'] = 1 test['has_homepage'] = 0 test.loc[test['homepage'].isnull() == False, 'has_homepage'] = 1

是否有主頁的分布情況,有主頁的票房更高:

sns.catplot(x='has_homepage', y='revenue', data=train); plt.title('Revenue for film with and without homepage');


各個語言的票房收入情況:

sns.boxplot(x='original_language',y='log_revenue',data=train.loc[train['original_language'].isin(train['original_language'].value_counts().head(10).index)]) plt.title("Log_Revenue VS Original_language")


英語好片很多,爛片也很多,其他語言也有好的電影,總體差別不大。

overview列,涉及到文本信息挖掘,這里簡單結合常用的Tfidf和線性回歸進行建模,如下:()

vectorizer=TfidfVectorizer(sublinear_tf=True,analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1,2),min_df=5) overview_text=vectorizer.fit_transform(train['overview'].fillna('')) linreg=LinearRegression() linreg.fit(overview_text,train['log_revenue'])

使用eli5可視化各關鍵字對log_revenue的影響:

import eli5 print('Target value:', train['log_revenue'][5]) eli5.show_prediction(linreg,doc=train['overview'][5],vec=vectorizer)


日期特征比較粗糙,增加星期幾、月份、季度、年份等特征:

def process_date(df):date_parts=['year','weekday','month','weekofyear','day','quarter']for part in date_parts:df["release_date_"+part]=getattr(df["release_date"].dt,part).astype(int)return df train=process_date(train) test=process_date(test)

先看下每年電影的發行量:這里用可交互式的可視化工具plotly

d1=train['release_date_year'].value_counts().sort_index() d2=test['release_date_year'].value_counts().sort_index() py.init_notebook_mode(connected=True) data=[go.Scatter(x=d1.index,y=d1.values,name='train'),go.Scatter(x=d2.index,y=d2.values,name='test')] layout=go.Layout(dict(title='Number of films per year',xaxis=dict(title='year'),yaxis=dict(title='Count')),legend=dict(orientation='v')) py.iplot(dict(data=data,layout=layout))


總發行量與總票房的趨勢:

d1=train['release_date_year'].value_counts().sort_index() d2=train.groupby(['release_date_year'])['revenue'].sum() data=[go.Scatter(x=d1.index,y=d1.values,name='Count'),go.Scatter(x=d2.index,y=d2.values,name='overall_revenue',yaxis='y2')] layout=go.Layout(dict(title= "Number of films and total revenue per year",xaxis=dict(title='year'),yaxis=dict(title='Count'),yaxis2=dict(title='Total revenue', overlaying='y', side='right')),legend=dict(orientation='v')) py.iplot(dict(data=data,layout=layout))


總發行量與平均票房的趨勢:(似乎平均票房在2000之后趨于穩定)

d1=train['release_date_year'].value_counts().sort_index() d2=train.groupby(['release_date_year'])['revenue'].mean() data=[go.Scatter(x=d1.index,y=d1.values,name='Count'),go.Scatter(x=d2.index,y=d2.values,name='Average revenue',yaxis='y2')] layout=go.Layout(dict(title="Number of films and average revenue per year",xaxis=dict(title='year'),yaxis=dict(title='Count'),yaxis2=dict(title='Average revenue',overlaying='y',side='right')),legend=dict(orientation='v')) py.iplot(dict(data=data,layout=layout))

周幾發行是否與票房有關:

sns.catplot(x='release_date_weekday',y='revenue',data=train) plt.title('Revenue on different days of week of release')


再看下箱線圖:

sns.boxplot(x='release_date_weekday',y='log_revenue',data=train) plt.title('Revenue on different days of week of release')

發現:周一到周三發布的電影很多是高票房的,周六發行的電影票房就很低了。

電影介紹tagline分析,分析出現頻率最高的詞匯:

plt.figure(figsize = (12, 12)) text_tagline=' '.join(train['tagline'].fillna('')) wordcloud_tagline=WordCloud(max_font_size=None,background_color='white',width=1200,height=1000).generate_from_text(text_tagline) plt.imshow(wordcloud_tagline) plt.title('Top words in tagline') plt.axis("off") plt.show()

是否有系列has_collection對票房的影響:

sns.boxplot(x='has_collection',y='log_revenue',data=train)


發現系列電影的平均票房更高。

分析電影題材數量與票房的關系:

train['num_genres'].value_counts() sns.catplot(x='num_genres',y='revenue',data=train)


題材數量3個往往有更高的票房,數量多了反而不好。

最后看下電影發行方與票房的關系,分別繪制分布圖:

f,axes=plt.subplots(6,5,figsize=(24,32)) plt.suptitle('Violin of revenue vs production company') for i,e in enumerate([i for i in train.columns if 'production_company_' in i]):sns.violinplot(x=e,y='revenue',data=train,ax=axes[i//5][i%5])

建模

先刪除一些無關的特征:

train = train.drop(['homepage', 'imdb_id', 'poster_path', 'release_date', 'status', 'log_revenue'], axis=1) test = test.drop(['homepage', 'imdb_id', 'poster_path', 'release_date', 'status'], axis=1)

再刪除特征值唯一的特征:

for col in train.columns:if train[col].nunique()==1:print(col)train.drop([col],axis=1)train.drop([col],axis=1)

對分類標簽進行編碼:

for col in ['original_language','collection_name','all_genres']:le=LabelEncoder()le.fit(list(train[col].fillna(''))+list(test[col].fillna('')))train[col]=le.transform(train[col].fillna('').astype(str))test[col]=le.transform(test[col].fillna('').astype(str))

將文本轉化成特征:

train_texts = train[['title', 'tagline', 'overview', 'original_title']] test_texts = test[['title', 'tagline', 'overview', 'original_title']] for col in ['title','tagline','overview','original_title']:train['len_'+col]=train[col].fillna('').apply(lambda x: len(str(x)))train['words_'+col]=train[col].fillna('').apply(lambda x: len(str(x).split(' ')))test['len_'+col]=test[col].fillna('').apply(lambda x: len(str(x)))test['words_'+col]=test[col].fillna('').apply(lambda x: len(str(x).split(' ')))train=train.drop(col,axis=1)test=test.drop(col,axis=1)

訓練數據和測試數據:

X=train.drop(['id','revenue'],axis=1) y=np.log1p(train['revenue']) X_test=test.drop(['id'],axis=1)

模型訓練:

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.1) # rmse: root mean square error, (sum(d^2)/N)^0.5 params = {'num_leaves': 30,'min_data_in_leaf': 20,'objective': 'regression','max_depth': 5,'learning_rate': 0.01,"boosting": "gbdt","feature_fraction": 0.9,"bagging_freq": 1,"bagging_fraction": 0.9,"bagging_seed": 11,"metric": 'rmse',"lambda_l1": 0.2,"verbosity": -1} model1=lgb.LGBMRegressor(**params,n_estimators=20000,nthread=4,jobs=-1) model1.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric='rmse',verbose=1000, early_stopping_rounds=200)

Training until validation scores don’t improve for 200 rounds.
[1000] training’s rmse: 1.42756 valid_1’s rmse: 2.07259
Early stopping, best iteration is:
[1118] training’s rmse: 1.38621 valid_1’s rmse: 2.06726

訓練后,各特征權重:

總結

以上是生活随笔為你收集整理的数据分析 - Kaggle TMDB 票房预测的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。