来自TMDB的5000部电影数据集
原文:
TMDB 5000 Movie Dataset
Metadata on ~5,000 movies from TMDb
What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?
This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.
We have removed the original version of this dataset per a?DMCA?takedown request from IMDB. In order to minimize the impact, we're replacing it with a similar set of films and data fields from?The Movie Database (TMDb)?in accordance with?their terms of use. The bad news is that kernels built on the old dataset will most likely no longer work.
The good news is that:
-
You can port your existing kernels over with a bit of editing.?This kernel?offers functions and examples for doing so. You can also find?a general introduction to the new format here.
-
The new dataset contains full credits for both the cast and the crew, rather than just the first three actors.
-
Actor and actresses are now listed in the order they appear in the credits. It's unclear what ordering the original dataset used; for the movies I spot checked it didn't line up with either the credits order or IMDB's stars order.
-
The revenues appear to be more current. For example, IMDB's figures for Avatar seem to be from 2010 and understate the film's global revenues by over $2 billion.
-
Some of the movies that we weren't able to port over (a couple of hundred) were just bad entries. For example,?this IMDB entry?has basically no accurate information at all. It lists Star Wars Episode VII as a documentary.
Data Source Transfer Details
-
Several of the new columns contain json. You can save a bit of time by porting the load data functions [from this kernel]().
-
Even in simple fields like runtime may not be consistent across versions. For example, previous dataset shows the duration for Avatar's extended cut while TMDB shows the time for the original version.
-
There's now a separate file containing the full credits for both the cast and crew.
-
All fields are filled out by users so don't expect them to agree on keywords, genres, ratings, or the like.
-
Your existing kernels will continue to render normally until they are re-run.
-
If you are curious about how this dataset was prepared, the code to access TMDb's API is posted?here.
New columns:
-
homepage
-
id
-
original_title
-
overview
-
popularity
-
production_companies
-
production_countries
-
release_date
-
spoken_languages
-
status
-
tagline
-
vote_average
Lost columns:
-
actor1facebook_likes
-
actor2facebook_likes
-
actor3facebook_likes
-
aspect_ratio
-
casttotalfacebook_likes
-
color
-
content_rating
-
directorfacebooklikes
-
facenumberinposter
-
moviefacebooklikes
-
movieimdblink
-
numcriticfor_reviews
-
numuserfor_reviews
譯:
TMDB 5000電影數(shù)據(jù)集
來自TMDb的約5000部電影的元數(shù)據(jù)
在一部電影上映之前,我們能對它的成功說些什么呢?是否有某些公司(皮克斯?)找到了一致的公式?鑒于制作成本超過1億美元的大型電影仍可能失敗,這個(gè)問題對電影業(yè)來說比以往任何時(shí)候都更重要。電影迷可能有不同的興趣。我們能否預(yù)測哪些電影會(huì)獲得高評價(jià),無論它們是否在商業(yè)上取得成功?
這是一個(gè)開始深入研究這些問題的好地方,有幾千部電影的情節(jié)、演員陣容、工作人員、預(yù)算和收入的數(shù)據(jù)。
已根據(jù)IMDB的DMCA刪除請求刪除了該數(shù)據(jù)集的原始版本。為了將影響降至最低,我們根據(jù)電影數(shù)據(jù)庫(TMDb)的使用條款,將其替換為一組類似的電影和數(shù)據(jù)字段。壞消息是,基于舊數(shù)據(jù)集構(gòu)建的內(nèi)核很可能不再工作。
好消息是:
● 您可以通過一些編輯來移植現(xiàn)有內(nèi)核。這個(gè)內(nèi)核提供了相關(guān)函數(shù)和示例。你也可以在這里找到新格式的一般介紹。
● 新的數(shù)據(jù)集包含演員和劇組的全部學(xué)分,而不僅僅是前三名演員。
● 男演員和女演員現(xiàn)在按他們在演員名單中出現(xiàn)的順序排列。目前尚不清楚原始數(shù)據(jù)集使用了什么順序;對于我抽查的電影,它既不符合信用卡訂單,也不符合IMDB的明星訂單。
● 收入似乎更具流動(dòng)性。例如,IMDB關(guān)于《阿凡達(dá)》的數(shù)據(jù)似乎是從2010年開始的,并且低估了這部電影的全球收入超過20億美元。
● 有些我們沒能搬過去的電影(幾百部)只是糟糕的作品。例如,這個(gè)IMDB條目基本上沒有準(zhǔn)確的信息。它將《星球大戰(zhàn)》第七集列為紀(jì)錄片。
數(shù)據(jù)源傳輸詳細(xì)信息
● 幾個(gè)新列包含json。通過[從這個(gè)內(nèi)核]()移植load data函數(shù),可以節(jié)省一些時(shí)間。
● 即使在運(yùn)行時(shí)這樣的簡單字段中,各版本之間也可能不一致。例如,之前的數(shù)據(jù)集顯示了《阿凡達(dá)》延長剪輯的持續(xù)時(shí)間,而TMDB顯示了原始版本的時(shí)間。
● 現(xiàn)在有一個(gè)單獨(dú)的文件,包含演員和工作人員的全部學(xué)分。
● 所有字段都由用戶填寫,所以不要期望他們在關(guān)鍵詞、類型、評分等方面達(dá)成一致。
● 現(xiàn)有內(nèi)核將繼續(xù)正常渲染,直到重新運(yùn)行。
● 如果您對這個(gè)數(shù)據(jù)集是如何準(zhǔn)備的感到好奇,可以在這里發(fā)布訪問TMDb API的代碼。
新增字段:
-
homepage
-
id
-
original_title
-
overview
-
popularity
-
production_companies
-
production_countries
-
release_date
-
spoken_languages
-
status
-
tagline
-
vote_average
Lost columns:
-
actor1facebook_likes
-
actor2facebook_likes
-
actor3facebook_likes
-
aspect_ratio
-
casttotalfacebook_likes
-
color
-
content_rating
-
directorfacebooklikes
-
facenumberinposter
-
moviefacebooklikes
-
movieimdblink
-
numcriticfor_reviews
-
numuserfor_reviews
總結(jié)
以上是生活随笔為你收集整理的来自TMDB的5000部电影数据集的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: VSCode解决中文乱码问题最详解
- 下一篇: 如何关闭mac的SIP