成为一名真正的数据科学家有多困难
Data Science and Machine Learning are hard sports to play. It’s difficult enough to motivate yourself to sit down and learn some maths, let alone to becoming an expert on the matter.
數(shù)據(jù)科學和機器學習是一項艱巨的運動。 激勵自己坐下來學習一些數(shù)學知識是非常困難的,更不用說要成為這方面的專家了。
I began my journey into machine learning with a prediction problem. I was tasked with predicting a variable, but had around 100 other variables I could use. As a fresh graduate, I understandably took this as a regression problem and despite my colleagues being seemingly impressed, in all honesty, my result was pretty bad. I knew I could do better.
我以預測問題開始了機器學習之旅。 我的任務是預測變量,但是我可以使用大約100個其他變量。 作為一名應屆畢業(yè)生,我可以理解為這是一個回歸問題,盡管我的同事們似乎印象深刻,但老實說,我的成績很差。 我知道我可以做得更好。
From there, I read, I experimented, I read some more, then experimented some more, and this led to a bit of a journey where I quit that job, went back into education, then back into industry, and along the way I’ve been lucky enough to work with people who’ve shape the field of Artificial Intelligence along the way.
從那里開始,我讀了書,做了實驗,又讀了一些書,然后再做了更多的實驗,這導致了一段旅程,我辭掉了工作,回到了教育領(lǐng)域,然后回到了行業(yè),并一路走來。我們很幸運地與在整個過程中塑造了人工智能領(lǐng)域的人們一起工作。
In what follows, I present 5 difficulties that Machine Learning practitioners and Data Scientists deal with on a daily basis. I offer sympathy to those who need it!
接下來,我提出了機器學習從業(yè)者和數(shù)據(jù)科學家每天要解決的5個困難。 我向需要幫助的人表示同情!
困難1:適應問題領(lǐng)域 (Difficulty 1: Adapting to the Problem Domain)
How many mathematicians study Linguistics? How many mathematicians study Healthcare? So why are we any good at solving problems in these fields?
多少位數(shù)學家學習語言學? 多少數(shù)學家學習醫(yī)療保健? 那么,為什么我們擅長解決這些領(lǐng)域的問題呢?
The art of being a Mathematician comes from the ability to abstract a problem in a manner that makes it solvable. In Linguistics, we can treat each “phone” as a discrete variable and create a model that determines the joint distribution between each phone. In Healthcare, we can build a model that picks up latent features in X-rays that discern a disease.
成為數(shù)學家的藝術(shù)源于以解決問題的方式抽象問題的能力。 在語言學中,我們可以將每個“電話”視為離散變量,并創(chuàng)建一個模型來確定每個電話之間的聯(lián)合分布。 在醫(yī)療保健領(lǐng)域,我們可以建立一個模型,該模型可以拾取識別疾病的X射線中的潛在特征。
但是伙計,這很難。 (But dude, it’s tough.)
To be a successful machine learning researcher you have to really be willing to put the time and effort into fully immersing yourself in the domain knowledge. Many of the successful game-changers in the field have broken ground in fields that they had experience in. Deepminds founder Demis Hassabis ran a games company before returning to UCL to study Neuroscience, ultimately leading to his developments in Reinforcement Learning and leading to his advances in games like Atari and Go.
要成為一名成功的機器學習研究人員,您必須真正愿意花費時間和精力將自己完全浸入領(lǐng)域知識中。 在該領(lǐng)域中許多成功的游戲規(guī)則改變者在他們經(jīng)驗豐富的領(lǐng)域都取得了突破。Deepminds創(chuàng)始人Demis Hassabis在回到UCL學習神經(jīng)科學之前經(jīng)營著一家游戲公司 ,最終導致他在強化學習方面的發(fā)展并取得了進步在Atari和Go等游戲中。
Not all of us are as fortunate as Demis in having a background in a field that we’re trying to revolutionise. Often we’ll be at work and a project comes up that we have to try figure out: and next week we may have another task. Project switching has its pro’s and con’s, but ultimately you suffer on the level of depth you go to.
在試圖革新的領(lǐng)域擁有背景知識的人,并不是所有人都像Demis一樣幸運。 通常,我們會在工作,需要提出一個項目,我們必須設法弄清楚:下周,我們可能還有另一項任務。 項目切換有其優(yōu)點和缺點,但最終您會陷入深度學習。
It definitely helps if you know a little bit about your niche before you apply some ML, but for what it’s worth, I sympathise with your struggles.
如果您在應用ML之前對利基有所了解,肯定會有所幫助,但是對于它的價值,我很同情您的努力。
Photo by Joshua Gresham on Unsplash Joshua Gresham在Unsplash上拍攝的照片難度2:識別和忽略噪音 (Difficulty 2: Identifying and Ignoring the Noise)
Noise is second to none in statistics, machine learning and data science. Honestly, it’s everywhere. From dirty data, to rogue data points, to literature built on weak foundations, to models capturing latent bias: noise is literally everywhere.
在統(tǒng)計,機器學習和數(shù)據(jù)科學中,噪聲是首屈一指的。 老實說,它無處不在。 從臟數(shù)據(jù)到流氓數(shù)據(jù)點,再到建立在薄弱基礎上的文獻,再到捕獲潛在偏差的模型:噪聲無處不在。
Machine Learning models generally perform by minimising the squared sum of errors (or some form of misclassification measure) but when you’re researching a new topic or getting feedback from a colleague, noise can be pretty hard to define — the last thing you want to do is be chasing down the rabbit hole.
機器學習模型通常通過最小化誤差的平方和(或某種形式的錯誤分類度量)來執(zhí)行,但是當您研究新主題或從同事那里獲得反饋時,很難定義噪音,這是您想要做的最后一件事要做的就是追逐兔子洞。
There are a few ways to get around it:
有幾種解決方法:
- Speak to reliable people often, keep them close 經(jīng)常與可信賴的人交談,保持親密接觸
- Learn how to spot nonsense, keep it at a distance 了解如何發(fā)現(xiàn)廢話,保持一定距離
- Fail often, fail quick. 經(jīng)常失敗,很快就會失敗。
Experiment more, speak to people more, try more things and eventually you’ll begin to recognise and ‘smell’ noise. You’ll avert it, and progress quicker.
多做實驗,多與人交流,嘗試更多事情,最終您將開始認識并“聞”到噪音。 您將避免它,并加快進度。
As an example: many algorithms have a high accuracy rating because the dependant variable happens so infrequently. E.g. a model which predicts how many people in London get struck by lightening on a daily basis will almost certainly be 99.9999% correct without any training. The “noise” is recognising that people don’t get struck by lightening that often, and by adjusting your model for it.
例如:許多算法具有很高的準確度,因為因變量很少發(fā)生。 例如,一個模型,每天預測在倫敦有多少人被閃電擊中,幾乎可以肯定在沒有任何培訓的情況下正確率為99.9999%。 “噪音”是指人們不會因為經(jīng)常減輕重量和調(diào)整模型而受到打擊。
困難三:接受良好的教育 (Difficulty 3: Getting Good Education)
Photo by Markus Leo on Unsplash Markus Leo在Unsplash上拍攝的照片Education is so important in this field because the domain of knowledge required is so broad. From computer science, to maths, to algorithms, to statistics: there’s a lot to cover in a relatively short amount of time.
在這一領(lǐng)域,教育是如此重要,因為所需的知識領(lǐng)域如此廣泛。 從計算機科學,數(shù)學,算法到統(tǒng)計數(shù)據(jù):在相對較短的時間內(nèi)涵蓋了很多內(nèi)容。
Formal education (like University) is one thing but education in machine learning really surpasses that. Practitioners have to develop an ability to quickly learn things themselves and be able to implement them well.
正規(guī)教育(例如大學)是一回事,但是機器學習方面的教育確實超越了正規(guī)教育。 從業(yè)者必須發(fā)展一種能力,以快速地自己學習事物并能夠很好地實施它們。
The reason why this is so important (and so difficult) is that it’s tempting at times to find a github repository where someone else has spent some time solving the same problem you have, pulling their code and applying it to your problem. The solution make look ok but plenty of things can just get missed in between all of it and there’s no comparison to having the fundamental understanding.
之所以如此重要(而且如此困難),是因為有時會很想找到一個github存儲庫,讓其他人花了一些時間解決您遇到的相同問題,提取他們的代碼并將其應用于您的問題。 該解決方案看起來不錯,但是在所有解決方案之間可能會遺漏很多東西,并且與擁有基本理解沒有可比之處。
難度4:發(fā)布負面結(jié)果 (Difficulty 4: Publishing Negative Results)
Negative results happen all the time, they’re hard, but they happen. You have to recognise that negative results are also results and that they should be welcomed.
負面結(jié)果一直在發(fā)生,很難,但確實會發(fā)生。 您必須認識到負面結(jié)果也是結(jié)果,應該歡迎他們。
Machine Learning has two sides to it: the theoretical and the applied side. Theorists will publish less frequently with the hope of making a bigger splash and applied academics will tend to publish more often but solve bigger problems.
機器學習有兩個方面:理論方面和應用方面。 理論家們將減少發(fā)表頻率,以期引起更大的轟動,而應用學者則傾向于增加發(fā)表頻率,但解決更大的問題。
However in the pursuit of experimentation or in the pursuit of publishing, a lot of negative results are often put to the side and not overly discussed. This then leads to other practitioners repeating these same experiments and at the aggregate, a lot of time is wasted. This inefficiency also breeds a form of ego where people are respected by only the ‘positive’ results they’ve discovered, rather than the results they can confirm to be simply incomplete.
但是,在進行實驗或出版時,常常會帶來很多負面結(jié)果,而不會進行過多討論。 然后,這導致其他從業(yè)者重復這些相同的實驗,并且總的來說浪費了很多時間。 這種低效率也滋生了一種自我的形式,在這種自我中,人們僅受到他們發(fā)現(xiàn)的“積極”結(jié)果的尊重,而不是僅僅確認其不完全的結(jié)果。
Everyone benefits if we can classify problems better.
如果我們能夠更好地對問題進行分類,那么每個人都會受益。
Photo by Francisco Moreno on Unsplash 弗朗西斯科·莫雷諾 ( Un Francisco) 攝難題5:掌握研究 (Difficulty 5: Keeping on Top of the Research)
Did I mention that there’s a lot of it?
我是否提到過很多?
截至撰寫本文時,Google已在本年度出版了340多種出版物。 (Google has published over 340 publications THIS YEAR as of writing this article.)
Google don’t mess around either: their research is always very good. Let alone with all the publications and Universities in the world — how am I meant to keep on top of all this research?
Google也不搞混:他們的研究始終非常出色。 更不用說世界上所有的出版物和大學了-我要在所有這些研究中保持領(lǐng)先地位是什么意思?
You kind of…just…h(huán)ave to find a way.
您……只是……必須找到一種方法。
I read a lot and spend most of my day looking out for new approaches and methodologies to solve the problems I’m facing but at times, you can get lost in a swathe of research or even, not even find the right articles because there’s so much research that it’s hard to identify what’s useful.
我讀了很多書,花了整整一天的時間尋找解決我所面臨問題的新方法和方法,但有時,您可能會迷失于大量的研究中,甚至找不到合適的文章,因為許多研究表明,很難確定有用的東西。
Using citations is a great method to filter research and staying on top of the most cited papers every year definitely helps but in finding an ‘edge’ or in discovering ‘novel’ applications of models, you just have to do the leg work and read as much as you can.
使用引用是過濾研究的一種好方法,并且每年留在被引用最多的論文上肯定有幫助,但是在尋找模型的“優(yōu)勢”或發(fā)現(xiàn)“新穎”應用時,您只需做些簡單的工作并閱讀盡你所能。
Ultimately and in my opinion, to be a successful Machine Learning Researcher or Data Scientist, you need to be able to teach yourself. You just have to find a reason to know how a neural-network works or why a Random Forest sucks in some cases, and use this to drive your understanding.
最終,以我的觀點,要成為成功的機器學習研究員或數(shù)據(jù)科學家,您需要能夠自學。 您只需要找到一個理由來了解神經(jīng)網(wǎng)絡的工作原理,或者在某些情況下為什么會吸引隨機森林,并以此來加深您的理解。
The reason being is that it’s such a multi-disciplined subject that moves leaps and bounds every year. I graduated from my masters program in 2016 and since then the whole AI sphere has been reinvented 3 times over.
原因是,它是一個如此多學科的學科,每年都在飛躍發(fā)展。 我于2016年從碩士課程畢業(yè),自那時以來,整個AI領(lǐng)域已被徹底改造了3次。
Thanks for reading! If you have any messages, please let me know!
謝謝閱讀! 如果您有任何留言,請告訴我!
Keep up to date with my latest articles here!
在這里了解我的最新文章!
翻譯自: https://medium.com/swlh/how-hard-is-it-to-be-a-real-data-scientist-85ab88f451f
總結(jié)
以上是生活随笔為你收集整理的成为一名真正的数据科学家有多困难的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 孕妇做梦梦到黄鳝是什么意思
- 下一篇: 数据驱动开发_开发数据驱动的股票市场投资