提高机器学习质量的想法_如何提高机器学习的数据质量?
提高機器學(xué)習(xí)質(zhì)量的想法
The ultimate goal of every data scientist or Machine Learning evangelist is to create a better model with higher predictive accuracy. However, in the pursuit of fine-tuning hyperparameters or improving modeling algorithms, data might actually be the culprit. There is a famous Chinese saying “工欲善其事,必先利其器” which literally translates to — To do a good job, an artisan needs the best tools. So if the data are generally of poor quality, regardless of how good a Machine Learning model is, the results will always be subpar at best.
每個數(shù)據(jù)科學(xué)家或機器學(xué)習(xí)傳播者的最終目標是創(chuàng)建一個具有更高預(yù)測準確性的更好模型。 但是,在追求微調(diào)超參數(shù)或改進建模算法時,數(shù)據(jù)實際上可能是罪魁禍首。 中國有句名言“工欲善其事,必先利其器”,字面意思是:要做好工作,工匠需要最好的工具。 因此,如果數(shù)據(jù)質(zhì)量通常很差,那么無論機器學(xué)習(xí)模型的質(zhì)量如何,結(jié)果總是最好的。
Why is data preparation so important?
為什么數(shù)據(jù)準備如此重要?
Photo by Austin Distel on Unsplash Austin Distel在Unsplash上拍攝的照片It is no secret that data preparation in the process of data analytics is ‘a(chǎn)n essential but unsexy’ task and more than half of data scientists regard cleaning and organizing data as the least enjoyable part of their work.
眾所周知 ,數(shù)據(jù)分析過程中的數(shù)據(jù)準備是“一項必不可少的但并不性感的任務(wù)”, 超過一半的數(shù)據(jù)科學(xué)家將清理和整理數(shù)據(jù)視為工作中最不愉快的部分。
Multiple surveys with data scientists and experts have indeed confirmed the common 80/20 trope — whereby 80% of the time is mired in the mundane janitorial work of prepping data, from collecting, cleaning to finding insights of the data (data wrangling or munching); leaving only 20% for the actual analytic work by modeling and building algorithm.
與數(shù)據(jù)科學(xué)家和專家進行的多次調(diào)查確實證實了常見的80/20斜率-80%的時間都沉浸在準備數(shù)據(jù)的平凡的清潔工作中,從收集,清理到發(fā)現(xiàn)數(shù)據(jù)見解(數(shù)據(jù)整理或壓縮) ; 通過建模和構(gòu)建算法只剩下20%的實際分析工作。
Thus, the Achilles heel of a data analytic process is in fact the unjustifiable amount of time spent on just data preparation. For data scientists, this can be a big hurdle in productivity for building a meaningful model. For businesses, this can be a huge blow to the resources as the investment into data analytics only sees the remaining one-fifth of the allocation dedicated to the original intent.
因此,數(shù)據(jù)分析過程的致命弱點實際上是僅僅花費在數(shù)據(jù)準備上的無用時間。 對于數(shù)據(jù)科學(xué)家而言,這對于構(gòu)建有意義的模型可能是生產(chǎn)力的一大障礙。 對于企業(yè)而言,這可能是對資源的巨大打擊,因為對數(shù)據(jù)分析的投資僅看到剩余的五分之一專用于原始意圖。
Heard of GIGO (garbage in, garbage out)? This is exactly what happens here. Data scientists arrive at a task with a given set of data, with the expectation to build the best model to fulfill the goal of the task. But halfway thru the assignment, he realizes that no matter how good the model is he can never achieve better results. After going back-and-forth he finds out that there are lapses in data quality and started scrubbing thru the data to make them “clean and usable”. By the time the data are finally fit again, the dateline is slowly creeping in and resources started draining up, and he is left with a limited amount of time to build and refine the actual model he was hired for.
聽說過GIGO(垃圾進,垃圾出)嗎? 這正是這里發(fā)生的情況。 數(shù)據(jù)科學(xué)家使用給定的數(shù)據(jù)集完成一項任務(wù),并期望構(gòu)建最佳模型來實現(xiàn)任務(wù)目標。 但是在完成任務(wù)的途中,他意識到無論模型多么出色,他都永遠無法取得更好的結(jié)果。 經(jīng)過反復(fù)研究,他發(fā)現(xiàn)數(shù)據(jù)質(zhì)量存在問題,并開始對數(shù)據(jù)進行清理以使其“干凈且可用”。 等到數(shù)據(jù)終于重新適合時,日期線就慢慢爬進去,資源開始消耗drain盡,他只剩下有限的時間來建立和完善他所雇用的實際模型。
This is akin to a product recall. When defects are discovered in products already on the market, it is often too late to remedy and products have to be recalled to ensure the public safety of consumers. In most cases, the defects are results of negligence in quality control of the components or ingredients used in the supply chain. For example, laptops being recalled due to battery issues or chocolates being recalled due to contamination in the dairy produce. Be it a physical or digital product, the staggering similarity we see here is that it is always the raw material taking the blame.
這類似于產(chǎn)品召回。 如果在市場上已有的產(chǎn)品中發(fā)現(xiàn)缺陷,通常為時已晚,無法補救,必須召回產(chǎn)品以確保消費者的公共安全。 在大多數(shù)情況下,缺陷是供應(yīng)鏈中使用的組件或成分的質(zhì)量控制疏忽的結(jié)果。 例如,由于電池問題而召回筆記本電腦 ,或者由于乳制品中的污染而召回巧克力 。 無論是物理產(chǎn)品還是數(shù)字產(chǎn)品,我們在這里看到的驚人相似之處都在于,總是責(zé)怪原材料。
But if data quality is a problem, why not just improve it?
但是,如果數(shù)據(jù)質(zhì)量有問題,為什么不僅僅改善它呢?
To answer this question, we first have to understand what is data quality.
要回答這個問題,我們首先必須了解什么是數(shù)據(jù)質(zhì)量。
Tindependent quality as the measure of the agreement between data views presented and the same data in real-world based on inherent characteristics and features; secondly, the quality of dependent application — a measure of conformance of the data to user needs for intended purposes.
T 獨立質(zhì)量是衡量基于固有特征和特征的數(shù)據(jù)視圖與現(xiàn)實世界中相同數(shù)據(jù)之間一致性的度量; 其次, 從屬應(yīng)用程序的質(zhì)量-衡量數(shù)據(jù)是否符合預(yù)期目的用戶需求的量度。
Let’s say you are a university recruiter trying to recruit fresh grads for entry-level jobs. You have a pretty accurate contact list but as you go thru the list you realize that most of the contacts are people over 50 years old, deeming it unsuitable for you to approach them. By applying the definition, this scenario fulfills only the first half of the complete definition — the list has the accuracy and consists of good data. But it does not meet the second criteria — the data, no matter how accurate are not suitable for the application.
假設(shè)您是一位大學(xué)招聘人員,正在嘗試為入門級工作招募應(yīng)屆畢業(yè)生。 您有一個非常準確的聯(lián)系人列表,但是當您瀏覽列表時,您會意識到大多數(shù)聯(lián)系人都是50歲以上的人,認為不適合與他們聯(lián)系。 通過應(yīng)用定義,此方案僅滿足完整定義的前半部分-列表具有準確性,并包含良好的數(shù)據(jù)。 但是它不符合第二個標準-數(shù)據(jù),無論多么精確,都不適合該應(yīng)用程序。
In this example, accuracy is the dimension we are looking at to assess the inherent quality of the data. There are a lot more different dimensions out there. To give you an idea of which dimensions are commonly studied and researched in peer-reviewed literature, here is a histogram showing the top 6 dimensions after studying 15 different data quality assessment methodologies involving 32 dimensions.
在此示例中,準確性是我們要評估的數(shù)據(jù)固有質(zhì)量的維度。 那里還有更多不同的尺寸。 為了讓您了解在同行評審的文獻中通常研究和研究哪些維度,下面的直方圖顯示了研究15種不同的數(shù)據(jù)質(zhì)量評估方法(涉及32個維度)后的前6個維度。
A systemic approach to Data Quality Assessment
數(shù)據(jù)質(zhì)量評估的系統(tǒng)方法
If you fail to plan, you plan to fail. A good systemic approach cannot be successful without a good planning. To have a good plan, you need to have a thorough understanding of the business, especially on problems associating with data quality. In the previous example, one should be aware that the contact list, albeit correct has a data quality problem of not being applicable to achieve the goal of the assigned task.
如果您沒有計劃,您計劃失敗。 沒有良好的計劃,好的系統(tǒng)方法就不會成功。 要制定好的計劃,您需要對業(yè)務(wù)有透徹的了解 ,尤其是在與數(shù)據(jù)質(zhì)量相關(guān)的問題上。 在前面的示例中,應(yīng)該知道聯(lián)系人列表(盡管正確)存在數(shù)據(jù)質(zhì)量問題,不適用于實現(xiàn)所分配任務(wù)的目標。
After the problems become clear, data quality dimensions to be investigated should be defined. This can be done using an empirical approach like surveys among stakeholders to find out which dimension matters the most in reference to the data quality problems.
在問題明確之后,應(yīng)該定義要研究的數(shù)據(jù)質(zhì)量維度。 可以使用經(jīng)驗方法(例如,在利益相關(guān)者之間進行調(diào)查)來完成,以找出哪個維度相對于數(shù)據(jù)質(zhì)量問題最為重要。
A set of assessment steps should follow suit. Design a way for the implementation so that these steps can map the assessment based on selected dimensions to the actual data. For instance, the following five requirements can be used as an example:
一套評估步驟也應(yīng)隨之而來。 設(shè)計一種實現(xiàn)方式,以便這些步驟可以將基于選定維度的評估映射到實際數(shù)據(jù)。 例如,可以使用以下五個要求作為示例:
[1] Timeframe — Decide on an interval for when the investigative data are collected.
[1]時間范圍-決定收集調(diào)查數(shù)據(jù)的時間間隔。
[2] Definition — Define a standard on how to differentiate the good from the bad data.
[2]定義-定義有關(guān)如何區(qū)分好數(shù)據(jù)和壞數(shù)據(jù)的標準。
[3] Aggregation — How to quantify the data for the assessment.
[3]匯總-如何量化評估數(shù)據(jù)。
[4] Interpretability — A mathematical expression to assess the data.
[4]可解釋性-評估數(shù)據(jù)的數(shù)學(xué)表達式。
[5] Threshold —Select a cut-off point to evaluate the results.
[5]閾值—選擇一個截止點以評估結(jié)果。
Once the assessment methodologies are in place, it is time to get hands-on and carry out the actual assessment. After the assessment, a reporting mechanism can be set up to evaluate the results. If the data quality is satisfactory, then the data are fit for further analytic purposes. Else, the data have to be revised and potentially to be collected again. An example can be seen in the following illustration.
評估方法到位后,就可以動手進行實際評估了。 評估之后 ,可以建立報告機制來評估結(jié)果。 如果數(shù)據(jù)質(zhì)量令人滿意,則將數(shù)據(jù)用于進一步的分析目的。 否則,必須修改數(shù)據(jù)并可能再次收集。 下圖顯示了一個示例。
Conclusion
結(jié)論
There is no one-size-fits-all solution for all data quality problems, as the definition outlined above, half of the data quality aspect is highly subjective. However, in the process of data quality assessment, we can always use a systemic approach to evaluate and assess data quality. While this approach is largely objective and relatively versatile, some domain knowledge is still required. For example in the selection of data quality dimension. Data Accuracy and Completeness might be critical aspects of the data for use case A but for use case B these dimensions might be less important.
對于所有數(shù)據(jù)質(zhì)量問題,沒有一種千篇一律的解決方案,正如上面概述的定義,數(shù)據(jù)質(zhì)量方面的一半是高度主觀的。 但是,在數(shù)據(jù)質(zhì)量評估過程中,我們始終可以使用系統(tǒng)的方法來評估和評估數(shù)據(jù)質(zhì)量。 盡管此方法主要是客觀的并且相對通用,但是仍需要一些領(lǐng)域知識。 例如在選擇數(shù)據(jù)質(zhì)量維度時。 對于用例A,數(shù)據(jù)準確性和完整性可能是數(shù)據(jù)的關(guān)鍵方面,但對于用例B,這些維度可能不太重要。
翻譯自: https://towardsdatascience.com/how-to-improve-data-preparation-for-machine-learning-dde107b60091
提高機器學(xué)習(xí)質(zhì)量的想法
總結(jié)
以上是生活随笔為你收集整理的提高机器学习质量的想法_如何提高机器学习的数据质量?的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 为什么会梦到鬼压床
- 下一篇: Matplotlib中的“ plt”和“