机器学习中一阶段网络是啥_机器学习项目的各个阶段
機(jī)器學(xué)習(xí)中一階段網(wǎng)絡(luò)是啥
Many businesses and organizations are turning to machine learning for solutions to challenging business goals and problems. Providing machine learning solutions to meet these needs requires that one follows a systematic process from problem to solution. The stages of a machine learning project constitute the machine learning pipeline. The machine learning pipeline is a systematic progression of a machine learning task from data to intelligence.
許多企業(yè)和組織正在轉(zhuǎn)向機(jī)器學(xué)習(xí)來(lái)尋求具有挑戰(zhàn)性的業(yè)務(wù)目標(biāo)和問(wèn)題的解決方案。 提供滿足這些需求的機(jī)器學(xué)習(xí)解決方案要求從問(wèn)題到解決方案遵循一個(gè)系統(tǒng)的過(guò)程。 機(jī)器學(xué)習(xí)項(xiàng)目的各個(gè)階段構(gòu)成了機(jī)器學(xué)習(xí)管道。 機(jī)器學(xué)習(xí)管道是機(jī)器學(xué)習(xí)任務(wù)從數(shù)據(jù)到智能的系統(tǒng)性演進(jìn)。
During our training as ML engineers, a lot of focus is invested in learning about algorithms, techniques, and machine learning tools but often, less attention is given to how to approach industry and business problems from the problem to a usable solution.
在我們作為ML工程師的培訓(xùn)期間,我們投入了大量的精力來(lái)學(xué)習(xí)算法,技術(shù)和機(jī)器學(xué)習(xí)工具,但通常很少關(guān)注如何從問(wèn)題到可用的解決方案來(lái)解決行業(yè)和業(yè)務(wù)問(wèn)題。
In this article, I present the machine learning pipeline that provisions for a comprehensive approach to solving real-world problems using machine learning. I will start with the observable or explainable problem as companies/businesses are likely to present them to an engineer and will walk you through the stages that a project needs to go through up till it ends as a usable solution available to platform end-users.
在本文中,我介紹了機(jī)器學(xué)習(xí)管道,該管道提供了一種全面的方法來(lái)使用機(jī)器學(xué)習(xí)解決實(shí)際問(wèn)題。 我將從一個(gè)可觀察或可以解釋的問(wèn)題開(kāi)始,因?yàn)楣?企業(yè)很可能將它們呈現(xiàn)給工程師,并將引導(dǎo)您完成項(xiàng)目需要經(jīng)歷的各個(gè)階段,直到它作為平臺(tái)最終用戶可用的可用解決方案結(jié)束為止。
You will basically see at a top-level what stages were involved in building, for instance, the Netflix movie recommendation engine that runs in the background of the movie platform and personalizes your experience, showing you the movies you are likely to be interested in.
您基本上會(huì)在高層看到構(gòu)建的各個(gè)階段,例如,在電影平臺(tái)的后臺(tái)運(yùn)行的Netflix電影推薦引擎,它將個(gè)性化您的體驗(yàn),向您展示您可能感興趣的電影。
Solving any business problem follows these fundamental stages and so it is necessary for all practitioners to understand and leverage it. If you sharpen your thinking about machine learning projects in light of this article, I believe that you will be more effective, structured when doing ML projects. You will understand from this article how to relate more with industrial stakeholders who may not understand the whole ML buzz but are genuinely seeking relevant and desired solutions to good problems.
解決任何業(yè)務(wù)問(wèn)題都遵循這些基本階段,因此所有從業(yè)人員都必須理解并利用它。 如果您根據(jù)本文加強(qiáng)對(duì)機(jī)器學(xué)習(xí)項(xiàng)目的思考,我相信您在進(jìn)行ML項(xiàng)目時(shí)會(huì)更有效率,更有條理。 您將從本文中了解如何與行業(yè)利益相關(guān)者建立更多聯(lián)系,他們可能不了解整個(gè)ML嗡嗡聲,但實(shí)際上正在尋求相關(guān)的問(wèn)題和理想的解決方案。
I know from my experience when I did my first internship as a data analyst for one of Africa’s leading data center colocation service providers, Africa Data Centers, the frustration inherent in not following this paradigm. I did not think then that my approach was not optimal, because I did not know it then. I can only imagine how much time and frustration it would have saved me and how considerably improved my performance and output would have been if I had this understanding then.
我從作為非洲領(lǐng)先的數(shù)據(jù)中心托管服務(wù)提供商之一的非洲數(shù)據(jù)中心的數(shù)據(jù)分析師的第一份實(shí)習(xí)經(jīng)歷時(shí)就知道,由于不遵循這種范例而產(chǎn)生的挫敗感。 當(dāng)時(shí)我并不認(rèn)為我的方法不是最優(yōu)的,因?yàn)槟菚r(shí)我還不知道。 我只能想象如果我有了這種理解,它將節(jié)省我多少時(shí)間和挫敗感,以及可以大大改善我的性能和輸出。
The stages of a machine learning project are summarized by the figure below.
下圖總結(jié)了機(jī)器學(xué)習(xí)項(xiàng)目的各個(gè)階段。
The machine learning pipeline, business problem to solution __ (image by author)機(jī)器學(xué)習(xí)管道,解決方案__的業(yè)務(wù)問(wèn)題(作者提供的圖像)業(yè)務(wù)問(wèn)題或研究問(wèn)題 (The Business problem or research problem)
Start with the business problem__
從業(yè)務(wù)問(wèn)題開(kāi)始__
In many cases, organizations tend to present this as a goal, what they want to achieve. Very often there is a story to it and that story is important. This is how we have been doing things or how the system was behaving, and this is what we would like to achieve. In my case, it was something like;
在許多情況下,組織傾向于將其作為目標(biāo),即他們想要實(shí)現(xiàn)的目標(biāo)。 很多時(shí)候都有一個(gè)故事,這個(gè)故事很重要。 這就是我們的工作方式或系統(tǒng)的行為方式,這就是我們想要實(shí)現(xiàn)的目標(biāo)。 就我而言,這有點(diǎn)像;
“We will like to use the historical data we have on our energy consumption to determine our options for energy efficiency optimization and cost-saving.”
“我們希望利用我們?cè)谀茉聪姆矫娴臍v史數(shù)據(jù)來(lái)確定我們?cè)谀茉葱蕛?yōu)化和成本節(jié)省方面的選擇。”
Simply put, the business problem was what can we do to reduce expenses on energy?
簡(jiǎn)而言之,業(yè)務(wù)問(wèn)題是我們?cè)撊绾螠p少能源支出?
構(gòu)架機(jī)器學(xué)習(xí)問(wèn)題 (Framing the machine learning problem)
From the business problem, you frame the machine learning problem. This is where domain knowledge/expertise comes in. This is not trivial at all because to get the right solution you must start with the right problem/questions.
從業(yè)務(wù)問(wèn)題中,您可以構(gòu)架機(jī)器學(xué)習(xí)問(wèn)題。 這就是領(lǐng)域知識(shí)/專長(zhǎng)的來(lái)源。這根本不是一件容易的事,因?yàn)橐@得正確的解決方案,您必須從正確的問(wèn)題/問(wèn)題開(kāi)始。
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question”__J. Tukey, The Future of Statistical Analysis
“對(duì)于正確的問(wèn)題 (通常是模糊的),比對(duì)錯(cuò)誤的問(wèn)題的精確答案要好得多”。 Tukey,統(tǒng)計(jì)分析的未來(lái)
Many organizations that are using machine learning seriously sometimes consult domain experts to help them ask the right questions. However, it may not be the case that when faced with a problem, you would bring in an expert. You may be the expert/consultant that was brought to figure things out. In that case, research is the only way to go. What is the industry doing to solve the same or similar problems?
許多認(rèn)真使用機(jī)器學(xué)習(xí)的組織有時(shí)會(huì)咨詢領(lǐng)域?qū)<?#xff0c;以幫助他們提出正確的問(wèn)題。 但是,遇到問(wèn)題時(shí),不一定會(huì)聘請(qǐng)專家。 您可能是被帶去解決問(wèn)題的專家/顧問(wèn)。 在這種情況下, 研究是唯一的方法 。 行業(yè)如何解決相同或相似的問(wèn)題?
Your goal is to not waste time answering the wrong question, right? Translating a business problem to a machine learning problem is so important that it determines the fate of your entire project. Completing this step should start leading you towards the kind of data that will be necessary to answer the machine learning question. From all your research and understanding, you should already have relevant features to expect in your dataset.
您的目標(biāo)是不要浪費(fèi)時(shí)間回答錯(cuò)誤的問(wèn)題,對(duì)嗎? 將業(yè)務(wù)問(wèn)題轉(zhuǎn)換為機(jī)器學(xué)習(xí)問(wèn)題非常重要,以至于它決定了整個(gè)項(xiàng)目的命運(yùn)。 完成此步驟應(yīng)開(kāi)始引導(dǎo)您獲得回答機(jī)器學(xué)習(xí)問(wèn)題所需的數(shù)據(jù)類型。 從您的所有研究和理解中,您應(yīng)該已經(jīng)具有期望在數(shù)據(jù)集中使用的相關(guān)功能。
數(shù)據(jù)收集和/或整合 (Data collection and/or integration)
Is the data relevant to the problem?
數(shù)據(jù)與問(wèn)題有關(guān)嗎?
Is the data enough to train a good model?
數(shù)據(jù)足以訓(xùn)練一個(gè)好的模型嗎?
This step involves putting together already existing data or collecting the necessary data. If there is already existing data, you will have to determine if the data is relevant to the machine learning problem and thus the business problem. This is very important especially if the organization is not a typical machine learning organization to have predetermined that before collecting the data they now have. It is not uncommon (been there once) to find that an organization has collected data that is not relevant to the problem they want to solve.
此步驟涉及將已經(jīng)存在的數(shù)據(jù)放在一起或收集必要的數(shù)據(jù)。 如果已經(jīng)存在數(shù)據(jù),則必須確定數(shù)據(jù)是否與機(jī)器學(xué)習(xí)問(wèn)題以及業(yè)務(wù)問(wèn)題相關(guān)。 這是非常重要的,特別是如果該組織不是典型的機(jī)器學(xué)習(xí)組織,那么在收集數(shù)據(jù)之前要預(yù)先確定它們。 發(fā)現(xiàn)組織收集的數(shù)據(jù)與他們要解決的問(wèn)題無(wú)關(guān)的情況并不少見(jiàn)(一次見(jiàn)過(guò))。
A good rule of thumb is to ask the question “What data will a human expert need to solve the problem if this task was left to them?”. If a human expert cannot use the data available to deduce correct predictions, it is almost definite that a machine cannot. Again, an expert will provide you with better information about how they will solve the problem and what data they will need to answer the question you are trying to answer using machine learning.
一個(gè)好的經(jīng)驗(yàn)法則是問(wèn)一個(gè)問(wèn)題:“如果這項(xiàng)任務(wù)留給他們,專家將需要什么數(shù)據(jù)來(lái)解決問(wèn)題?”。 如果人類專家無(wú)法使用可用數(shù)據(jù)來(lái)得出正確的預(yù)測(cè),則幾乎可以肯定機(jī)器無(wú)法做到。 同樣,專家將為您提供更好的信息,說(shuō)明他們?nèi)绾谓鉀Q問(wèn)題以及使用機(jī)器學(xué)習(xí)來(lái)回答您要回答的問(wèn)題所需的數(shù)據(jù)。
The quality of the model or analysis performed is totally dependent on the quality of the data. Just like one cannot make fine wine with low-quality grapes, one cannot build a good model with poor quality data.
模型或執(zhí)行的分析的質(zhì)量完全取決于數(shù)據(jù)的質(zhì)量。 就像一個(gè)人不能用低質(zhì)量的葡萄來(lái)釀造優(yōu)質(zhì)葡萄酒一樣,一個(gè)人也不能用低質(zhì)量的數(shù)據(jù)來(lái)建立一個(gè)好的模型 。
It might be possible to deduce more valuable features from the original data using feature engineering. Therefore, also think critically to see if relevant features are simply hidden in the dataset. Nevertheless, it is better to advise what data the organization/business should collect that will help their quest better.
使用特征工程可以從原始數(shù)據(jù)中推斷出更多有價(jià)值的特征。 因此,還必須進(jìn)行批判性思考,以查看相關(guān)特征是否僅隱藏在數(shù)據(jù)集中。 但是,最好建議組織/企業(yè)應(yīng)收集哪些數(shù)據(jù),以更好地幫助他們進(jìn)行搜索。
The final consideration is the size (number of examples) of the dataset. While there is no definite answer to how much data is enough data, algorithms always perform better when trained with huge amounts of data.
最后要考慮的是數(shù)據(jù)集的大小(示例數(shù))。 盡管對(duì)于多少數(shù)據(jù)就是足夠的數(shù)據(jù)沒(méi)有確切的答案,但是當(dāng)訓(xùn)練大量數(shù)據(jù)時(shí),算法始終會(huì)表現(xiàn)更好。
The required minimum is to have at least 10times as many data examples as there are features in the dataset.
所需的最小值是數(shù)據(jù)示例的至少10倍,是數(shù)據(jù)集中存在的特征的數(shù)量 。
If this is not the case, then more data should be collected. Many options are available for getting more data. These include crowdsourcing using platforms like Amazon Mechanical Turk; other external sources; or internal data collection within the organization. For some problems, it might be possible and appropriate to generate more data from existing data examples. This is best determined by a machine learning engineer.
如果不是這種情況,則應(yīng)收集更多數(shù)據(jù)。 許多選項(xiàng)可用于獲取更多數(shù)據(jù)。 其中包括使用Amazon Mechanical Turk等平臺(tái)進(jìn)行眾包; 其他外部來(lái)源; 或組織內(nèi)部的內(nèi)部數(shù)據(jù)收集。 對(duì)于某些問(wèn)題,從現(xiàn)有數(shù)據(jù)示例中生成更多數(shù)據(jù)可能是適當(dāng)?shù)摹?最好由機(jī)器學(xué)習(xí)工程師決定。
數(shù)據(jù)準(zhǔn)備/預(yù)處理 (Data preparation/pre-processing)
At this stage, you explore the data critically and prepare or transform it such that it is ready for training. Look out for such things as missing data, duplicate examples and features, feature value ranges, the data type of values, feature units, and so on. Use easy tools to quickly examine the data and scavenge as much general information as possible. After gathering useful information, some of the following actions may be required:
在此階段,您需要批判性地探索數(shù)據(jù),并準(zhǔn)備或轉(zhuǎn)換數(shù)據(jù)以使其準(zhǔn)備好進(jìn)行訓(xùn)練。 請(qǐng)注意缺少數(shù)據(jù),重復(fù)的示例和特征,特征值范圍,值的數(shù)據(jù)類型,特征單位等問(wèn)題。 使用簡(jiǎn)單的工具快速檢查數(shù)據(jù)并清除盡可能多的常規(guī)信息。 收集有用的信息后,可能需要執(zhí)行以下一些操作:
Deal with missing data (NaN, NA, “”, ?, None) and outliers__ Standardize all missing data to np.nan. Some common options for handling missing data and outliers: dropping the data examples with missing values or applying imputation techniques (mean, mode/frequency, median).
處理丟失的數(shù)據(jù)(NaN,NA,“”,?,無(wú))和異常值 __將所有丟失的數(shù)據(jù)標(biāo)準(zhǔn)化為np.nan。 處理缺失數(shù)據(jù)和離群值的一些常見(jiàn)選項(xiàng):刪除具有缺失值的數(shù)據(jù)示例或應(yīng)用插補(bǔ)技術(shù)(均值,模式/頻率,中位數(shù))。
Deal with duplicate features and/or examples__ Duplicate features cause problems of linear dependence in the data set and duplicate examples may give a false impression of the data being enough meanwhile the number of unique examples might be too small to reasonably train a good model.
處理重復(fù)的特征和/或示例 __重復(fù)的特征會(huì)導(dǎo)致數(shù)據(jù)集線性相關(guān)的問(wèn)題,重復(fù)的示例可能給數(shù)據(jù)帶來(lái)錯(cuò)誤的印象,同時(shí)獨(dú)特示例的數(shù)量可能太少而無(wú)法合理地訓(xùn)練一個(gè)好的模型。
Feature scaling, normalization, standardization__ You want to ensure that your features are in the same or comparable ranges typically 0 to 1. This ensures that your model trains faster and is stable especially if you are using optimization algorithms like gradient descent.
特征縮放,歸一化,標(biāo)準(zhǔn)化 __您要確保特征處于相同或可比較的范圍內(nèi)(通常為0到1)。這可以確保模型訓(xùn)練更快且穩(wěn)定,尤其是在使用諸如梯度下降的優(yōu)化算法時(shí)。
Balance the class sizes for categorical data__ Ensure that the number of training examples across the different target categories in your dataset is comparable. But if the task you are working on involves naturally skewed patterns where one class always dominates the other, balancing is not an option. This is common with anomaly detection tasks like rare diseases prediction (e.g. cancer), and fraud detection. An appropriate training method and evaluation metric must be chosen for skewed datasets that cannot be reasonably balanced.
平衡分類數(shù)據(jù)的類大小 __確保數(shù)據(jù)集中不同目標(biāo)類別的訓(xùn)練示例的數(shù)量可比。 但是,如果您正在處理的任務(wù)涉及自然偏斜的模式,其中一類總是主導(dǎo)另一類,那么平衡就不是一種選擇。 這在異常檢測(cè)任務(wù)(例如罕見(jiàn)病預(yù)測(cè)(例如癌癥)和欺詐檢測(cè))中很常見(jiàn)。 必須為無(wú)法合理平衡的偏斜數(shù)據(jù)集選擇適當(dāng)?shù)挠?xùn)練方法和評(píng)估指標(biāo)。
Harmonize inconsistent units__ Inconsistent units can easily escape notice. Ensure that all units measuring the same physical quantities are the same. Just to emphasize the point, NASA lost its $125-million Mars Climate Orbiter satellite because of inconsistent units.
協(xié)調(diào)不一致的單元 __不一致的單元可以輕松逃脫通知。 確保所有測(cè)量相同物理量的單位都相同。 為了強(qiáng)調(diào)這一點(diǎn), 美國(guó)宇航局由于單位不一致而損失了價(jià)值1.25億美元的“火星氣候軌道器”衛(wèi)星 。
數(shù)據(jù)可視化和探索性分析 (Data visualization and exploratory analysis)
Data visualization provides the most optimum means for exploratory analysis. Using plots like histograms and scatter plots one may easily spot things like outliers, trends, clusters, or categories in your dataset. However, visualizations tend to be very useful only for low dimensional data (1D, 2D, 3D) as higher dimensions cannot be plotted. For high dimensional data, you may select some specific features to visualize.
數(shù)據(jù)可視化為探索性分析提供了最佳的方法。 使用直方圖和散點(diǎn)圖之類的圖,可以輕松發(fā)現(xiàn)數(shù)據(jù)集中的異常值,趨勢(shì),聚類或類別。 但是,可視化僅對(duì)低維數(shù)據(jù)(1D,2D,3D)有用,因?yàn)闊o(wú)法繪制高維。 對(duì)于高維數(shù)據(jù),您可以選擇一些特定功能進(jìn)行可視化。
特征選擇和特征工程 (Feature selection and Feature engineering)
Which features are relevant to make correct predictions?
哪些特征與做出正確的預(yù)測(cè)有關(guān)?
The goal is to select features so that you have the least correlation between features but the maximum correlation between each feature and the targets.
目的是選擇要素,以使要素之間的關(guān)聯(lián)最少,但每個(gè)要素與目標(biāo)之間的關(guān)聯(lián)最大 。
Feature engineering involves manipulating the original features in the dataset into new potentially more useful features. As mentioned above, always think of what hidden features might be deduced from the original data. Debatably, feature engineering is one of the most critical and time-consuming activities in the ML pipeline.
特征工程涉及將數(shù)據(jù)集中的原始特征操縱為可能更有用的新特征。 如上所述,請(qǐng)始終考慮可以從原始數(shù)據(jù)中推斷出哪些隱藏特征。 值得一提的是,特征工程是ML管道中最關(guān)鍵和最耗時(shí)的活動(dòng)之一。
With all the above steps performed, you now have a sizable dataset with features that are relevant for the ML task and we can proceed (with some confidence) to train a model.
完成上述所有步驟后,您現(xiàn)在已經(jīng)擁有一個(gè)相當(dāng)大的數(shù)據(jù)集,其中包含與ML任務(wù)相關(guān)的功能,我們可以(有把握地)進(jìn)行模型訓(xùn)練。
模型訓(xùn)練 (Model Training)
The first step before training is to split your dataset into a train set, cross-validation, or development set and test set with randomization.
訓(xùn)練之前的第一步是將您的數(shù)據(jù)集分為具有隨機(jī)性的訓(xùn)練集,交叉驗(yàn)證或開(kāi)發(fā)集和測(cè)試集。
Randomization helps to eliminate bias in your models and is achieved by shuffling the data before splitting. Randomization is extremely important, especially when dealing with sequential data that follows some chronological order. This will ensure that the model does not go learning the structure in the data.
隨機(jī)化有助于消除模型中的偏差,可以通過(guò)在分割前對(duì)數(shù)據(jù)進(jìn)行混洗來(lái)實(shí)現(xiàn)。 隨機(jī)化非常重要,尤其是在處理遵循某些時(shí)間順序的順序數(shù)據(jù)時(shí)。 這將確保模型不會(huì)學(xué)習(xí)數(shù)據(jù)中的結(jié)構(gòu)。
There is no guiding rule for optimum splits, but the main intuition is to have as much training data as possible; smaller but sufficient data to tune hyperparameters during training and enough test data to test the model’s ability to generalize on. Some typical and commonly used splits include:
沒(méi)有最佳分割的指導(dǎo)規(guī)則,但主要的直覺(jué)是要擁有盡可能多的訓(xùn)練數(shù)據(jù)。 較小但足夠的數(shù)據(jù)以在訓(xùn)練期間調(diào)整超參數(shù),而足夠的測(cè)試數(shù)據(jù)可測(cè)試模型的概括能力。 一些典型和常用的拆分包括:
Some common data split percentages in machine learning機(jī)器學(xué)習(xí)中的一些常見(jiàn)數(shù)據(jù)拆分百分比Next, you will set aside the test set for later testing your models and proceed with the train set to train your model. It is good to quickly try out different potential algorithms and pick the one with the best generalization performance on the cross-validation or dev set for further tuning or pick a set of algorithms to form an ensemble. Use the dev set for model hyperparameter tuning.
接下來(lái),您將預(yù)留測(cè)試集以用于以后測(cè)試模型,并繼續(xù)使用訓(xùn)練模型來(lái)訓(xùn)練您的模型。 Swift嘗試不同的潛在算法,并在交叉驗(yàn)證或開(kāi)發(fā)集上選擇具有最佳泛化性能的算法,以進(jìn)行進(jìn)一步調(diào)整,或者選擇一組算法以形成整體,這是很好的。 將開(kāi)發(fā)集用于模型超參數(shù)調(diào)整。
Dataset splits and usage __ (image by author)數(shù)據(jù)集拆分和用法__(作者提供的圖片)模型評(píng)估 (Model Evaluation)
Is the model useful, (does it have the minimum required performance measure)?
該模型有用嗎(它具有最低要求的性能指標(biāo))嗎?
Is the model computationally efficient?
模型的計(jì)算效率高嗎?
Once you have optimized your model’s performance on the dev set as much as possible, you can now assess how well it performs on unseen data that was set aside in your test set. The performance observed on the test data gives you a glimpse of what you can expect to see in the production environment. Use single value evaluation metrics for quantifying performance.
一旦盡可能在開(kāi)發(fā)集上優(yōu)化了模型的性能,就可以評(píng)估模型在測(cè)試集中保留的未見(jiàn)數(shù)據(jù)上的性能。 在測(cè)試數(shù)據(jù)上觀察到的性能使您可以大致了解在生產(chǎn)環(huán)境中可以看到的內(nèi)容。 使用單值評(píng)估指標(biāo)來(lái)量化性能。
- Accuracy: suitable for classification task 精度:適合分類任務(wù)
- Precision/recall: suitable for skewed classification task 精度/召回率:適用于傾斜的分類任務(wù)
- Rsquared: suitable for regression Rsquared:適合回歸
It is hard to strike a good balance between precision and recall, hence they are always combined into a single value evaluation metric, the F1 score.
很難在精度和召回率之間取得良好的平衡,因此,它們總是組合為一個(gè)單一的價(jià)值評(píng)估指標(biāo),即F1得分 。
If the minimum required performance is obtained, then you have a useful model that is ready for deployment.
如果獲得了最低要求的性能,那么您就有了一個(gè)可供部署的有用模型。
“All models are wrong, but some are useful.” __ George Box
“所有模型都是錯(cuò)誤的,但有些是有用的。” __喬治·博克斯
模型部署,集成和監(jiān)控 (Model deployment, integration, and monitoring)
“The deployment of machine learning models is the process for making your models available in production environments, where they can provide predictions to other software systems. It is only once models are deployed to production that they start adding value.” __ Christopher Samiullah
機(jī)器學(xué)習(xí)模型的部署是使模型在生產(chǎn)環(huán)境中可用的過(guò)程,他們可以在其中為其他軟件系統(tǒng)提供預(yù)測(cè)。 只有將模型部署到生產(chǎn)中后,它們才能開(kāi)始增加價(jià)值。” __克里斯托弗·薩米拉
Deployment is very crucial and probably the ML engineer’s nightmare as it is more of a software engineering discipline. Nevertheless, ML engineers are largely expected to be able to deploy and integrate their models with existing software systems to cater to end-users. I have very little to say about deployment, but a few things to note are how ML deployments fundamentally differ from explicitly programmed software.
部署非常關(guān)鍵,可能是ML工程師的噩夢(mèng),因?yàn)樗嗟厥擒浖こ虒W(xué)科。 盡管如此,人們普遍期望ML工程師能夠?qū)⑵淠P团c現(xiàn)有軟件系統(tǒng)進(jìn)行部署和集成,以滿足最終用戶的需求。 關(guān)于部署,我?guī)缀鯖](méi)有什么要說(shuō)的,但是要注意的是ML部署與顯式編程的軟件在根本上有何不同。
Models in production environments suffer from performance decay with time. As a solution, monitoring the performance of your model in production is standard practice. Performance decay is inevitable partly because of drifts in data distribution in the production environment outside the data distribution that was existent in the train set. If you notice a significant difference in the production data distribution, then you need to retrain your model.
生產(chǎn)環(huán)境中的模型會(huì)隨著時(shí)間的推移而性能下降。 作為解決方案,在生產(chǎn)中監(jiān)視模型的性能是標(biāo)準(zhǔn)做法。 性能下降是不可避免的,部分原因是生產(chǎn)環(huán)境中的數(shù)據(jù)分布在列車集中存在的數(shù)據(jù)分布之外漂移。 如果您發(fā)現(xiàn)生產(chǎn)數(shù)據(jù)分布存在顯著差異,則需要重新訓(xùn)練模型。
Model in production is continuously monitored, retrained and deployed __ (image by author)對(duì)生產(chǎn)中的模型進(jìn)行持續(xù)監(jiān)控,再培訓(xùn)和部署__(作者提供圖片)Over the lifetime of any deployed ML model, the cycle monitor, retrain, and update is a routine process and it helps to use continuous logging of system performance information and creating performance drift alerts for efficient monitoring.
在任何已部署的ML模型的整個(gè)生命周期中,周期監(jiān)視,重新訓(xùn)練和更新都是例行過(guò)程,它有助于使用系統(tǒng)性能信息的連續(xù)記錄并創(chuàng)建性能漂移警報(bào)以進(jìn)行有效監(jiān)視。
結(jié)論 (Conclusion)
In this article, I summarized the stages of a machine learning project from understanding the problem to a usable solution.
在本文中,我總結(jié)了機(jī)器學(xué)習(xí)項(xiàng)目的各個(gè)階段,從理解問(wèn)題到可用的解決方案。
An ML solution is a system with a machine learning engine running in the background.
ML解決方案是一個(gè)在后臺(tái)運(yùn)行機(jī)器學(xué)習(xí)引擎的系統(tǒng)。
Summary of activities:
活動(dòng)摘要:
- Understand the business problems and needs 了解業(yè)務(wù)問(wèn)題和需求
- Frame the ML problem 框架ML問(wèn)題
- Understand the data needs and acquire the data 了解數(shù)據(jù)需求并獲取數(shù)據(jù)
- Clean and preprocess the data 清理和預(yù)處理數(shù)據(jù)
- Select relevant features 選擇相關(guān)功能
- Perform feature engineering 執(zhí)行特征工程
- Train a model 訓(xùn)練模型
- Tune hyperparameters to optimize the performance of the model (accuracy and speed). 調(diào)整超參數(shù)以優(yōu)化模型的性能(準(zhǔn)確性和速度)。
- Test the model 測(cè)試模型
- Deploy the model 部署模型
- Monitoring and updating the model/system (continuous process) 監(jiān)視和更新模型/系統(tǒng)(連續(xù)過(guò)程)
資源資源 (Resources)
How to deploy machine learning models
如何部署機(jī)器學(xué)習(xí)模型
6 stages to get success in machine learning projects
在機(jī)器學(xué)習(xí)項(xiàng)目中獲得成功的6個(gè)階段
翻譯自: https://medium.com/swlh/the-stages-of-a-machine-learning-project-cf4bb073a4ad
機(jī)器學(xué)習(xí)中一階段網(wǎng)絡(luò)是啥
總結(jié)
以上是生活随笔為你收集整理的机器学习中一阶段网络是啥_机器学习项目的各个阶段的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: ChatGPT“军备竞赛”使英伟达 CE
- 下一篇: 生成高分辨率pdf_用于高分辨率图像合成