机器学习结构化学习模型_生产化机器学习模型
機器學習結構化學習模型
The biggest issue in the life-cycle of ML project isn’t to create a good algorithm or to generalize the results or to get good predictions or better accuracy. The biggest issue is to put ML systems into production. One of the known truth of Machine Learning world is that only a small part of real-world ML system is composed of ML code and a big part is model deployment, model retraining, maintenance, on-going updates and experiments, auditing, versioning and monitoring. And these steps take a huge part in ML systems technical debt as it exists at the system/platform level rather than the code/development level. Hence the model deployment strategy becomes a very crucial step in designing the ML platform.
機器學習項目生命周期中最大的問題不是創建一個好的算法,也不是對結果進行概括,或者獲得好的預測或更好的準確性。 最大的問題是將機器學習系統投入生產。 機器學習世界的已知真理之一是,現實世界中的ML系統只有一小部分由ML代碼組成,而很大一部分是模型部署,模型重新訓練,維護,正在進行的更新和實驗,審計,版本控制和監控。 這些步驟在ML系統的技術債務中占了很大比重,因為它存在于系統/平臺級別而不是代碼/開發級別。 因此,模型部署策略成為設計ML平臺中非常關鍵的一步。
介紹 (Introduction)
The first step in determining how to deploy a model is understanding the system with these questions-
確定如何部署模型的第一步是通過以下問題來了解系統-
It’s indicative of the complexity of machine learning systems that many large technology companies who depend heavily on machine learning have dedicated teams and platforms that focus on building, training, deploying and maintaining ML models. Here are some examples:
這表明機器學習系統的復雜性,許多嚴重依賴機器學習的大型技術公司擁有專門的團隊和平臺,專注于構建,訓練,部署和維護ML模型。 這里有些例子:
- Databricks has MLFlow Databricks具有MLFlow
- Google has TensorFlow Extended (TFX) Google擁有TensorFlow Extended(TFX)
- Uber has Michelangelo 優步有米開朗基羅
- Facebook has FBLearner Flow Facebook有FBLearner Flow
- Microsoft has AI Lab 微軟有AI Lab
- Amazon has Amazon ML 亞馬遜有Amazon ML
- AirBnb has BigHead AirBnb擁有BigHead
- JPMC has Omni AI JPMC具有Omni AI
機器學習系統與傳統軟件系統 (Machine Learning System vs Traditional Software System)
1. Unlike Traditional Software Systems, ML systems deployment isn’t same as deploying a trained ML model as service. ML systems requires multi-step automated deployment pipeline for retraining, validation and deployment of model — which adds complexity.
1.與傳統軟件系統不同,機器學習系統的部署與將經過訓練的機器學習模型作為服務的部署不同。 ML系統需要多步驟的自動部署管道來進行模型的重新培訓,驗證和部署-這增加了復雜性。
2. Testing a ML system involves model validation, model training etc — in addition to the software tests such as unit testing and integration testing.
2.測試ML系統還涉及模型驗證,模型訓練等,此外還包括軟件測試(例如單元測試和集成測試)。
3. Machine Learning Systems are much more dynamic in terms of performance of the system due to varying data profiles and the model has to be retrained/refreshed often which leads to more iterations in the pipeline. This is not the case with Traditional Software Systems.
3.由于數據配置文件的變化,機器學習系統在系統性能方面具有更大的動態性,并且必須經常對模型進行重新訓練/刷新,這會導致更多迭代。 傳統軟件系統不是這種情況。
模型可移植性(從模型開發到生產) (Model Portability (From Model Development to Production))
Writing code to predict/score data, is most often done in Jupyter notebooks or an IDE. Taking this model development code to production environment requires to convert language specific code to some exchange format (compressed & serialized) that is language neutral and lightweight. Hence portability of model is also a key requirement.
編寫代碼來預測/評分數據通常是在Jupyter筆記本電腦或IDE中完成的。 將此模型開發代碼帶入生產環境需要將特定于語言的代碼轉換為某種語言中立且輕量級的交換格式(壓縮和序列化)。 因此,模型的可移植性也是關鍵要求。
Below are the widely use formats for ML model portability-
以下是ML模型可移植性的廣泛使用的格式-
1. Pickle — The pickle file is the binary version of Python object which is used for serializing and de-serializing a Python object structure. Conversion of a python object hierarchy into byte stream is called “pickling”. When the byte stream is converted back to object hierarchy this operation is called as “unpickling”.
1. Pickle — Pickle文件是Python對象的二進制版本,用于序列化和反序列化Python對象結構。 將python對象層次結構轉換為字節流的過程稱為“腌制”。 當字節流轉換回對象層次結構時,此操作稱為“解開”。
2. ONNX (Open Neural Network Exchange) — ONNX is an open source format for machine learning models. ONNX has a common set of operators and file format to use with models on a variety of frameworks and tools.
2. ONNX(開放神經網絡交換)— ONNX是一種用于機器學習模型的開源格式。 ONNX具有一組通用的運算符和文件格式,可用于各種框架和工具上的模型。
3. PMML (The Predictive Model Markup Language) — PMML is an XML-based predictive model interchange format. With PMML, you can develop a model on one system on a application and deploy the model on another system with another application, only by transmitting an XML configuration file.
3. PMML(預測模型標記語言)— PMML是基于XML的預測模型交換格式。 使用PMML,僅通過傳輸XML配置文件,即可在應用程序的一個系統上開發模型,并在具有另一個應用程序的另一系統上部署模型。
4. PFA (Portable Format for Analytics) — PFA is an emerging standard for statistical models and data transformation engines. PFA has the ease of portability across different systems and models, pre-processing, and post-processing functions can be chained and built into complex workflows. PFA can be a simple raw data transformation or a sophisticated suite of concurrent data mining models, with a JSON or YAML configuration file.
4. PFA(分析的便攜式格式)— PFA是統計模型和數據轉換引擎的新興標準。 PFA易于跨不同系統和模型進行移植,可以將預處理和后處理功能鏈接起來并構建到復雜的工作流程中。 PFA可以是簡單的原始數據轉換,也可以是復雜的并發數據挖掘模型套件,帶有JSON或YAML配置文件。
5. NNEF (Neural Network Exchange Format) — NNEF is useful in reducing the pains in machine learning deployment process by enabling a rich mix of neural network training tools for applications to be used across a range of devices and platforms.
5. NNEF(神經網絡交換格式) -NNEF通過啟用豐富的神經網絡培訓工具組合,以便在各種設備和平臺上使用的應用程序,可減輕機器學習部署過程中的痛苦。
There are some framework specific formats as well, like — Spark MLWritable (Spark specific) and POJO / MOJO (H2O.ai specific).
也有一些特定于框架的格式,例如-Spark MLWritable(特定于Spark)和POJO / MOJO(特定于H2O.ai) 。
機器學習中的CI / CD (CI/CD in Machine Learning)
In traditional software systems, Continuous Integration & Delivery is the approach that provides automation, quality, and discipline for creating a reliable, predictable and repeatable process to release software into production. The same should be applied to ML Systems? Yes, but the process is not simple. The reason is in case of ML systems, changes to ML model and the data used for training also needs to be managed along with the code into the ML delivery process.
在傳統軟件系統中,持續集成與交付是一種提供自動化,質量和紀律性的方法,用于創建可靠,可預測和可重復的過程以將軟件投入生產。 ML系統是否應同樣適用? 是的,但是過程并不簡單。 原因是在ML系統的情況下,對ML模型的更改以及用于訓練的數據也需要與ML交付過程中的代碼一起進行管理。
So unlike traditional DevOps, MLOps has 2 more steps every time CI/CD runs.
因此,與傳統的DevOps不同,每次運行CI / CD時,MLOps都有2個步驟。
Continuous integration in machine learning means that each time you update your code or data, the machine learning pipeline reruns, which kickoff builds and test cases. If all the tests are successful then Continuous Deployment begins that deploy the changes to the environment.
機器學習中的持續集成意味著每次更新代碼或數據時,機器學習管道都會重新運行,從而啟動構建和測試用例。 如果所有測試均成功,則開始持續部署,將更改部署到環境中。
Within ML System, there is one more term for MLOps called CT (Continuous Training) which comes into picture if you need to automate the training process.
在ML系統中,對于MLOps還有另外一個術語,稱為CT(連續訓練),如果您需要使訓練過程自動化,則可以使用它。
Although the market has some reliable tools for ML Ops and new tools are also coming up, its still new to predict the ML model outcome in the production environment.
盡管市場上有一些針對ML Ops的可靠工具,并且還會有新工具出現,但在生產環境中預測ML模型的結果仍然是新的。
New tools like, Gradient and MLflow are becoming popular for building a robust CI/CD pipelines in ML systems. Tools such as Quilt, Pachyderm are leading the way for a forward-looking data science/ML workflows but they have not yet had widespread adoption. Some other alternatives include dat, DVC and gitLFS; but the space is still new and relatively unexplored.
在ML系統中構建健壯的CI / CD管道時,諸如Gradient和MLflow之類的新工具正變得越來越流行。 Quilt和Pachyderm之類的工具正在引領前瞻性數據科學/ ML工作流程,但尚未得到廣泛采用。 其他一些替代方法包括dat , DVC和gitLFS ; 但該空間仍然是新的,尚未開發。
部署策略 (Deployment Strategies)
There are many different approaches when it comes to deploy machine learning models into production and an entire book could be written on this topic. In fact, I am not sure if it exists already. The choice of deployment strategy depends totally on the business requirement and how we plan to consume the output prediction. On a very high level, it can be categorized as below-
將機器學習模型部署到生產環境中有許多不同的方法,并且可以就此主題撰寫整本書。 實際上,我不確定它是否已經存在。 部署策略的選擇完全取決于業務需求以及我們計劃如何使用輸出預測。 在很高的層次上,它可以分為以下幾種:
批量預測 (Batch Prediction)
Batch Prediction is the simplest form of machine learning deployment strategy which is used in online competitions and academics. In this strategy you schedule the predictions to run at a particular time and outputs them to database / file systems.
批次預測是機器學習部署策略的最簡單形式,用于在線競賽和學術界。 在此策略中,您可以安排預測在特定時間運行,并將其輸出到數據庫/文件系統。
Implementation
實作
Below approaches can be used to implement batch predictions-
以下方法可用于實現批量預測-
- Simplest way is to write a program in Python and schedule it using Cron, but it requires extra effort to introduce functionalities for validating, auditing and monitoring. However, now days we have many tool/approaches that can make this task simpler. 最簡單的方法是用Python編寫程序并使用Cron對其進行調度,但是需要額外的精力來引入用于驗證,審核和監視的功能。 但是,如今,我們擁有許多可以簡化此任務的工具/方法。
Writing a Spark Batch job and schedule it in yarn and introduce logging for monitoring and retry functionalities.
編寫Spark Batch作業并將其安排在紗線中,并引入日志記錄以進行監視和重試功能。
Using tools like Perfect and Airflow which provides UI capabilities for scheduling, monitoring and alert notifications in case of failures.
使用諸如Perfect和Airflow之類的工具,該工具提供UI功能,以便在發生故障時進行計劃,監視和警報通知。
Platforms like Kubeflow, MLFlow and Amazon Sagemaker also provide batch deployment and scheduling capabilities.
諸如Kubeflow , MLFlow和Amazon Sagemaker之類的平臺還提供批處理部署和調度功能。
網絡服務 (Web Service)
The most common and widely used machine learning deployment strategy is a simple web service. It is easy to build and deploy. The web service takes input parameters and outputs the model predictions. The predictions are almost real-time and doesn’t require lots of resources also as it will predict one record at a time, unlike batch prediction that processes all the records at once.
最普遍使用的機器學習部署策略是簡單的Web服務。 易于構建和部署。 Web服務采用輸入參數并輸出模型預測。 預測幾乎是實時的,并且不需要大量資源,因為它可以一次預測一個記錄,而批處理預測則可以一次處理所有記錄。
Implementation
實作
- To implement the predictions as web service, the simplest way is to write a service and put it in a docker container to integrate with existing products. Though this is not the sexiest solution but probably the cheapest. 要將預測實現為Web服務,最簡單的方法是編寫服務并將其放入docker容器中以與現有產品集成。 雖然這不是最性感的解決方案,但可能是最便宜的。
The most common framework to implement ML model as a service is using Flask. You can then deploy your flask application on Heroku or Azure or AWS or Google Cloud or just deploy using PythonAnywhere.
將ML模型實現為服務的最常見框架是使用Flask 。 然后,您可以在Heroku或Azure或AWS或Google Cloud上部署Flask應用程序,或僅使用PythonAnywhere進行部署。
Another common way to implement ML model as service is using Django app and deploy it using Heroku/AWS/Azure/Google Cloud platforms.
將ML模型作為服務實現的另一種常見方法是使用Django應用程序,并使用Heroku / AWS / Azure / Google Cloud平臺進行部署。
There are few new options like Falcon, Starlette, Sanic, FastAPI and Tornado also talking space in this area. FastAPI along with Uvicorn server is becoming famous these days because of minimal code requirements and it automatically creates both OpenAPI (Swagger) and ReDoc documentation.
很少有新選擇,例如Falcon , Starlette , Sanic , FastAPI和Tornado也在該區域進行討論。 由于很少的代碼要求,FastAPI和Uvicorn服務器一起最近變得越來越著名,它會自動創建OpenAPI(Swagger)和ReDoc文檔。
為什么要進行在線/實時預測? (Why Online/Real-Time Predictions?)
Above two approaches are widely used and almost 90% of the time you will be using either of two strategies to build and deploy your ML pipelines. However, there are few concerns with both of these approaches-
以上兩種方法被廣泛使用,幾乎有90%的時間將使用兩種策略中的一種來構建和部署ML管道。 但是,這兩種方法都很少有人擔心-
1. Performance tuning of bulk size for batch partitioning.
1.對批量大小進行性能調整以進行批量分區。
2. Service exhaustion, Client starvation, Handling failures and retries are common issues with web services. If model calls are asynchronous, this approach fails to trigger back pressure in case there is a burst of data such as during restarts. This can lead to Out of Memory failures in the model servers.
2.服務耗盡,客戶端匱乏,處理失敗和重試是Web服務的常見問題。 如果模型調用是異步的,則這種方法將無法觸發背壓,以防萬一出現數據突發(例如在重新啟動期間)。 這可能會導致模型服務器中的內存不足故障。
The answer to the above issues lies in next two approaches.
上述問題的答案在于接下來的兩種方法。
實時流分析 (Real-Time Streaming Analytics)
From last few years, the world of software has moved from Restful services to the Streaming APIs, so should the world of ML.
從最近幾年開始,軟件世界已經從Restful服務轉變為Streaming API,ML領域也應該如此。
Hence another ML workflow that’s emerging now days is real-time streaming analytics, which is also known as Hot Path Analytics.
因此,如今正在出現的另一個ML工作流是實時流分析,也稱為“熱路徑分析”。
In this approach, the requests to the model/data load comes as stream (commonly as Kafka stream) of events, the model is placed right in the firehose, to run on the data as it enters the system. This creates a system that is asynchronous, fault tolerant, replayable and is highly scalable.
在這種方法中,對模型/數據加載的請求以事件流(通常為Kafka流)的形式出現,模型直接放置在firehose中,以在數據進入系統時對其運行。 這將創建一個異步,容錯,可重播且高度可擴展的系統。
The ML system in this approach is event-driven and hence it allows us to gain better model computing performance.
這種方法中的ML系統是事件驅動的,因此它使我們可以獲得更好的模型計算性能。
Implementation
實作
To implement ML system using this strategy, the most common way is to use Apache Spark or Apache Flink (both provide Python API). Both allows for easy integration of ML models written using Scikit-Learn or Tensorflow other than Spark MLlib or Flink ML.
要使用此策略實現ML系統,最常見的方法是使用Apache Spark或Apache Flink (均提供Python API)。 兩者都可以輕松集成使用Spark MLlib或Flink ML以外的Scikit-Learn或Tensorflow編寫的ML模型。
If you are not comfortable with python or if there is already an existing data pipeline which is written in Java or Scala, then you can use Tensorflow Java API or third-party libraries such as MLeap or JPMML.
如果您不熟悉python或已經存在用Java或Scala編寫的數據管道,則可以使用Tensorflow Java API或第三方庫(例如MLeap或JPMML) 。
自動化機器學習 (Automated Machine Learning)
If we just train a model once and never touch it again, we’re missing out the information more/new data could provide us.
如果我們只訓練一次模型而再也不會碰它,那么我們就會丟失更多/新數據可以提供給我們的信息。
This is especially important in environments where behaviors change quickly, so you need ML model that can learn from new examples in something closer to real time.
這在行為快速變化的環境中尤其重要,因此您需要可以更接近實時地從新示例中學習的ML模型。
With Automated ML, you should both predict and learn in real time.
使用自動ML,您應該同時進行預測和實時學習。
A lot of engineering is involved in building ML model that learns online, but the most important factor is architecture/deployment of model. As model can, and will, change every second, you can’t instantiate several instances. Also it’s not horizontally scalable and you are forced to have a single model instance that eats new data as fast as it can, spitting out sets of learned parameters behind an API. The most important part in the process (the model) is only vertically scalable. It may not even be feasible to distribute between threads.
建立在線學習的ML模型涉及很多工程,但是最重要的因素是模型的體系結構/部署。 由于模型可以而且將每秒更改一次,因此您無法實例化多個實例。 此外,它不是水平可伸縮的,并且您不得不擁有一個單個模型實例,該實例必須盡快吸收新數據,并在API后面吐出一些學習的參數。 該過程中最重要的部分(模型)只能縱向擴展。 在線程之間進行分配甚至是不可行的。
Real-time examples of this strategy are — Uber Eats delivery estimation, LinkedIn’s connections suggestions, Airbnb’s search engines, augmented reality, virtual reality, human-computer interfaces, self-driving cars.
這種策略的實時示例包括:Uber Eats交付估算,LinkedIn的聯系建議,Airbnb的搜索引擎,增強現實,虛擬現實,人機界面,自動駕駛汽車。
Implementation
實作
- Sklearn library has few algorithms that support online incremental learning using partial_fit method, like SGDClassifier, SGDRegressor, MultinomialNB, MiniBatchKMeans, MiniBatchDictionaryLearning. Sklearn庫中很少有算法支持使用partial_fit方法進行在線增量學習,例如SGDClassifier,SGDRegressor,MultinomialNB,MiniBatchKMeans,MiniBatchDictionaryLearning。
- Spark MLlib doesn’t have much support for online learning and has 2 ML algorithms to support online learning — StreamingLinearRegressionWithSGD and StreamingKMeans. Spark MLlib對在線學習沒有太多支持,并且有2種ML算法支持在線學習-StreamingLinearRegressionWithSGD和StreamingKMeans。
Creme also has good APIs for Online Learning.
Creme還具有良好的在線學習API。
Challenges
挑戰性
Online training also has some issues associated with it. As data is changing often, your ML model can be sensitive to the new data and change its behavior. Hence a mandatory on the fly monitoring is required and if the change threshold is more than a certain percentage; then data behavior has to be managed properly.
在線培訓也有一些相關問題。 由于數據經常變化,因此您的ML模型可以對新數據敏感并更改其行為。 因此,如果變化閾值超過一定百分比,則需要強制性的即時監視。 那么就必須正確管理數據行為。
For example in any recommendation engine, if one user is liking or disliking a category of data in bulk; then this behavior, if not taken care properly can influence the results for other users. Also chances are that this data can be a scam, so it should be removed from the training data.
例如,在任何推薦引擎中,如果一個用戶批量喜歡或不喜歡某類數據; 那么,如果未適當注意,此行為可能會影響其他用戶的結果。 也有可能該數據可能是騙局,因此應將其從訓練數據中刪除。
Taking care of these issues/patterns in batch training is relatively easy and the misleading data patterns and outliers can be removed from training data very easily. But in Online learning its much harder, and creating a monitoring pipeline for such data behavior can be a big hit on performance as well due to the size of training data.
批量培訓中解決這些問題/模式相對容易,并且可以很容易地從培訓數據中消除誤導性數據模式和異常值。 但是在在線學習中,要困難得多,并且由于訓練數據的大小,為此類數據行為創建監視管道也可能對性能產生重大影響。
部署策略的其他變體 (Other Variants in Deployment Strategies)
There are few other variants in deployment strategies, like adhoc predictions via SQL, model server (RPCs) and embedded model deployments, tiered storage without any Data Storage, Database as a model storage. All these are combination / variants of above four strategies. Each strategy itself is a chapter, so its beyond the scope of this article. But the essence is that deployment strategies can be combined / molded as per the business need. For example, if data is changing frequently but you do not have the platform / environment to do online learning, then you can do batch learning (every hour/day, depending on need) parallel to the online prediction.
部署策略中幾乎沒有其他變體,例如通過SQL進行的臨時預測,模型服務器(RPC)和嵌入式模型部署,沒有任何數據存儲的分層存儲,數據庫作為模型存儲。 所有這些都是以上四種策略的組合/變體。 每個策略本身都是一章,因此不在本文討論范圍之內。 但是本質是部署策略可以根據業務需求進行組合/構建。 例如,如果數據經常變化,但是您沒有平臺/環境進行在線學習,則可以與在線預測并行地進行批處理學習(每小時/每天,具體取決于需求)。
監控ML模型性能 (Monitoring ML Model Performance)
Once a model is deployed and running successfully into production environment, it is necessary to monitor how well the model is performing. Monitoring should be designed to provide early warnings to the myriad of things that can go wrong in a production environment.
一旦模型被部署并成功運行到生產環境中,就必須監視模型的執行情況。 監視應設計為對在生產環境中可能出錯的各種情況提供預警。
模型漂移 (Model Drift)
Model Drift is described as the change in the predictive power of ML model. In a dynamic data system where new data is being acquired very regularly, the data can change significantly over a short period of time. Therefore the data that we used to train the model in the research or production environment does not represent the data that we actually get in our live system.
模型漂移被描述為ML模型的預測能力的變化。 在非常規律地獲取新數據的動態數據系統中,數據可能會在短時間內發生重大變化。 因此,我們在研究或生產環境中用于訓練模型的數據并不代表我們實際從實時系統中獲得的數據。
模型陳舊 (Model Staleness)
If we use historic data to train the models, we need to anticipate that the population, consumer behavior, economy and its effects may not be the same in current times. So the features that were used to train the model will also change.
如果我們使用歷史數據來訓練模型,則需要預測當前人口,消費者行為,經濟及其影響可能會有所不同。 因此,用于訓練模型的功能也會改變。
負反饋回路 (Negative Feedback Loops)
One of the key features of live ML systems is that they tend to influence their self behavior when they update over time which may lead to a form of analysis debt. This in turn makes it difficult to predict the behavior of a ML model before it is released into the system. These feedback loops are difficult to detect and address specially if they occur gradually over time, which may be the case when models are not updated frequently.
實時ML系統的關鍵特征之一是,隨著時間的推移,它們會趨向于影響自身行為,這可能導致某種形式的分析欠債。 反過來,這使得在將ML模型發布到系統中之前很難預測它的行為。 如果這些反饋回路隨著時間的推移逐漸發生,則很難對其進行檢測和處理,尤其是在模型不經常更新的情況下。
To avoid/treat above issues in the Production system, there needs to be a process that measures the model’s performance against new data. If the model falls below an acceptable performance threshold, then a new process has be initiated to retrain the model with new/updated data, and that newly trained model should be deployed.
為了避免/處理生產系統中的上述問題,需要有一個過程可以根據新數據衡量模型的性能。 如果模型降到可接受的性能閾值以下,則將啟動新過程以使用新/更新的數據重新訓練模型,并且應該部署該新訓練的模型。
結論 (Conclusion)
At the end, there is no generic strategy that fits every problem and every organization. Deciding what practices to use, and implementing them, is at the heart of what machine learning engineering is all about.
最后,沒有適合每個問題和每個組織的通用策略。 決定使用和實施哪些實踐是機器學習工程的全部核心。
You will often see when starting with any ML project; the primary focus is given on the data and ML algorithms, but looking at how much of work is involved in deciding ML infrastructure and deployment, focus should be given to these factors as well.
從任何ML項目開始時,您都會經常看到; 主要側重于數據和ML算法,但是在確定ML基礎結構和部署時要涉及多少工作,也應重點關注這些因素。
Thanks for the read. I hope you liked the article!! As always, please reach out for any questions / comments / feedback.
感謝您的閱讀。 我希望你喜歡這篇文章! 與往常一樣,如果有任何問題/意見/反饋,請聯系我們。
翻譯自: https://medium.com/swlh/productionizing-machine-learning-models-bb7f018f8122
機器學習結構化學習模型
總結
以上是生活随笔為你收集整理的机器学习结构化学习模型_生产化机器学习模型的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 室温超导的进展
- 下一篇: 人工智能已经迫在眉睫_创意计算机已经迫在