机器学习经典算法实践_服务机器学习算法的系统设计-不同环境下管道的最佳实践
機器學習經典算法實踐
“Eureka”! While working on a persistently difficult-to-solve problem, you discovered a marketable and profitable solution. To simplify your tax reporting, you decided to file paperwork to start your LLC. You realize the importance of keeping your personal and business finances separate. If you fail to keep your finances separate, it can become very messy and cumbersome to file your taxes when the time comes. This concept is very similar to building pipelines for different environments, specifically the Development/Non-Production and Deployment/Production environments. What is the purpose of having these separate environments? We will cover the benefits of compartmentalizing these environments throughout this section.
“尤里卡”! 在解決一個持續難以解決的問題時,您發現了一個可銷售且有利可圖的解決方案。 為了簡化您的納稅報告,您決定提交文書文件以啟動您的LLC。 您意識到保持個人財務和業務財務分開的重要性。 如果您無法分開自己的財務狀況,那么在時機成熟時提交稅款可能會變得非常麻煩和麻煩。 此概念與為不同環境(特別是開發/非生產和部??署/生產)的環境構建管道非常相似。 擁有這些獨立環境的目的是什么? 在本節中,我們將介紹分隔這些環境的好處。
Throughout constructing our Data Science solution, we develop a series of steps to perform the operations we need to become successful. We extract data from our source, engineer our features, train our various models, validate on a subset of our source data, and then upload our predictions. While it is straightforward to construct our solution using a script, we can quickly introduce complexities with our pipeline deployments. Why and how can we make that statement? Data we use in production can be different from the data we use in development.
在構建數據科學解決方案的整個過程中,我們制定了一系列步驟來執行成功所需的操作。 我們從源中提取數據,設計功能,訓練我們的各種模型,對源數據的一部分進行驗證,然后上傳我們的預測。 盡管使用腳本構建解決方案很簡單,但是我們可以通過管道部署快速引入復雜性。 我們為什么以及如何發表這一聲明? 我們在生產中使用的數據可能與我們在開發中使用的數據不同。
For this reason alone, we need to have different considerations or additional steps to compensate for these differences. Once we start considering data syncing with a Data Lake, our Data Science solution must not synchronize predictions we are making in development with the Data Lake that our clients will see. In addition to the points above, stability and validation tests would be impacted by how we construct our pipelines. If our pipelines have coalesced into one, it will become challenging to create inspections to ensure expected functionality. By following these concepts and thoughts, we can save ourselves a measure of “pain and suffering” by restricting our focus on simplifying our tunnel of vision and compartmentalizing.
僅出于這個原因,我們需要有不同的考慮因素或其他步驟來彌補這些差異。 一旦我們開始考慮與Data Lake進行數據同步,我們的Data Science解決方案就不能與客戶將看到的與Data Lake同步開發中的預測。 除了上述幾點之外,穩定性和驗證測試還將受我們構建管道的方式的影響。 如果我們的管道合并為一個,那么進行檢查以確保預期功能將變得具有挑戰性。 通過遵循這些概念和思想,我們可以將精力集中在簡化視力通道和分隔上,從而節省一些“痛苦和痛苦”。
為什么要有單獨的培訓管道和預測管道? (Why Have Separate Training Pipeline and Prediction Pipelines?)
In the previous subsection, we discussed the importance of having two different pipelines between running our Data Science solution in development and production. We increase our stability and testability of our pipeline. However, as mentioned in the previous section, it is easy to develop our pipelines such that we have integrated our Training and Prediction processes in the same script. Typically, what is common in these situations is that boolean values are passed into the software to denote whether we are predicting our Machine Learning model or training our Machine Learning Model. Melding the two fundamentally different processes together complicates the Data Science code repositories and decreases the software and the pipeline’s maintainability. There are software concepts and principles to increase maintainability and decrease cognitive overload. The Single Responsibility principle can be applied and used to help make the Training and Prediction processes easier to maintain and manage. If this principle is upheld and enforced, the prediction pipeline could realistically consist of five base operations and up to ten code lines. Following this principle reduces what each member of the deployment party needs to know about the training pipeline.
在上一部分中,我們討論了在開發和生產中運行數據科學解決方案之間具有兩個不同管道的重要性。 我們提高了管道的穩定性和可測試性。 但是,如前一節所述,很容易開發管道,以便我們將訓練和預測過程集成在同一腳本中。 通常,在這些情況下常見的是將布爾值傳遞到軟件中以表示我們是在預測我們的機器學習模型還是在訓練我們的機器學習模型。 將這兩個根本不同的過程融合在一起,會使數據科學代碼存儲庫復雜化,并降低軟件和管道的可維護性。 有一些軟件概念和原則可以提高可維護性并減少認知負擔。 可以應用和使用“ 單一職責”原則來幫助使培訓和預測過程更易于維護和管理。 如果堅持并執行該原則,則預測管道實際上可以包含五個基本操作和最多十個代碼行。 遵循此原則減少了部署團隊的每個成員需要了解的培訓渠道。
Thank you for reading thus far! This is part of a series of articles to come.
到目前為止,感謝您的閱讀! 這是后續系列文章的一部分。
Please stay tune!
請繼續關注!
As always, #happycoding
與往常一樣,#happycoding
翻譯自: https://towardsdatascience.com/system-design-proposal-for-serving-machine-learning-algorithms-best-practices-for-pipelines-for-8b14d4f6e13c
機器學習經典算法實踐
總結
以上是生活随笔為你收集整理的机器学习经典算法实践_服务机器学习算法的系统设计-不同环境下管道的最佳实践的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 智能电视困于“套路”
- 下一篇: 梯度下降和随机梯度下降_梯度下降和链链接