當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

大疆机器学习实习生_我们的数据科学机器人实习生

發布時間：2023/12/15 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了大疆机器学习实习生_我们的数据科学机器人实习生小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

大疆機器學習實習生

Machine learning practitioners know how overwhelming the number of possibilities that we have when building a model can be. It’s like going to a restaurant and having a menu the size of a book and we never tried any of the dishes. What models do we test? How do we configure their parameters? Which features do we use? Those who try to solve this problem by ad-hoc manual experiments end up having their time consumed by menial tasks and have their work constantly interrupted to check results and launch a new experiment. This has motivated the rise of the field of automated machine learning.

機器學習的從業者知道，建立模型時，我們擁有無數的可能性。這就像去餐廳，擁有一本書一樣大小的菜單，我們從來沒有嘗試過任何菜肴。我們測試什么模型？我們如何配置它們的參數？我們使用哪些功能？那些試圖通過臨時手動實驗解決此問題的人最終將自己的時間花在了瑣碎的工作上，并不斷地中斷工作以檢查結果并啟動新的實驗。這激發了自動化機器學習領域的興起。

The main tasks of automated ML are hyperparameter optimization and feature selection, and in this article, we will present Legiti’s solution to these tasks (no, we didn’t hire an intern to do that). We developed a simple algorithm that addresses both challenges and is designed for rapidly changing environments, such as the ones found by data scientists working in real-world business problems. Some convenient properties and limitations are presented, as well as possible extensions to the algorithm. We end by showing results obtained applying it to one of our models.

自動化ML的主要任務是超參數優化和功能選擇，在本文中，我們將介紹Legiti針對這些任務的解決方案(不，我們沒有雇用實習生來完成此任務)。我們開發了一種簡單的算法來應對這兩個挑戰，并針對快速變化的環境(例如，從事實際業務問題的數據科學家發現的環境)而設計。提出了一些方便的屬性和限制，以及對該算法的可能擴展。最后，我們展示將其應用于我們的模型之一所獲得的結果。

問題的介紹和動機 (Introduction and motivation to the problem)

At Legiti, we build machine learning models to fight credit card transactional fraud; that is, to predict whether an online purchase was made by (or with the consent of) the person who owns that credit card or not. We know when fraud occurs after some time when a refund is requested by the owner of the credit card, a process known as a chargeback. This is translated into our modeling as inputs for a supervised learning algorithm.

在Legiti，我們建立了機器學習模型來應對信用卡交易欺詐；也就是說，預測在線購物是否由擁有該信用卡的人進行(或得到其同意)。我們知道信用卡所有者要求退款一段時間后，何時發生欺詐行為，這一過程稱為拒付。這被轉換為我們的建模，作為監督學習算法的輸入。

If you already have a good knowledge of common tools for feature selection and hyperparameter optimization you can skip to the “An outline of (the first version of) our algorithm” section.

如果您已經對用于特征選擇和超參數優化的常用工具有足夠的了解，則可以跳到“ 我們的算法的概述(第一版) ”部分。

機器學習模型通常采用許多可以調整的參數 (Machine learning models usually take many parameters that can be adjusted)

In our case, we’re currently using XGBoost. XGBoost builds a tree ensemble (a set of decision trees that are averaged for prediction) with a technique called gradient boosting. This algorithm usually works quite well “out of the box”, but its performance can be maximized by tuning some characteristics of the model, known as “hyperparameters”. Some examples of hyperparameters are the depth of the trees of the ensemble and the number of them.

就我們而言，我們目前正在使用XGBoost 。 XGBoost使用稱為梯度提升的技術構建樹集合(一組平均樹以進行預測)。該算法通常“開箱即用”，效果很好，但是可以通過調整模型的某些特征(稱為“超參數”)來最大化其性能。超參數的一些示例是合奏樹的深度及其數量。

Hyperparameters are not exclusive to XGBoost. Other ensemble algorithms share a similar set of hyperparameters and different ones exist for neural networks for example, such as the number of layers and number of nodes in each layer. Even linear models can have hyperparameters — think of L1 and L2 regularization parameters in Lasso/Ridge regression. Any applied machine learning practitioner, if they want to maximize the performance of their model, will be engaged, sooner or later in the task of optimizing these parameters, usually by trial and error.

超參數并非XGBoost獨有。其他集成算法共享一組相似的超參數，而對于神經網絡則存在不同的超參數，例如層數和每層中的節點數。甚至線性模型也可以具有超參數-考慮Lasso / Ridge回歸中的L1和L2正則化參數。任何應用機器學習從業者，如果他們想最大化其模型的性能，遲早都會參與優化這些參數的任務，通常是通過反復試驗來進行。

我們通常也有許多功能可供使用 (We also usually have many features at our disposal)

In solutions that use a lot of structured data to make predictions (in contrast with, for example, image or text processing with deep learning) a lot of the work is not on the model itself, but on its inputs; that is, on the features of this model. Another similar problem to hyperparameter selection arises here — the problem of feature selection.

在使用大量結構化數據進行預測的解決方案中(例如，與具有深度學習的圖像或文本處理相反)，很多工作不是在模型本身上，而是在模型輸入上。也就是說，基于此模型的功能。這里出現了另一個與超參數選擇類似的問題-特征選擇問題。

Feature selection consists of choosing an optimal configuration of features that will maximize performance. Usually, this performance is measured in predictive capability (some accuracy metric in out-of-time validation sets), but other factors can count as well, such as the speed of training and interpretability of models. At Legiti, we have, for one of our models, more than 1000 features at our disposal. However, simply throwing these 1000 features to a model doesn’t necessarily lead to better accuracy than, for example, using only 50 carefully selected features. On top of that, using multiple features that yield little predictive firepower can make training time much longer.

功能選擇包括選擇功能的最佳配置以最大程度地提高性能。通常，此性能是通過預測能力(過期的驗證集中的某些準確性指標)來衡量的，但其他因素也可以計算，例如訓練的速度和模型的可解釋性。在Legiti，對于我們的其中一種型號，我們可以使用1000多種功能。但是，僅將這1000個要素扔給模型并不一定比例如僅使用50個精心選擇的要素來提高準確性。最重要的是，使用產生很少的預測火力的多種功能可以使訓練時間更長。

The amount of features we are choosing from is higher than the variety of products in this shelf. Photo by NeONBRAND on Unsplash.我們從中選擇的功能數量比該架子上的產品種類高。 NeONBRAND在Unsplash上的照片。

沒人希望數據科學家將時間浪費在手動任務上 (No one wants data scientists wasting their time with manual tasks)

Unfortunately, when training a model, the model won’t tell you what the optimal hyperparameter and feature sets are. What data scientists do is to observe out-of-sample performance to avoid the effect of overfitting (that can be done in multiple ways, but we won’t enter in details here). The simplest way to address this problem is to experiment manually with different sets of hyperparameters and features, eventually developing an intuitive understanding of what kind of modifications are important. However, this is a very time-consuming practice, and we do not want to waste our data scientists’ time with often quite manual tasks. What happens then is that most machine learning “power users” develop and/or use algorithms to automate that process.

不幸的是，在訓練模型時，該模型不會告訴您最佳超參數和特征集是什么。數據科學家要做的是觀察樣本外性能，以避免過度擬合的影響(可以通過多種方式完成，但在此不再贅述)。解決此問題的最簡單方法是使用不同的超參數和功能集進行手動試驗，最終形成對哪種修改很重要的直觀了解。但是，這是一個非常耗時的做法，并且我們不想將數據科學家的時間浪費在很多手動任務上。然后發生的事情是，大多數機器學習“高級用戶”開發和/或使用算法來自動化該過程。

現有的一些算法 (Some existing algorithms)

Several feature selection and hyperparameter optimization algorithms exist, and lots of them are packaged in open-source libraries. Some basic examples of feature selection algorithms are backward selection and forward selection.

存在幾種特征選擇和超參數優化算法，其中許多算法打包在開源庫中。特征選擇算法的一些基本示例是向后選擇和正向選擇。

Another alternative is using the Lasso — an L1-regularized linear regression model that has sparseness properties. That means that lots of features end up having weight 0. Some people use the Lasso only for selecting the features that were assigned weights and discard the actual regression model.

另一種選擇是使用套索-具有稀疏屬性的L1正規化線性回歸模型。這意味著許多特征最終的權重為0。某些人僅將套索用于選擇分配了權重的特征，并丟棄實際的回歸模型。

Hyperparameter optimization is usually done either with very simple algorithms (random or grid selection) or with complex Bayesian optimization surrogate models, such as the ones that are implemented in the Hyperopt and Spearmint packages.

通常使用非常簡單的算法(隨機或網格選擇)或復雜的貝葉斯優化替代模型(例如在Hyperopt和Spearmint軟件包中實現的模型)來完成超參數優化。

However, despite all this available tooling, as far as we know, there’s not a universal tool that is good both for hyperparameter optimization and feature selection. Another common difficulty is that the optimal hyperparameter set depends on the set of features being used and vice-versa. Finally, most of these algorithms were designed for a laboratory environment such as research programs or Kaggle competitions, where usually neither the data nor the problem changes with time.

但是，盡管有所有可用的工具，但據我們所知，還沒有一種同時適用于超參數優化和功能選擇的通用工具。另一個常見的困難是，最佳超參數集取決于所使用的特征集，反之亦然。最后，大多數算法都是為實驗室環境(例如研究計劃或Kaggle競賽)設計的，通常情況下，數據或問題都不會隨時間變化。

Given all of that, we decided to build our own.

考慮到所有這些，我們決定構建自己的。

算法(第一個版本)的概述 (An outline of (the first version of) our algorithm)

When we decided to build our own thing instead of relying on other methods, we didn’t do it because we thought we could do something “superior” to anything else available. In fact, it was more of a pragmatic decision. We didn’t have anything at all in place (we were at the stage of manual tweakings to features and hyperparameters) and we wanted to build sort of a minimum viable solution. None of the algorithms we found would work with little or no adaptation at all, so we decided to code a very simple algorithm and replace it with more elaborate solutions later. However, this algorithm exceeded our expectations and we have been using it since then. Some numbers are available later in the results section.

當我們決定構建自己的東西而不是依靠其他方法時，我們沒有這么做，因為我們認為我們可以做的事情“優于”其他任何東西。實際上，這更多是一個務實的決定。我們根本沒有任何東西(我們正處于手動調整功能和超參數的階段)，我們想構建某種最小可行的解決方案。我們發現所有算法都幾乎無法適應或根本無法適應，因此我們決定編寫一個非常簡單的算法，并在以后用更精細的解決方案代替它。但是，此算法超出了我們的期望，從那時起我們一直在使用它。稍后在結果部分中提供了一些數字。

算法說明 (Description of the algorithm)

The strategy we took was to use our best solution at the time as the initial anchor and apply small random modifications to it. And so, we built an algorithm that takes the following general steps:

我們采取的策略是使用當時最好的解決方案作為初始定位點，并對它進行小的隨機修改。因此，我們構建了一種算法，該算法采取以下一般步驟：

1) Choose randomly between a change in features or hyperparameters

1)在特征或超參數變化之間隨機選擇

2.1) If we are changing a feature, choose a random feature from our feature pool and swap it in the model (that is, remove the feature if it is already in the model, or add it if it is not)

2.1)如果我們要更改要素，請從要素池中選擇一個隨機要素，然后在模型中進行交換(即，如果要素已存在于模型中則將其刪除，否則將其添加)

2.2) If we are changing a hyperparameter, choose a random hyperparameter out of the ones we’re using and then randomly increase it or decrease it by a certain configurable value

2.2)如果我們要更改超參數，請從正在使用的參數中選擇一個隨機的超參數，然后將其隨機增加或減少一定的可配置值

3) Test the model with these small changes in our cross-validation procedure. If the out-of-sample accuracy metric is higher, accept the change to the model; if it doesn’t improve, reject this change.

3)在我們的交叉驗證過程中，對這些小的更改進行測試。如果樣本外準確性度量標準較高，請接受對模型的更改；否則，請接受模型更改。如果沒有改善，請拒絕此更改。

Perhaps some readers will note some similarities between this and the Metropolis-Hastings algorithm or to simulated annealing. We can actually see it as a simplification of simulated annealing, where there’s no variable “temperature” and the random conditional turns into a deterministic one.

也許有些讀者會注意到該算法與Metropolis-Hastings算法或模擬退火算法之間存在一些相似之處。實際上，我們可以將其視為模擬退火的簡化，其中沒有可變的“溫度”，而隨機條件變為確定性溫度。

一些大優勢 (Some big advantages)

A few nice properties of this algorithm are:

該算法的一些不錯的屬性是：

It works for both hyperparameters and feature selection. There’s no need to implement and maintain two different processes/algorithms, both on the mathematical/conceptual point of view (there are fewer concepts to think about) and from the technology point of view (there’s less code to write and maintain, so fewer sources of bugs).
它適用于超參數和特征選擇。 從數學/概念的觀點(要考慮的概念更少)和從技術的觀點(不需要編寫和維護的代碼更少，這兩個方面，都不需要實現和維護兩個不同的過程/算法。錯誤)。
This method can quickly find a solution that is better than our current best. Unlike many algorithms, there’s no need to execute a very long-running task (such as backward/forward selection). We can interrupt it at any time and relaunch it right afterward. This is particularly useful considering our circumstances, where we are all the time receiving new data and changing our evaluation procedure to accommodate that. What we do then is to simply interrupt the process, evaluate the current best model with new data, and start running the process again. If instead, we were using Hyperopt for example, all the history of samples you gathered to build your surrogate model is suddenly not valuable anymore.
這種方法可以Swift找到比我們目前最好的解決方案更好的解決方案。 與許多算法不同，不需要執行非常長時間的任務(例如向后/向前選擇)。我們可以隨時中斷它，然后立即重新啟動。考慮到我們的情況，這特別有用，因為我們一直在接收新數據并更改評估程序以適應這種情況。然后，我們要做的就是簡單地中斷該過程，使用新數據評估當前最佳模型，然后再次開始運行該過程。如果相反，例如，我們使用Hyperopt，則您收集的用于構建代理模型的樣本的所有歷史記錄突然變得不再有價值。
It is a never-ending process. This is interesting for a practical reason: no idle time for our machines.
這是一個永無止境的過程。 出于實際原因，這很有趣：我們的機器沒有空閑時間。
The setup is very easy. The only “state” needed in the system is the current best model, which was already tracked by us on version control. There’s no need for a database or files to keep a history of runs for example.
設置非常簡單。 系統中唯一需要的“狀態”是當前的最佳模型，我們已經在版本控制中對其進行了跟蹤。例如，無需數據庫或文件來保留運行歷史記錄。
It is easy to understand what happens under the hood. So when things are not working very well, we don’t need to think about Gaussian processes (the surrogate model used by most hyperparameter algorithms) or anything like that.
很容易理解引擎蓋下會發生什么。 因此，當事情運行不順利時，我們無需考慮高斯過程 (大多數超參數算法使用的替代模型)或類似的東西。

局限性 (Limitations)

Of course, this algorithm is not without challenges. However, for all of the most relevant difficulties, there are ways to overcome them with relatively small extensions to the algorithm. Some of the limitations we found are:

當然，該算法并非沒有挑戰。但是，對于所有最相關的困難，有一些方法可以通過對算法進行較小擴展來克服它們。我們發現的一些限制是：

It can get stuck at local optima since all tested alternatives are derived from the best current model. A solution is to use multiple random steps at the same time, that is, instead of swapping one feature, we swap two features, or we change a hyperparameter and swap a feature and then we test this new model. By doing this we increase the number of possible modifications that the algorithm can make, therefore increasing its search space and decreasing the chances of getting stuck.
由于所有經過測試的替代方案均來自最佳當前模型， 因此它可能會卡在局部最優值上 。一種解決方案是同時使用多個隨機步驟，即，代替交換一個功能，我們交換兩個功能，或者我們更改超參數并交換一個功能，然后測試此新模型。通過這樣做，我們增加了算法可以進行的可能修改的次數，因此增加了其搜索空間并減少了卡住的可能性。

Image by peach turns pro on Wordpress, with modifications.通過修改，桃子的圖像在Wordpress上變成了專業版。

It can select highly correlated features. If interpretability is a big problem for you, this algorithm might not be the best bet. Eliminating features based on the correlation they have with other ones might be a good choice even if it decreases out-of-sample accuracy. We, however, are clearly choosing accuracy in the accuracy vs. interpretability trade-off.
它可以選擇高度相關的功能。 如果可解釋性對您來說是個大問題，那么此算法可能不是最佳選擇。基于特征與其他特征的相關性來消除特征可能是一個不錯的選擇，即使這樣會降低樣本外準確性。但是，我們顯然會在準確性與可解釋性的權衡中選擇準確性。
New features take too long to be tested if the number of features is too high. We already are coming across this problem since we have more than 1300 features in our feature pool. To be tested, a new feature will have to wait on average hundreds of times the time of one iteration of the algorithm. Now, one iteration of the algorithm consists in calculating an out-of-sample accuracy of the new proposed model with, for example, a cross-validation procedure. To put that in numbers, if the cross-validation procedure takes 10 minutes and we have to wait for 600 iterations of the algorithm to test a feature, we will wait a total of about 10 days to test one feature. A solution here is to build a feature queue. For every feature, we store the last time in which it was swapped. Then, instead of choosing features randomly, we choose the feature that hasn’t been tested for the longest period or a feature that was never tested.
如果功能數量太多，新功能將花費很長時間進行測試 。我們已經遇到了這個問題，因為我們的功能池中有1300多個功能。要進行測試，一項新功能平均需要等待算法一次迭代的時間數百次。現在，算法的一次迭代在于使用例如交叉驗證程序來計算新提出的模型的樣本外準確性。簡單地說，如果交叉驗證過程需要10分鐘，并且我們必須等待600次算法迭代才能測試一項功能，那么我們將總共等待約10天才能測試一項功能。這里的解決方案是建立功能隊列。對于每個功能，我們都會存儲上次交換功能。然后，我們將選擇沒有經過最長測試的功能或從未測試過的功能，而不是隨機選擇功能。
The running time is directly correlated with the model evaluation time. Therefore, If the evaluation process (usually cross-validation) takes a long time to run then this algorithm will be very slow as well. A possible solution here is to introduce a pre-selection procedure — for example, before fully testing one model we first test only one of the cross-validation iterations and pass it to the next step (the full test) if this partial metric has over-performed the current best model partial metric.
運行時間與模型評估時間直接相關。 因此，如果評估過程(通常是交叉驗證)需要很長時間才能運行，那么該算法也會非常慢。此處可能的解決方案是引入預選程序-例如，在完全測試一個模型之前，如果該部分指標已超過標準，我們首先僅測試交叉驗證迭代中的一個，然后將其傳遞到下一步(完整測試)。 -執行當前的最佳模型局部指標。

結果 (Results)

Most importantly, none of this would matter if we couldn’t reap the benefits in terms of better fraud identification for our customers. Fortunately, Optimus Prime (as we decided to call our robot intern) is a results-improvement machine.

最重要的是，如果我們無法從為客戶提供更好的欺詐識別方面獲得收益，那么這一切都不重要。幸運的是，Optimus Prime(我們決定將其稱為機器人實習生)是一種提高結果的機器。

You can see some results below. In this specific case, Optimus Prime found multiple ways to improve our model, and we can see in the pull request the difference in our metrics from the old model to the new one.

您可以在下面看到一些結果。在這種特定情況下，Optimus Prime找到了多種方法來改進我們的模型，我們可以在請求請求中看到從舊模型到新模型的指標差異。

Note that this PR was not opened by a human. With the help of Python packages for Git and GitHub, we can automate all this process. All that we do now is wait until we get notified of a new pull request with improvement from Optimus, check the results, and accept the pull request.

請注意，此PR不是由人打開的。借助適用于Git和GitHub的Python軟件包，我們可以自動化所有過程。現在，我們要做的就是等到Optimus對改進的新拉動通知得到通知，檢查結果并接受拉動請求。

A very important feature of the GitHub API is that we can add text to pull request descriptions; without it, we wouldn’t be able to put randomly selected words of wisdom by Optimus Prime in them. This way we can always be reminded of our place in the universe and keep strengthening our company culture.

GitHub API的一個非常重要的功能是，我們可以添加文本以提取請求描述。沒有它，我們將無法在其中添加擎天柱隨機選擇的智慧之詞。這樣一來，我們總是可以想起我們在宇宙中的地位，并不斷加強我們的公司文化。

Naihaan at DeviantArt.Naihaan在DeviantArt的藝術。

下一步 (Next steps)

Some of our main challenges were described in the limitations section. The ones we think are the most relevant for us at the moment are:

限制部分描述了我們的一些主要挑戰。我們認為與當前最相關的是：

1. our lack of certainty on whether the model can get out of local optima and

1.我們對于模型是否可以擺脫局部最優和模型缺乏不確定性

2. the long time it takes for our algorithm to test new features.

2.我們的算法需要花費很長時間來測試新功能。

The first expansion that we’re testing for is alternating between feature and hyperparameter optimization modes. The feature optimization mode uses the exact same algorithm that we described, while for the hyperparameter optimization we’re trying out Hyperopt, but with a limited number of trials.

我們正在測試的第一個擴展是功能和超參數優化模式之間的交替。特征優化模式使用與我們描述的算法完全相同的算法，而對于超參數優化，我們正在嘗試Hyperopt，但嘗試次數有限。

結論 (Conclusion)

In this article we gave a brief introduction to the field of automated machine learning, presenting some well-known tools and also the solution that we built for Legiti. As shown in the results section, we are now having continuous improvements in predictive capabilities, and on top of that with almost no cumulative work from our data scientists, who can now focus more on the development of new features and research in general.

在本文中，我們簡要介紹了自動化機器學習領域，并介紹了一些知名工具以及我們為Legiti構建的解決方案。如結果部分所示，我們現在的預測能力正在不斷提高，最重要的是，我們的數據科學家幾乎沒有做任何積累的工作，他們現在可以將更多精力放在新功能的開發和總體上。

Even with the success that we’re having, we don’t see our work as finished on this front. We now see this process as a core part of our product, and like any other part, should be undergoing constant improvement.

即使我們已經取得了成功，但我們在這方面的工作仍未完成。現在，我們將此過程視為產品的核心部分，并且應像其他任何部分一樣，不斷進行改進。

If you have similar experiences trying to build an automated machine learning system, feel welcome to contribute to the discussion in the comments below!

如果您有嘗試構建自動化機器學習系統的類似經驗，歡迎在下面的評論中為討論做出貢獻！

翻譯自: https://medium.com/legiti/our-data-science-robot-intern-cea29894d40