當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

[机器学习]AutoML --- TOPT

發布時間：2023/12/15 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 [机器学习]AutoML --- TOPT 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

?

TPOT介紹

　　自動化機器學習（AML）是一種流水線（也稱管線），它能夠讓你自動執行機器學習（ML）問題中的重復步驟，從而節省時間，讓你專注于使你的專業知識發揮更高價值。最重要的是，它不僅是一些模糊的想法，而且還有一些基于標準python ML包建立的應用包，如scikit-learn。

　　在這種情況下，任何熟悉機器學習的人都可能會回想起網格搜索（grid search）這個概念。他們這樣想是完全正確的。實際上，AML是在scikit-learn中應用的網格搜索的擴展，而不是迭代這些值預先定義的集合和其組合，它通過搜索方法，特征，變換和參數值來獲得最佳解決方案。因此，AML“網格搜索”不需要在可能的配置空間上進行詳盡的搜索 - AML有一個很贊的應用叫做TPOT包，其提供了像遺傳算法這樣的應用，可用來在某個配置中混合各個參數并達到最佳設置。

TPOT 是一個 Python 編寫的軟件包，利用遺傳算法行特征選擇和算法模型選擇，僅需幾行代碼，就能生成完整的機器學習代碼。

TPOT github：https://github.com/rhiever/tpot
TPOT 官方文檔：http://rhiever.github.io/tpot/

眾所周知，一個機器學習問題或者數據挖掘問題整體上有如下幾個處理步驟：從數據清洗、特征選取、特征重建、特征選擇、算法模型算法和算法參數優化，以及最后的交叉驗證。整個步驟異常繁瑣，但使用TPOT可以輕松解決特征提取和算法模型選擇的問題，如下圖陰影部分所示。

從下圖對MNIST數據集進行處理的流程可以看到，TPOT可以輕松取得98.4%的結果，這個結果還是很不錯的（在傳統方法中，TPOT暫時沒有添加任何神經網絡算法，如CNN）。最最重要的是TPOT還可以將整個的處理流程輸出為Python代碼，好激動啊有木有！Talk is simple，show you the code。

?

TPOT安裝

TPOT是運行在Python環境下的，所以你首先需要按照相應的Python庫：

NumPy
SciPy
scikit-learn
DEAP
update_checker
tqdm

此外TPOT還支持xgboost模型，所以你可以自行安裝xgboost。

pip install xgboost

最后安裝

pip install tpot

TPOT安裝可以參考官方文檔，也可以直接到github項目頁面提交issue。

TPOT例子

1.IRIS

TPOT使用起來很簡單：首先載入數據，聲明TPOTClassifier，fit，最后export代碼。

from tpot import TPOTClassifier from sklearn.datasets import load_iris from sklearn.cross_validation import train_test_split import numpy as npiris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),iris.target.astype(np.float64), train_size=0.75, test_size=0.25)tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2) tpot.fit(X_train, y_train) print(tpot.score(X_test, y_test)) tpot.export('tpot_iris_pipeline.py')

生成的tpot_iris_pipeline.py是這樣的：

import numpy as npfrom sklearn.cross_validation import train_test_split from sklearn.ensemble import VotingClassifier from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline, make_union from sklearn.preprocessing import FunctionTransformer, PolynomialFeaturestpot_data = np.recfromcsv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64) features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1) training_features, testing_features, training_classes, testing_classes = \train_test_split(features, tpot_data['class'], random_state=42)exported_pipeline = make_pipeline(PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),LogisticRegression(C=0.9, dual=False, penalty="l2") )exported_pipeline.fit(training_features, training_classes) results = exported_pipeline.predict(testing_features)

2.Titanic Kaggle

由于TPOT并不包含數據清洗的功能，所以需要人工進行數據清洗，整個例子代碼，最后生成的代碼如下：

import numpy as np import pandas as pdfrom sklearn.cross_validation import train_test_split from sklearn.ensemble import AdaBoostClassifier from sklearn.preprocessing import PolynomialFeatures# NOTE: Make sure that the class is labeled 'class' in the data file tpot_data = pd.read_csv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR') training_indices, testing_indices = train_test_split(tpot_data.index, stratify = tpot_data['class'].values, train_size=0.75, test_size=0.25)result1 = tpot_data.copy()# Use Scikit-learn's PolynomialFeatures to construct new features from the existing feature set training_features = result1.loc[training_indices].drop('class', axis=1)if len(training_features.columns.values) > 0 and len(training_features.columns.values) <= 700:# The feature constructor must be fit on only the training datapoly = PolynomialFeatures(degree=2, include_bias=False)poly.fit(training_features.values.astype(np.float64))constructed_features = poly.transform(result1.drop('class', axis=1).values.astype(np.float64))result1 = pd.DataFrame(data=constructed_features)result1['class'] = result1['class'].values else:result1 = result1.copy()result2 = result1.copy() # Perform classification with an Ada Boost classifier adab2 = AdaBoostClassifier(learning_rate=0.15, n_estimators=500, random_state=42) adab2.fit(result2.loc[training_indices].drop('class', axis=1).values, result2.loc[training_indices, 'class'].values)result2['adab2-classification'] = adab2.predict(result2.drop('class', axis=1).values)

?

TPOT Notes

1. TPOTClassifier()

TPOT最核心的就是整個函數，在使用TPOT的時候，一定要弄清楚TPOTClassifier()函數中的重要參數。

generation：遺傳算法進化次數，可理解為迭代次數
population_size：每次進化中種群大小
num_cv_folds：交叉驗證
scoring：也就是損失函數

generation和population_size共同決定TPOT的復雜度，還有其他參數可以在官方文檔中找到。

TPOT 大規模數據上的一些tips

TPOT在處理小規模數據非?？?#xff0c;結果很給力。但處理大規模的數據問題，速度非常慢，很慢。所以在做數據挖掘問題，可以嘗試在數據清洗之后，抽樣小部分數據跑一下TPOT，最初能得到一個還不錯的算法。

一些建議：

Set n_jobs=1?for TPOT object and make sure there are enough RAM to avoid memory issue;

Use?TPOT light?configuration;

Apply?MDR?or other features selection/dimension reduction algorithms for reducing feature numbers before using TPOT.

As TPOT is based on scikit-learn, it supports large-scale ML about as well as scikit-learn does (i.e., not great). My recommendation is to look into ML packages based on TensorFlow and/or that have GPU support to scale ML to that size of data.

If you plan to use?Dask?for parallel training, make sure to install?dask[delay]?and?dask_ml.

總結

以上是生活随笔為你收集整理的[机器学习]AutoML --- TOPT的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： users的权限_user是啥
下一篇： [机器学习]AutoML --- NNI