利用PyCaret的力量
PyCaret is an open-source, low-code machine learning library in Python that aims to reduce the cycle time and allows you to go from preparing your data to deploying your model within seconds using your choice of notebook environment.
P yCaret是Python中的一種開放源代碼,低代碼的機器學習庫,旨在減少周期時間,并允許您使用選擇的筆記本環境從準備數據到在幾秒鐘內部署模型。
This article is aimed at someone who is familiar with machine learning concepts, and also knows how to implement the various Machine Learning algorithms using different libraries such as Scikit-Learn. The perfect reader is aware of the need for automation and doesn’t want to spend so much time seeking the optimal algorithm and its hyperparameters.
本文針對的對象是熟悉機器學習概念的人,并且知道如何使用不同的庫(例如Scikit-Learn)來實現各種機器學習算法。 完美的讀者已經意識到了自動化的必要性,并且不想花太多時間尋找最佳算法及其超參數。
As machine learning practitioners, we know that there are several steps involved in the life cycle of a complete Data Science project and these include Data Preprocessing — missing value treatment, null value treatment, changing the data types, encoding techniques for categorical features, data transformation — log, box cox transformations, feature engineering, Exploratory Data Analysis (EDA), etc. before we can actually start the model building, evaluation and prediction. So we use various libraries such as numpy, pandas, matplotlib scikit-learn, etc in python for accomplishing these tasks. So Pycaret is a very powerful library that helps us in the automation of the process.
作為機器學習的從業者,我們知道一個完整的數據科學項目的生命周期涉及幾個步驟,其中包括數據預處理-缺失值處理,空值處理,更改數據類型,分類特征的編碼技術,數據轉換—日志,Box Cox轉換,功能工程,探索性數據分析(EDA)等,然后我們才能真正開始模型的建立,評估和預測。 因此,我們在python中使用了各種庫(例如numpy,pandas,matplotlib scikit-learn等)來完成這些任務。 因此,Pycaret是一個非常強大的庫,可以幫助我們實現流程的自動化。
安裝Pycaret (Installing Pycaret)
!pip install pycaret==2.0Once Pycaret is installed, we are ready to go! I am going to discuss a regression problem here and Pycaret can be used for many problems such as classification, anomaly detection, clustering, Natural Language Processing.
一旦安裝了Pycaret,我們就可以開始了! 我將在這里討論回歸問題,Pycaret可以用于許多問題,例如分類,異常檢測,聚類,自然語言處理。
I am going to use the Laptop Prices dataset here which I have obtained from scraping Flipkart website.
我將在這里使用筆記本電腦價格數據集 我是從抓取Flipkart網站獲得的。
df = pd.read_csv('changed.csv') # Reading the datasetdf.head()from pycaret.regression import *
reg = setup(data = df, target = 'Price')
The setup() function of Pycaret does most of the correction, which is normally done with many lines of code — is done in a single line of code! That’s the beauty of this amazing library!
Pycaret的setup()函數進行了大部分校正,這通常是用多行代碼完成的—只需一行代碼即可完成! 這就是這個令人驚嘆的圖書館的美!
We use the setup variable, and in the target, we mention the feature name (dependent variable)-here we want to predict the Price of the laptop so that becomes the dependent variable.
我們使用設置變量,在目標中,我們提到功能名稱(因變量),此處我們要預測筆記本電腦的價格,以使其成為因變量。
X = df.drop('Price',axis=1)Y = df['Price']
Y = pd.DataFrame(Y)
Comparing all the regression models
比較所有回歸模型
compare_models()Training all the regression models. So after this, we can create any model-either CatBoost or else XGBoost regressor model, and then we can perform hyperparameter tuning.
訓練所有回歸模型。 因此,在此之后,我們可以創建任何模型-CatBoost或XGBoost回歸模型,然后執行超參數調整。
We can see that our Gradient Boosting Regressor (GBR) model has performed relatively better when compared to all the other models. But I have performed the analysis using the XGBoost model as well, and this model performed better than the GBR Model.
我們可以看到,與所有其他模型相比,我們的Gradient Boosting Regressor(GBR)模型的性能相對較好。 但是我也使用XGBoost模型進行了分析,并且該模型的性能優于GBR模型。
Error using Gradient Boosting Regressor model使用梯度提升回歸模型時出錯As we have identified the best model to be XGBoost so we create xgboost model with the help of create_model function and mention the max_depth(number of iteration for which the model will run)
由于我們已確定最佳模型為XGBoost,因此我們在create_model函數的幫助下創建了xgboost模型,并提到了max_depth(該模型將針對其運行的迭代次數)
Creating the model
建立模型
xgboost = create_model('xgboost', max_depth = 10)Error using XGBoost model使用XGBoost模型時出錯So after creating the model with a depth of 10, it runs 10 iterations and calculates the MAE(Mean Absolute Error), MSE (Mean Squared Error), RMSE (Root Mean Squared Error), R2(R2_score-R squared value), MAPE (Mean Absolute Percentage Error) in every iteration. Finally, it displays the mean and standard deviation of all the errors in these 10 iterations. Lesser the error better is the machine learning model! So in order to reduce the error, we try to find out the hyperparameters which can minimize the error.
因此,在創建深度為10的模型之后,它將運行10次迭代并計算MAE(均值絕對誤差),MSE(均方誤差),RMSE(均方根誤差),R2(R2_score-R平方值),MAPE (平均絕對百分比誤差)。 最后,它顯示了這10次迭代中所有誤差的平均值和標準偏差。 誤差越小,機器學習模型就越好! 因此,為了減少錯誤,我們嘗試找出可以使錯誤最小化的超參數。
For this purpose, we apply the tune_model function and apply K-fold cross-validation to find out the best hyperparameters.
為此,我們應用tune_model函數并應用K折交叉驗證以找出最佳的超參數。
Hyper tuning of the model
超調模型
xgboost = tune_model(xgboost, fold=5)Errors after hyper tuning超調后的錯誤The model runs 5 iterations and gives us the mean and standard deviation of all the errors. The mean value of MAE after 5 iterations was almost the same for both GBR and XGBoost models, but after hyper tuning and making the predictions, the XGBoost model had less error and performed better than the GBR model.
該模型運行5次迭代,并為我們提供所有誤差的均值和標準差。 對于GBR和XGBoost模型,經過5次迭代后,MAE的平均值幾乎相同,但是經過超調和做出預測之后,XGBoost模型的誤差較小,并且性能優于GBR模型。
Making predictions using the best model
使用最佳模型進行預測
predict_model(xgboost)Making the Predictions做出預測Checking the scores after applying Cross Validation (we mainly need the Mean Absolute Error). Here we can see that the MAE for the best model has come down to 10847.2257 so the Mean Absolute Error is approximately 10,000.
應用交叉驗證后檢查分數(我們主要需要平均絕對誤差)。 在這里,我們可以看到最佳模型的MAE已降至10847.2257,因此平均絕對誤差約為10,000。
Checking all the parameters of the xgboost model
檢查xgboost模型的所有參數
print(xgboost)Checking the hyperparameters檢查超參數XGBoost model hyperparamaters
XGBoost模型超參數
plot_model(xgboost, plot='parameter')Checking the hyperparameters檢查超參數Residuals Plot
殘差圖
The distances (errors) between the actual and predicted values
實際值與預測值之間的距離(誤差)
plot_model(xgboost, plot='residuals')Residuals Plot殘差圖We can clearly see that my model is overfitting as the R squared for training set is 0.999 and test set is 0.843. This is actually not surprising because my dataset contains a total of only 168 rows! But the main point here is to highlight the excellent features of Pycaret as you can create plots and curves with just one line of code!
我們可以清楚地看到,我的模型過度擬合,因為訓練集的R平方為0.999,而測試集的R平方為0.843。 這實際上不足為奇,因為我的數據集總共僅包含168行! 但是這里的重點是要突出Pycaret的出色功能,因為您只需一行代碼就可以創建繪圖和曲線!
Plotting the Prediction Error
繪制預測誤差
plot_model(xgboost, plot='error')Prediction Error預測誤差The value of R squared for the model is 0.843.
該模型的R平方的值為0.843。
Cooks Distance Plot
廚師距離圖
plot_model(xgboost, plot='cooks')Cooks Distance Plot廚師距離圖Learning Curve
學習曲線
plot_model(xgboost, plot='learning')Learning Curve學習曲線Validation Curve
驗證曲線
plot_model(xgboost, plot='vc')Validation Curve驗證曲線These 2 plots also show us that the model is clearly overfitting!
這兩個圖也向我們顯示該模型顯然過擬合!
Plot of Feature Importance
特征重要性圖
plot_model(xgboost, plot='feature')Feature Importance功能重要性By this plot, we can see that Processor_Type_i9 (i9 CPU) is a very important feature for determining the price of the laptop.
通過此圖,我們可以看到Processor_Type_i9(i9 CPU)是確定筆記本電腦價格的非常重要的功能。
Splitting the dataset into training and testing set
將數據集分為訓練和測試集
from sklearn.model_selection import train_test_splitX_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)
Final XGBoost parameters for deployment
最終的XGBoost部署參數
final_xgboost = finalize_model(xgboost)Final parameters of the XGB modelXGB模型的最終參數Making the prediction on the unseen data ( Test set data)
對看不見的數據(測試集數據)進行預測
new_predictions = predict_model(xgboost, data=X_test)new_predictions.head()Predictions on the test set對測試集的預測
Saving the transformation pipeline and model
保存轉換流程和模型
save_model(xgboost, model_name = 'deployment_08082020')Transformation Pipeline and Model Succesfully Saveddeployment_08082020 = load_model('deployment_08082020')Transformation Pipeline and Model Sucessfully Loadeddeployment_08082020Final Machine Learning Model最終機器學習模型
So this is the final Machine Learning model that can be used for deployment.
因此,這是可用于部署的最終機器學習模型。
The model is saved in the pickle format!
模型以pickle格式保存!
For more info, check the documentation here
有關更多信息,請查看文檔 這里
In this article, I have not discussed everything in detail. But you can always refer to my GitHub Repository for the whole code. My conclusion from this article is that don’t expect a perfect model, but expect something you can use in your own company/project today!
在本文中,我沒有詳細討論所有內容。 但是您始終可以參考我的GitHub存儲庫以獲取整個代碼。 我從本文得出的結論是,不要期望一個完美的模型,而是希望您今天可以在自己的公司/項目中使用某些東西!
Shout out to Moez Ali for this absolutely brilliant library!
為這個絕對出色的圖書館大喊Moez Ali !
Connect with me on LinkedIn here
在此處通過LinkedIn與我聯系
The bottom line is that the automation lowers the risk of human error and adds some intelligence to the enterprise system. — Stephen Elliot
最重要的是,自動化降低了人為錯誤的風險,并為企業系統增加了一些智能。 —斯蒂芬·艾略特(Stephen Elliot)
I hope you found the article insightful. I would love to hear feedback to improvise it and come back with better content.
我希望您發現這篇文章很有見地。 我很想聽聽反饋以即興創作,并以更好的內容回來。
Thank you so much for reading!
非常感謝您的閱讀!
翻譯自: https://towardsdatascience.com/leverage-the-power-of-pycaret-d5c3da3adb9b
總結
以上是生活随笔為你收集整理的利用PyCaret的力量的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 线上线下合理布局 2022年vivo位居
- 下一篇: css餐厅_餐厅的评分预测