使用机器学习预测天气_使用机器学习的二手车价格预测
使用機器學習預測天氣
You can reach all Python scripts relative to this on my GitHub page. If you are interested, you can also find the scripts used for data cleaning and data visualization for this study in the same repository. And the project is also deployed using Django on Heroku. View Deployment
您可以在我的 GitHub頁面 上 找到 所有與此相關的Python腳本 。 如果您有興趣,還可以在同一存儲庫中找到用于此研究的數據清理和數據可視化的腳本。 而且該項目還使用Django在Heroku上進行了部署。 查看部署
內容 (Content)
為什么通過對數轉換來縮放價格特征? (Why is price feature scaled by log transformation?)
In the regression model, for any fixed value of X, Y is distributed in this problem data-target value (Price ) not normally distributed, it is right skewed.
在回歸模型中,對于X的任何固定值,Y均以非正態分布的這個問題數據目標值(價格)分布,因此右偏。
To solve this problem, the log transformation on the target variable is applied when it has skewed distribution and we need to apply an inverse function on the predicted values to get the actual predicted target value.
為了解決這個問題,當目標變量具有偏斜分布時,對它進行對數轉換,我們需要對預測值應用反函數以獲得實際的預測目標值。
Due to this, for evaluating the model, the RMSLE is calculated to check the error and the R2 Score is also calculated to evaluate the accuracy of the model.
因此,為了評估模型,將計算RMSLE以檢查誤差,并且還計算R2分數以評估模型的準確性。
一些關鍵概念: (Some Key Concepts:)
Learning Rate: Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network concerning the loss gradient. The lower the value, the slower we travel along the downward slope. While this might be a good idea (using a low learning rate) in terms of making sure that we do not miss any local minima, it could also mean that we’ll be taking a long time to converge — especially if we get stuck on a plateau region.
學習率:學習率是一個超參數,它控制我們在網絡上調整與損耗梯度有關的權重的程度。 值越低,我們沿著下坡行駛的速度就越慢。 盡管就確保我們不錯過任何局部最小值而言,這可能是一個好主意(使用較低的學習率),但這也意味著我們將花費很長的時間進行收斂,尤其是如果我們陷入困境高原地區。
n_estimators: This is the number of trees you want to build before taking the maximum voting or averages of predictions. A higher number of trees give you better performance but make your code slower.
n_estimators :這是在進行最大投票或平均預測之前要構建的樹數。 數量更多的樹可為您提供更好的性能,但會使您的代碼變慢。
R2 Score: It is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 0% indicates that the model explains none of the variability of the response data around its mean.
R2得分:它是統計數據與擬合回歸線的接近程度的一種統計量度。 也稱為確定系數,或用于多元回歸的多重確定系數。 0%表示該模型無法解釋響應數據均值附近的變化。
1.數據: (1. The Data:)
The dataset used in this project was downloaded from Kaggle.
該項目中使用的數據集是從Kaggle下載的。
2.數據清理: (2. Data Cleaning:)
The first step is to remove irrelevant/useless features like ‘URL’, ’region_url’, ’vin’, ’image_url’, ’description’, ’county’, ’state’ from the dataset.
第一步是從數據集中刪除不相關/無用的功能,例如“ URL”,“ region_url”,“ vin”,“ image_url”,“ description”,“ county”,“ state”。
As a next step, check missing values for each feature.
下一步,檢查每個功能的缺失值。
Showing missing values (Image By Panwar Abhash Anil)顯示缺失值(Panwar Abhash Anil攝)Next, now missing values were filled with appropriate values by an appropriate method.
接下來,現在通過適當的方法用適當的值填充缺少的值。
To fill the missing values, IterativeImputer method is used and different estimators are implemented then calculated MSE of each estimator using cross_val_score
為了填充缺失值,使用了IterativeImputer方法,并實現了不同的估計量,然后使用cross_val_score計算每個估計量的MSE
From the above figure, we can conclude that the ExtraTreesRegressor estimator will be better for the imputation method to fill the missing value.
從上圖可以得出結論, ExtraTreesRegressor估計器將更適合插補方法來填充缺失值。
Panwar Abhash Anil)Panwar Abhash Anil攝 )At last, after dealing with missing values there zero null values.
最后,在處理了缺失值之后,零值為零。
Outliers: InterQuartile Range (IQR) method is used to remove the outliers from the data.
離群值:四分位數間距(IQR)方法用于從數據中刪除離群值。
Panwar Abhash Anil)Panwar Abhash Anil攝 ) Panwar Abhash Anil)Panwar Abhash Anil攝 ) Panwar Abhash Anil)Panwar Abhash Anil攝 )- From figure 1, the prices whose log is below 6.55 and above 11.55 are the outliers 從圖1中,對數低于6.55和高于11.55的價格是異常值
- From figure 2, it is impossible to conclude something so IQR is calculated to find outliers i.e. odometer values below 6.55 and above 11.55 are the outliers. 從圖2中無法得出結論,因此要計算IQR以找到異常值,即里程表值低于6.55而高于11.55就是異常值。
- From figure 3, the year below 1995 and above 2020 are the outliers. 根據圖3,1995年以下和2020年以上的年份是異常值。
At last, Shape of dataset before process= (435849, 25) and after process= (374136, 18). Total 61713 rows and 7 cols removed.
最后,處理之前的數據集的形狀=(435849,25),處理之后的數據集的形狀=(374136,18)。 總共61713行和7列刪除。
3.數據預處理: (3. Data preprocessing:)
Label Encoder: In our dataset, 12 features are categorical variables and 4 numerical variables (price column excluded). To apply the ML models, we need to transform these categorical variables into numerical variables. And sklearn library LabelEncoder is used to solve this problem.
標簽編碼器:在我們的數據集中,有12個要素是分類變量和4個數字變量(不包括價格欄)。 要應用ML模型,我們需要將這些分類變量轉換為數值變量。 sklearn庫LabelEncoder用于解決此問題。
Normalization: The dataset is not normally distributed. All the features have different ranges. Without normalization, the ML model will try to disregard coefficients of features that have low values because their impact will be so small compared to the big value. Hence to normalized, sklearn library i.e. MinMaxScaler is used.
標準化 :數據集不是正態分布的。 所有功能都有不同的范圍。 如果不進行歸一化,則ML模型將嘗試忽略具有低值的要素的系數,因為與大值相比,其影響將很小。 因此,為了進行標準化,使用了sklearn庫,即MinMaxScaler 。
Train the data. In this process, 90% of the data was split for the train data and 10% of the data was taken as test data.
訓練數據。 在此過程中,將90%的數據拆分為火車數據,并將10%的數據作為測試數據。
4.機器學習模型: (4. ML Models:)
In this section, different machine learning algorithms are used to predict price/target-variable.
在本節中,將使用不同的機器學習算法來預測價格/目標變量。
The dataset is supervised, so the models are applied in a given order:
數據集受到監督,因此以給定順序應用模型:
Linear Regression
線性回歸
Ridge Regression
嶺回歸
Lasso Regression
套索回歸
K-Neighbors Regressor
K鄰域回歸器
Random Forest Regressor
隨機森林回歸
Bagging Regressor
裝袋機
Adaboost Regressor
Adaboost回歸器
XGBoost
XGBoost
1)線性回歸: (1) Linear Regression:)
In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). In linear regression, the relationships are modelled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. More Details
在統計中,線性回歸是對標量響應(或因變量)與一個或多個解釋變量(或自變量)之間的關系進行建模的線性方法。 在線性回歸中,使用線性預測函數對關系進行建模,這些函數的未知模型參數可從數據中估算出來。 這種模型稱為線性模型。 更多細節
Coefficients: The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable.
系數:每個系數的符號表示預測變量和響應變量之間關系的方向。
- A positive sign indicates that as the predictor variable increases, the response variable also increases. 正號表示隨著預測變量的增加,響應變量也增加。
- A negative sign indicates that as the predictor variable increases, the response variable decreases. 負號表示隨著預測變量增加,響應變量減少。
Considering this figure, linear regression suggests that year, cylinder, transmission, fuel, and odometer these five variables are the most important.
考慮到這個數字,線性回歸表明年份,汽缸,變速箱,燃油和里程表這五個變量是最重要的。
Panwar Abhash Anil)Panwar Abhash Anil )2)嶺回歸: (2) Ridge Regression:)
Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value.
Ridge回歸是一種用于分析遭受多重共線性的多個回歸數據的技術。 當發生多重共線性時,最小二乘估計是無偏的,但是它們的方差很大,因此可能與真實值相去甚遠。
To find the best alpha value in ridge regression, yellowbrick library AlphaSelection was applied.
為了在嶺回歸中找到最佳的alpha值,應用了yellowbrick庫AlphaSelection 。
Graph showing best value of Alpha該圖顯示了Alpha的最佳價值From the figure, the best value of alpha to fit the dataset is 20.336.
從圖中可以看出,最適合該數據集的alpha值為20.336。
Note: The value of alpha is not constant it varies every time.
注意:alpha值不是恒定的,每次都會變化。
Using this value of alpha, Ridgeregressor is implemented.
使用此alpha值,可實現Ridgeregressor。
Graph showing Important Features該圖顯示重要功能Considering this figure, Lasso regression suggests that year, cylinder, transmission, fuel, and odometer these five variables are the most important.
考慮到該數字,Lasso回歸表明年份,汽缸,變速箱,燃油和里程表這五個變量是最重要的。
Panwar Abhash Anil)Panwar Abhash Anil攝 )The performance of ridge regression is almost the same as Linear Regression.
嶺回歸的性能幾乎與線性回歸相同。
3)套索回歸: (3)Lasso Regression:)
Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point as mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters).
套索回歸是一種使用收縮的線性回歸。 收縮是指數據值平均向中心點收縮。 套索程序鼓勵使用簡單,稀疏的模型(即參數較少的模型)。
Why Lasso regression is used?
為什么使用套索回歸?
The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that cause regression coefficients for some variables to shrink toward zero.
套索回歸的目標是獲得使定量響應變量的預測誤差最小化的預測子集。 套索通過對模型參數施加約束來實現此目的,該約束會使某些變量的回歸系數縮小為零。
Panwar Abhash Anil)Panwar Abhash Anil攝 )But for this dataset, there is no need for lasso regression as there no much difference in error.
但是對于此數據集,不需要套索回歸,因為誤差沒有太大差異。
4)KNeighbors回歸器:基于k最近鄰的回歸。 (4)KNeighbors Regressor: Regression-based on k-nearest neighbors.)
The target is predicted by local interpolation of the targets associated with the nearest neighbours the training set.
通過與訓練集的最近鄰居相關聯的目標的局部插值來預測目標。
k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until function evaluation. Read More
k -NN是一種基于實例的學習或懶惰學習 ,其中功能僅在本地近似,所有計算都推遲到功能評估為止。
Panwar Abhash Anil)Panwar Abhash Anil攝 )From the above figure, for k=5 KNN give the least error. So dataset is trained using n_neighbors=5 and metric=’euclidean’.
從上圖可以看出,對于k = 5 KNN,誤差最小。 因此,使用n_neighbors = 5和metric ='euclidean'訓練數據集。
Panwar Abhash Anil)Panwar Abhash Anil攝 )The performance KNN is better and error is decreasing with increased accuracy.
性能KNN更好,并且誤差隨著精度的提高而降低。
5)隨機森林: (5) Random Forest:)
The random forest is a classification algorithm consisting of many decision trees. It uses bagging and feature randomness when building each tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. Read More
隨機森林是一種由許多決策樹組成的分類算法。 在構建每棵樹時,它使用套袋和特征隨機性來嘗試創建不相關的樹林,其委員會的預測比任何單個樹的預測更為準確。
In our model, 180 decisions are created with max_features 0.5
在我們的模型中,使用max_features 0.5創建了180個決策
Performance of Random Forest (True value vs predicted value)隨機森林的性能(真實值與預測值)This is the simple bar plot which illustrates that year is the most important feature of a car and then odometer variable and then others.
這是簡單的條形圖,它說明年份是汽車的最重要特征,然后是里程表變量,然后是其他變量。
Panwar Abhash Anil)Panwar Abhash Anil提供 )The performance of the Random forest is better and accuracy is increased by approx. 10% which is good. Since the random forest is using bagging when building each tree so next Bagging Regressor will be performed.
隨機森林的性能更好,并且準確性提高了約5%。 10%很好。 由于隨機森林在構建每棵樹時正在使用裝袋,因此將執行下一個裝袋回歸器。
6)套袋回歸器: (6) Bagging Regressor:)
A Bagging regressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original dataset and then aggregates their predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it. Read More
Bagging回歸器是一個集合元估計器,它使每個基本回歸器都適合原始數據集的隨機子集,然后將其預測(通過投票或平均)進行匯總以形成最終預測。 通過將隨機化引入其構造過程中,然后使其整體,這種元估計器通常可以用作減少黑盒估計器(例如決策樹)方差的方法。
In our model, DecisionTreeRegressor is used as the estimator with max_depth=20 which creates 50 decision trees and the results show below.
在我們的模型中,DecisionTreeRegressor用作max_depth = 20的估計量,它創建了50個決策樹,結果如下所示。
Panwar Abhash Anil)Panwar Abhash Anil提供 )The performance of Random Forest is much better than Bagging regressor.
Random Forest的性能比Bagging回歸器要好得多。
The key difference between Random forest and Bagging: The fundamental difference is that in Random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node.
隨機森林和套袋的關鍵區別:最根本的區別是,在隨機森林中,只有功能的子集在總的隨機開出,并從子集的最佳分割特征選擇用于每個節點樹分割,不像在裝袋中考慮將所有要素拆分節點。
7)Adaboost回歸器: (7) Adaboost regressor:)
AdaBoost can be used to boost the performance of any machine learning algorithm. Adaboost helps you combine multiple “weak classifiers” into a single “strong classifier”. Library used: AdaBoostRegressor & Read More
AdaBoost可用于提高任何機器學習算法的性能。 Adaboost可幫助您將多個“弱分類器”組合為一個“強分類器”。 使用的庫: AdaBoostRegressor &
This is the simple bar plot which illustrates that year is the most important feature of a car and then odometer variable and then model, etc.
這是簡單的條形圖,它說明年份是汽車的最重要特征,然后是里程表變量,然后是模型,等等。
In our model, DecisionTreeRegressor is used as an estimator with 24 max_depth and creates 200 trees & learning the model with 0.6 learning_rate and result shown below.
在我們的模型中,DecisionTreeRegressor用作具有24個max_depth的估計量,并創建200棵樹并以0.6 learning_rate和以下所示的結果學習模型。
Panwar Abhash Anil)Panwar Abhash Anil提供 )8)XGBoost:XGBoost代表eXtreme Gradient Boosting (8) XGBoost: XGBoost stands for eXtreme Gradient Boosting)
XGBoost is an ensemble learning method.XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. The beauty of this powerful algorithm lies in its scalability, which drives fast learning through parallel and distributed computing and offers efficient memory usage. Read More
XGBoost是一種整體學習方法 .XGBoost是為速度和性能而設計的梯度增強決策樹的實現。 這種強大算法的優點在于可擴展性,可擴展性通過并行和分布式計算驅動快速學習,并提供有效的內存使用率。
This is the simple bar plot in descending of importance which illustrates that which feature/variable is an important feature of a car is more important.
這是重要性遞減的簡單條形圖,它說明哪個特征/變量是汽車的重要特征更為重要。
According to XGBoost, Odometer is an important feature whereas from the previous models year is an important feature.
根據XGBoost的介紹, 里程表是一項重要功能,而從以前的型號開始,年份是一項重要功能。
In this model,200 decision trees are created of 24 max depth and the model is learning the parameter with a 0.4 learning rate.
在該模型中,創建了200個最大深度為24的決策樹,并且該模型正在以0.4的學習率學習參數。
Panwar Abhash Anil)Panwar Abhash Anil提供 )4)模型性能比較: (4)Comparison of the performance of the models:)
Panwar Abhash Anil)Panwar Abhash Anil攝 ) Panwar Abhash Anil)Panwar Abhash Anil提供 )From the above figures, we can conclude that XGBoost regressor with 89.662% accuracy is performing better than other models.
從以上數據可以得出結論,精度為89.662%的XGBoost回歸器的性能優于其他模型。
5)來自數據集的一些見解: (5) Some insights from the dataset:)
1From the pair plot, we can’t conclude anything. There is no correlation between the variables.
1從對圖中,我們無法得出任何結論。 變量之間沒有相關性。
Pair Plot to Find Correlation配對圖以找到相關性2From the distplot, we can conclude that initially, the price is increasing rapidly but after a particular point, the price starts decreasing.
2從distplot中,我們可以得出結論,最初,價格正在Swift上漲,但是在特定點之后,價格開始下降。
Panwar Abhash Anil)Panwar Abhash Anil攝 )3From figure 1, we analyze that the car price of the diesel variant is high then the price of the electric variant comes. Hybrid variant cars have the lowest price.
3從圖1中,我們分析出柴油車型的汽車價格高,然后電動車型的價格就來了。 混合動力汽車的價格最低。
Bar Plot showing the price of each fuel type條形圖顯示每種燃料類型的價格4 From figure 2, we analyze that the car price of the respective fuel also depends upon the condition of the car.
4從圖2中,我們分析了相應燃料的汽車價格還取決于汽車的狀況。
Bar Plot between fuel and price with hue condition帶有色相條件的燃料和價格之間的條形圖5From figure 3, we analyze that car prices are increasing per year after 1995, and from figure 4, the number of cars also increasing per year, and at some point i.e in 2012yr, the number of cars is nearly the same.
5從圖3中,我們分析了1995年以后汽車價格每年都在上漲,從圖4中,汽車數量也在逐年增加,在某個年份,即2012年,汽車數量幾乎是相同的。
Graph showing how the price varies per year該圖顯示了價格每年的變化6From figure 5, we can analyze that the price of the cars also depends upon the condition of the car, and from figure 6, price varies with the condition of the cars with there size also.
6從圖5中,我們可以分析出汽車的價格也取決于汽車的狀況,而從圖6中,價格也隨汽車的大小而變化。
Bar Plot showing the price respective of the condition of the car條形圖顯示了汽車狀況的價格7From figure 7–8, we analyze that price of the cars also various each transmission of a car. People are ready to buy the car having “other transmission” and the price of the cars having “manual transmission” is low.
7從圖7–8中,我們分析了汽車的價格也隨汽車的每個變速箱而變化。 人們準備購買具有“其他變速箱”的汽車,并且具有“手動變速箱”的汽車的價格很低。
Panwar Abhash Anil)Panwar Abhash Anil提供 )8 Below there are similar graphs with the same insight but different features.
8下面是具有相同見解但功能不同的相似圖表。
結論: (Conclusion:)
By performing different ML models, we aim to get a better result or less error with max accuracy. Our purpose was to predict the price of the used cars having 25 predictors and 509577 data entries.
通過執行不同的ML模型,我們旨在以最大的精度獲得更好的結果或更少的誤差。 我們的目的是通過25個預測器和509577個數據輸入來預測二手車的價格。
Initially, data cleaning is performed to remove the null values and outliers from the dataset then ML models are implemented to predict the price of cars.
最初,執行數據清理以從數據集中刪除空值和離群值,然后實施ML模型以預測汽車價格。
Next, with the help of data visualization features were explored deeply. The relation between the features is examined.
接下來,借助數據可視化功能進行了深入探索。 檢查特征之間的關系。
From the below table, it can be concluded that XGBoost is the best model for the prediction for used car prices. XGBoost as a regression model gave the best MSLE and RMSLE values.
從下表中可以得出結論,XGBoost是預測二手車價格的最佳模型。 XGBoost作為回歸模型可提供最佳的MSLE和RMSLE值。
Panwar Abhash Anil)Panwar Abhash Anil提供 )翻譯自: https://towardsdatascience.com/used-car-price-prediction-using-machine-learning-e3be02d977b2
使用機器學習預測天氣
總結
以上是生活随笔為你收集整理的使用机器学习预测天气_使用机器学习的二手车价格预测的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ue4 gpu构建_待在家里吗 为什么不
- 下一篇: 马尔可夫的营销归因