當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【算法竞赛学习】二手车交易价格预测-Baseline

發布時間：2023/12/15 编程问答 43 豆豆

生活随笔收集整理的這篇文章主要介紹了【算法竞赛学习】二手车交易价格预测-Baseline 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

二手車交易價格預測-Baseline

Baseline-v1.0 版

Tip:這是一個最初始baseline版本,拋磚引玉,為大家提供一個基本Baseline和一個競賽流程的基本介紹，歡迎大家多多交流。

賽題：零基礎入門數據挖掘 - 二手車交易價格預測

地址：https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

# 查看數據文件目錄 list datalab files !ls datalab/ 231784

Step 1:導入函數工具箱

## 基礎工具 import numpy as np import pandas as pd import warnings import matplotlib import matplotlib.pyplot as plt import seaborn as sns from scipy.special import jn from IPython.display import display, clear_output import timewarnings.filterwarnings('ignore') %matplotlib inline## 模型預測的 from sklearn import linear_model from sklearn import preprocessing from sklearn.svm import SVR from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor## 數據降維處理的 from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCAimport lightgbm as lgb import xgboost as xgb## 參數搜索和評價的 from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split from sklearn.metrics import mean_squared_error, mean_absolute_error

Step 2:數據讀取

## 通過Pandas對于數據進行讀取 (pandas是一個很友好的數據讀取函數庫) Train_data = pd.read_csv('datalab/231784/used_car_train_20200313.csv', sep=' ') TestA_data = pd.read_csv('datalab/231784/used_car_testA_20200313.csv', sep=' ')## 輸出數據的大小信息 print('Train data shape:',Train_data.shape) print('TestA data shape:',TestA_data.shape) Train data shape: (150000, 31) TestA data shape: (50000, 30)

1) 數據簡要瀏覽

## 通過.head() 簡要瀏覽讀取數據的形式 Train_data.head() SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_1401234

0	736	20040402	30.0	6	1.0	0.0	60	12.5	...	0.235676	0.101988	0.129549	0.022816	0.097462	-2.881803	2.804097	-2.420821	0.795292	0.914762
1	2262	20030301	40.0	1	2.0	0.0	0	15.0	...	0.264777	0.121004	0.135731	0.026597	0.020582	-4.900482	2.096338	-1.030483	-1.722674	0.245522
2	14874	20040403	115.0	15	1.0	0.0	163	12.5	...	0.251410	0.114912	0.165147	0.062173	0.027075	-4.846749	1.803559	1.565330	-0.832687	-0.229963
3	71865	19960908	109.0	10	0.0	1.0	193	15.0	...	0.274293	0.110300	0.121964	0.033395	0.000000	-4.509599	1.285940	-0.501868	-2.438353	-0.478699
4	111080	20120103	110.0	5	1.0	0.0	68	5.0	...	0.228036	0.073205	0.091880	0.078819	0.121534	-1.896240	0.910783	0.931110	2.834518	1.923482

5 rows × 31 columns

2) 數據信息查看

## 通過 .info() 簡要可以看到對應一些數據列名，以及NAN缺失信息 Train_data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 150000 entries, 0 to 149999 Data columns (total 31 columns): SaleID 150000 non-null int64 name 150000 non-null int64 regDate 150000 non-null int64 model 149999 non-null float64 brand 150000 non-null int64 bodyType 145494 non-null float64 fuelType 141320 non-null float64 gearbox 144019 non-null float64 power 150000 non-null int64 kilometer 150000 non-null float64 notRepairedDamage 150000 non-null object regionCode 150000 non-null int64 seller 150000 non-null int64 offerType 150000 non-null int64 creatDate 150000 non-null int64 price 150000 non-null int64 v_0 150000 non-null float64 v_1 150000 non-null float64 v_2 150000 non-null float64 v_3 150000 non-null float64 v_4 150000 non-null float64 v_5 150000 non-null float64 v_6 150000 non-null float64 v_7 150000 non-null float64 v_8 150000 non-null float64 v_9 150000 non-null float64 v_10 150000 non-null float64 v_11 150000 non-null float64 v_12 150000 non-null float64 v_13 150000 non-null float64 v_14 150000 non-null float64 dtypes: float64(20), int64(10), object(1) memory usage: 35.5+ MB ## 通過 .columns 查看列名 Train_data.columns Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3','v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12','v_13', 'v_14'],dtype='object') TestA_data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 30 columns): SaleID 50000 non-null int64 name 50000 non-null int64 regDate 50000 non-null int64 model 50000 non-null float64 brand 50000 non-null int64 bodyType 48587 non-null float64 fuelType 47107 non-null float64 gearbox 48090 non-null float64 power 50000 non-null int64 kilometer 50000 non-null float64 notRepairedDamage 50000 non-null object regionCode 50000 non-null int64 seller 50000 non-null int64 offerType 50000 non-null int64 creatDate 50000 non-null int64 v_0 50000 non-null float64 v_1 50000 non-null float64 v_2 50000 non-null float64 v_3 50000 non-null float64 v_4 50000 non-null float64 v_5 50000 non-null float64 v_6 50000 non-null float64 v_7 50000 non-null float64 v_8 50000 non-null float64 v_9 50000 non-null float64 v_10 50000 non-null float64 v_11 50000 non-null float64 v_12 50000 non-null float64 v_13 50000 non-null float64 v_14 50000 non-null float64 dtypes: float64(20), int64(9), object(1) memory usage: 11.4+ MB

3) 數據統計信息瀏覽

## 通過 .describe() 可以查看數值特征列的一些統計信息 Train_data.describe() SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14countmeanstdmin25%50%75%max

150000.000000	150000.000000	1.500000e+05	149999.000000	150000.000000	145494.000000	141320.000000	144019.000000	150000.000000	150000.000000	...	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000
74999.500000	68349.172873	2.003417e+07	47.129021	8.052733	1.792369	0.375842	0.224943	119.316547	12.597160	...	0.248204	0.044923	0.124692	0.058144	0.061996	-0.001000	0.009035	0.004813	0.000313	-0.000688
43301.414527	61103.875095	5.364988e+04	49.536040	7.864956	1.760640	0.548677	0.417546	177.168419	3.919576	...	0.045804	0.051743	0.201410	0.029186	0.035692	3.772386	3.286071	2.517478	1.288988	1.038685
0.000000	0.000000	1.991000e+07	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	...	0.000000	0.000000	0.000000	0.000000	0.000000	-9.168192	-5.558207	-9.639552	-4.153899	-6.546556
37499.750000	11156.000000	1.999091e+07	10.000000	1.000000	0.000000	0.000000	0.000000	75.000000	12.500000	...	0.243615	0.000038	0.062474	0.035334	0.033930	-3.722303	-1.951543	-1.871846	-1.057789	-0.437034
74999.500000	51638.000000	2.003091e+07	30.000000	6.000000	1.000000	0.000000	0.000000	110.000000	15.000000	...	0.257798	0.000812	0.095866	0.057014	0.058484	1.624076	-0.358053	-0.130753	-0.036245	0.141246
112499.250000	118841.250000	2.007111e+07	66.000000	13.000000	3.000000	1.000000	0.000000	150.000000	15.000000	...	0.265297	0.102009	0.125243	0.079382	0.087491	2.844357	1.255022	1.776933	0.942813	0.680378
149999.000000	196812.000000	2.015121e+07	247.000000	39.000000	7.000000	6.000000	1.000000	19312.000000	15.000000	...	0.291838	0.151420	1.404936	0.160791	0.222787	12.357011	18.819042	13.847792	11.147669	8.658418

8 rows × 30 columns

TestA_data.describe() SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14countmeanstdmin25%50%75%max

50000.000000	50000.000000	5.000000e+04	50000.000000	50000.000000	48587.000000	47107.000000	48090.000000	50000.000000	50000.000000	...	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000
174999.500000	68542.223280	2.003393e+07	46.844520	8.056240	1.782185	0.373405	0.224350	119.883620	12.595580	...	0.248669	0.045021	0.122744	0.057997	0.062000	-0.017855	-0.013742	-0.013554	-0.003147	0.001516
14433.901067	61052.808133	5.368870e+04	49.469548	7.819477	1.760736	0.546442	0.417158	185.097387	3.908979	...	0.044601	0.051766	0.195972	0.029211	0.035653	3.747985	3.231258	2.515962	1.286597	1.027360
150000.000000	0.000000	1.991000e+07	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	...	0.000000	0.000000	0.000000	0.000000	0.000000	-9.160049	-5.411964	-8.916949	-4.123333	-6.112667
162499.750000	11203.500000	1.999091e+07	10.000000	1.000000	0.000000	0.000000	0.000000	75.000000	12.500000	...	0.243762	0.000044	0.062644	0.035084	0.033714	-3.700121	-1.971325	-1.876703	-1.060428	-0.437920
174999.500000	52248.500000	2.003091e+07	29.000000	6.000000	1.000000	0.000000	0.000000	109.000000	15.000000	...	0.257877	0.000815	0.095828	0.057084	0.058764	1.613212	-0.355843	-0.142779	-0.035956	0.138799
187499.250000	118856.500000	2.007110e+07	65.000000	13.000000	3.000000	1.000000	0.000000	150.000000	15.000000	...	0.265328	0.102025	0.125438	0.079077	0.087489	2.832708	1.262914	1.764335	0.941469	0.681163
199999.000000	196805.000000	2.015121e+07	246.000000	39.000000	7.000000	6.000000	1.000000	20000.000000	15.000000	...	0.291618	0.153265	1.358813	0.156355	0.214775	12.338872	18.856218	12.950498	5.913273	2.624622

8 rows × 29 columns

Step 3:特征與標簽構建

1) 提取數值類型特征列名

numerical_cols = Train_data.select_dtypes(exclude = 'object').columns print(numerical_cols) Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'regionCode', 'seller', 'offerType','creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],dtype='object') categorical_cols = Train_data.select_dtypes(include = 'object').columns print(categorical_cols) Index(['notRepairedDamage'], dtype='object')

2) 構建訓練和測試樣本

## 選擇特征列 feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','creatDate','price','model','brand','regionCode','seller']] feature_cols = [col for col in feature_cols if 'Type' not in col]## 提前特征列，標簽列構造訓練樣本和測試樣本 X_data = Train_data[feature_cols] Y_data = Train_data['price']X_test = TestA_data[feature_cols]print('X train shape:',X_data.shape) print('X test shape:',X_test.shape) X train shape: (150000, 18) X test shape: (50000, 18) ## 定義了一個統計函數，方便后續信息統計 def Sta_inf(data):print('_min',np.min(data))print('_max:',np.max(data))print('_mean',np.mean(data))print('_ptp',np.ptp(data))print('_std',np.std(data))print('_var',np.var(data))

3) 統計標簽的基本分布信息

print('Sta of label:') Sta_inf(Y_data) Sta of label: _min 11 _max: 99999 _mean 5923.32733333 _ptp 99988 _std 7501.97346988 _var 56279605.9427 ## 繪制標簽的統計圖，查看標簽分布 plt.hist(Y_data) plt.show() plt.close()

4) 缺省值用-1填補

X_data = X_data.fillna(-1) X_test = X_test.fillna(-1)

Step 4:模型訓練與預測

1) 利用xgb進行五折交叉驗證查看模型的參數效果

## xgb-Model xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'scores_train = [] scores = []## 5折交叉驗證方式 sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0) for train_ind,val_ind in sk.split(X_data,Y_data):train_x=X_data.iloc[train_ind].valuestrain_y=Y_data.iloc[train_ind]val_x=X_data.iloc[val_ind].valuesval_y=Y_data.iloc[val_ind]xgr.fit(train_x,train_y)pred_train_xgb=xgr.predict(train_x)pred_xgb=xgr.predict(val_x)score_train = mean_absolute_error(train_y,pred_train_xgb)scores_train.append(score_train)score = mean_absolute_error(val_y,pred_xgb)scores.append(score)print('Train mae:',np.mean(score_train)) print('Val mae',np.mean(scores)) Train mae: 628.086664863 Val mae 715.990013454

2）定義xgb和lgb模型函數

def build_model_xgb(x_train,y_train):model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'model.fit(x_train, y_train)return modeldef build_model_lgb(x_train,y_train):estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)param_grid = {'learning_rate': [0.01, 0.05, 0.1, 0.2],}gbm = GridSearchCV(estimator, param_grid)gbm.fit(x_train, y_train)return gbm

3）切分數據集（Train,Val）進行模型訓練，評價和預測

## Split data with val x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3) print('Train lgb...') model_lgb = build_model_lgb(x_train,y_train) val_lgb = model_lgb.predict(x_val) MAE_lgb = mean_absolute_error(y_val,val_lgb) print('MAE of val with lgb:',MAE_lgb)print('Predict lgb...') model_lgb_pre = build_model_lgb(X_data,Y_data) subA_lgb = model_lgb_pre.predict(X_test) print('Sta of Predict lgb:') Sta_inf(subA_lgb) Train lgb... MAE of val with lgb: 689.084070621 Predict lgb... Sta of Predict lgb: _min -519.150259864 _max: 88575.1087721 _mean 5922.98242599 _ptp 89094.259032 _std 7377.29714126 _var 54424513.1104 print('Train xgb...') model_xgb = build_model_xgb(x_train,y_train) val_xgb = model_xgb.predict(x_val) MAE_xgb = mean_absolute_error(y_val,val_xgb) print('MAE of val with xgb:',MAE_xgb)print('Predict xgb...') model_xgb_pre = build_model_xgb(X_data,Y_data) subA_xgb = model_xgb_pre.predict(X_test) print('Sta of Predict xgb:') Sta_inf(subA_xgb) Train xgb... MAE of val with xgb: 715.37757816 Predict xgb... Sta of Predict xgb: _min -165.479 _max: 90051.8 _mean 5922.9 _ptp 90217.3 _std 7361.13 _var 5.41862e+07

4）進行兩模型的結果加權融合

## 這里我們采取了簡單的加權融合的方式 val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb val_Weighted[val_Weighted<0]=10 # 由于我們發現預測的最小值有負數，而真實情況下，price為負是不存在的，由此我們進行對應的后修正 print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted)) MAE of val with Weighted ensemble: 687.275745703 sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb## 查看預測值的統計進行 plt.hist(Y_data) plt.show() plt.close()

5）輸出結果

sub = pd.DataFrame() sub['SaleID'] = X_test.SaleID sub['price'] = sub_Weighted sub.to_csv('./sub_Weighted.csv',index=False) sub.head() SaleIDprice01234

0	39533.727414
1	386.081960
2	7791.974571
3	11835.211966
4	585.420407

總結

以上是生活随笔為你收集整理的【算法竞赛学习】二手车交易价格预测-Baseline的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。