當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

kaggle (02) - 房价预测案例（进阶版）

發(fā)布時(shí)間：2023/12/13 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 kaggle (02) - 房价预测案例（进阶版）小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

房?jī)r(jià)預(yù)測(cè)案例（進(jìn)階版）

這是進(jìn)階版的notebook。主要是為了比較幾種模型框架。所以前面的特征工程部分內(nèi)容，我也并沒有做任何改動(dòng)，重點(diǎn)都在后面的模型建造section

Step 1: 檢視源數(shù)據(jù)集

import numpy as np import pandas as pd

讀入數(shù)據(jù)

一般來(lái)說(shuō)源數(shù)據(jù)的index那一欄沒什么用，我們可以用來(lái)作為我們pandas dataframe的index。這樣之后要是檢索起來(lái)也省事兒。
有人的地方就有鄙視鏈。跟知乎一樣。Kaggle的也是個(gè)處處呵呵的危險(xiǎn)地帶。Kaggle上默認(rèn)把數(shù)據(jù)放在input文件夾下。所以我們沒事兒寫個(gè)教程什么的，也可以依據(jù)這個(gè)convention來(lái)，顯得自己很有逼格。。

train_df = pd.read_csv('../input/train.csv', index_col=0) test_df = pd.read_csv('../input/test.csv', index_col=0)

檢視源數(shù)據(jù)

train_df.head() MSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilitiesLotConfig...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePriceId12345

60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	Inside	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	FR2	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	Inside	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	Corner	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	FR2	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 80 columns

這時(shí)候大概心里可以有數(shù)，哪些地方需要人為的處理一下，以做到源數(shù)據(jù)更加好被process。

Step 2: 合并數(shù)據(jù)

這么做主要是為了用DF進(jìn)行數(shù)據(jù)預(yù)處理的時(shí)候更加方便。等所有的需要的預(yù)處理進(jìn)行完之后，我們?cè)侔阉麄兎指糸_。

首先，SalePrice作為我們的訓(xùn)練目標(biāo)，只會(huì)出現(xiàn)在訓(xùn)練集中，不會(huì)在測(cè)試集中（要不然你測(cè)試什么？）。所以，我們先把SalePrice這一列給拿出來(lái)，不讓它礙事兒。

我們先看一下SalePrice長(zhǎng)什么樣紙：

%matplotlib inline prices = pd.DataFrame({"price":train_df["SalePrice"], "log(price + 1)":np.log1p(train_df["SalePrice"])}) prices.hist() array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000009B8DE48>,<matplotlib.axes._subplots.AxesSubplot object at 0x0000000009BF4710>]],dtype=object)

可見，label本身并不平滑。為了我們分類器的學(xué)習(xí)更加準(zhǔn)確，我們會(huì)首先把label給“平滑化”（正態(tài)化）

這一步大部分同學(xué)會(huì)miss掉，導(dǎo)致自己的結(jié)果總是達(dá)不到一定標(biāo)準(zhǔn)。

這里我們使用最有逼格的log1p, 也就是 log(x+1)，避免了復(fù)值的問題。

記住喲，如果我們這里把數(shù)據(jù)都給平滑化了，那么最后算結(jié)果的時(shí)候，要記得把預(yù)測(cè)到的平滑數(shù)據(jù)給變回去。

按照“怎么來(lái)的怎么去”原則，log1p()就需要expm1(); 同理，log()就需要exp(), … etc.

y_train = np.log1p(train_df.pop('SalePrice'))

然后我們把剩下的部分合并起來(lái)

all_df = pd.concat((train_df, test_df), axis=0)

此刻，我們可以看到all_df就是我們合在一起的DF

all_df.shape (2919, 79)

而y_train則是SalePrice那一列

y_train.head() Id 1 12.247699 2 12.109016 3 12.317171 4 11.849405 5 12.429220 Name: SalePrice, dtype: float64

Step 3: 變量轉(zhuǎn)化

類似『特征工程』。就是把不方便處理或者不unify的數(shù)據(jù)給統(tǒng)一了。

正確化變量屬性

首先，我們注意到，MSSubClass 的值其實(shí)應(yīng)該是一個(gè)category，

但是Pandas是不會(huì)懂這些事兒的。使用DF的時(shí)候，這類數(shù)字符號(hào)會(huì)被默認(rèn)記成數(shù)字。

這種東西就很有誤導(dǎo)性，我們需要把它變回成string

all_df['MSSubClass'].dtypes dtype('int64') all_df['MSSubClass'] = all_df['MSSubClass'].astype(str)

變成str以后，做個(gè)統(tǒng)計(jì)，就很清楚了

all_df['MSSubClass'].value_counts() 20 1079 60 575 50 287 120 182 30 139 70 128 160 128 80 118 90 109 190 61 85 48 75 23 45 18 180 17 40 6 150 1 Name: MSSubClass, dtype: int64

把category的變量轉(zhuǎn)變成numerical表達(dá)形式

當(dāng)我們用numerical來(lái)表達(dá)categorical的時(shí)候，要注意，數(shù)字本身有大小的含義，所以亂用數(shù)字會(huì)給之后的模型學(xué)習(xí)帶來(lái)麻煩。于是我們可以用One-Hot的方法來(lái)表達(dá)category。

pandas自帶的get_dummies方法，可以幫你一鍵做到One-Hot。

pd.get_dummies(all_df['MSSubClass'], prefix='MSSubClass').head() MSSubClass_120MSSubClass_150MSSubClass_160MSSubClass_180MSSubClass_190MSSubClass_20MSSubClass_30MSSubClass_40MSSubClass_45MSSubClass_50MSSubClass_60MSSubClass_70MSSubClass_75MSSubClass_80MSSubClass_85MSSubClass_90Id12345

0	1	0
1	0	0
0	1	0
0	0	1
0	1	0

此刻MSSubClass被我們分成了12個(gè)column，每一個(gè)代表一個(gè)category。是就是1，不是就是0。

同理，我們把所有的category數(shù)據(jù)，都給One-Hot了

all_dummy_df = pd.get_dummies(all_df) all_dummy_df.head() LotFrontageLotAreaOverallQualOverallCondYearBuiltYearRemodAddMasVnrAreaBsmtFinSF1BsmtFinSF2BsmtUnfSF...SaleType_ConLwSaleType_NewSaleType_OthSaleType_WDSaleCondition_AbnormlSaleCondition_AdjLandSaleCondition_AllocaSaleCondition_FamilySaleCondition_NormalSaleCondition_PartialId12345

65.0	8450	7	5	2003	2003	196.0	706.0	150.0	...	1	0	1
80.0	9600	6	8	1976	1976	0.0	978.0	284.0	...	1	0	1
68.0	11250	7	5	2001	2002	162.0	486.0	434.0	...	1	0	1
60.0	9550	7	5	1915	1970	0.0	216.0	540.0	...	1	1	0
84.0	14260	8	5	2000	2000	350.0	655.0	490.0	...	1	0	1

5 rows × 303 columns

處理好numerical變量

就算是numerical的變量，也還會(huì)有一些小問題。

比如，有一些數(shù)據(jù)是缺失的：

all_dummy_df.isnull().sum().sort_values(ascending=False).head(10) LotFrontage 486 GarageYrBlt 159 MasVnrArea 23 BsmtHalfBath 2 BsmtFullBath 2 BsmtFinSF2 1 GarageCars 1 TotalBsmtSF 1 BsmtUnfSF 1 GarageArea 1 dtype: int64

可以看到，缺失最多的column是LotFrontage

處理這些缺失的信息，得靠好好審題。一般來(lái)說(shuō)，數(shù)據(jù)集的描述里會(huì)寫的很清楚，這些缺失都代表著什么。當(dāng)然，如果實(shí)在沒有的話，也只能靠自己的『想當(dāng)然』。。

在這里，我們用平均值來(lái)填滿這些空缺。

mean_cols = all_dummy_df.mean() mean_cols.head(10) LotFrontage 69.305795 LotArea 10168.114080 OverallQual 6.089072 OverallCond 5.564577 YearBuilt 1971.312778 YearRemodAdd 1984.264474 MasVnrArea 102.201312 BsmtFinSF1 441.423235 BsmtFinSF2 49.582248 BsmtUnfSF 560.772104 dtype: float64 all_dummy_df = all_dummy_df.fillna(mean_cols)

看看是不是沒有空缺了？

all_dummy_df.isnull().sum().sum() 0

標(biāo)準(zhǔn)化numerical數(shù)據(jù)

這一步并不是必要，但是得看你想要用的分類器是什么。一般來(lái)說(shuō)，regression的分類器都比較傲嬌，最好是把源數(shù)據(jù)給放在一個(gè)標(biāo)準(zhǔn)分布內(nèi)。不要讓數(shù)據(jù)間的差距太大。

這里，我們當(dāng)然不需要把One-Hot的那些0/1數(shù)據(jù)給標(biāo)準(zhǔn)化。我們的目標(biāo)應(yīng)該是那些本來(lái)就是numerical的數(shù)據(jù)：

先來(lái)看看哪些是numerical的：

numeric_cols = all_df.columns[all_df.dtypes != 'object'] numeric_cols Index([u'LotFrontage', u'LotArea', u'OverallQual', u'OverallCond',u'YearBuilt', u'YearRemodAdd', u'MasVnrArea', u'BsmtFinSF1',u'BsmtFinSF2', u'BsmtUnfSF', u'TotalBsmtSF', u'1stFlrSF', u'2ndFlrSF',u'LowQualFinSF', u'GrLivArea', u'BsmtFullBath', u'BsmtHalfBath',u'FullBath', u'HalfBath', u'BedroomAbvGr', u'KitchenAbvGr',u'TotRmsAbvGrd', u'Fireplaces', u'GarageYrBlt', u'GarageCars',u'GarageArea', u'WoodDeckSF', u'OpenPorchSF', u'EnclosedPorch',u'3SsnPorch', u'ScreenPorch', u'PoolArea', u'MiscVal', u'MoSold',u'YrSold'],dtype='object')

計(jì)算標(biāo)準(zhǔn)分布：(X-X’)/s

讓我們的數(shù)據(jù)點(diǎn)更平滑，更便于計(jì)算。

注意：我們這里也是可以繼續(xù)使用Log的，我只是給大家展示一下多種“使數(shù)據(jù)平滑”的辦法。

numeric_col_means = all_dummy_df.loc[:, numeric_cols].mean() numeric_col_std = all_dummy_df.loc[:, numeric_cols].std() all_dummy_df.loc[:, numeric_cols] = (all_dummy_df.loc[:, numeric_cols] - numeric_col_means) / numeric_col_std

Step 4: 建立模型

把數(shù)據(jù)集分回訓(xùn)練/測(cè)試集

dummy_train_df = all_dummy_df.loc[train_df.index] dummy_test_df = all_dummy_df.loc[test_df.index] dummy_train_df.shape, dummy_test_df.shape ((1460, 303), (1459, 303)) X_train = dummy_train_df.values X_test = dummy_test_df.values

做一點(diǎn)高級(jí)的Ensemble

一般來(lái)說(shuō)，單個(gè)分類器的效果真的是很有限。我們會(huì)傾向于把N多的分類器合在一起，做一個(gè)“綜合分類器”以達(dá)到最好的效果。

我們從剛剛的試驗(yàn)中得知，Ridge(alpha=15)給了我們最好的結(jié)果

from sklearn.linear_model import Ridge ridge = Ridge(15)

Bagging

Bagging把很多的小分類器放在一起，每個(gè)train隨機(jī)的一部分?jǐn)?shù)據(jù)，然后把它們的最終結(jié)果綜合起來(lái)（多數(shù)投票制）。

Sklearn已經(jīng)直接提供了這套構(gòu)架，我們直接調(diào)用就行：

from sklearn.ensemble import BaggingRegressor from sklearn.model_selection import cross_val_score E:\Anaconda2\soft\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.from numpy.core.umath_tests import inner1d

在這里，我們用CV結(jié)果來(lái)測(cè)試不同的分類器個(gè)數(shù)對(duì)最后結(jié)果的影響。

注意，我們?cè)诓渴養(yǎng)agging的時(shí)候，要把它的函數(shù)base_estimator里填上你的小分類器（ridge）

params = [1, 10, 15, 20, 25, 30, 40] test_scores = [] for param in params:clf = BaggingRegressor(n_estimators=param, base_estimator=ridge)test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))test_scores.append(np.mean(test_score)) import matplotlib.pyplot as plt %matplotlib inline plt.plot(params, test_scores) plt.title("n_estimator vs CV Error");

可見，前一個(gè)版本中，ridge最優(yōu)結(jié)果也就是0.135；而這里，我們使用25個(gè)小ridge分類器的bagging，達(dá)到了低于0.132的結(jié)果。

當(dāng)然了，你如果并沒有提前測(cè)試過ridge模型，你也可以用Bagging自帶的DecisionTree模型：

代碼是一樣的，把base_estimator給刪去即可

params = [10, 15, 20, 25, 30, 40, 50, 60, 70, 100] test_scores = [] for param in params:clf = BaggingRegressor(n_estimators=param)test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))test_scores.append(np.mean(test_score)) import matplotlib.pyplot as plt %matplotlib inline plt.plot(params, test_scores) plt.title("n_estimator vs CV Error");

咦，看來(lái)單純用DT不太靈光的。最好的結(jié)果也就0.140

Boosting

Boosting比Bagging理論上更高級(jí)點(diǎn)，它也是攬來(lái)一把的分類器。但是把他們線性排列。下一個(gè)分類器把上一個(gè)分類器分類得不好的地方加上更高的權(quán)重，這樣下一個(gè)分類器就能在這個(gè)部分學(xué)得更加“深刻”。

from sklearn.ensemble import AdaBoostRegressor params = [10, 15, 20, 25, 30, 35, 40, 45, 50] test_scores = [] for param in params:clf = BaggingRegressor(n_estimators=param, base_estimator=ridge)test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))test_scores.append(np.mean(test_score)) plt.plot(params, test_scores) plt.title("n_estimator vs CV Error");

Adaboost+Ridge在這里，25個(gè)小分類器的情況下，也是達(dá)到了接近0.132的效果。

同理，這里，你也可以不必輸入Base_estimator，使用Adaboost自帶的DT。

params = [10, 15, 20, 25, 30, 35, 40, 45, 50] test_scores = [] for param in params:clf = BaggingRegressor(n_estimators=param)test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))test_scores.append(np.mean(test_score)) plt.plot(params, test_scores) plt.title("n_estimator vs CV Error");

看來(lái)我們也許要先tune一下我們的DT模型，再做這個(gè)實(shí)驗(yàn)。。?

XGBoost

最后，我們來(lái)看看巨牛逼的XGBoost，外號(hào)：Kaggle神器

這依舊是一款Boosting框架的模型，但是卻做了很多的改進(jìn)。

from xgboost import XGBRegressor

用Sklearn自帶的cross validation方法來(lái)測(cè)試模型

params = [1,2,3,4,5,6] test_scores = [] for param in params:clf = XGBRegressor(max_depth=param)test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))test_scores.append(np.mean(test_score))

存下所有的CV值，看看哪個(gè)alpha值更好（也就是『調(diào)參數(shù)』）

import matplotlib.pyplot as plt %matplotlib inline plt.plot(params, test_scores) plt.title("max_depth vs CV Error");

驚了，深度為5的時(shí)候，錯(cuò)誤率縮小到0.127

這就是為什么，浮躁的競(jìng)賽圈，人人都在用XGBoost ?

總結(jié)

以上是生活随笔為你收集整理的kaggle (02) - 房价预测案例（进阶版）的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： PaperNotes(10)-Maxim
下一篇：算法(23)-leetcode-剑指of