當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

多元线性回归练习-预测房价

發(fā)布時(shí)間：2024/8/1 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了多元线性回归练习-预测房价小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

目的：

找到數(shù)據(jù)集中關(guān)于特征的描述。使用數(shù)據(jù)集中的其他變量來(lái)構(gòu)建最佳模型以預(yù)測(cè)平均房?jī)r(jià)。

數(shù)據(jù)集說(shuō)明：

數(shù)據(jù)集總共包含506個(gè)案例。

每種情況下，數(shù)據(jù)集都有14個(gè)屬性：

特征說(shuō)明

MedianHomePrice	房?jī)r(jià)中位數(shù)
CRIM	人均城鎮(zhèn)犯罪率
ZN	25,000平方英尺以上土地的住宅用地比例
INDIUS	每個(gè)城鎮(zhèn)非零售業(yè)務(wù)英畝的比例。
CHAS	查爾斯河虛擬變量（如果束縛河，則為1；否則為0）
NOX-	氧化氮濃度（百萬(wàn)分之一）
RM	每個(gè)住宅的平均房間數(shù)
AGE	1940年之前建造的自有住房的比例
DIS	到五個(gè)波士頓就業(yè)中心的加權(quán)距離
RAD	徑向公路的可達(dá)性指數(shù)
TAX	每10,000美元的全值財(cái)產(chǎn)稅率
PTRATIO	各鎮(zhèn)師生比例
B	1000（Bk-0.63）^ 2，其中Bk是按城鎮(zhèn)劃分的黑人比例
LSTAT	人口狀況降低百分比
MEDV	自有住房的中位價(jià)格（以$ 1000為單位）

設(shè)定庫(kù)和數(shù)據(jù)。

import numpy as np import pandas as pd import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factorfrom sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler from sklearn.metrics import r2_score from patsy import dmatrices import matplotlib.pyplot as plt %matplotlib inlinenp.random.seed(42)#加載內(nèi)置數(shù)據(jù)集，了解即可 boston_data = load_boston() df = pd.DataFrame() df['MedianHomePrice'] = boston_data.target df2 = pd.DataFrame(boston_data.data) df2.columns = boston_data.feature_names df = df.join(df2) df.head() MedianHomePriceCRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT01234

24.0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
21.6	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
34.7	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
33.4	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
36.2	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

1.獲取數(shù)據(jù)集中每個(gè)特征的匯總

使用 corr 方法計(jì)算各變量間的相關(guān)性，判斷是否存在多重線性。

#繪制熱力圖 import seaborn as sns plt.subplots(figsize=(10,10))#調(diào)節(jié)圖像大小 sns.heatmap(df.corr(), annot = True, vmax = 1, square = True, cmap='RdPu')

2.拆分?jǐn)?shù)據(jù)集

創(chuàng)建一個(gè) training 數(shù)據(jù)集與一個(gè) test 數(shù)據(jù)集，其中20％的數(shù)據(jù)在 test 數(shù)據(jù)集中。將結(jié)果存儲(chǔ)在 X_train, X_test, y_train, y_test 中。

X = df.drop('MedianHomePrice' , axis=1, inplace=False) y = df['MedianHomePrice'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42 )

3.標(biāo)準(zhǔn)化

使用 [StandardScaler]來(lái)縮放數(shù)據(jù)集中的所有 x 變量。將結(jié)果存儲(chǔ)在 X_scaled_train 中。

#把y_train的索引改為從0開(kāi)始，因?yàn)樵饕c下面的training_data索引不一致，合并會(huì)出錯(cuò) y_train = pd.Series(y_train.values) #使用 StandardScaler 來(lái)縮放數(shù)據(jù)集中的所有 x 變量,將結(jié)果存儲(chǔ)在 X_scaled_train 中。 X_scaled_train = StandardScaler()#創(chuàng)建一個(gè) pandas 數(shù)據(jù)幀并存儲(chǔ)縮放的 x 變量以及 y_train。命名為 training_data 。 training_data = X_scaled_train.fit_transform(X_train) training_data = pd.DataFrame(training_data, columns = X_train.columns)training_data['MedianHomePrice'] = y_train training_data.head() CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMedianHomePrice01234

1.287702	-0.500320	1.033237	-0.278089	0.489252	-1.428069	1.028015	-0.802173	1.706891	1.578434	0.845343	-0.074337	1.753505	12.0
-0.336384	-0.500320	-0.413160	-0.278089	-0.157233	-0.680087	-0.431199	0.324349	-0.624360	-0.584648	1.204741	0.430184	-0.561474	19.9
-0.403253	1.013271	-0.715218	-0.278089	-1.008723	-0.402063	-1.618599	1.330697	-0.974048	-0.602724	-0.637176	0.065297	-0.651595	19.4
0.388230	-0.500320	1.033237	-0.278089	0.489252	-0.300450	0.591681	-0.839240	1.706891	1.578434	0.845343	-3.868193	1.525387	13.4
-0.325282	-0.500320	-0.413160	-0.278089	-0.157233	-0.831094	0.033747	-0.005494	-0.624360	-0.584648	1.204741	0.379119	-0.165787	18.2

4.模型1:所有特征

對(duì)訓(xùn)練集training_data進(jìn)行線性擬合，查看p值判斷顯著性

#用所有的縮放特征來(lái)擬合線性模型，以預(yù)測(cè)此響應(yīng)（平均房?jī)r(jià)）。不要忘記添加一個(gè)截距。 training_data['intercept'] = 1 X_train1= training_data.drop('MedianHomePrice' , axis=1, inplace=False) lm = sm.OLS(training_data['MedianHomePrice'], X_train1) result = lm.fit() result.summary() OLS Regression ResultsDep. Variable: R-squared: Model: Adj. R-squared: Method: F-statistic: Date: Prob (F-statistic):Time: Log-Likelihood: No. Observations: AIC: Df Residuals: BIC: Df Model:Covariance Type:

MedianHomePrice	0.751
OLS	0.743
Least Squares	90.43
Sun, 10 May 2020	6.21e-109
20:22:27	-1194.3
404	2417.
390	2473.
13
nonrobust

coefstd errtP>|t|[0.0250.975]CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATintercept


-1.0021	0.308	-3.250	0.001	-1.608	-0.396
0.6963	0.370	1.882	0.061	-0.031	1.423
0.2781	0.464	0.599	0.549	-0.634	1.190
0.7187	0.247	2.914	0.004	0.234	1.204
-2.0223	0.498	-4.061	0.000	-3.001	-1.043
3.1452	0.329	9.567	0.000	2.499	3.792
-0.1760	0.407	-0.432	0.666	-0.977	0.625
-3.0819	0.481	-6.408	0.000	-4.027	-2.136
2.2514	0.652	3.454	0.001	0.970	3.533
-1.7670	0.704	-2.508	0.013	-3.152	-0.382
-2.0378	0.321	-6.357	0.000	-2.668	-1.408
1.1296	0.271	4.166	0.000	0.596	1.663
-3.6117	0.395	-9.133	0.000	-4.389	-2.834
22.7965	0.236	96.774	0.000	22.333	23.260

Omnibus: Durbin-Watson: Prob(Omnibus): Jarque-Bera (JB): Skew: Prob(JB): Kurtosis: Cond. No.

133.052	2.114
0.000	579.817
1.379	1.24e-126
8.181	9.74

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

5.判斷解釋變量間是否存在相關(guān)性：

計(jì)算訓(xùn)練集中的vif

#計(jì)算數(shù)據(jù)集中每個(gè) x_variable 的 vif def vif_calculator(df, response):'''INPUT:df - 包含x和y的數(shù)據(jù)集response - 反應(yīng)變量的列名stringOUTPUT:vif - a dataframe of the vifs'''df2 = df.drop(response, axis = 1, inplace=False)#刪除反應(yīng)變量列features = "+".join(df2.columns)y, X = dmatrices(response + ' ~' + features, df, return_type='dataframe')vif = pd.DataFrame()vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]vif["features"] = X.columnsvif = vif.round(1)return vif vif = vif_calculator(training_data, 'MedianHomePrice') vif C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\regression\linear_model.py:1685: RuntimeWarning: divide by zero encountered in double_scalarsreturn 1 - self.ssr/self.centered_tss VIF Factorfeatures01234567891011121314

0.0	Intercept
1.7	CRIM
2.5	ZN
3.9	INDUS
1.1	CHAS
4.5	NOX
1.9	RM
3.0	AGE
4.2	DIS
7.7	RAD
8.9	TAX
1.9	PTRATIO
1.3	B
2.8	LSTAT
0.0	intercept

結(jié)合vif、相關(guān)性和p值，判斷要?jiǎng)h除哪些變量：

vif限制在4以內(nèi)。INDUS、RAD、TAX、NOX的VIF較大

TAX 和 RAD 之間具有強(qiáng)相關(guān)性，INDUS 和 NOX 也是如此，因此，每組相關(guān)性高的變量只要?jiǎng)h除一個(gè)就能有效地減小另一個(gè)的 VIF。

p值限制在0.05以內(nèi)。AGE和INDUS的p值較大。

根據(jù)查看 p 值和VIF的結(jié)果，如果選擇保留RAD和INDUS，那么刪除 AGE、 NOX 與TAX，刪掉這些特征之后，用剩余的特征擬合一個(gè)新的線性模型。

6.模型2：刪除 AGE、 NOX 與TAX

X_train1 = training_data.drop(['AGE','NOX','TAX','MedianHomePrice'] , axis=1, inplace=False) lm1 = sm.OLS(training_data['MedianHomePrice'], X_train1) result1 = lm1.fit() result1.summary() OLS Regression ResultsDep. Variable: R-squared: Model: Adj. R-squared: Method: F-statistic: Date: Prob (F-statistic):Time: Log-Likelihood: No. Observations: AIC: Df Residuals: BIC: Df Model:Covariance Type:

MedianHomePrice	0.733
OLS	0.727
Least Squares	108.1
Sun, 10 May 2020	2.77e-106
21:02:41	-1208.0
404	2438.
393	2482.
10
nonrobust

coefstd errtP>|t|[0.0250.975]CRIMZNINDUSCHASRMDISRADPTRATIOBLSTATintercept


-0.9116	0.317	-2.876	0.004	-1.535	-0.289
0.5622	0.363	1.548	0.123	-0.152	1.276
-0.8746	0.411	-2.128	0.034	-1.683	-0.067
0.6896	0.252	2.738	0.006	0.194	1.185
3.2406	0.330	9.818	0.000	2.592	3.889
-2.1728	0.434	-5.010	0.000	-3.025	-1.320
0.4380	0.389	1.126	0.261	-0.327	1.202
-1.6369	0.310	-5.288	0.000	-2.246	-1.028
1.2106	0.279	4.345	0.000	0.663	1.758
-3.9851	0.381	-10.470	0.000	-4.733	-3.237
22.7965	0.243	93.916	0.000	22.319	23.274

Omnibus: Durbin-Watson: Prob(Omnibus): Jarque-Bera (JB): Skew: Prob(JB): Kurtosis: Cond. No.

126.568	2.033
0.000	542.197
1.310	1.83e-118
8.034	4.66

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

根據(jù)p值，應(yīng)刪除 RAD ，保留其他變量。

7.模型3：刪除 AGE、 NOX 、TAX、RAD

X_train2 = training_data.drop(['AGE','NOX','TAX','RAD', 'MedianHomePrice'] , axis=1, inplace=False) lm2 = sm.OLS(training_data['MedianHomePrice'], X_train2) result2 = lm2.fit() result2.summary() OLS Regression ResultsDep. Variable: R-squared: Model: Adj. R-squared: Method: F-statistic: Date: Prob (F-statistic):Time: Log-Likelihood: No. Observations: AIC: Df Residuals: BIC: Df Model:Covariance Type:

MedianHomePrice	0.733
OLS	0.726
Least Squares	119.9
Sun, 10 May 2020	4.60e-107
21:02:09	-1208.6
404	2437.
394	2477.
9
nonrobust

coefstd errtP>|t|[0.0250.975]CRIMZNINDUSCHASRMDISPTRATIOBLSTATintercept


-0.7616	0.288	-2.647	0.008	-1.327	-0.196
0.6151	0.360	1.707	0.089	-0.093	1.323
-0.7544	0.397	-1.900	0.058	-1.535	0.026
0.7067	0.252	2.810	0.005	0.212	1.201
3.3022	0.326	10.142	0.000	2.662	3.942
-2.2235	0.432	-5.153	0.000	-3.072	-1.375
-1.5090	0.288	-5.239	0.000	-2.075	-0.943
1.1502	0.273	4.206	0.000	0.613	1.688
-3.9413	0.379	-10.406	0.000	-4.686	-3.197
22.7965	0.243	93.884	0.000	22.319	23.274

Omnibus: Durbin-Watson: Prob(Omnibus): Jarque-Bera (JB): Skew: Prob(JB): Kurtosis: Cond. No.

134.948	2.028
0.000	619.161
1.381	3.56e-135
8.399	4.36

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

仔細(xì)檢查所有的 VIF 是否小于4。與先前模型相比，Rsquared 值沒(méi)有發(fā)生變化。

training_data2 = training_data.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False) vif = vif_calculator(training_data2, 'MedianHomePrice') vif VIF Factorfeatures012345678910

0.0	Intercept
1.4	CRIM
2.2	ZN
2.7	INDUS
1.1	CHAS
1.8	RM
3.2	DIS
1.4	PTRATIO
1.3	B
2.4	LSTAT
0.0	intercept

8.模型評(píng)估

對(duì)各個(gè)模型的測(cè)試預(yù)測(cè)值和實(shí)際測(cè)試值的匹配度進(jìn)行打分

#含有全部變量的模型 lm_full = LinearRegression() lm_full.fit(X_train, y_train) lm_full.score(X_test, y_test)#打分 0.66848257539715972 #刪除AGE、NOX、TAX X_train_red = X_train.drop(['AGE','NOX','TAX'] , axis=1, inplace=False) X_test_red = X_test.drop(['AGE','NOX','TAX'] , axis=1, inplace=False)#刪除 AGE、 NOX 、TAX、RAD X_train_red2 = X_train.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False) X_test_red2 = X_test.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False) lm_red = LinearRegression()#刪除AGE、NOX、TAX的模型 lm_red.fit(X_train_red, y_train) print(lm_red.score(X_test_red, y_test))#打分lm_red2 = LinearRegression()#刪除 AGE、 NOX 、TAX、RAD的模型 lm_red2.fit(X_train_red2, y_train) print(lm_red2.score(X_test_red2, y_test))#打分 0.639421781821 0.63441065636

從評(píng)分可以看出，在此測(cè)試集中，擁有所有變量的模型表現(xiàn)最佳。后續(xù)可以用交叉驗(yàn)證（即在多個(gè)訓(xùn)練和測(cè)試集里重復(fù)這一操作）來(lái)確定模型效果是否有穩(wěn)定性。

總結(jié)

以上是生活随笔為你收集整理的多元线性回归练习-预测房价的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：网页自动跳转 5种方法
下一篇： android qq勋章动画,qq最新的