日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

多元线性回归练习-预测房价

發(fā)布時(shí)間:2024/8/1 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 多元线性回归练习-预测房价 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

目的:

找到數(shù)據(jù)集中關(guān)于特征的描述。使用數(shù)據(jù)集中的其他變量來(lái)構(gòu)建最佳模型以預(yù)測(cè)平均房?jī)r(jià)。

數(shù)據(jù)集說(shuō)明:

數(shù)據(jù)集總共包含506個(gè)案例。

每種情況下,數(shù)據(jù)集都有14個(gè)屬性:

特征說(shuō)明
MedianHomePrice房?jī)r(jià)中位數(shù)
CRIM人均城鎮(zhèn)犯罪率
ZN25,000平方英尺以上土地的住宅用地比例
INDIUS每個(gè)城鎮(zhèn)非零售業(yè)務(wù)英畝的比例。
CHAS查爾斯河虛擬變量(如果束縛河,則為1;否則為0)
NOX-氧化氮濃度(百萬(wàn)分之一)
RM每個(gè)住宅的平均房間數(shù)
AGE1940年之前建造的自有住房的比例
DIS到五個(gè)波士頓就業(yè)中心的加權(quán)距離
RAD徑向公路的可達(dá)性指數(shù)
TAX每10,000美元的全值財(cái)產(chǎn)稅率
PTRATIO各鎮(zhèn)師生比例
B1000(Bk-0.63)^ 2,其中Bk是按城鎮(zhèn)劃分的黑人比例
LSTAT人口狀況降低百分比
MEDV自有住房的中位價(jià)格(以$ 1000為單位)

設(shè)定庫(kù)和數(shù)據(jù)。

import numpy as np import pandas as pd import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factorfrom sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler from sklearn.metrics import r2_score from patsy import dmatrices import matplotlib.pyplot as plt %matplotlib inlinenp.random.seed(42)#加載內(nèi)置數(shù)據(jù)集,了解即可 boston_data = load_boston() df = pd.DataFrame() df['MedianHomePrice'] = boston_data.target df2 = pd.DataFrame(boston_data.data) df2.columns = boston_data.feature_names df = df.join(df2) df.head() MedianHomePriceCRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT01234
24.00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.98
21.60.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.14
34.70.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.03
33.40.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.94
36.20.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.33

1.獲取數(shù)據(jù)集中每個(gè)特征的匯總

使用 corr 方法計(jì)算各變量間的相關(guān)性,判斷是否存在多重線性。

#繪制熱力圖 import seaborn as sns plt.subplots(figsize=(10,10))#調(diào)節(jié)圖像大小 sns.heatmap(df.corr(), annot = True, vmax = 1, square = True, cmap='RdPu')

2.拆分?jǐn)?shù)據(jù)集

創(chuàng)建一個(gè) training 數(shù)據(jù)集與一個(gè) test 數(shù)據(jù)集,其中20%的數(shù)據(jù)在 test 數(shù)據(jù)集中。將結(jié)果存儲(chǔ)在 X_train, X_test, y_train, y_test 中。

X = df.drop('MedianHomePrice' , axis=1, inplace=False) y = df['MedianHomePrice'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42 )

3.標(biāo)準(zhǔn)化

使用 [StandardScaler]來(lái)縮放數(shù)據(jù)集中的所有 x 變量。將結(jié)果存儲(chǔ)在 X_scaled_train 中。

#把y_train的索引改為從0開(kāi)始,因?yàn)樵饕c下面的training_data索引不一致,合并會(huì)出錯(cuò) y_train = pd.Series(y_train.values) #使用 StandardScaler 來(lái)縮放數(shù)據(jù)集中的所有 x 變量,將結(jié)果存儲(chǔ)在 X_scaled_train 中。 X_scaled_train = StandardScaler()#創(chuàng)建一個(gè) pandas 數(shù)據(jù)幀并存儲(chǔ)縮放的 x 變量以及 y_train。命名為 training_data 。 training_data = X_scaled_train.fit_transform(X_train) training_data = pd.DataFrame(training_data, columns = X_train.columns)training_data['MedianHomePrice'] = y_train training_data.head() CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMedianHomePrice01234
1.287702-0.5003201.033237-0.2780890.489252-1.4280691.028015-0.8021731.7068911.5784340.845343-0.0743371.75350512.0
-0.336384-0.500320-0.413160-0.278089-0.157233-0.680087-0.4311990.324349-0.624360-0.5846481.2047410.430184-0.56147419.9
-0.4032531.013271-0.715218-0.278089-1.008723-0.402063-1.6185991.330697-0.974048-0.602724-0.6371760.065297-0.65159519.4
0.388230-0.5003201.033237-0.2780890.489252-0.3004500.591681-0.8392401.7068911.5784340.845343-3.8681931.52538713.4
-0.325282-0.500320-0.413160-0.278089-0.157233-0.8310940.033747-0.005494-0.624360-0.5846481.2047410.379119-0.16578718.2

4.模型1:所有特征

對(duì)訓(xùn)練集training_data進(jìn)行線性擬合,查看p值判斷顯著性

#用所有的縮放特征來(lái)擬合線性模型,以預(yù)測(cè)此響應(yīng)(平均房?jī)r(jià))。不要忘記添加一個(gè)截距。 training_data['intercept'] = 1 X_train1= training_data.drop('MedianHomePrice' , axis=1, inplace=False) lm = sm.OLS(training_data['MedianHomePrice'], X_train1) result = lm.fit() result.summary() OLS Regression ResultsDep. Variable: R-squared: Model: Adj. R-squared: Method: F-statistic: Date: Prob (F-statistic):Time: Log-Likelihood: No. Observations: AIC: Df Residuals: BIC: Df Model:Covariance Type:
MedianHomePrice 0.751
OLS 0.743
Least Squares 90.43
Sun, 10 May 20206.21e-109
20:22:27 -1194.3
404 2417.
390 2473.
13
nonrobust
coefstd errtP>|t|[0.0250.975]CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATintercept
-1.0021 0.308 -3.250 0.001 -1.608 -0.396
0.6963 0.370 1.882 0.061 -0.031 1.423
0.2781 0.464 0.599 0.549 -0.634 1.190
0.7187 0.247 2.914 0.004 0.234 1.204
-2.0223 0.498 -4.061 0.000 -3.001 -1.043
3.1452 0.329 9.567 0.000 2.499 3.792
-0.1760 0.407 -0.432 0.666 -0.977 0.625
-3.0819 0.481 -6.408 0.000 -4.027 -2.136
2.2514 0.652 3.454 0.001 0.970 3.533
-1.7670 0.704 -2.508 0.013 -3.152 -0.382
-2.0378 0.321 -6.357 0.000 -2.668 -1.408
1.1296 0.271 4.166 0.000 0.596 1.663
-3.6117 0.395 -9.133 0.000 -4.389 -2.834
22.7965 0.236 96.774 0.000 22.333 23.260
Omnibus: Durbin-Watson: Prob(Omnibus): Jarque-Bera (JB): Skew: Prob(JB): Kurtosis: Cond. No.
133.052 2.114
0.000 579.817
1.3791.24e-126
8.181 9.74


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

5.判斷解釋變量間是否存在相關(guān)性:

計(jì)算訓(xùn)練集中的vif

#計(jì)算數(shù)據(jù)集中每個(gè) x_variable 的 vif def vif_calculator(df, response):'''INPUT:df - 包含x和y的數(shù)據(jù)集response - 反應(yīng)變量的列名stringOUTPUT:vif - a dataframe of the vifs'''df2 = df.drop(response, axis = 1, inplace=False)#刪除反應(yīng)變量列features = "+".join(df2.columns)y, X = dmatrices(response + ' ~' + features, df, return_type='dataframe')vif = pd.DataFrame()vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]vif["features"] = X.columnsvif = vif.round(1)return vif vif = vif_calculator(training_data, 'MedianHomePrice') vif C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\regression\linear_model.py:1685: RuntimeWarning: divide by zero encountered in double_scalarsreturn 1 - self.ssr/self.centered_tss VIF Factorfeatures01234567891011121314
0.0Intercept
1.7CRIM
2.5ZN
3.9INDUS
1.1CHAS
4.5NOX
1.9RM
3.0AGE
4.2DIS
7.7RAD
8.9TAX
1.9PTRATIO
1.3B
2.8LSTAT
0.0intercept

結(jié)合vif、相關(guān)性和p值,判斷要?jiǎng)h除哪些變量:

vif限制在4以內(nèi)。INDUS、RAD、TAX、NOX的VIF較大

TAX 和 RAD 之間具有強(qiáng)相關(guān)性,INDUS 和 NOX 也是如此,因此,每組相關(guān)性高的變量只要?jiǎng)h除一個(gè)就能有效地減小另一個(gè)的 VIF。

p值限制在0.05以內(nèi)。AGE和INDUS的p值較大。

根據(jù)查看 p 值和VIF的結(jié)果,如果選擇保留RAD和INDUS,那么刪除 AGE、 NOX 與TAX,刪掉這些特征之后,用剩余的特征擬合一個(gè)新的線性模型。

6.模型2:刪除 AGE、 NOX 與TAX

X_train1 = training_data.drop(['AGE','NOX','TAX','MedianHomePrice'] , axis=1, inplace=False) lm1 = sm.OLS(training_data['MedianHomePrice'], X_train1) result1 = lm1.fit() result1.summary() OLS Regression ResultsDep. Variable: R-squared: Model: Adj. R-squared: Method: F-statistic: Date: Prob (F-statistic):Time: Log-Likelihood: No. Observations: AIC: Df Residuals: BIC: Df Model:Covariance Type:
MedianHomePrice 0.733
OLS 0.727
Least Squares 108.1
Sun, 10 May 20202.77e-106
21:02:41 -1208.0
404 2438.
393 2482.
10
nonrobust
coefstd errtP>|t|[0.0250.975]CRIMZNINDUSCHASRMDISRADPTRATIOBLSTATintercept
-0.9116 0.317 -2.876 0.004 -1.535 -0.289
0.5622 0.363 1.548 0.123 -0.152 1.276
-0.8746 0.411 -2.128 0.034 -1.683 -0.067
0.6896 0.252 2.738 0.006 0.194 1.185
3.2406 0.330 9.818 0.000 2.592 3.889
-2.1728 0.434 -5.010 0.000 -3.025 -1.320
0.4380 0.389 1.126 0.261 -0.327 1.202
-1.6369 0.310 -5.288 0.000 -2.246 -1.028
1.2106 0.279 4.345 0.000 0.663 1.758
-3.9851 0.381 -10.470 0.000 -4.733 -3.237
22.7965 0.243 93.916 0.000 22.319 23.274
Omnibus: Durbin-Watson: Prob(Omnibus): Jarque-Bera (JB): Skew: Prob(JB): Kurtosis: Cond. No.
126.568 2.033
0.000 542.197
1.3101.83e-118
8.034 4.66


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

根據(jù)p值,應(yīng)刪除 RAD ,保留其他變量。

7.模型3:刪除 AGE、 NOX 、TAX、RAD

X_train2 = training_data.drop(['AGE','NOX','TAX','RAD', 'MedianHomePrice'] , axis=1, inplace=False) lm2 = sm.OLS(training_data['MedianHomePrice'], X_train2) result2 = lm2.fit() result2.summary() OLS Regression ResultsDep. Variable: R-squared: Model: Adj. R-squared: Method: F-statistic: Date: Prob (F-statistic):Time: Log-Likelihood: No. Observations: AIC: Df Residuals: BIC: Df Model:Covariance Type:
MedianHomePrice 0.733
OLS 0.726
Least Squares 119.9
Sun, 10 May 20204.60e-107
21:02:09 -1208.6
404 2437.
394 2477.
9
nonrobust
coefstd errtP>|t|[0.0250.975]CRIMZNINDUSCHASRMDISPTRATIOBLSTATintercept
-0.7616 0.288 -2.647 0.008 -1.327 -0.196
0.6151 0.360 1.707 0.089 -0.093 1.323
-0.7544 0.397 -1.900 0.058 -1.535 0.026
0.7067 0.252 2.810 0.005 0.212 1.201
3.3022 0.326 10.142 0.000 2.662 3.942
-2.2235 0.432 -5.153 0.000 -3.072 -1.375
-1.5090 0.288 -5.239 0.000 -2.075 -0.943
1.1502 0.273 4.206 0.000 0.613 1.688
-3.9413 0.379 -10.406 0.000 -4.686 -3.197
22.7965 0.243 93.884 0.000 22.319 23.274
Omnibus: Durbin-Watson: Prob(Omnibus): Jarque-Bera (JB): Skew: Prob(JB): Kurtosis: Cond. No.
134.948 2.028
0.000 619.161
1.3813.56e-135
8.399 4.36


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

仔細(xì)檢查所有的 VIF 是否小于4。與先前模型相比,Rsquared 值沒(méi)有發(fā)生變化。

training_data2 = training_data.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False) vif = vif_calculator(training_data2, 'MedianHomePrice') vif VIF Factorfeatures012345678910
0.0Intercept
1.4CRIM
2.2ZN
2.7INDUS
1.1CHAS
1.8RM
3.2DIS
1.4PTRATIO
1.3B
2.4LSTAT
0.0intercept

8.模型評(píng)估

對(duì)各個(gè)模型的測(cè)試預(yù)測(cè)值和實(shí)際測(cè)試值的匹配度進(jìn)行打分

#含有全部變量的模型 lm_full = LinearRegression() lm_full.fit(X_train, y_train) lm_full.score(X_test, y_test)#打分 0.66848257539715972 #刪除AGE、NOX、TAX X_train_red = X_train.drop(['AGE','NOX','TAX'] , axis=1, inplace=False) X_test_red = X_test.drop(['AGE','NOX','TAX'] , axis=1, inplace=False)#刪除 AGE、 NOX 、TAX、RAD X_train_red2 = X_train.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False) X_test_red2 = X_test.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False) lm_red = LinearRegression()#刪除AGE、NOX、TAX的模型 lm_red.fit(X_train_red, y_train) print(lm_red.score(X_test_red, y_test))#打分lm_red2 = LinearRegression()#刪除 AGE、 NOX 、TAX、RAD的模型 lm_red2.fit(X_train_red2, y_train) print(lm_red2.score(X_test_red2, y_test))#打分 0.639421781821 0.63441065636

從評(píng)分可以看出,在此測(cè)試集中,擁有所有變量的模型表現(xiàn)最佳。后續(xù)可以用交叉驗(yàn)證 (即在多個(gè)訓(xùn)練和測(cè)試集里重復(fù)這一操作)來(lái)確定模型效果是否有穩(wěn)定性。

總結(jié)

以上是生活随笔為你收集整理的多元线性回归练习-预测房价的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。