當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习基于加州房价的线性回归实验

發布時間：2023/12/8 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习基于加州房价的线性回归实验小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1.線性回歸閉合形式參數求解的原理

如果定義X為m*(n+1)的矩陣，Y為m1的矩陣，θ為(n+1)1維的矩陣，那么在之前的定義中就可以表示為h(x)=Xθ。則代價函數可以表示為J(θ)=1/2(Xθ-y)Т(Xθ-y),J(θ)為凹函數，我們要讓其值最小化，只需對該函數求導，然后令導數為0即可求得θ。對其求導后得到XTXθ-XTy，令其等于0，得到θ=(XTX)^-1XT*y。

2.線性回歸梯度下降參數求解的原理

我們構造了擬合函數h(θ),并且得到了損失函數J（θ），我們要求得使J（θ）取得最小值的θ，其原理還是求偏導然后使導數為0，我們對J（θ）求導得到(hθ(x) ? y) xj，然后可以得到對θj的更新公式

由于數據量較大，所以采用了隨機梯度下降，但是準確度相較于批量梯度下降來說會下降。在我的程序里，由于數據采用矩陣形式存儲，所以更新過程可以替換為

其中θ為(n+1)1維，X為m(n+1)維，Y為m*1維。梯度為0用損失函數差值小于1e-18來表示，說明這個點是損失函數的極小值點，但并不一定是最小值點。

3.相關文件

4.程序清單

相關包：

from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split import pandas as pd import numpy as np import os import time from numpy import median from sklearn.preprocessing import OneHotEncoder

（一）讀取數據：

# 讀取數據 HOUSE_PATH = './'def load_housing_data(housing_path=HOUSE_PATH):csv_path = os.path.join(housing_path, 'housing.csv')return pd.read_csv(csv_path) housing = load_housing_data()

（二）數據處理：

# 將中位數補全空位 median = housing["total_bedrooms"].median() housing["total_bedrooms"].fillna(median, inplace=True)# 獨熱編碼 housing_category = housing[["ocean_proximity"]] cat_encoder = OneHotEncoder() housing_category_onehot = cat_encoder.fit_transform(housing_category) housing = housing.drop("ocean_proximity", axis=1) housing_values = np.c_[housing.values, housing_category_onehot.toarray()] housing_fixed = pd.DataFrame(housing_values,columns=list(housing.columns) +['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],index=housing.index )

（三）分析數據相關性

# 分析數據相關性 corr_matrix = housing_fixed.corr() # 用corr計算兩兩特征之間的相關性系數 correlation = corr_matrix["median_house_value"].sort_values(ascending=False) # 跟街區價格中位數特征的其他特征的相關系數 print(correlation)

（四）將數據集分類

# 將數據集分類 train_set, test_set = train_test_split(housing_fixed, test_size=0.3, random_state=42) X_train = train_set[['longitude', 'latitude', 'housing_median_age','total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', '<1H OCEAN', 'INLAND', 'ISLAND','NEAR BAY', 'NEAR OCEAN']] y_train = train_set[["median_house_value"]] X_test = test_set[['longitude', 'latitude', 'housing_median_age','total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', '<1H OCEAN', 'INLAND', 'ISLAND','NEAR BAY', 'NEAR OCEAN']] y_test = test_set[["median_house_value"]] X = np.hstack([np.ones((len(X_train_std), 1)), X_train_std]) # 訓練集X Y = np.array(y_train_std) # 訓練集Y x = np.hstack([np.ones((len(X_test_std), 1)), X_test_std]) # 測試集x y = np.array(y_test_std) # 測試集y y_var = np.var(y) # 標準差

（五）特征標準化

# 特征標準化 stdsc = StandardScaler() X_train_std = stdsc.fit_transform(X_train) X_test_std = stdsc.transform(X_test) y_train_std = stdsc.fit_transform(y_train) y_test_std = stdsc.fit_transform(y_test)

（六）初始化θ

theta = np.zeros((14, 1)) # 初始化theta

（七）正規方程

# 正規方程 def nomal(X, Y):theta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y)return theta # 損失函數 def cost_function(x, theta, y):cost = np.sum((np.dot(x, theta)-y)**2)return cost/(2*len(y))# 正規方程 def nomal(X, Y):theta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y)return theta# 梯度下降方向 def gradient(X, theta, Y):return X.T.dot((X.dot(theta)-Y))/len(Y)# 梯度下降 def gradient_descent(X, theta, Y, eta):while True:last_theta = thetagrad = gradient(X, theta, Y)theta = theta - eta*gradif abs(cost_function(X, last_theta, Y) - cost_function(X, theta, Y)) < 1e-18:breakreturn theta# theta = nomal(X, Y) # 閉合形式求解 theta = gradient_descent(X, theta, Y, 0.001) # 梯度下降# 評估項(R2) def evaluation(x, theta, y, y_var):return 1 - ((np.sum((np.dot(x, theta)-y)**2))/(y_var*len(y)))MSE = np.sum(np.power((np.dot(x, theta)-y), 2))/len(y) cost = cost_function(x, theta, y) # 損失函數值 R2 = evaluation(x, theta, y, y_var) # 評估值end = time.time()# print("The normal equations:") print("Gradient descent:") print("theta=") print(theta) print("MSE=", MSE) print("cost=", cost) print("R2=", R2) print('Running time: %s Seconds' % (end-start))

實驗結果：
（一）正規方程求解

（二）梯度下降求解

可以看到兩個方法得出的結果差別不大，用測試集進行測試時候，損失函數值均為0.18，評估項R2均為0.6多，梯度下降的擬合效果會比正規方程的好一點。在運算過程中，能很明顯看到正規方程的計算速度要比梯度下降快很多，原因在于梯度下降在更新θ時候需要迭代很多次才能得到較優解。但是梯度下降在特征數量n較大時也能很好使用，而正規方程需要計算(X*X)-1，如果特征數量太多則運算代價較大因為矩陣的運算時間復雜度為O(n3)，而且只適用于線性模型，不適用于邏輯回歸模型等其他模型。在這個模型里面，由于特征數量不是很多，因此用正規式求解比較合理。

總結

以上是生活随笔為你收集整理的机器学习基于加州房价的线性回归实验的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：疫情对广州房价的影响
下一篇： FPGA中ICAP原语的使用——Mult