當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

吴恩达《机器学习》学习笔记七——逻辑回归（二分类）代码

發布時間：2024/7/23 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了吴恩达《机器学习》学习笔记七——逻辑回归（二分类）代码小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

吳恩達《機器學習》學習筆記七——邏輯回歸（二分類）代碼

一、無正則項的邏輯回歸
- 1.問題描述
- 2.導入模塊
- 3.準備數據
- 4.假設函數
- 5.代價函數
- 6.梯度下降
- 7.擬合參數
- 8.用訓練集預測和驗證
- 9.尋找決策邊界
二、正則化邏輯回歸
- 1.準備數據
- 2.特征映射
- 3.正則化代價函數
- 4.正則化梯度
- 5.擬合參數
- 6.預測
- 7.畫出決策邊界

課程鏈接：https://www.bilibili.com/video/BV164411b7dx?from=search&seid=5329376196520099118

這次的筆記緊接著上兩次對邏輯回歸模型和正則化筆記，將一個分類問題用邏輯回歸和正則化的方法解決。機器學習在我看來，理論和代碼需要兩手抓，即使理論搞懂，代碼也將是又一個門檻，所以多多嘗試。

這次筆記用到的數據集：https://pan.baidu.com/s/1h5Ygse5q2wkTeXA9Pwq2RA
提取碼：5rd4

一、無正則項的邏輯回歸

1.問題描述

建立一個邏輯回歸模型來預測一個學生是否被大學錄取。根據兩次考試的結果來決定每個申請人的錄取機會。有以前的申請人的歷史數據，可以用它作為邏輯回歸的訓練集

python實現邏輯回歸目標：建立分類器（求解出三個參數 θ0 θ1 θ2）即得出分界線備注:θ1對應’Exam 1’成績,θ2對應’Exam 2’ 設定閾值，根據閾值判斷錄取結果備注:閾值指的是最終得到的概率值.將概率值轉化成一個類別.一般是＞0.5是被錄取了,＜0.5未被錄取.

2.導入模塊

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns plt.style.use('fivethirtyeight') #樣式美化 import matplotlib.pyplot as plt from sklearn.metrics import classification_report#這個包是評價報告

1.Seaborn是基于matplotlib的圖形可視化python包。它提供了一種高度交互式界面，便于用戶能夠做出各種有吸引力的統計圖表。

Seaborn是在matplotlib的基礎上進行了更高級的API封裝，從而使得作圖更加容易，在大多數情況下使用seaborn能做出很具有吸引力的圖，而使用matplotlib就能制作具有更多特色的圖。應該把Seaborn視為matplotlib的補充，而不是替代物。同時它能高度兼容numpy與pandas數據結構以及scipy與statsmodels等統計模式。

2.plt.style.use()函數；可以對圖片的整體風格進行設置?？梢酝ㄟ^plt.style.availabel知道一共有多少種主題。具體參考plt.style.use()函數介紹。

3.sklearn中的classification_report函數用于顯示主要分類指標的文本報告．在報告中顯示每個類的精確度，召回率，F1值等信息。具體參考classification_report函數介紹

3.準備數據

data = pd.read_csv('work/ex2data1.txt', names=['exam1', 'exam2', 'admitted']) data.head()#看前五行

data.describe()

數據讀入后，通過可視化查看一下數據分布：

sns.set(context="notebook", style="darkgrid", palette=sns.color_palette("RdBu", 2)) #設置樣式參數,默認主題 darkgrid（灰色背景+白網格）,調色板 2色sns.lmplot('exam1', 'exam2', hue='admitted', data=data, size=6, fit_reg=False, #fit_reg'參數，控制是否顯示擬合的直線scatter_kws={"s": 50}) #hue參數是將name所指定的不同類型的數據疊加在一張圖中顯示 plt.show()#看下數據的樣子

定義了下面三個函數，分別用于從數據中提取特征X，提取標簽y，以及對特征進行標準化處理。

def get_X(df):#讀取特征 # """ # use concat to add intersect feature to avoid side effect # not efficient for big dataset though # """ones = pd.DataFrame({'ones': np.ones(len(df))})#ones是m行1列的dataframedata = pd.concat([ones, df], axis=1) # 合并數據，根據列合并 axis = 1的時候，concat就是行對齊，然后將不同列名稱的兩張表合并加列return data.iloc[:, :-1].as_matrix() # 這個操作返回 ndarray,不是矩陣def get_y(df):#讀取標簽 # '''assume the last column is the target'''return np.array(df.iloc[:, -1])#df.iloc[:, -1]是指df的最后一列def normalize_feature(df): # """Applies function along input axis(default 0) of DataFrame."""return df.apply(lambda column: (column - column.mean()) / column.std())#特征縮放在邏輯回歸同樣適用

提取特征和標簽：

X = get_X(data) print(X.shape)y = get_y(data) print(y.shape)

4.假設函數

邏輯回歸模型的假設函數：

def sigmoid(z):# your code here (appro ~ 1 lines)return 1 / (1 + np.exp(-z))

繪制一下sigmoid函數的圖像：

fig, ax = plt.subplots(figsize=(8, 6)) ax.plot(np.arange(-10, 10, step=0.01),sigmoid(np.arange(-10, 10, step=0.01))) ax.set_ylim((-0.1,1.1)) #lim 軸線顯示長度 ax.set_xlabel('z', fontsize=18) ax.set_ylabel('g(z)', fontsize=18) ax.set_title('sigmoid function', fontsize=18) plt.show()

5.代價函數

初始化參數：

theta = theta=np.zeros(3) # X(m*n) so theta is n*1 theta

定義代價函數：

def cost(theta, X, y):''' cost fn is -l(theta) for you to minimize'''costf = np.mean(-y * np.log(sigmoid(X @ theta)) - (1 - y) * np.log(1 - sigmoid(X @ theta)))return costf # Hint:X @ theta與X.dot(theta)等價

計算一下初始的代價函數值：

cost(theta, X, y)

6.梯度下降

這是批量梯度下降（batch gradient descent）
轉化為向量化計算：

依次定義梯度：

def gradient(theta, X, y):# your code here (appro ~ 2 lines)return (1 / len(X)) * X.T @ (sigmoid(X @ theta) - y)

計算梯度初始值：

gradient(theta, X, y)

7.擬合參數

這里不再自定義更新參數的函數，而是使用scipy.optimize.minimize 去自動尋找參數。

import scipy.optimize as opt res = opt.minimize(fun=cost, x0=theta, args=(X, y), method='Newton-CG', jac=gradient) print(res)

其中fun是指優化后的代價函數值，x是指優化后的三個參數值。以上，算是已經訓練完成。

8.用訓練集預測和驗證

因為這里沒有提供驗證集，所以使用訓練集進行預測和驗證。就是用訓練好的模型對訓練集進行預測，將結果與真實結果進行比較評估。

def predict(x, theta):prob = sigmoid(x @ theta)return (prob >= 0.5).astype(int) #實現變量類型轉換 final_theta = res.x y_pred = predict(X, final_theta)print(classification_report(y, y_pred))

9.尋找決策邊界

決策邊界就是下面這樣一條線：

print(res.x) # this is final theta

coef = -(res.x / res.x[2]) # find the equation print(coef)x = np.arange(130, step=0.1) y = coef[0] + coef[1]*x

在看一下數據描述，確定一下x和y的范圍：

data.describe() # find the range of x and y

sns.set(context="notebook", style="ticks", font_scale=1.5) 默認使用notebook上下文主題 context可以設置輸出圖片的大小尺寸(scale)sns.lmplot('exam1', 'exam2', hue='admitted', data=data, size=6, fit_reg=False, scatter_kws={"s": 25})plt.plot(x, y, 'grey') plt.xlim(0, 130) plt.ylim(0, 130) plt.title('Decision Boundary') plt.show()

二、正則化邏輯回歸

1.準備數據

這邊使用一個新的數據集：

df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted']) df.head()

sns.set(context="notebook", style="ticks", font_scale=1.5)sns.lmplot('test1', 'test2', hue='accepted', data=df, size=6, fit_reg=False, scatter_kws={"s": 50})plt.title('Regularized Logistic Regression') plt.show()

從這個數據分布來看，不可能使用一條直線做到很好的劃分數據集兩個類別。所以我們需要做一個特征映射，就是在已有的兩個特征的基礎上添加一些高次冪的特征組合，使得決策邊界可以變成一條能較好劃分的曲線。

2.特征映射

在這里我把它映射成這樣的一組特征：

一共有28個項，那么我們可以將這些組合特征看成一個個獨立的特征，即看成x1、x2。。。x28，然后通過邏輯回歸的方法來求解。

def feature_mapping(x, y, power, as_ndarray=False): # """return mapped features as ndarray or dataframe"""data = {"f{}{}".format(i - p, p): np.power(x, i - p) * np.power(y, p)for i in np.arange(power + 1)for p in np.arange(i + 1)}if as_ndarray:return pd.DataFrame(data).as_matrix()else:return pd.DataFrame(data) x1 = np.array(df.test1) x2 = np.array(df.test2) data = feature_mapping(x1, x2, power=6) print(data.shape) data.head()

下面是特征映射之后的數據集，特征變成了28維：

data.describe()

3.正則化代價函數

相比之前的表達式，多了正則化的懲罰項。

theta = np.zeros(data.shape[1]) X = feature_mapping(x1, x2, power=6, as_ndarray=True) print(X.shape)y = get_y(df) print(y.shape)

def regularized_cost(theta, X, y, l=1):theta_j1_to_n = theta[1:]regularized_term = (l / (2 * len(X))) * np.power(theta_j1_to_n, 2).sum()return cost(theta, X, y) + regularized_term

計算一下初始代價函數值：

regularized_cost(theta, X, y, l=1)

因為我們設置theta為0，所以這個正則化代價函數與代價函數的值應該相同

4.正則化梯度

def regularized_gradient(theta, X, y, l=1):theta_j1_to_n = theta[1:] #不加theta0regularized_theta = (l / len(X)) * theta_j1_to_nregularized_term = np.concatenate([np.array([0]), regularized_theta])return gradient(theta, X, y) + regularized_term

計算一下梯度的初始值：

regularized_gradient(theta, X, y)

5.擬合參數

import scipy.optimize as opt print('init cost = {}'.format(regularized_cost(theta, X, y)))res = opt.minimize(fun=regularized_cost, x0=theta, args=(X, y), method='Newton-CG', jac=regularized_gradient) res

6.預測

final_theta = res.x y_pred = predict(X, final_theta)print(classification_report(y, y_pred))

7.畫出決策邊界

我們需要找到所有滿足 X×θ=0 的x，這里不求解多項式表達式，而是創造一個足夠密集的網格，對網格里的每一個點進行 X×θ的計算，若結果小于一個很小的值，如10 ^ -3，則可以當做是邊界上的一點，遍歷該網格上的每一點，即可得到近似邊界。

def draw_boundary(power, l): # """ # power: polynomial power for mapped feature # l: lambda constant # """density = 1000threshhold = 2 * 10**-3final_theta = feature_mapped_logistic_regression(power, l)x, y = find_decision_boundary(density, power, final_theta, threshhold)df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])sns.lmplot('test1', 'test2', hue='accepted', data=df, size=6, fit_reg=False, scatter_kws={"s": 100})plt.scatter(x, y, c='R', s=10)plt.title('Decision boundary')plt.show() def feature_mapped_logistic_regression(power, l): # """for drawing purpose only.. not a well generealize logistic regression # power: int # raise x1, x2 to polynomial power # l: int # lambda constant for regularization term # """df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])x1 = np.array(df.test1)x2 = np.array(df.test2)y = get_y(df)X = feature_mapping(x1, x2, power, as_ndarray=True)theta = np.zeros(X.shape[1])res = opt.minimize(fun=regularized_cost,x0=theta,args=(X, y, l),method='TNC',jac=regularized_gradient)final_theta = res.xreturn final_theta def find_decision_boundary(density, power, theta, threshhold):t1 = np.linspace(-1, 1.5, density) #1000個樣本t2 = np.linspace(-1, 1.5, density)cordinates = [(x, y) for x in t1 for y in t2]x_cord, y_cord = zip(*cordinates)mapped_cord = feature_mapping(x_cord, y_cord, power) # this is a dataframeinner_product = mapped_cord.as_matrix() @ thetadecision = mapped_cord[np.abs(inner_product) < threshhold]return decision.f10, decision.f01 #尋找決策邊界函數

下面我們看一下正則化系數不同，導致的決策邊界有什么不同？

draw_boundary(power=6, l=1) #set lambda = 1

draw_boundary(power=6, l=0) # set lambda < 0.1

draw_boundary(power=6, l=100) # set lambda > 10

上面三個例子分別展示了較好擬合、過擬合和欠擬合的三種情況。

總結

以上是生活随笔為你收集整理的吴恩达《机器学习》学习笔记七——逻辑回归（二分类）代码的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：深度学习-KNN，K近邻算法简介
下一篇： leetcode-search-in-r