當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

[scikit-learn 机器学习] 3. K-近邻算法分类和回归

發(fā)布時間：2024/7/5 编程问答 42 豆豆

生活随笔收集整理的這篇文章主要介紹了 [scikit-learn 机器学习] 3. K-近邻算法分类和回归小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

文章目錄

- 1. KNN模型
- 2. KNN分類
- 3. 使用sklearn KNN分類
- 4. KNN回歸

本文為 scikit-learn機器學習（第2版）學習筆記

K 近鄰法（K-Nearest Neighbor, K-NN）常用于搜索和推薦系統(tǒng)。

1. KNN模型

確定距離度量方法（如歐氏距離）
根據(jù) K 個最近的距離的鄰居樣本，選擇策略做出預測
模型假設：距離相近的樣本，有接近的響應值

2. KNN分類

根據(jù)身高、體重對性別進行分類

import numpy as np import matplotlib.pyplot as pltX_train = np.array([[158, 64],[170, 86],[183, 84],[191, 80],[155, 49],[163, 59],[180, 67],[158, 54],[170, 67] ]) y_train = ['male', 'male', 'male', 'male', 'female', 'female', 'female', 'female', 'female']plt.figure() plt.title('Human Heights and Weights by Sex') plt.xlabel('Height in cm') plt.ylabel('Weight in kg')for i, x in enumerate(X_train):if y_train[i] == 'male':c1 = plt.scatter(x[0], x[1], c='k', marker='x')else:c2 = plt.scatter(x[0], x[1], c='r', marker='o') plt.grid(True) plt.legend((c1,c2),('male','female'),loc='lower right') # plt.show()

對身高 155cm，體重 70 kg的人進行性別預測
設置 KNN 模型 k = 3

計算距離 x = np.array([[155,70]]) dis = np.sqrt(np.sum((X_train-x)**2 ,axis = 1)) dis 選取最近k個 nearset_k_neighbor = dis.argsort()[0:3] k_genders = [y_train[i] for i in nearset_k_neighbor] k_genders # ['male', 'female', 'female'] 計算最近的k個的標簽 from collections import Counter # b = Counter(np.take(y_train, dis.argsort()[0:3])) b = Counter(k_genders) b # Counter({'male': 1, 'female': 2}) 性別為女性占多數(shù) # help(Counter.most_common) # most_common(self, n=None) # List the n most common elements and their counts from the most # common to the least. If n is None, then list all element counts. b.most_common(2) # [('female', 2), ('male', 1)] b.most_common(1)[0][0] # 'female'

3. 使用sklearn KNN分類

標簽（male，female）數(shù)字化（0,1）

from sklearn.preprocessing import LabelBinarizer from sklearn.neighbors import KNeighborsClassifierlb = LabelBinarizer() y_train_lb = lb.fit_transform(y_train) y_train_lb ###### array([[1],[1],[1],[1],[0],[0],[0],[0],[0]])

預測前面的例子的性別

K=3 clf = KNeighborsClassifier(n_neighbors=K) clf.fit(X_train,y_train_lb.ravel()) pred_gender = clf.predict(x) pred_gender # array([0]) pred_label_gender = lb.inverse_transform(pred_gender) pred_label_gender # array(['female'], dtype='<U6')

在test集上驗證

X_test = np.array([[168, 65],[180, 96],[160, 52],[169, 67] ]) y_test = ['male', 'male', 'female', 'female'] y_test_lb = lb.transform(y_test)pred_lb = clf.predict(X_test) print('Predicted labels: %s' % lb.inverse_transform(pred_lb)) # Predicted labels: ['female' 'male' 'female' 'female']

計算評價指標

準確率：預測對了的比例3/4 from sklearn.metrics import accuracy_score accuracy_score(y_test_lb, pred_lb) # 0.75 精準率：正類為男，男預測為男/（男預測男+女預測男） from sklearn.metrics import precision_score precision_score(y_test_lb, pred_lb) # 1.0 召回率：男預測男/(男預測男+男預測女) from sklearn.metrics import recall_score recall_score(y_test_lb, pred_lb) # 0.5

F1 值

F1 得分是：精準率和召回率的均衡 from sklearn.metrics import f1_score f1_score(y_test_lb, pred_lb) # 0.6667 評價報告 from sklearn.metrics import classification_report # help(classification_report) # classification_report(y_true, y_pred, labels=None, target_names=None, s # ample_weight=None, digits=2, output_dict=False, zero_division='warn') print(classification_report(y_test_lb, pred_lb, target_names=['male','female'], labels=[1,0]))

4. KNN回歸

根據(jù)身高、性別，預測其體重

from sklearn.neighbors import KNeighborsRegressor from sklearn.metrics import mean_absolute_error, mean_squared_error,r2_scoreX_train = np.array([[158, 1],[170, 1],[183, 1],[191, 1],[155, 0],[163, 0],[180, 0],[158, 0],[170, 0] ]) y_train = [64,86,84,80,49,59,67,54,67]X_test = np.array([[168, 1],[180, 1],[160, 0],[169, 0] ]) y_test = [65,96,52,67]K = 3 clf = KNeighborsRegressor(n_neighbors=K) clf.fit(X_train, y_train) predictions = clf.predict(np.array(X_test)) predictions # array([70.66666667, 79. , 59. , 70.66666667])# help(r2_score) # R^2 (coefficient of determination) r2_score(y_test, predictions) # 0.6290565226735438平均絕對值誤差 mean_absolute_error(y_test, predictions) # 8.333333333333336平均平方誤差 mean_squared_error(y_test, predictions) # 95.8888888888889

數(shù)據(jù)沒有標準化的影響

from scipy.spatial.distance import euclidean # help(euclidean) # 歐氏距離 X_train = np.array([[1700,1],[1600,0] ]) X_test = np.array([1640,1]).reshape(1,-1) print(euclidean(X_train[0,:], X_test)) print(euclidean(X_train[1,:], X_test)) # 60.0 # 40.01249804748511X_train = np.array([[1.7,1],[1.6,0] ]) X_test = np.array([1.64,1]).reshape(1,-1) print(euclidean(X_train[0,:], X_test)) print(euclidean(X_train[1,:], X_test)) # 0.06000000000000005 # 1.0007996802557444

可以看出不同單位下的歐式距離差異很大

進行數(shù)據(jù)標準化

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train)print(X_train) print(X_train_scaled) [[158 1][170 1][183 1][191 1][155 0][163 0][180 0][158 0][170 0]] [[-0.9908706 1.11803399][ 0.01869567 1.11803399][ 1.11239246 1.11803399][ 1.78543664 1.11803399][-1.24326216 -0.89442719][-0.57021798 -0.89442719][ 0.86000089 -0.89442719][-0.9908706 -0.89442719][ 0.01869567 -0.89442719]]

標準化特征后模型誤差更低

pred = clf.predict(X_test_scaled) pred # array([78. , 83.33333333, 54. , 64.33333333])# R^2 (coefficient of determination) r2_score(y_test, pred) # 0.6706425961745109# 平均絕對值誤差 mean_absolute_error(y_test, pred) # 7.583333333333336# 平均平方誤差 mean_squared_error(y_test, pred) # 85.13888888888893

總結

以上是生活随笔為你收集整理的[scikit-learn 机器学习] 3. K-近邻算法分类和回归的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： [Hands On ML] 8. 降维
下一篇： LeetCode 1429. 第一个唯一