日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

非线性回归模型(part3)--K近邻

發布時間:2023/12/19 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 非线性回归模型(part3)--K近邻 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

學習筆記,僅供參考,有錯必糾

PS : 本BLOG采用中英混合模式


非線性回歸模型


k近鄰


The KNN approach simply predicts a new sample using the K -closest samples from the training set.

KNN cannot be cleanly summarized by a model.Instead, its construction is solely based on the individual samples from the training data. (KNN沒有一個簡單的模型表達式,相反,它的建立是基于訓練集中每一個單獨的樣本點)

To predict a new sample for regression, KNN identi?es that sample’s KNNs in the predictor space. The predicted response for the new sample is then the mean of the K neighbors’ responses. Other summary statistics, such as the median, can also be used in place of the mean to predict the new sample.

The basic KNN method as described above depends on how the user de?nes distance between samples(用戶如何定義樣本點之間的距離). Euclidean distance(歐氏距離) is the most commonly used metric and is de?ned as follows:
(∑j=1P(xaj?xbj)2)1/2\left(\sum_{j=1}^P(x_{aj}-x_{bj})^2 \right)^{1/2} (j=1P?(xaj??xbj?)2)1/2
where xax_axa? and xbx_bxb? are two individual samples. Minkowski distance(閔可夫斯基距離) is a generalization of Euclidean distance(歐氏距離的推廣) and is de?ned as:
(∑j=1P(xaj?xbj)q)1/q\left(\sum_{j=1}^P(x_{aj}-x_{bj})^q \right)^{1/q} (j=1P?(xaj??xbj?)q)1/q

where q > 0 (其中q>0q>0q>0). It is easy to see that when q = 2, then Minkowski distance is the same as Euclidean distance. When q = 1, then Minkowski distance is equivalent to Manhattan distance(曼哈頓距離) , which is a common metric used for samples with binary predictors(二元預測變量).

Because the KNN method fundamentally depends on distance between samples, the scale of the predictors can have a dramatic in?uence on the distances among samples. (預測變量的標度會極大影響距離的取值)

Data with predictors that are on vastly di?erent scales will generate distances that are weighted towards predictors that have the largest scales.(當數據預測變量的標度相差很大時,具有最大標度的預測變量將會在整體的距離中占據很大權重)

That is, predictors with the largest scales will contribute most to the distance between samples.To avoid this potential bias and to enable each predictor to contribute equally to the distance calculation, we recommend that all predictors be centered and scaled prior to performing KNN(所有預測變量在KNN建模之前,進行中心化和標準化).

In addition to the issue of scaling, using distances between samples can be problematic if one or more of the predictor values for a sample is missing(1個或多個預測變量存在缺失值), since it is then not possible to compute the distance between samples.

First, either the samples or the predictors can be excluded from the analysis.

If a predictor contains a su?cient amount of information across the samples(如果一個預測變量在樣本中包含了足夠多的信息), then an alternative approach is to impute the missing data using a naive estimator(樸素貝葉斯評估器) such as the mean of the predictor(預測變量的均值), or a nearest neighbor approach that uses only the predictors with complete information(或者利用有完整信息的預測變量計算最近鄰)

Upon pre-processing the data and selecting the distance metric, the next step is to ?nd the optimal number of neighbors(最優的近鄰數). Like tuning parameters from other models, K can be determined by resampling(重抽樣).

需要注意的是,較小的k會導致過擬合,較大的k則會導致擬合不足。在下圖中RMSE隨著K的增加先快速下降,之后平穩,最后緩慢上升,這種模式的概覽圖對于KNN模型而言是很典型的:


The elementary version of KNN is intuitive and straightforward and can produce decent predictions, especially when the response is dependent on the local predictor structure.

However, this version does have some notable problems(很顯著的問題), of which researchers have sought solutions. Two commonly noted problems are computational time (計算時間)and the disconnect between local structure and the predictive ability of KNN(局部結構與KNN預測能力之間的聯系可能失效).

對于計算時間的問題,我們可以使用k維樹(或稱為k-d樹)來解決。

A k-d tree orthogonally partitions the predictor space(正交的劃分預測空間) using a tree approach.After the tree has been grown, a new sample is placed through the structure. Distances are only computed for those training observations in the tree that are close to the new sample.(只有那些靠近新樣本的訓練集觀測需要計算距離)

當預測變量的局部結構與響應變量不相關時,KNN可能會有很差的預測效果。不相關或者包含噪聲的預測變量是一大隱患,這是因為它們會使得相近的樣本點在預測變量空間中相互遠離

Hence, removing irrelevant, noise-laden predictors is a key pre-processing step for KNN.

Another approach to enhancing KNN predictivity is to weight the neighbors’ contribution to the prediction of a new sample based on their distance to the new sample.In this variation, training samples that are closer to the new sample contribute more to the predicted response, while those that are farther away contribute less to the predicted response.

總結

以上是生活随笔為你收集整理的非线性回归模型(part3)--K近邻的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。