當前位置：首頁 > 编程语言 > python >内容正文

python

交叉验证python_交叉验证

發布時間：2023/12/15 python 23 豆豆

生活随笔收集整理的這篇文章主要介紹了交叉验证python_交叉验证小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

交叉驗證python

Cross validation may be any of various model validation techniques that are used to assess how well a predictive model will generalize to an independent set of data that the model has not seen before. Hence, it is typically employed in situations where we predict something and we want to gain a rough estimate of how well our predictive model will perform in practice.

交叉驗證可以是各種模型驗證技術中的任何一種，可用于評估預測模型將多大程度地推廣到該模型之前未見過的獨立數據集。因此，它通常用于以下情況：我們預測某些東西，并且希望對我們的預測模型在實踐中的表現有一個大概的估計。

“Before Creating any Machine Learning Model, we must know what Cross Validation is and how to choose the best Cross Validation” — Abhishek Thakur, Kaggle’s first 4x Grandmaster

“在創建任何機器學習模型之前，我們必須知道什么是交叉驗證以及如何選擇最佳的交叉驗證。” — Kaggle的第一個4x Grandmaster Abhishek Thakur

By the end of this post you will have a good understanding of the popular cross validation techniques, how we can implement them using scikit-learn, and how to select the correct CV given a specific problem.

到本文結尾，您將對流行的交叉驗證技術，如何使用scikit-learn實現它們以及如何針對特定問題選擇正確的CV有一個很好的了解。

流行的交叉驗證技術 (Popular Cross Validation Techniques)

Essentially, selecting the correct cross validation technique boils down to the data that we have on hand, hence why one choice of cross validation may or may not work for another set of data. However, the goal of employing a cross validation technique remains constant, we want to estimate the expected level of fit for of a predictive model on unseen data, since with this information we can make the necessary adaptations to make our predictive models (if required) or decide to use a totally different one.

本質上，選擇正確的交叉驗證技術可以歸結為我們手頭的數據，因此，為什么選擇交叉驗證對另一組數據可能有效也可能無效。但是，采用交叉驗證技術的目標仍然是不變的，我們希望根據看不見的數據來估計預測模型的預期擬合水平，因為有了這些信息，我們就可以進行必要的調整以構建我們的預測模型(如果需要)或決定使用完全不同的產品。

基于保留的交叉驗證 (Hold-Out Based Cross Validation)

I may receive abuse from some experience Data Scientist, Machine Learning and/or Deep learning practitioners for improper terminology because cross validation often allows the predictive model to train and test on various splits whereas hold-out sets do not. Regardless, a hold-out based cross validation is when we split our data into a train and test set. This is often the first validation technique you’d of implemented and the easiest to get your head around. It consist of quite literally dividing up your data into separate portions where you may train your predictive model on one set and test it on the test set.

我可能會從數據科學家，機器學習和/或深度學習從業人員那里濫用一些不當的術語，因為交叉驗證通常允許預測模型對各種拆分進行訓練和測試，而保留集則不允許。無論如何，基于保留的交叉驗證是將數據拆分為訓練和測試集時的結果。這通常是您要實施的第一個驗證技術，并且最容易動手。它實際上是將數據分成幾個部分，您可以在其中訓練一組預測模型，然后在測試集中進行測試。

Note: Some people take this further and will have a training dataset, a validation dataset and test data set. The validation dataset will be used to tune the predictive model and the test set will be used to test how well the model generalizes.

注意：有些人會更進一步，將擁有訓練數據集，驗證數據集和測試數據集。驗證數據集將用于調整預測模型，測試集將用于測試模型的推廣程度。

When dividing the data, something to keep in mind is that you’d have to determine what proportion of the data is for training and what is for testing. I’ve seen various splits from 60% (train)— 40% (test) to 80% (train) — 20% (test). I believe it is safe to say that a range from 60%-80% of your data should go towards training your predictive model and the remainder may go directly to the test set (or split again into validation and test sets).

在劃分數據時，要記住的一點是，您必須確定要訓練的數據比例是多少，用于測試的比例是多少。我已經看到了從60％(火車)-40％(測試)到80％(火車)-20％(測試)的各種劃分。我相信可以肯定地說，數據的60％-80％應該用于訓練預測模型，其余的可以直接用于測試集(或再次分為驗證集和測試集)。

Adi Bronshtein, Adi Bronshtein ， Train/Test Split and Cross Validation in Python)Python中的訓練/測試拆分和交叉驗證 ) # https://bit.ly/3fUuyOyimport numpy as npfrom sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)print(X)
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])print(list(y))
[0, 1, 2, 3, 4]

Performing the hold-out based validation technique is most effective when we have a very large dataset. As it is not required we test on various splits, this technique uses much less computational power hence making it the go-to strategy for validation on large datasets.

當我們有非常大的數據集時，執行基于保留的驗證技術最為有效。由于不需要我們在各種分割上進行測試，因此該技術使用的計算能力要低得多，因此使其成為對大型數據集進行驗證的首選策略。

K折交叉驗證 (K-Fold Cross Validation)

I briefly touched on cross validation consist of above “cross validation often allows the predictive model to train and test on various splits whereas hold-out sets do not.” — In other words, cross validation is a resampling procedure. When “k” is present in machine learning discussions, it’s often used to represent a constant value, for instance k in k-means clustering refers to the number of clusters and k in k Neareast Neighbors refers to the number of neighbors to consider when performing a plurality vote (for classification). This pattern holds true for K-Fold cross validation also, where k references the number of groups that a given data sample should be split into.

我簡要地提到了交叉驗證，其中包括“交叉驗證通常允許預測模型在各種分割上進行訓練和測試，而保留集則不允許。” —換句話說，交叉驗證是一個重采樣過程。當機器學習討論中出現“ k”時，它通常用于表示一個恒定值，例如，k均值聚類中的k表示聚類數，k中的近鄰中的k表示執行時要考慮的鄰居數多票(用于分類)。這種模式也適用于K折交叉驗證，其中k表示應將給定數據樣本拆分為的組數。

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized groups. From the k groups, one group would be removed as a hold-out set and the remaining groups would be the training data. The predictive model is then fit on the training data and evaluated on the hold-out set. This procedure is k times so that all groups have served exactly once as the hold-out set.

在k倍交叉驗證中，原始樣本被隨機分為k個相等大小的組。從k個組中，將一個組作為保留集刪除，其余的組將作為訓練數據。然后將預測模型擬合到訓練數據上，并在保留集上進行評估。此過程是k次，因此所有組都僅作為保留集服務了一次。

Scikit-Learn Documentation)Scikit-Learn文檔 ) # https://bit.ly/2POmqVbimport numpy as npfrom sklearn.model_selection import KFoldX = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)print(kf.get_n_splits(X))
2print(kf)
KFold(n_splits=2, random_state=None, shuffle=False)>>> for train_index, test_index in kf.split(X):... print("TRAIN:", train_index, "TEST:", test_index)... X_train, X_test = X[train_index], X[test_index]... y_train, y_test = y[train_index], y[test_index]
TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]

Due to the ease of comprehension, k-fold cross validation is quite popular. and it often results in a less biased outcome than performing hold-out based validation. This technique is often a good starting point for Regression problems, although it may be better to use stratified k-fold if the distribution of the target variable is not consistent — which would require binning the target variable. Somethings to consider is the configuration of k which must split the data so that each train/test split of the data samples are large enough to be statistically representative of the broader dataset.

由于易于理解，k折交叉驗證非常受歡迎。與執行基于保留的驗證相比，它通常會減少偏差。該技術通常是回歸問題的一個很好的起點，盡管如果目標變量的分布不一致，則最好使用分層k倍，這將需要對目標變量進行分箱。需要考慮的是k的配置，該配置必須拆分數據，以使數據樣本的每個訓練/測試拆分都足夠大，足以在統計學上代表更廣泛的數據集。

分層K折 (Stratified K-Fold)

Both of the techniques that we have covered till now are relatively effective in many scenarios, although misleading results (and potentially overall failure) can arise when the target data has imbalanced labels — I’ve been careful not to only mention this as a problem for classification task because we can adjust a regression task in some ways for us to be able to perform stratified k-fold for validation. Instead, a better solution would be to split the randomly in such a way that we maintain the same class distribution in each subset — this is what we refer to as stratification.

到目前為止，我們涵蓋的兩種技術在許多情況下都是相對有效的，盡管當目標數據的標簽不平衡時可能會產生誤導性的結果(并可能導致整體故障)-我一直小心謹慎，不要僅將此作為一個問題分類任務，因為我們可以通過某種方式調整回歸任務，以便能夠執行分層k折進行驗證。取而代之，更好的解決方案是采用以下方式隨機分割隨機數，即我們在每個子集中保持相同的類分布-這就是我們所說的分層。

Note: Other than the way we randomly split the data, the stratified k-fold cross validation is the same as simple k-fold cross validation.

注意：除了我們隨機分割數據的方式以外，分層的k折交叉驗證與簡單的k折交叉驗證相同。

Scikit-Learn Documentation)Scikit-Learn文檔 ) # https://bit.ly/3iCHavoimport numpy as npfrom sklearn.model_selection import StratifiedKFoldX = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
skf = StratifiedKFold(n_splits=2)print(skf.get_n_splits(X, y)
2print(skf)
StratifiedKFold(n_splits=2, random_state=None, shuffle=False)for train_index, test_index in skf.split(X, y): print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]TRAIN: [1 3] TEST: [0 2]
TRAIN: [0 2] TEST: [1 3]

According to the first 4X Kaggle Grandmaster, Abhishek Thakur, it is safe to say that if we have a standard classification task then applying stratified k-fold cross validation blindly is not a bad idea at all.

根據第一位4X Kaggle大師Abhishek Thakur的說法，可以肯定地說，如果我們有標準分類任務，那么盲目應用分層k倍交叉驗證絕對不是一個壞主意。

留一法交叉驗證 (Leave-One-Out Cross Validation)

Leave-one-out Cross validation may be thought of as a special case of k-fold cross validation where k = n and n is the number of samples within the original dataset. In other words, the data will be trained on n - 1 samples and will be used to predict the sample that was left out and this would be repeated n times so that each sample serves as the left out sample.

留一法交叉驗證可以認為是k倍交叉驗證的一種特殊情況，其中k = n ，n是原始數據集中的樣本數。換句話說，將在n -1個樣本上訓練數據，并將這些數據用于預測遺漏的樣本，并將其重復n次，以便每個樣本都充當遺漏的樣本。

DataCamp)DataCamp ) #https://bit.ly/2POw0HUimport numpy as npfrom sklearn.model_selection import LeaveOneOutX = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])
loo = LeaveOneOut()print(loo.get_n_splits(X))
2print(loo)
LeaveOneOut()for train_index, test_index in loo.split(X):
print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)TRAIN: [1] TEST: [0]
[[3 4]] [[1 2]] [2] [1]
TRAIN: [0] TEST: [1]
[[1 2]] [[3 4]] [1] [2]

This technique may require large computation time, and in that case k-fold cross validation may be a better solution. Alternatively, if the dataset is small then we would be losing plenty of data to the predictive model (especially if the cross validation is big) therefore making leave-one-out is a good solution in this situation.

此技術可能需要大量的計算時間，在這種情況下，k倍交叉驗證可能是更好的解決方案。另外，如果數據集很小，那么我們將丟失大量數據到預測模型中(特別是如果交叉驗證很大)，因此在這種情況下進行留一法是一個很好的解決方案。

組K折交叉驗證 (Group K-fold Cross Validation)

GroupKFold cross validation is another variation of k-fold cross validation which ensures the same group is not represented in the train and test set. For instance, if we would like to build a predictive model that classifies malignant or benign from images of patients skin, it is likely that we would have images from the same patient. Since we do not split a single patient across the training and test set we revert to GroupKfold instead of k-fold (or stratified k-fold for that matter) — therefore, the patients would be considered as groups.

GroupKFold交叉驗證是k折疊交叉驗證的另一個變體，可確保訓練和測試集中不代表同一組。例如，如果我們想建立一個根據患者皮膚圖像對惡性或良性進行分類的預測模型，則可能會有來自同一患者的圖像。由于我們沒有在訓練和測試集中分散一名患者，因此我們將其還原為GroupKfold，而不是k-fold(或分層k-fold)-因此，將這些患者視為一組。

Scikit-Learn Documentation)Scikit-Learn文檔 ) # https://scikit-learn.org/stable/modules/cross_validation.htmlfrom sklearn.model_selection import GroupKFold
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]
gkf = GroupKFold(n_splits=3)for train, test in gkf.split(X, y, groups=groups):
print("%s %s" % (train, test))[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]

Note: Another variant of this technique is Stratified K-fold Cross Validation which is performed when we want to preserve the class distribution and we don’t want the same group will not appear in two different folds. There is no inbuilt solution to this is scikit-learn, but there is a nice implementation in this Kaggle Notebook.

注意：此技術的另一種變體是分層K折交叉驗證，當我們想要保留類分布并且我們不希望同一組不會出現在兩個不同的折中時執行。 scikit-learn沒有內置的解決方案，但是這個Kaggle Notebook中有一個不錯的實現。

結語 (Wrap Up)

Cross Validation is the first step to building Machine Learning Models and it’s extremely important that we consider the data that we have when deciding what technique to employ — In some cases, it may even be necessary to adopt new forms of cross validation depending on the data.

交叉驗證是構建機器學習模型的第一步，在決定采用哪種技術時，考慮到我們擁有的數據非常重要-在某些情況下，甚至有必要根據數據采用新形式的交叉驗證。

“If you have a good Cross Validation scheme in which validation data is representative of the training and real-world data, you will be able to build a good Machine Learning Model which is highly generalizable” — Abhishek Thakur

“如果您有一個很好的交叉驗證方案，其中驗證數據可以代表訓練和現實世界的數據，那么您將能夠建立一個具有良好通用性的良好機器學習模型。” — Abhishek Thakur

This story was highly inspired by the book Approaching (almost) any Machine Learning Problem (the link is not an affiliate link and I have not be asked to promote the book) by the first 4x Kaggle Grandmaster Abhishek Thakur. If you already have some experience with Machine Learning then and want more practical advice then I’d highly recommend this book.

這個故事的靈感來自第一本4倍Kaggle宗師Abhishek Thakur所著的《幾乎(幾乎)任何機器學習問題》 (該鏈接不是會員鏈接， 我也沒有被要求推廣這本書 )。如果您已經對機器學習有所了解，并且想要更多實用的建議，那么我強烈推薦這本書。

If you’d like to get in contact with me, I am most reachable on LinkedIn:

如果您想與我聯系，可以通過LinkedIn與我聯系：

翻譯自: https://towardsdatascience.com/cross-validation-c4fae714f1c5

交叉驗證python

總結

以上是生活随笔為你收集整理的交叉验证python_交叉验证的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：胶囊路由_评论：胶囊之间的动态路由
下一篇： python 线性回归_Python中的