當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Feature Engineering 特征工程 4. Feature Selection

發布時間：2024/7/5 编程问答 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 Feature Engineering 特征工程 4. Feature Selection 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

- 1. Univariate Feature Selection 單變量特征選擇
- 2. L1 regularization L1正則

learn from https://www.kaggle.com/learn/feature-engineering

上一篇：Feature Engineering 特征工程 3. Feature Generation

經過各種編碼和特征生成后，通常會擁有成百上千個特征。這可能導致兩個問題：

首先，擁有的特征越多，就越有可能過擬合
其次，擁有的特征越多，訓練模型和優化超參數所需的時間就越長。使用較少的特征可以加快預測速度，但會降低預測準確率

為了解決這些問題，使用特征選擇技術來為模型保留最豐富的特征

1. Univariate Feature Selection 單變量特征選擇

最簡單，最快的方法是基于單變量統計檢驗

統計label對每個單一特征的依賴程度
在scikit-learn特征選擇模塊中，feature_selection.SelectKBest返回 K 個最佳特征
對于分類問題，該模塊提供了三種不同的評分功能： $χ2\chi^2$ ，ANOVA F-value和mutual information score
F-value測量特征變量和目標之間的線性相關性。這意味著如果是非線性關系，得分可能會低估特征與目標之間的關系
mutual information score是非參數的，可以捕獲非線性關系

from sklearn.feature_selection import SelectKBest, f_classiffeature_cols = baseline_data.columns.drop('outcome')# Keep 5 features 保留5個最好的特征 selector = SelectKBest(f_classif, k=5)# 評價函數，保留特征數量 X_new = selector.fit_transform(baseline_data[feature_cols],baseline_data['outcome'])# 特征，標簽 X_new array([[2015., 5., 9., 18., 1409.],[2017., 13., 22., 31., 957.],[2013., 13., 22., 31., 739.],...,[2010., 13., 22., 31., 238.],[2016., 13., 22., 31., 1100.],[2011., 13., 22., 31., 542.]])

但是，上面犯了嚴重的錯誤，特征選擇時fit，把所有數據用進去了，會造成數據泄露
我們應該只用訓練集來進行fit，選擇特征

feature_cols = baseline_data.columns.drop('outcome') train, valid, _ = get_data_splits(baseline_data)# Keep 5 features selector = SelectKBest(f_classif, k=5)X_new = selector.fit_transform(train[feature_cols], train['outcome'])# 區別，僅用訓練集 X_new array([[2.015e+03, 5.000e+00, 9.000e+00, 1.800e+01, 1.409e+03],[2.017e+03, 1.300e+01, 2.200e+01, 3.100e+01, 9.570e+02],[2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 7.390e+02],...,[2.011e+03, 1.300e+01, 2.200e+01, 3.100e+01, 5.150e+02],[2.015e+03, 1.000e+00, 3.000e+00, 2.000e+00, 1.306e+03],[2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 1.084e+03]])

可以看見，兩種情況下，選擇了不同的特征
現在，我們需要把得到的特征數值，轉換回去，并丟棄其他未選擇的特征

# Get back the features we've kept, zero out all other features selected_features = pd.DataFrame(selector.inverse_transform(X_new), index=train.index, columns=feature_cols) selected_features.head() goalhourdaymonthyearcategorycurrencycountrycategory_currencycategory_countrycurrency_countrycount_7_daystime_since_last_project

0	2015.0	5.0	9.0	18.0	1409.0
1	2017.0	13.0	22.0	31.0	957.0
2	2013.0	13.0	22.0	31.0	739.0
3	2012.0	13.0	22.0	31.0	907.0
4	2015.0	13.0	22.0	31.0	1429.0

我們發現逆轉換回去后，未被選擇的特征都是0.0，需要丟棄它們

# Dropped columns have values of all 0s, so var is 0, drop them # 保留方差不為0的 selected_columns = selected_features.columns[selected_features.var() != 0]# Get the valid dataset with the selected features. valid[selected_columns].head() yearcurrencycountrycurrency_countrycount_7_days

302896	2015	13	22	31	1534.0
302897	2013	13	22	31	625.0
302898	2014	5	9	18	851.0
302899	2014	13	22	31	1973.0
302900	2014	5	9	18	2163.0

2. L1 regularization L1正則

單變量方法在做出選擇決定時一次只考慮一個特征

相反，我們可以通過將所有特征包括在具有L1正則化的線性模型中來使用所有特征進行特征篩選

與懲罰系數平方的 L2（Ridge）回歸相比，這種類型的正則化（有時稱為Lasso）會懲罰系數的絕對大小

隨著L1正則化強度的提高，對于預測目標而言次要的特征將設置為0

對于回歸問題，可以使用sklearn.linear_model.Lasso
分類問題，可以使用sklearn.linear_model.LogisticRegression
這些都可以跟sklearn.feature_selection.SelectFromModel一起使用，來選擇非零系數

from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import SelectFromModeltrain, valid, _ = get_data_splits(baseline_data)X, y = train[train.columns.drop("outcome")], train['outcome']# Set the regularization parameter C=1 logistic = LogisticRegression(C=1, penalty="l1", random_state=7).fit(X, y) model = SelectFromModel(logistic, prefit=True)X_new = model.transform(X) X_new array([[1.000e+03, 1.200e+01, 1.100e+01, ..., 1.900e+03, 1.800e+01,1.409e+03],[3.000e+04, 4.000e+00, 2.000e+00, ..., 1.630e+03, 3.100e+01,9.570e+02],[4.500e+04, 0.000e+00, 1.200e+01, ..., 1.630e+03, 3.100e+01,7.390e+02],...,[2.500e+03, 0.000e+00, 3.000e+00, ..., 1.830e+03, 3.100e+01,5.150e+02],[2.600e+03, 2.100e+01, 2.300e+01, ..., 1.036e+03, 2.000e+00,1.306e+03],[2.000e+04, 1.600e+01, 4.000e+00, ..., 9.200e+02, 3.100e+01,1.084e+03]])

類似于單變量測試，返回具有選定特征的數組。我們要將它們轉換為DataFrame，以便獲得選定的特征列

# Get back the kept features as a DataFrame with dropped columns as all 0s selected_features = pd.DataFrame(model.inverse_transform(X_new), index=X.index,columns=X.columns)# Dropped columns have values of all 0s, keep other columns selected_columns = selected_features.columns[selected_features.var() != 0]

通常，使用L1正則化進行特征選擇比單變量測試更強大
但是在具有大量數據和大量特征的情況下，L1正則化的特征選擇速度也會很慢
在大型數據集上，單變量測試將更快，但預測性能可能會更差

完成課程和練習，獲得證書一張，繼續加油！🚀🚀🚀

上一篇：Feature Engineering 特征工程 3. Feature Generation

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的Feature Engineering 特征工程 4. Feature Selection的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： LeetCode 293. 翻转游戏
下一篇： LeetCode 1101. 彼此熟识的