當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【sklearn学习】数据预处理和特征工程

發布時間：2023/12/15 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了【sklearn学习】数据预处理和特征工程小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

主成分分析 sklearn.PCA

特征選擇 sklearn.feature_selection

特征處理 sklearn.preprocessing

特征提取 sklearn.feature_extraction

數據預處理

數據無量綱化

將不同規格的數據轉換到統一規格，或不同分布的數據轉化到，這種需求統稱為將數據"無量綱化"

線性無量綱化

中心化：讓所有記錄減去一個固定值

縮放處理：除以一個固定值，如取對數

preprocessing.MinMaxScaler

數據歸一化，數據按照最小值中心化，再按極差（最大值、最小值）縮放，數據移動了最小值個單位，并且會被收斂到[0, 1]之間。

>>> from sklearn.preprocessing import MinMaxScaler >>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]] >>> scaler = MinMaxScaler() >>> print(scaler.fit(data)) MinMaxScaler() >>> print(scaler.data_max_) [ 1. 18.] >>> print(scaler.transform(data)) [[0. 0. ][0.25 0.25][0.5 0.5 ][1. 1. ]] >>> print(scaler.transform([[2, 2]])) [[1.5 0. ]] # 訓練和導出結果一步達成 result = scaler.fit_transform(data)# 將歸一化后的結果逆轉 scaler.inverse_transform(result)# 數據量太大無法訓練使用partial_fit scaler = scaler.partial_fit(data)

preprocessing.StandardScaler

當數據按均值中心化后，再按標準差縮放，數據就會服從均值為0，方差為1的正態分布，這個過程叫做數據標準化。

>>> from sklearn.preprocessing import StandardScaler >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]] >>> scaler = StandardScaler() >>> print(scaler.fit(data)) StandardScaler() >>> print(scaler.mean_) [0.5 0.5] >>> print(scaler.transform(data)) [[-1. -1.][-1. -1.][ 1. 1.][ 1. 1.]] >>> print(scaler.transform([[2, 2]])) [[3. 3.]]

大多數機器學習算法中，會選擇StandardScaler進行特征縮放，因為MinMaxScaler對異常值敏感

數據需要壓縮到[0, 1]區間時，使用MinMaxScaler

sklearn.impute.SimpleImputer()

填補缺失值

參數	含義
missing_values	缺失值樣式，默認為np.nan
strategy	"mean"均值、”median“中值、”most_frequent“眾數
fill_value	strategy為”constant“時，可輸入字符串或數字表示填充值
copy	默認True,創建特征矩陣的副本

>>> import numpy as np >>> from sklearn.impute import SimpleImputer >>> imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') >>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) SimpleImputer() >>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] >>> print(imp_mean.transform(X)) [[ 7. 2. 3. ][ 4. 3.5 6. ][10. 3.5 9. ]]

preprocessing.LabelEncoder

將分類轉換為分類數值

>>> from sklearn.processing import LabelEncoder >>> le = preprocessing.LabelEncoder() >>> le.fit(["paris", "paris", "tokyo", "amsterdam"]) LabelEncoder() # 查看特征中有多少個類別 >>> list(le.classes_) ['amsterdam', 'paris', 'tokyo'] >>> le.transform(["tokyo", "tokyo", "paris"]) array([2, 2, 1]...) >>> list(le.inverse_transform([2, 2, 1])) ['tokyo', 'tokyo', 'paris']

preprocessing.OneHotEncoder

獨熱編碼，創建啞變量

labelEncoder轉換后存在大小關系，不適合無大小關系的變量

>>> enc = OneHotEncoder(handle_unknown='ignore') >>> X = [['Male', 1], ['Female', 3], ['Female', 2]] >>> enc.fit(X) OneHotEncoder(handle_unknown='ignore') >>> enc.categories_ [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)] >>> enc.transform([['Female', 1], ['Male', 4]]).toarray() array([[1., 0., 1., 0., 0.],[0., 1., 0., 0., 0.]]) >>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]]) array([['Male', 1],[None, 2]], dtype=object) >>> enc.get_feature_names_out(['gender', 'group']) array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'], ...) >>> drop_enc = OneHotEncoder(drop='first').fit(X) >>> drop_enc.categories_ [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)] >>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray() array([[0., 0., 0.],[1., 1., 0.]])

處理連續型數據

sklearn.preprocessing.Binarizer

根據閾值將數據二值化，用于處理連續型變量。大于閾值映射為1，小于閾值映射為0

>>> from sklearn.preprocessing import Binarizer >>> X = [[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]] >>> transformer = Binarizer().fit(X) # fit does nothing. >>> transformer Binarizer() >>> transformer.transform(X) array([[1., 0., 1.],[1., 0., 0.],[0., 1., 0.]])

preprocessing.KBinsDiscretizer

將連續型變量劃分為分類變量，能夠將連續型變量排序后按順序分箱后編碼

>>> from sklearn.preprocessing import KBinsDiscretizer >>> X = [[-2, 1, -4, -1], ... [-1, 2, -3, -0.5], ... [ 0, 3, -2, 0.5], ... [ 1, 4, -1, 2]] >>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform') >>> est.fit(X) KBinsDiscretizer(...) >>> Xt = est.transform(X) >>> Xt array([[ 0., 0., 0., 0.],[ 1., 1., 1., 0.],[ 2., 2., 2., 1.],[ 2., 2., 2., 2.]])

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的【sklearn学习】数据预处理和特征工程的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【sklearn学习】随机森林分类、回归
下一篇：【sklearn学习】模型网格化调参