當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Feature Engineering 特征工程 2. Categorical Encodings

發(fā)布時(shí)間：2024/7/5 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 Feature Engineering 特征工程 2. Categorical Encodings 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

- 1. Count Encoding 計(jì)數(shù)編碼
- 2. Target Encoding 目標(biāo)編碼
- 3. CatBoost Encoding

learn from https://www.kaggle.com/learn/feature-engineering

上一篇：Feature Engineering 特征工程 1. Baseline Model

下一篇：Feature Engineering 特征工程 3. Feature Generation

在中級(jí)機(jī)器學(xué)習(xí)里介紹過(guò)了Label Encoding、One-Hot Encoding，下面將學(xué)習(xí)count encoding計(jì)數(shù)編碼，target encoding目標(biāo)編碼、singular value decomposition奇異值分解

在上一篇中使用LabelEncoder()，得分為Validation AUC score: 0.7467

# Label encoding cat_features = ['category', 'currency', 'country'] encoder = LabelEncoder() encoded = ks[cat_features].apply(encoder.fit_transform)

1. Count Encoding 計(jì)數(shù)編碼

計(jì)數(shù)編碼，就是把該類型的value，替換為其出現(xiàn)的次數(shù)
例如：一個(gè)特征中CN出現(xiàn)了100次，那么就將CN，替換成數(shù)值100
category_encoders.CountEncoder()，最終得分Validation AUC score: 0.7486

import category_encoders as ce cat_features = ['category', 'currency', 'country'] count_enc = ce.CountEncoder() count_encoded = count_enc.fit_transform(ks[cat_features])data = baseline_data.join(count_encoded.add_suffix("_count"))# Training a model on the baseline data train, valid, test = get_data_splits(data) bst = train_model(train, valid)

2. Target Encoding 目標(biāo)編碼

category_encoders.TargetEncoder()，最終得分Validation AUC score: 0.7491

Target encoding replaces a categorical value with the average value of the target for that value of the feature.
目標(biāo)編碼：將會(huì)用該特征值的 label 的平均值替換分類特征值
For example, given the country value “CA”, you’d calculate the average outcome for all the rows with country == ‘CA’, around 0.28.
舉例子：特征值 “CA”，你要計(jì)算所有 “CA” 行的 label（即outcome列）的均值，用該均值來(lái)替換 “CA”
This is often blended with the target probability over the entire dataset to reduce the variance of values with few occurences.
這么做，可以降低很少出現(xiàn)的值的方差？

This technique uses the targets to create new features. So including the validation or test data in the target encodings would be a form of target leakage.
這種編碼方法會(huì)產(chǎn)生新的特征，不要把驗(yàn)證集和測(cè)試集拿進(jìn)來(lái)fit，會(huì)產(chǎn)生數(shù)據(jù)泄露
Instead, you should learn the target encodings from the training dataset only and apply it to the other datasets.
應(yīng)該從訓(xùn)練集里fit，應(yīng)用到其他數(shù)據(jù)集

import category_encoders as ce cat_features = ['category', 'currency', 'country']# Create the encoder itself target_enc = ce.TargetEncoder(cols=cat_features)train, valid, _ = get_data_splits(data)# Fit the encoder using the categorical features and target target_enc.fit(train[cat_features], train['outcome'])# Transform the features, rename the columns with _target suffix, and join to dataframe train = train.join(target_enc.transform(train[cat_features]).add_suffix('_target')) valid = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))train.head() bst = train_model(train, valid)

3. CatBoost Encoding

category_encoders.CatBoostEncoder()，最終得分Validation AUC score: 0.7492

This is similar to target encoding in that it’s based on the target probablity for a given value.
跟目標(biāo)編碼類似的點(diǎn)在于，它基于給定值的 label 目標(biāo)概率
However with CatBoost, for each row, the target probability is calculated only from the rows before it.
計(jì)算上，對(duì)每一行，目標(biāo)概率的計(jì)算只依靠它之前的行

cat_features = ['category', 'currency', 'country'] target_enc = ce.CatBoostEncoder(cols=cat_features)train, valid, _ = get_data_splits(data) target_enc.fit(train[cat_features], train['outcome'])train = train.join(target_enc.transform(train[cat_features]).add_suffix('_cb')) valid = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_cb'))bst = train_model(train, valid)

上一篇：Feature Engineering 特征工程 1. Baseline Model

下一篇：Feature Engineering 特征工程 3. Feature Generation

總結(jié)

以上是生活随笔為你收集整理的Feature Engineering 特征工程 2. Categorical Encodings的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： LeetCode 524. 通过删除字母
下一篇： LeetCode 683. K 个空花盆