日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

【算法竞赛学习】心跳信号分类预测-特征工程

發(fā)布時(shí)間:2023/12/15 编程问答 28 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【算法竞赛学习】心跳信号分类预测-特征工程 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Task3 特征工程

此部分為零基礎(chǔ)入門(mén)數(shù)據(jù)挖掘-心跳信號(hào)分類預(yù)測(cè)的 Task3 特征工程部分,帶你來(lái)了解時(shí)間序列特征工程以及分析方法,歡迎大家后續(xù)多多交流。

賽題:零基礎(chǔ)入門(mén)數(shù)據(jù)挖掘-心跳信號(hào)分類預(yù)測(cè)

項(xiàng)目地址:
比賽地址:

3.1 學(xué)習(xí)目標(biāo)

  • 學(xué)習(xí)時(shí)間序列數(shù)據(jù)的特征預(yù)處理方法
  • 學(xué)習(xí)時(shí)間序列特征處理工具 Tsfresh(TimeSeries Fresh)的使用

3.2 內(nèi)容介紹

  • 數(shù)據(jù)預(yù)處理
    • 時(shí)間序列數(shù)據(jù)格式處理
    • 加入時(shí)間步特征time
  • 特征工程
    • 時(shí)間序列特征構(gòu)造
    • 特征篩選
    • 使用 tsfresh 進(jìn)行時(shí)間序列特征處理

3.3 代碼示例

3.3.1 導(dǎo)入包并讀取數(shù)據(jù)

# 包導(dǎo)入 import pandas as pd import numpy as np import tsfresh as tsf from tsfresh import extract_features, select_features from tsfresh.utilities.dataframe_functions import impute # 數(shù)據(jù)讀取 data_train = pd.read_csv("train.csv") data_test_A = pd.read_csv("testA.csv")print(data_train.shape) print(data_test_A.shape) (100000, 3) (20000, 2) data_train.head() id heartbeat_signals label 0 0 0.9912297987616655,0.9435330436439665,0.7646770.0 1 1 0.9714822034884503,0.9289687459588268,0.5729320.0 2 2 1.0,0.9591487564065292,0.7013782792997189,0.232.0 3 3 0.9757952826275774,0.9340884687738161,0.6596360.0 4 4 0.0,0.055816398940721094,0.26129357194994196,02.0 data_test_A.head() id heartbeat_signals 0 100000 0.9915713654170097,1.0,0.6318163407681274,0.131 100001 0.6075533139615096,0.5417083883163654,0.3406942 100002 0.9752726292239277,0.6710965234906665,0.6867583 100003 0.9956348033996116,0.9170249621481004,0.5210964 100004 1.0,0.8879490481178918,0.745564725322326,0.531

3.3.2 數(shù)據(jù)預(yù)處理

# 對(duì)心電特征進(jìn)行行轉(zhuǎn)列處理,同時(shí)為每個(gè)心電信號(hào)加入時(shí)間步特征time train_heartbeat_df = data_train["heartbeat_signals"].str.split(",", expand=True).stack() train_heartbeat_df = train_heartbeat_df.reset_index() train_heartbeat_df = train_heartbeat_df.set_index("level_0") train_heartbeat_df.index.name = None train_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True) train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)train_heartbeat_df time heartbeat_signals 0 0 0.991230 0 1 0.943533 0 2 0.764677 0 3 0.618571 0 4 0.379632 ... ... ... 99999 200 0.000000 99999 201 0.000000 99999 202 0.000000 99999 203 0.000000 99999 204 0.00000020500000 rows × 2 columns # 將處理后的心電特征加入到訓(xùn)練數(shù)據(jù)中,同時(shí)將訓(xùn)練數(shù)據(jù)label列單獨(dú)存儲(chǔ) data_train_label = data_train["label"] data_train = data_train.drop("label", axis=1) data_train = data_train.drop("heartbeat_signals", axis=1) data_train = data_train.join(train_heartbeat_df)data_train id time heartbeat_signals 0 0 0 0.991230 0 0 1 0.943533 0 0 2 0.764677 0 0 3 0.618571 0 0 4 0.379632 ... ... ... ... 99999 99999 200 0.0 99999 99999 201 0.0 99999 99999 202 0.0 99999 99999 203 0.0 99999 99999 204 0.020500000 rows × 4 columns data_train[data_train["id"]==1] id time heartbeat_signals 1 1 0 0.971482 1 1 1 0.928969 1 1 2 0.572933 1 1 3 0.178457 1 1 4 0.122962 ... ... ... ... 1 1 200 0.0 1 1 201 0.0 1 1 202 0.0 1 1 203 0.0 1 1 204 0.0205 rows × 4 columns

可以看到,每個(gè)樣本的心電特征都由205個(gè)時(shí)間步的心電信號(hào)組成。

3.3.3 使用 tsfresh 進(jìn)行時(shí)間序列特征處理

  • 特征抽取
    **Tsfresh(TimeSeries Fresh)**是一個(gè)Python第三方工具包。 它可以自動(dòng)計(jì)算大量的時(shí)間序列數(shù)據(jù)的特征。此外,該包還包含了特征重要性評(píng)估、特征選擇的方法,因此,不管是基于時(shí)序數(shù)據(jù)的分類問(wèn)題還是回歸問(wèn)題,tsfresh都會(huì)是特征提取一個(gè)不錯(cuò)的選擇。官方文檔:Introduction — tsfresh 0.17.1.dev24+g860c4e1 documentation
  • from tsfresh import extract_features# 特征提取 train_features = extract_features(data_train, column_id='id', column_sort='time') train_features id sum_values abs_energy mean_abs_change mean_change ... 0 38.927945 18.216197 0.019894 -0.004859 ... 1 19.445634 7.705092 0.019952 -0.004762 ... 2 21.192974 9.140423 0.009863 -0.004902 ... ... ... ... ... ... ... 99997 40.897057 16.412857 0.019470 -0.004538 ... 99998 42.333303 14.281281 0.017032 -0.004902 ... 99999 53.290117 21.637471 0.021870 -0.004539 ...100000 rows × 779 columns
  • 特征選擇
    train_features中包含了heartbeat_signals的779種常見(jiàn)的時(shí)間序列特征(所有這些特征的解釋可以去看官方文檔),這其中有的特征可能為NaN值(產(chǎn)生原因?yàn)楫?dāng)前數(shù)據(jù)不支持此類特征的計(jì)算),使用以下方式去除NaN值:
  • from tsfresh.utilities.dataframe_functions import impute# 去除抽取特征中的NaN值 impute(train_features) id sum_values abs_energy mean_abs_change mean_change ... 0 38.927945 18.216197 0.019894 -0.004859 ... 1 19.445634 7.705092 0.019952 -0.004762 ... 2 21.192974 9.140423 0.009863 -0.004902 ... ... ... ... ... ... ... 99997 40.897057 16.412857 0.019470 -0.004538 ... 99998 42.333303 14.281281 0.017032 -0.004902 ... 99999 53.290117 21.637471 0.021870 -0.004539 ...100000 rows × 779 columns

    接下來(lái),按照特征和響應(yīng)變量之間的相關(guān)性進(jìn)行特征選擇,這一過(guò)程包含兩步:首先單獨(dú)計(jì)算每個(gè)特征和響應(yīng)變量之間的相關(guān)性,然后利用Benjamini-Yekutieli procedure [1] 進(jìn)行特征選擇,決定哪些特征可以被保留。

    from tsfresh import select_features# 按照特征和數(shù)據(jù)label之間的相關(guān)性進(jìn)行特征選擇 train_features_filtered = select_features(train_features, data_train_label)train_features_filtered id sum_values fft_coefficient__attr_"abs"__coeff_35 fft_coefficient__attr_"abs"__coeff_34 ... 0 38.927945 1.168685 0.982133 ... 1 19.445634 1.460752 1.924501 ... 2 21.192974 1.787166 2.1469872 ... ... ... ... ... ... 99997 40.897057 1.190514 0.674603 ... 99998 42.333303 1.237608 1.325212 ... 99999 53.290117 0.154759 2.921164 ...100000 rows × 700 columns

    可以看到經(jīng)過(guò)特征選擇,留下了700個(gè)特征。

    References

    [1] Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165–1188

    總結(jié)

    以上是生活随笔為你收集整理的【算法竞赛学习】心跳信号分类预测-特征工程的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

    如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。