ECG 数据库介绍
現(xiàn)有心電數(shù)據(jù)集
一、四大數(shù)據(jù)庫概述
目前國際上最重要的,具有權威性的心電數(shù)據(jù)庫有四個:
美國麻省理工學院與Beth Israel醫(yī)院聯(lián)合建立的MIT-BIH心電數(shù)據(jù)庫;
美國心臟學會的AHA心律失常心電數(shù)據(jù)庫;
歐盟的CSE心電數(shù)據(jù)庫;
歐盟ST-T心電數(shù)據(jù)庫。
除此之外國際上被廣泛認可的還有Sudden Cardiac DeathHolter Database,PTB Diagnostic ECG Database,PAF Prediction ChallengeDatabase等心電數(shù)據(jù)庫。
1、美國的MIT-BIH心電數(shù)據(jù)庫
MIT-BIH Arrhythmia Database
診斷: Arrhythmia(心律失常)
采樣率:360 Hz
分辨率:11 bit
導聯(lián):2
每段數(shù)據(jù)持續(xù)時間:30+ min
儲存格式:Format 212
23條記錄源自住院病人,25條記錄源自罕見但是臨床上很重要。48條記錄每一條記錄都超過了30min。 106號 異位節(jié)拍更加突出 114號 記錄被逆轉 102,104,107,217這些記錄源自佩戴有起搏器的患者,由于起搏器節(jié)律與竇性節(jié)律接近,因此會出現(xiàn)許多起搏器融合的情況。但是官網(wǎng)表示,即使有一些肌肉噪聲,這些信號的質(zhì)量依然很好。 108——The lower channel exhibits considerable noise and baseline shifts.有相當大的噪聲和基線漂移 111——偶然的肌肉噪聲和基線漂移,但是大體上信號質(zhì)量極好 113——The variation in the rate of normal sinus rhythm is possibly due to a wandering atrial pacemaker.竇性心律正常率的變化可能是由于心房起搏器的漂移。 114——PVC是統(tǒng)一的 118,119——PVC是多樣的 122——The lower channel has low-amplitude high-frequency noise throughout.低振幅高頻率噪聲 200——上通道偶爾有高頻噪聲,下通道有嚴重的噪聲和偽影 203——There are QRS morphology changes in the upper channel due to axis shifts. There is considerable noise in both channels, including muscle artifact and baseline shifts. This is a very difficult record, even for humans!由于軸移位,QRS發(fā)生改變,兩個通道都有相當大的噪聲,包括肌肉和基線漂移。 207——This is an extremely difficult record. … The record ends during the episode of SVTA.在最長的心室撲動發(fā)作后出現(xiàn)室性心律。記錄在 SVTA 的情節(jié)期間結束。 214——有兩次假振幅降低和一次磁帶打滑 215——有<1s的兩次磁帶打滑 219——Following some conversions from atrial fibrillation to normal sinus rhythm are pauses up to 3 seconds in duration.房顫轉正常之后,暫停時長有3s。 222——兩個通道都含有高頻噪聲和偽影 228——膠帶有三次短暫滑動,最長2.2s做去噪算法常用的幾條記錄: 100/101/103/105/106/115/215 MIT-BIH ST Change Database診斷: Recorded during exercise stress tests and which exhibit transient ST depression(在運動壓力測試期間記錄,并顯示暫時性ST凹陷)
采樣率:360 Hz
分辨率:12 bit
導聯(lián):2
每段數(shù)據(jù)持續(xù)時間:varying lengths
儲存格式:Format 212
MIT-BIH Atrial Fibrillation Database
診斷: Atrial fibrillation (mostly paroxysmal)(陣發(fā)房顫)
采樣率:250 Hz
分辨率:12 bit
導聯(lián):2
每段數(shù)據(jù)持續(xù)時間:10 h
儲存格式:Format 212
網(wǎng)址:http://ecg.mit.edu/
MIT-BIH Noise Stress Test Database
12條半小時的ECG記錄,3條半小時的加入了典型噪聲的記錄。
其中118,119是兩條干凈的記錄。通過加入校準的噪聲量制作噪聲記錄。信噪比見下表。
原始的記錄是無噪聲的,新紀錄的注釋也源自原始的數(shù)據(jù)。
2、AHA心律失常心電數(shù)據(jù)庫
由美國國家心肺及血液研究院資助的美國心臟協(xié)會(American HeartAssociation,AHA)開發(fā)了AHA心律失常心電數(shù)據(jù)庫,該數(shù)據(jù)庫的開發(fā)目的是評價室性心律不齊探測器的檢測效果。
診斷:No ventricular ectopy (records 1001 through 1010)Isolated unifocal PVCs (records 2001 through 2010)Isolated multifocal PVCs (records 3001 through 3010)Ventricular bi- and trigeminy (records 4001 through 4010)R-on-T PVCs (records 5001 through 5010)Ventricular couplets (records 6001 through 6010)Ventricular tachycardia (records 7001 through 7010)Ventricular flutter/fibrillation (records 8001 through 8010)采樣率:250 Hz分辨率:12 bit網(wǎng)址:https://www.ecri.org/Products/Pages/AHA_ECG_DVD.aspx3、歐盟CSE數(shù)據(jù)庫
歐盟的CSE(Common Standards for Electrocardiography,心電圖通用標準)心電數(shù)據(jù)庫包含1000例短時間的心電記錄,采用12或15導聯(lián),主要開發(fā)目的是用于評價心電圖自動分析儀的性能。
e-mail: Paul.Rubel@insa-lyon.fr
4、歐盟ST-T數(shù)據(jù)庫
歐盟的ST-T數(shù)據(jù)庫是由歐洲心臟病學會(European Society ofCardiology)開發(fā)的,用于評價ST段和T波檢測算法性能的數(shù)據(jù)庫。
診斷: 每個受試者都被診斷或懷疑有心肌缺血。建立了額外的選擇標準,以便在數(shù)據(jù)庫中獲得代表性的 ECG 異常選擇,包括由高血壓、心室運動障礙和藥物影響等疾病導致的基線 ST 段位移。
采樣率:250 Hz
分辨率:12 bit
導聯(lián):2
每段數(shù)據(jù)持續(xù)時間:2 h
儲存格式:Format 212
網(wǎng)址:http://www.escardio.org/Pages/index.aspx
心臟性猝死動態(tài)心電數(shù)據(jù)庫
據(jù)估計,在世界范圍內(nèi),每年有400000人,還有上百萬的兒童猝死,所以PhysioNet舉行了心臟性猝死的數(shù)據(jù)庫建設,支持和刺激這一重要領域的電生理研究。
診斷: 18 patients with underlying sinus rhythm (4 with intermittent pacing), 1 who was continuously paced, and 4 with atrial fibrillation. All patients had a sustained ventricular tachyarrhythmia, and most had an actual cardiac arrest.
采樣率:250 Hz
分辨率:12 bit
導聯(lián):2
每段數(shù)據(jù)持續(xù)時間:30 min
儲存格式:Format 212
網(wǎng)址:http://physionet.org/physiobank/database/sddb/
PTB 心電診斷數(shù)據(jù)庫:
德國國家計量署提供的數(shù)字化心電數(shù)據(jù)庫,其目的在于算法標準的研究與教學。數(shù)據(jù)來自柏林的本杰明富蘭克林醫(yī)學大學的心臟內(nèi)科。
采樣率:1000Hz
分辨率:16 bit (± 16.384 mV)
導聯(lián):16(14 通道心電信號,1通道呼吸,1通道電壓)
每段數(shù)據(jù)持續(xù)時間:varying lengths(大多2 min)
儲存格式:Format 16
網(wǎng)址:https://archive.physionet.org/cgi-bin/atm/ATM
PAF 預測挑戰(zhàn)數(shù)據(jù)庫
The PAF Prediction Challenge Database來自2001年針對自動預測陣發(fā)性心房纖顫/顫振(predicting paroxysmal atrial fibrillation , PAF)的開放性競賽。競賽的意義是刺激并促進美國在這個重大臨床問題上的探索和培養(yǎng)友好競爭和廣泛合作的環(huán)境。
診斷: paroxysmal atrial fibrillation
采樣率:128Hz
分辨率:16 bit
導聯(lián):2
每段數(shù)據(jù)持續(xù)時間:5 min / 30 min
儲存格式:Format 16
網(wǎng)址:http://physionet.org/challenge/2001/
以上數(shù)據(jù)庫數(shù)據(jù)共分兩種保存格式,即WFDB signal files Format 212和WFDB signal files Format 16
MIT數(shù)據(jù)集&&PTB數(shù)據(jù)集
WFDB讀取心電數(shù)據(jù)(針對MIT-BIH)
.hea文件格式
100 2 360 650000 100.dat 212 200 11 1024 995 -22131 0 MLII 100.dat 212 200 11 1024 1011 20052 0 V5 # 69 M 1085 1629 x1 # Aldomet, Inderal導聯(lián)分布情況
表示含義
數(shù)據(jù)集處理
MIT數(shù)據(jù)集V5導聯(lián)(并非常用的MLII)
"""本程序用于重新預處理raw ecg以獲取更短的seg和更精確的label描述 """ import pandas as pd # from config import slice_window, slice_stride, RpreDistance, RpostDisance import numpy as np import os import wfdb import pickle import gzip import re #當前路徑下所有子目錄 ''':type file_path = '../records100/' pathname filename = [] for root, dirs, files in os.walk(file_path):for file in files:if file.endswith(".dat"):# print(os.path.join(root, file))filename.append(file.split(".")[0])# print(file) print(filename)''' file_path = '../records100/' pathname=[] filename = [] for root, dirs, files in os.walk(file_path):for file in files:if file.endswith(".dat"):# print(os.path.join(root, file))# pathname.append(root)filename.append(os.path.join(root, file))# print(file)# pathname.append(filename.split(".")[0]) # print(filename) number03 = [] for i in filename:number03.append(i.split(".dat")[0]) # print(number03) # all_files =os.listdir(file_path) # print(all_files) # filenames = [] # for dirs in os.walk(file_path): # for file in dirs: # if file.endswith(".dat"): # filenames.append(file)# print(dirs)# print(filenames) numberSet = ['100', '101', '103', '105', '106', '108', '109', '111', '112', '113', '114', '115', '116','117', '118', '119', '121', '122', '123', '124', '200', '201', '202', '203', '205', '207','208', '209', '210', '212', '213', '214', '215', '219', '220', '221', '222', '223', '228','230', '231', '232', '233', '234'] numberSet1 = ['123'] numset=[]# 讀取心電數(shù)據(jù)和對應標簽 def getDataSet(number_str):# ecgClassSet = ['N', 'A', 'V', 'L', 'R']# 讀取心電數(shù)據(jù)記錄print("正在讀取 " + number_str + " 號心電數(shù)據(jù)...")# record = wfdb.rdrecord('C:/Users/MeetT/Desktop/ECG_codes/MIT-BIH/' + number_str, channel_names=['MLII'])record = wfdb.rdrecord( number_str, channel_names=['V5'])# data = record.p_signal.flatten() # (650000,)data = record.p_signal# print(data)# rdata = denoise(data=data)# 獲取心電數(shù)據(jù)記錄中R波的位置和對應的標簽# annotation = wfdb.rdann('C:/Users/MeetT/Desktop/ECG_codes/MIT-BIH/' + number_str, 'atr')# annotation = wfdb.rdann(number_str,'dat')# Rlocation = annotation.sample # (xx,)# Rclass = annotation.symbol # (xx,)# print(annotation.symbol)## return data, Rlocation, Rclassreturn datadef prepare_df_02(save_path, slice_window=512, slice_stride=512, RpreDistance=115, RpostDisance=144):# 115=0.32s,144=0.4s# 加載數(shù)據(jù)集并進行預處理,與old_prepare_df的區(qū)別在于采用了新的標簽設置方式"""標簽設置方式:1.截取片段2.if seg_label has 'A' ==> label = 'A'3.else:4. if d_t <= d_Rpre and d_b <= d_Rpost ==> label='N'5. else:6. 截取兩側相鄰label7. if 左N右N ==> label = 'N'8. else if 左N右A ==> 向后取一個拍,從當前位置到R_point+d_Rpost , label = 'A'9. else:10. 向前取到R_point+d_Rpre, label = 'A'"""slices_list = []l = []for n_str in number03:labels_list = []# print(tmp_annot) ['+', 'L', 'L', 'L', 'L', 'L' 所有記錄的第一個標簽是+, 對應節(jié)律改變,應該屬于一種異常,如果必要時把前后5個心拍給刪掉tmp_data = getDataSet(n_str) # ndarray,ndarray,listslices_list.append(tmp_data) ## str(slices_list)print("slices_list",slices_list)print(len(slices_list))np.savetxt('ptb_II.csv',slices_list.reshape(1,-1),delimiter=',')#可根據(jù)需求取消注釋# if not os.path.exists(save_path):# os.makedirs(save_path)# np.savetxt( 'ptb_II.csv',slices_list,delimiter=',' )## slices_list = []# else:# np.savetxt('ptb_II.csv',slices_list,delimiter=',')# print(slices_list)# slices_list = []# for i in slices_list.split(' '):# l.append(i)# print(l)# print(len(l))# print(type(slices_list))# print(len(slices_list))print(len(slices_list))# if len(slices_list)%1 == 0:# if not os.path.exists(save_path):# os.makedirs(save_path)# np.savetxt( 'ptb_II_142.csv',slices_list,delimiter=' ' )## slices_list = []# else:# np.savetxt('ptb_II_142.csv',slices_list,delimiter=' ')# print(slices_list)# slices_list = []# print('DataFrame of ' + slices_list + ' patients is saved.')# 寫成一個迭代器,保存每個片段在源數(shù)據(jù)中所在位置的首索引# slices_head_index_generator = (x for x in range(0, len(tmp_data), slice_stride)# if x + slice_window <= len(tmp_data))# # 直接二元化處理標簽,讀取數(shù)據(jù)及標簽直接判定,不再保存片段內(nèi)的原始標簽# # labels = [0 if x == 'N' else 1 for x in tmp_annot]## for per_slice_head_index in slices_head_index_generator:# # 截取片段:# # 如果最后無法截取完整的心拍就跳過# if per_slice_head_index+slice_window > len(tmp_data):# print(n_str+' has stoped in advance.')# break # 快要到頭了,不再迭代# 尋找片段在標簽時間中經(jīng)過排序后在tmp_atrtime(R峰對應的點的位置,有序,不是time)中對應位置的索引# first_R_index_in_atrtime = np.searchsorted(tmp_atrtime, per_slice_head_index, side='left') # 第一個R峰的位置# last_R_index_in_Ratrtime = np.searchsorted(tmp_atrtime, per_slice_head_index + slice_window-1, side='right') # 最后一個R峰再加一個R峰的位置## # 考慮到一種情況,就是first_R_index_in_atrtime=0,last_R_index_in_Ratrtime=len(tmp_atrtime),代表離首尾太近取不到完整的seg:跳過本次循環(huán)# if first_R_index_in_atrtime == 0 or last_R_index_in_Ratrtime >= len(tmp_atrtime): # 因為atrtime最后一個index是len()-1# continue## per_slice = tmp_data[per_slice_head_index:per_slice_head_index + slice_window]# # 進入標簽設置算法# slice_labels = labels[first_R_index_in_atrtime:last_R_index_in_Ratrtime]# if 1 in slice_labels:# per_label = 1# else:# d_left = tmp_atrtime[first_R_index_in_atrtime] - per_slice_head_index + 1 # RpreDistance >=# # d_left:首元素距離片段內(nèi)第一個R峰的位置;d_right:末尾元素距離片段內(nèi)最后一個R峰的位置# d_right = per_slice_head_index + slice_window - 1 - tmp_atrtime[last_R_index_in_Ratrtime-1] # RpostDisance <=# if d_left <= RpreDistance and d_right <= RpostDisance:# per_label = 0# else:# NN_L_LABEL_NORMAL = labels[first_R_index_in_atrtime - 1] == 0 # 片段外最近的左邊的標簽# NN_R_LABEL_NORMAL = labels[last_R_index_in_Ratrtime] == 0# if NN_L_LABEL_NORMAL and NN_R_LABEL_NORMAL:# per_label = 0# elif NN_L_LABEL_NORMAL and not NN_R_LABEL_NORMAL:# # 向后取到NN_R_point+R_POST# # 只計算要多余截取的片段至:tmp_atrtime[last_R_index_in_Ratrtime]+R_post 這一點,往前倒推slice_window個# new_slice_end = tmp_atrtime[last_R_index_in_Ratrtime] + RpostDisance + 1 # 加1防止切片取不到# per_slice = tmp_data[new_slice_end - slice_window:new_slice_end] # 因為切片上限取不到# # 因為d_right已經(jīng)求出來了,所以這里可以怎么搞?--不搞了,懶得算# per_label = 1# else:# # 向前取到NN_R_point+R_PRE# # 計算要向前多余截取的片段至:tmp_atrtime[first_R_index_in_atrtime-1]-R_pre-1 這一點# new_slice_start = tmp_atrtime[first_R_index_in_atrtime - 1] - RpreDistance# per_slice = tmp_data[new_slice_start:new_slice_start + slice_window]# per_label = 1## slices_list.append(per_slice)# labels_list.append(per_label)# 利用pandas追加寫的特點將數(shù)據(jù)逐條保存至csv; 選擇csv的理由是比較簡單好操作,很熟悉,而且能打開# 為了避免頻繁的讀寫操作,一個病人的所有數(shù)據(jù)一起寫入,包括id,per_slice,per_label; 用存儲空間換io次數(shù)# 生成dataframe保存至csv文件內(nèi)# keep_head = True if n_str == '100' else False# df1 = pd.DataFrame({'id': np.array([int(n_str) for _ in range(len(slices_list))])}).astype('category')# # df2涉及多個值,需要zip# df2 = pd.DataFrame(np.array(slices_list))# df3 = pd.DataFrame({'labels': labels_list}).astype('category') ## dfA = pd.concat([df1, df2, df3], axis=1) # 這是對于一個病人數(shù)據(jù)的處理# 保存dfA# if not os.path.exists(save_path):# os.makedirs(save_path)# dfA.to_csv(save_path + 'df.csv', index=False, header=keep_head, mode='w')# else:# dfA.to_csv(save_path + 'df.csv', index=False, header=keep_head, mode='a')## print('DataFrame of ' + n_str + ' patients is saved.')def save_pkl(filename, data, compress=True):""" Save dictionary in a pickle file. """if compress:with gzip.open(filename, 'wb') as fh:pickle.dump(data, fh, protocol=4)else:with open(filename, 'wb') as fh:pickle.dump(data, fh, protocol=4)def save_patients_pkl(csv_path, s_window):""":param csv_path: 路徑:param slice_window: 窗口,用于提取樣本:return: 針對每個病人保存一條pkl,不再劃分訓練和測試集"""df = pd.read_csv(csv_path + 'df.csv')print(df.head(5))ids = df['id'].valuesslice_window=512data = df.iloc[:, 1:slice_window + 1].values# print(data[0].dtype) float64labels = df['labels'].valuesassert ids.shape[0] == data.shape[0] == labels.shape[0], 'Length of data and labels must be the same'del dffor n in numberSet:index = np.where(ids == int(n))# print(index[0]) index是包含一個列表作為元素的元組n_data = data[index[0]]n_labels = np.expand_dims(np.array(labels)[index[0]], axis=1) # 加入numpy數(shù)組轉化是因為單個元素數(shù)組作為標量進行索引不行n_set = np.concatenate([n_data, n_labels], axis=1)save_pkl(csv_path + n + '.pkl', n_set, compress=True)print('patient ' + n + ' data have been saved!')print('things have been rearranged.')slice_window=512 slice_stride=512 RpreDistance=115 RpostDisance=144 # data_path = './data/win{}strd{}Rpre{}Rpo{}/'.format(slice_window,slice_stride,RpreDistance,RpostDisance) data_path = './data03/'# data_path = 'intra-patient ECG anomaly detection/DATA/win{}strd{}Rpre{}Rpo{}/'.format(slice_window, # slice_stride, # RpreDistance, # RpostDisance) # np.set_printoptions(linewidth=10000) prepare_df_02(data_path) # res = pd.read_csv('../data02/df02_v5.csv') # print(res) # save_patients_pkl(data_path,slice_window)
部分轉載:https://blog.csdn.net/HJ33_/article/details/120405786
擔心原文鏈接失效,特備份
僅供學習交流
總結
- 上一篇: 算法 代码拷来终觉浅,绝知此事要躬行
- 下一篇: linux cmake编译源码,linu