當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

kaggle项目：基于随机森林模型的心脏病人预测分类

發布時間：2023/12/9 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 kaggle项目：基于随机森林模型的心脏病人预测分类小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Kaggle案例：基于隨機森林的心臟病人預測分類

大家好，我是Peter~

今天給大家分享的一個kaggle案例：基于隨機森林模型（RandomForest）的心臟病人預測分類。本文涉及到的知識點主要包含：

數據預處理和類型轉化
隨機森林模型建立與解釋
部分依賴圖PDP的繪制和解釋
AutoML機器學習SHAP庫的使用和解釋（個人待提升）

導讀

Of all the applications of machine-learning, diagnosing any serious disease using a black box is always going to be a hard sell. If the output from a model is the particular course of treatment (potentially with side-effects), or surgery, or the absence of treatment, people are going to want to know why.

在機器學習的所有應用中，使用黑匣子診斷任何嚴重疾病總是很難的。如果模型的輸出是特定的治療過程（可能有副作用）、手術或是否有療效，人們會想知道為什么。

This dataset gives a number of variables along with a target condition of having or not having heart disease. Below, the data is first used in a simple random forest model, and then the model is investigated using ML explainability tools and techniques.

該數據集提供了許多變量以及患有或不患有心臟病的目標條件。下面，數據首先用于一個簡單的隨機森林模型，然后使用 ML 可解釋性工具和技術對該模型進行研究。

原notebook地址：https://www.kaggle.com/tentotheminus9/what-causes-heart-disease-explaining-the-model，感興趣的可以參考原文學習。

導入庫

本案例中涉及到多個不同方向的庫：

數據預處理
多種可視化繪圖；尤其是shap的可視化，模型可解釋性的模塊的使用（后面會專門寫這個庫）
隨機森林模型
模型評價等

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.tree import export_graphviz from sklearn.metrics import roc_curve, auc from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split import eli5 from eli5.sklearn import PermutationImportance import shap from pdpbox import pdp, info_plots np.random.seed(123) pd.options.mode.chained_assignment = None

數據探索EDA

1、導入數據

2、缺失值情況

數據比較完美，沒有任何缺失值！

字段含義

在這里重點介紹下各個字段的含義。Peter近期導出的數據集中字段和原notebook中的字段名字寫法稍有差異，已經為大家做了一一對應的關系，下面是具體的中文含義：

age：年齡

sex 性別 1=male 0=female

cp 胸痛類型；4種取值情況

1：典型心絞痛
2：非典型心絞痛
3：非心絞痛
4：無癥狀

trestbps 靜息血壓

chol 血清膽固醇

fbs 空腹血糖 >120mg/dl ：1=true； 0=false

restecg 靜息心電圖(值0,1,2)

thalach 達到的最大心率

exang 運動誘發的心絞痛(1=yes;0=no)

oldpeak 相對于休息的運動引起的ST值(ST值與心電圖上的位置有關)

slope 運動高峰ST段的坡度

1：upsloping向上傾斜
2： flat持平
3：downsloping向下傾斜

ca The number of major vessels(血管) (0-3)

thal A blood disorder called thalassemia ，一種叫做地中海貧血的血液疾病(3 = normal；6 = fixed defect;；7 = reversable defect)

target 生病沒有(0=no；1=yes)

原notebook中的英文含義；

下面是Peter整理的對應關系。本文中以當前的版本為標準：

字段轉化

轉化編碼

對部分字段進行一一的轉化。以sex字段為例：將數據中的0變成female，1變成male

# 1、sexdf["sex"][df["sex"] == 0] = "female" df["sex"][df["sex"] == 1] = "male"

字段類型轉化

# 指定數據類型 df["sex"] = df["sex"].astype("object") df["cp"] = df["cp"].astype("object") df["fbs"] = df["fbs"].astype("object") df["restecg"] = df["restecg"].astype("object") df["exang"] = df["exang"].astype("object") df["slope"] = df["slope"].astype("object") df["thal"] = df["thal"].astype("object")

生成啞變量

# 生成啞變量 df = pd.get_dummies(df,drop_first=True) df

隨機森林RandomForest

切分數據

# 生成特征變量數據集和因變量數據集 X = df.drop("target",1) y = df["target"]# 切分比例為8：2 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=10)X_train

建模

rf = RandomForestClassifier(max_depth=5) rf.fit(X_train, y_train)

3個重要屬性

隨機森林中3個重要的屬性：

查看森林中樹的狀況：estimators_
袋外估計準確率得分：oob_score_，必須是oob_score參數選擇True的時候才可用
變量的重要性：feature_importances_

決策樹可視化

在這里我們選擇的第二棵樹的可視化過程：

# 查看第二棵樹的狀況 estimator = rf.estimators_[1]# 全部屬性 feature_names = [i for i in X_train.columns] #print(feature_names) # 指定數據類型 y_train_str = y_train.astype('str') # 0-no 1-disease y_train_str[y_train_str == '0'] = 'no disease' y_train_str[y_train_str == '1'] = 'disease' # 訓練數據的取值 y_train_str = y_train_str.values y_train_str[:5]

繪圖的具體代碼為：

# 繪圖顯示export_graphviz(estimator, # 傳入第二顆樹out_file='tree.dot', # 導出文件名feature_names = feature_names, # 屬性名class_names = y_train_str, # 最終的分類數據rounded = True, proportion = True, label='root',precision = 2, filled = True )from subprocess import call call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])from IPython.display import Image Image(filename = 'tree.png')

決策樹的可視化過程能夠讓我們看到具體的分類過程，但是并不能解決哪些特征或者屬性比較重要。后面會對部分屬性的特征重要性進行探索

模型得分驗證

關于混淆矩陣和使用特異性（specificity）以及靈敏度（sensitivity）這兩個指標來描述分類器的性能：

# 模型預測 y_predict = rf.predict(X_test) y_pred_quant = rf.predict_proba(X_test)[:,1] y_pred_bin = rf.predict(X_test)# 混淆矩陣 confusion_matrix = confusion_matrix(y_test,y_pred_bin) confusion_matrix# 計算sensitivity and specificity total=sum(sum(confusion_matrix)) sensitivity = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[1,0]) specificity = confusion_matrix[1,1]/(confusion_matrix[1,1]+confusion_matrix[0,1])

繪制ROC曲線

fpr, tpr, thresholds = roc_curve(y_test, y_pred_quant)fig, ax = plt.subplots() ax.plot(fpr, tpr)ax.plot([0,1],[0,1],transform = ax.transAxes,ls = "--",c = ".3")plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0])plt.rcParams['font.size'] = 12# 標題 plt.title('ROC Curve') # 兩個軸的名稱 plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') # 網格線 plt.grid(True)

本案例中的ROC曲線值：

auc(fpr, tpr) # 結果 0.9076923076923078

根據一般ROC曲線的評價標準，案例的表現結果還是不錯的：

0.90 - 1.00 = excellent
0.80 - 0.90 = good
0.70 - 0.80 = fair
0.60 - 0.70 = poor
0.50 - 0.60 = fail

補充知識點：分類器的評價指標

考慮一個二分類的情況，類別為1和0，我們將1和0分別作為正類（positive）和負類（negative），根據實際的結果和預測的結果，則最終的結果有4種，表格如下：

常見的評價指標：

1、ACC：classification accuracy，描述分類器的分類準確率

計算公式為：ACC=(TP+TN)/(TP+FP+FN+TN)

2、BER：balanced error rate
計算公式為：BER=1/2*(FPR+FN/(FN+TP))

3、TPR：true positive rate，描述識別出的所有正例占所有正例的比例
計算公式為：TPR=TP/ (TP+ FN)

4、FPR：false positive rate，描述將負例識別為正例的情況占所有負例的比例
計算公式為：FPR= FP / (FP + TN)

5、TNR：true negative rate，描述識別出的負例占所有負例的比例
計算公式為：TNR= TN / (FP + TN)

6、PPV：Positive predictive value
計算公式為：PPV=TP / (TP + FP)

7、NPV：Negative predictive value
計算公式：NPV=TN / (FN + TN)

其中TPR即為敏感度（sensitivity），TNR即為特異度（specificity）。

來自維基百科的經典圖形：

可解釋性

排列重要性-Permutation Importance

下面的內容是關于機器學習模型的結果可解釋性。首先考察的是每個變量對模型的重要性。重點考量的排列重要性Permutation Importance：

部分依賴圖（ Partial dependence plots ，PDP）

一維PDP

Partial Dependence就是用來解釋某個特征和目標值y的關系的，一般是通過畫出Partial Dependence Plot(PDP)來體現。也就是說PDP在X1的值，就是把訓練集中第一個變量換成X1之后，原模型預測出來的平均值。

重點：查看單個特征和目標值的關系

字段ca

base_features = df.columns.values.tolist() base_features.remove("target")feat_name = 'ca' # ca-num_major_vessels 原文 pdp_dist = pdp.pdp_isolate(model=rf, # 模型dataset=X_test, # 測試集model_features=base_features, # 特征變量；除去目標值 feature=feat_name # 指定單個字段 )pdp.pdp_plot(pdp_dist, feat_name) # 傳入兩個參數 plt.show()

通過下面的圖形我們觀察到：當ca字段增加的時候，患病的幾率在下降。ca字段的含義是血管數量（num_major_vessels），也就是說當血管數量增加的時候，患病率隨之降低

字段age

feat_name = 'age'pdp_dist = pdp.pdp_isolate(model=rf, dataset=X_test, model_features=base_features, feature=feat_name)pdp.pdp_plot(pdp_dist, feat_name) plt.show()

關于年齡字段age，原文的描述：

That’s a bit odd. The higher the age, the lower the chance of heart disease? Althought the blue confidence regions show that this might not be true (the red baseline is within the blue zone).

翻譯：這有點奇怪。年齡越大，患心臟病的幾率越低？盡管藍色置信區間表明這可能不是真的（紅色基線在藍色區域內）

字段oldpeak

feat_name = 'oldpeak'pdp_dist = pdp.pdp_isolate(model=rf, dataset=X_test, model_features=base_features, feature=feat_name)pdp.pdp_plot(pdp_dist, feat_name) plt.show()

oldpeak字段同樣表明：取值越大，患病幾率越低。

這個變量稱之為“相對休息運動引起的ST壓低值”。正常的狀態下，該值越高，患病幾率越高。但是上面的圖像卻顯示了相反的結果。

作者推斷：造成這個結果的原因除了the depression amount，可能還和slope type有關系。原文摘錄如下，于是作者繪制了2D-PDP圖形

Perhaps it’s not just the depression amount that’s important, but the interaction with the slope type? Let’s check with a 2D PDP

2D-PDP圖

查看的是 slope_upsloping 、slope_flat和 oldpeak的關系：

inter1 = pdp.pdp_interact(model=rf, # 模型dataset=X_test, # 特征數據集model_features=base_features, # 特征features=['slope_upsloping', 'oldpeak'])pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=['slope_upsloping', 'oldpeak'], plot_type='contour') plt.show()## ------------inter1 = pdp.pdp_interact(model=rf, dataset=X_test,model_features=base_features, features=['slope_flat', 'oldpeak'] )pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=['slope_flat', 'oldpeak'], plot_type='contour') plt.show()

從兩張圖形中我們可以觀察到：在oldpeak取值較低的時候，患病幾率都比較高（黃色），這是一個奇怪的現象。于是作者進行了如下的SHAP可視化探索：針對單個變量進行分析。

SHAP可視化

關于SHAP的介紹可以參考文章：https://zhuanlan.zhihu.com/p/83412330 和 https://blog.csdn.net/sinat_26917383/article/details/115400327

SHAP是Python開發的一個"模型解釋"包，可以解釋任何機器學習模型的輸出。下面SHAP使用的部分功能：

Explainer

在SHAP中進行模型解釋之前需要先創建一個explainer，SHAP支持很多類型的explainer，例如deep, gradient, kernel, linear, tree, sampling)。在這個案例我們以tree為例：

# 傳入隨機森林模型rf explainer = shap.TreeExplainer(rf) # 在explainer中傳入特征值的數據，計算shap值 shap_values = explainer.shap_values(X_test) shap_values

Feature Importance

取每個特征SHAP值的絕對值的平均值作為該特征的重要性，得到一個標準的條形圖(multi-class則生成堆疊的條形圖：

結論：能夠直觀地觀察到ca字段的SHAP值最高

summary_plot

summary plot 為每個樣本繪制其每個特征的SHAP值，這可以更好地理解整體模式，并允許發現預測異常值。

每一行代表一個特征，橫坐標為SHAP值
一個點代表一個樣本，顏色表示特征值的高低(紅色高，藍色低)

個體差異

查看單個病人的不同特征屬性對其結果的影響，原文描述為：

Next, let’s pick out individual patients and see how the different variables are affecting their outcomes

def heart_disease_risk_factors(model, patient):explainer = shap.TreeExplainer(model) # 建立explainershap_values = explainer.shap_values(patient) # 計算shape值shap.initjs()return shap.force_plot(explainer.expected_value[1],shap_values[1], patient)

從兩個病人的結果中顯示：

P1：預測準確率高達29%（baseline是57%），更多的因素集中在ca、thal_fixed_defect、oldpeak等藍色部分。
P3：預測準確率高達82%，更多的影響因素在sel_male=0，thalach=143等

通過對比不同的患者，我們是可以觀察到不同病人之間的預測率和主要影響因素。

dependence_plot

為了理解單個feature如何影響模型的輸出，我們可以將該feature的SHAP值與數據集中所有樣本的feature值進行比較：

多樣本可視化探索

下面的圖形是針對多患者的預測和影響因素的探索。在jupyter notebook的交互式作用下，能夠觀察不同的特征屬性對前50個患者的影響：

shap_values = explainer.shap_values(X_train.iloc[:50]) shap.force_plot(explainer.expected_value[1], shap_values[1], X_test.iloc[:50])

總結

以上是生活随笔為你收集整理的kaggle项目：基于随机森林模型的心脏病人预测分类的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： hive中任意相邻时间段数据获取
下一篇：知识点025-服务器的基础优化脚本