日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

[Kaggle] Heart Disease Prediction

發(fā)布時(shí)間:2024/7/5 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 [Kaggle] Heart Disease Prediction 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

    • 1. 數(shù)據(jù)探索
    • 2. 特征處理管道
    • 3. 訓(xùn)練模型
    • 4. 預(yù)測(cè)

kaggle項(xiàng)目地址

1. 數(shù)據(jù)探索

import pandas as pd train = pd.read_csv('./train.csv') test = pd.read_csv('./test.csv')train.info() test.info() abs(train.corr()['target']).sort_values(ascending=False) <class 'pandas.core.frame.DataFrame'> RangeIndex: 241 entries, 0 to 240 Data columns (total 14 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 241 non-null int64 1 sex 241 non-null int64 2 cp 241 non-null int64 3 trestbps 241 non-null int64 4 chol 241 non-null int64 5 fbs 241 non-null int64 6 restecg 241 non-null int64 7 thalach 241 non-null int64 8 exang 241 non-null int64 9 oldpeak 241 non-null float6410 slope 241 non-null int64 11 ca 241 non-null int64 12 thal 241 non-null int64 13 target 241 non-null int64 dtypes: float64(1), int64(13) memory usage: 26.5 KB

訓(xùn)練數(shù)據(jù)241條,13個(gè)特征(全部為數(shù)字特征),標(biāo)簽為 target

  • 特征與 標(biāo)簽 的相關(guān)系數(shù)
target 1.000000 cp 0.457688 exang 0.453784 ca 0.408107 thalach 0.390346 oldpeak 0.389787 slope 0.334991 thal 0.324611 sex 0.281272 age 0.242338 restecg 0.196018 chol 0.170592 trestbps 0.154086 fbs 0.035450 Name: target, dtype: float64
  • 查看特征的值
for col in train.columns:print(col)print(train[col].unique()) age [37 41 56 44 52 57 54 48 64 50 66 43 69 42 61 71 59 65 46 51 45 47 53 6358 35 62 29 55 60 68 39 34 67 74 49 76 70 38 77 40] sex [1 0] cp [2 1 0 3] trestbps [130 140 120 172 150 110 160 125 142 135 155 104 138 128 108 134 122 115118 100 124 94 112 102 152 101 132 178 129 136 106 156 170 117 145 180165 192 144 123 126 154 148 114 164] chol [250 204 294 263 199 168 239 275 211 219 226 247 233 243 302 212 177 273304 232 269 360 308 245 208 235 257 216 234 141 252 201 222 260 303 265309 186 203 183 220 209 258 227 261 221 205 318 298 277 197 214 248 255207 223 160 394 315 270 195 240 196 244 254 126 313 262 215 193 271 268267 210 295 178 242 180 228 149 253 342 157 175 286 229 256 224 206 230276 353 225 330 290 266 172 305 188 282 185 326 274 164 307 249 341 407217 174 281 288 289 246 322 299 300 293 184 409 283 259 200 327 237 319166 218 335 169 187 176 241 264 236] fbs [0 1] restecg [1 0 2] thalach [187 172 153 173 162 174 160 139 144 158 114 171 151 179 178 137 157 140152 170 165 148 142 180 156 115 175 186 185 159 130 190 132 182 143 163147 154 202 161 166 164 184 122 168 169 138 111 145 194 131 133 155 167192 121 96 126 105 181 116 149 150 125 108 129 112 128 109 113 99 177141 146 136 127 103 124 88 120 195 95 117 71 118 134 90 123] exang [0 1] oldpeak [3.5 1.4 1.3 0. 0.5 1.6 1.2 0.2 1.8 2.6 1.5 0.4 1. 0.8 3. 0.6 2.4 0.11.9 4.2 1.1 2. 0.7 0.3 0.9 2.3 3.6 3.2 2.2 2.8 3.4 6.2 4. 5.6 2.1 4.4] slope [0 2 1] ca [0 2 1 4 3] thal [2 3 0 1] target [1 0]
  • 一些特征不能用大小來度量,將其轉(zhuǎn)為 分類變量(string 類型,后序onehot編碼)
object_cols = ['cp', 'restecg', 'slope', 'ca', 'thal'] def strfeatures(data):data_ = data.copy()for col in object_cols:data_[col] = data_[col].astype(str)return data_train_ = strfeatures(train) test_ = strfeatures(test)

2. 特征處理管道

  • 數(shù)字特征、文字特征分離
def num_cat_split(data):s = (data.dtypes == 'object')object_cols = list(s[s].index)num_cols = list(set(data.columns)-set(object_cols))return num_cols, object_colsnum_cols, object_cols = num_cat_split(train_) num_cols.remove('target')
  • 抽取部分?jǐn)?shù)據(jù)作為本地驗(yàn)證
# 本地測(cè)試,分成抽樣,分割訓(xùn)練集,驗(yàn)證集 from sklearn.model_selection import StratifiedShuffleSplit splt = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=1) for train_idx, valid_idx in splt.split(train_, train_['target']):train_part = train_.loc[train_idx]valid_part = train_.loc[valid_idx]train_part_y = train_part['target'] valid_part_y = valid_part['target'] train_part = train_part.drop(['target'], axis=1) valid_part = valid_part.drop(['target'], axis=1)
  • 數(shù)據(jù)處理管道
from sklearn.base import TransformerMixin, BaseEstimator from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import StandardScaler from sklearn.impute import SimpleImputerclass DataFrameSelector(BaseEstimator, TransformerMixin):def __init__(self, attribute_name):self.attribute_name = attribute_namedef fit(self, X, y=None):return selfdef transform(self, X):return X[self.attribute_name].valuesnum_pipeline = Pipeline([('selector', DataFrameSelector(num_cols)),# ('imputer', SimpleImputer(strategy='median')),# ('std_scaler', StandardScaler()), ])cat_pipeline = Pipeline([('selector', DataFrameSelector(object_cols)),('cat_encoder', OneHotEncoder(sparse=False, handle_unknown='ignore')) ])full_pipeline = FeatureUnion(transformer_list=[('num_pipeline', num_pipeline),('cat_pipeline', cat_pipeline) ])

3. 訓(xùn)練模型

# 本地測(cè)試 from sklearn.ensemble import GradientBoostingClassifier from sklearn.linear_model import Perceptron from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.metrics import accuracy_score from sklearn.model_selection import GridSearchCVrf = RandomForestClassifier() knn = KNeighborsClassifier() lr = LogisticRegression() svc = SVC() gbdt = GradientBoostingClassifier() perceptron = Perceptron()models = [perceptron, knn, lr, svc, rf, gbdt] param_grid_list = [# perceptron[{'model__max_iter' : [10000, 5000]}],# knn[{'model__n_neighbors' : [3,5,10,15,35],'model__leaf_size' : [3,5,10,20,30,40,50]}],# lr[{'model__penalty' : ['l1', 'l2'],'model__C' : [0.05, 0.1, 0.2, 0.5, 1, 1.2],'model__max_iter' : [50000]}],# svc[{'model__degree' : [3, 5, 7],'model__C' : [0.2, 0.5, 1, 1.2, 1.5],'model__kernel' : ['rbf', 'sigmoid', 'poly']}],# rf[{# 'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'model__n_estimators' : [100,200,250,300,350],'model__max_features' : [5,8, 10, 12, 15, 20, 30, 40, 50],'model__max_depth' : [3,5,7]}],# gbdt[{'model__learning_rate' : [0.02, 0.05, 0.1, 0.2],'model__n_estimators' : [30, 50, 100, 150],'model__max_features' : [5, 8, 10,20,30,40],'model__max_depth' : [3,5,7],'model__min_samples_split' : [10, 20,40],'model__min_samples_leaf' : [5,10,20],'model__subsample' : [0.5, 0.8, 1]}], ]for i, model in enumerate(models):pipe = Pipeline([('preparation', full_pipeline),('model', model)])grid_search = GridSearchCV(pipe, param_grid_list[i], cv=3,scoring='accuracy', verbose=2, n_jobs=-1)grid_search.fit(train_part, train_part_y)print(grid_search.best_params_)final_model = grid_search.best_estimator_pred = final_model.predict(valid_part)print('accuracy score: ', accuracy_score(valid_part_y, pred)) Fitting 3 folds for each of 2 candidates, totalling 6 fits {'model__max_iter': 10000} accuracy score: 0.4489795918367347Fitting 3 folds for each of 35 candidates, totalling 105 fits {'model__leaf_size': 3, 'model__n_neighbors': 3} accuracy score: 0.5306122448979592Fitting 3 folds for each of 12 candidates, totalling 36 fits {'model__C': 0.1, 'model__max_iter': 50000, 'model__penalty': 'l2'} accuracy score: 0.8979591836734694Fitting 3 folds for each of 45 candidates, totalling 135 fits {'model__C': 1, 'model__degree': 5, 'model__kernel': 'poly'} accuracy score: 0.6326530612244898Fitting 3 folds for each of 135 candidates, totalling 405 fits {'model__max_depth': 5, 'model__max_features': 5, 'model__n_estimators': 250} accuracy score: 0.8775510204081632Fitting 3 folds for each of 7776 candidates, totalling 23328 fits {'model__learning_rate': 0.05, 'model__max_depth': 7, 'model__max_features': 20, 'model__min_samples_leaf': 10, 'model__min_samples_split': 40, 'model__n_estimators': 150, 'model__subsample': 0.5} accuracy score: 0.8163265306122449

LR,RF,GBDT 表現(xiàn)較好

4. 預(yù)測(cè)

# 全量數(shù)據(jù)訓(xùn)練,提交測(cè)試 # 采用隨機(jī)參數(shù)搜索 y_train = train_['target'] X_train = train_.drop(['target'], axis=1) X_test = test_from sklearn.model_selection import RandomizedSearchCV from scipy.stats import randint import numpy as npselect_model = [lr, rf, gbdt] name = ['lr', 'rf', 'gbdt'] param_distribs = [# lr[{'model__penalty' : ['l1', 'l2'],'model__C' : np.linspace(0.01, 0.5, 10),'model__max_iter' : [50000]}],# rf[{# 'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'model__n_estimators' : randint(low=50, high=500),'model__max_features' : randint(low=3, high=30),'model__max_depth' : randint(low=2, high=20)}],# gbdt[{'model__learning_rate' : np.linspace(0.01, 0.3, 10),'model__n_estimators' : randint(low=30, high=500),'model__max_features' : randint(low=5, high=50),'model__max_depth' : randint(low=3, high=20),'model__min_samples_split' : randint(low=10, high=100),'model__min_samples_leaf' : randint(low=3, high=50),'model__subsample' : np.linspace(0.5, 1.5, 10)}], ]for i, model in enumerate(select_model):pipe = Pipeline([('preparation', full_pipeline),('model', model)])rand_search = RandomizedSearchCV(pipe, param_distributions=param_distribs[i], cv=5,n_iter=1000, scoring='accuracy', verbose=2, n_jobs=-1)rand_search.fit(X_train, y_train)print(rand_search.best_params_)final_model = rand_search.best_estimator_pred = final_model.predict(X_test)print(model,"\nFINISH !!!")res = pd.DataFrame()res['Id'] = range(1,63,1)res['Prediction'] = predres.to_csv('{}_pred.csv'.format(name[i]), index=False)

測(cè)試效果如下。

總結(jié)

以上是生活随笔為你收集整理的[Kaggle] Heart Disease Prediction的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。

主站蜘蛛池模板: 亚洲性网| jizz在线免费观看 | 一区二区三区日韩精品 | 欧美成人黄色小说 | 免费在线观看av网站 | 国家队动漫免费观看在线观看晨光 | 天天插天天射 | 北岛玲一区二区 | 免费的黄色片 | 亚洲欧美日韩精品色xxx | 日韩在线观看视频一区二区三区 | 亚洲区视频在线观看 | 夜夜爽夜夜| 欧洲精品一区二区三区 | 欧美一区亚洲一区 | 性久久久 | 开心激情五月婷婷 | 国产三级网站 | 天天色棕合合合合合合合 | 国产56页| 香蕉依人 | 欧美天堂在线视频 | 靠逼视频网站 | 男男车车的车车网站w98免费 | 大乳村妇的性需求 | 国产精品第一区 | 日韩精品欧美 | 欧美www| 一二三区在线视频 | 久久久永久久久人妻精品麻豆 | 美女一级片| 国产亚洲一区二区不卡 | 国产日韩欧美精品 | 狠狠躁| 光棍影院手机版在线观看免费 | 毛片毛片毛片毛片毛片 | 日本精品一二三 | 国产成人精品综合在线观看 | 午夜激情电影院 | 在线视频免费观看 | 老地方在线观看免费动漫 | 青青草原在线免费 | 不卡福利视频 | 日本东京热一区二区 | 国产片91 | 国产一级特黄视频 | 欧美成人免费在线观看 | 可以免费看污视频的网站 | 亚洲国产日韩一区无码精品久久久 | 伊人免费在线观看高清版 | 黄色高清视频 | 国产三级中文字幕 | 亚洲免费av一区二区 | 香蕉视频国产在线观看 | 午夜精品久久久久久久无码 | 亚洲人妻电影一区 | 婷婷伊人 | 国产精品女人久久久 | 国产精品99久久久精品无码 | 婷婷综合av | 伊人久久亚洲综合 | 国产精品成人久久 | 日韩少妇av| 国产免费久久 | 影音先锋成人资源 | 一区视频免费观看 | 久久1024 | 五月丁香综合激情六月久久 | 女人的天堂av | 日韩中文字幕不卡 | 欧美无专区 | 日本人体视频 | 亚洲视频免费在线 | 少妇大叫太粗太大爽一区二区 | 国产精品20p | 精品久久一区 | 日本成人午夜视频 | 婷婷六月色 | 亚洲影院一区 | 亚洲在线免费视频 | 日韩av在线看免费观看 | 视频在线 | 色播网址 | 两根大肉大捧一进一出好爽视频 | 国产精品久久久久久免费观看 | 国产91精品露脸国语对白 | 国产白丝精品91爽爽久久 | 日本一区二区观看 | 国内自拍青青草 | 久久在线精品 | 欧美日韩国产中文字幕 | 激情成人综合 | 超碰97国产在线 | 操操日| 欧美人妻精品一区二区三区 | 午夜久久精品 | 777亚洲| 精品国产一区二区不卡 | 日韩在线观看视频一区 |