當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

所有特征在不同分类之间、 train和test之间的列分布差异(图形绘制)

發(fā)布時間：2023/12/20 编程问答 18 豆豆

生活随笔收集整理的這篇文章主要介紹了所有特征在不同分类之间、 train和test之间的列分布差异(图形绘制) 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

代碼來自：

https://www.kaggle.com/ragnar123/e-d-a-and-baseline-mix-lgbm

代碼本身有bug,下面已經(jīng)修正

－－－－－－－－－－所有特征在不同分類之間的概率分布差異－－－－－－－－－－－－－－－

代碼如下：
?

def plot_feature_distribution(df1, df2, label1, label2, features):i = 0sns.set_style('whitegrid')plt.figure()fig, ax = plt.subplots(68,5,figsize=(18,220))#68行，每行６個圖，因為總共有339個特征for feature in features:print("feature=",feature)i += 1plt.subplot(68,5,i)try:sns.kdeplot(np.log(df1[feature]), bw=0.5,label=label1)print("---------------------1------------------------------")sns.kdeplot(np.log(df2[feature]), bw=0.5,label=label2)print("---------------------2------------------------------")plt.xlabel(feature, fontsize=9)locs, labels = plt.xticks()plt.tick_params(axis='x', which='major', labelsize=6, pad=-6)plt.tick_params(axis='y', which='major', labelsize=6)except:print("特征%s繪制異常"%feature)print("繪制失敗的特征有：",fail_list)plt.show(); V=["V"+f"{i+1}"for i in range(339)] #下面是特征在train中關于isFraud=1,以及isFradu=0的分布 t0 = train[train['isFraud']==0] t1 = train[train['isFraud']==1]first = V[0:339] print("first=",first) plot_feature_distribution(t0, t1, '0', '1', first)

運行結果為：

這面這個圖總共有339個子圖，但是因為csdn對圖片大小有限制，所以這里只截圖一小部分。

339個V特征中，?V27,V28,V68,V89繪制失敗(和特征取值有關)

－－－－－－－－－－所有特征在train和test之間的概率分布差異－－－－－－－－－－－－－－－－－－－－－－－

然后，依然使用上面的plot_feature_distribution，我們改下入口參數(shù)：

t0 = train t1 = testfirst = V[0:339] print("first=",first) plot_feature_distribution(t0, t1, 'train', 'test', first)

得到結果為：

同樣的，有339個子圖，由于csdn的上傳圖片大小限制，所以這里只截取了一小部分

＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃補充＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃

上面的圖都比較小，所以來個大圖繪制代碼，但是大圖繪制代碼不能把339個特征全部放入到一張圖中。

代碼如下：

def plot_feature_one(train,test,feature,log,fail_list):passdf1_0 = train[train['isFraud']==0]df1_1 = train[train['isFraud']==1]fig, (ax1, ax2) = plt.subplots(2,1, figsize=(13,9))#不同分類值之間，觀測概率分布的不同try:if log == True:sns.kdeplot(np.log(df1_0[feature]), bw=0.001,shade = True, label = 'Not Fraud', ax = ax1)sns.kdeplot(np.log(df1_1[feature]), bw=0.001,shade = True, label = 'Fraud', ax = ax1)ax1.set_title(feature)plt.title(feature,fontsize='large',fontweight='bold') else:sns.kdeplot(df1_0[feature], bw=0.001,shade = True, label = 'Not Fraud', ax = ax1)sns.kdeplot(df1_1[feature], bw=0.001,shade = True, label = 'Fraud', ax = ax1)ax1.set_title(feature)plt.title(feature,fontsize='large',fontweight='bold') except:fail_list.append(feature)#在train和test之間，觀測概率分布的不同 try: if log == True:sns.kdeplot(np.log(train[feature]),bw=0.001, shade = True, label = 'Train', ax = ax2)sns.kdeplot(np.log(test[feature]),bw=0.001, shade = True, label = 'Test', ax = ax2)ax2.set_title(feature)plt.title(feature,fontsize='large',fontweight='bold') else:sns.kdeplot(train[feature], bw=0.001,shade = True, label = 'Train', ax = ax2)sns.kdeplot(test[feature], bw=0.001,shade = True, label = 'Test', ax = ax2)ax2.set_title(feature)plt.title(feature,fontsize='large',fontweight='bold') except:fail_list.append(feature)plt.show(); V=["V"+f"{i+1}"for i in range(339)] fail_list=[] for feature in V[0:339]:print(feature)plot_feature_one(train,test,feature,True,fail_list)

總結

以上是生活随笔為你收集整理的所有特征在不同分类之间、 train和test之间的列分布差异(图形绘制)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：批量绘制train和test关于特征上的
下一篇： kaggle删除自己的数据集