日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【算法竞赛学习】金融风控之贷款违约预测-数据分析

發布時間:2023/12/15 编程问答 28 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【算法竞赛学习】金融风控之贷款违约预测-数据分析 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Task2 數據分析

此部分為零基礎入門金融風控的 Task2 數據分析部分,帶你來了解數據,熟悉數據,為后續的特征工程做準備,歡迎大家后續多多交流。

賽題:零基礎入門數據挖掘 - 零基礎入門金融風控之貸款違約

目的:

  • 1.EDA價值主要在于熟悉了解整個數據集的基本情況(缺失值,異常值),對數據集進行驗證是否可以進行接下來的機器學習或者深度學習建模.

  • 2.了解變量間的相互關系、變量與預測值之間的存在關系。

  • 3.為特征工程做準備

項目地址:https://github.com/datawhalechina/team-learning-data-mining/tree/master/FinancialRiskControl

比賽地址:https://tianchi.aliyun.com/competition/entrance/531830/introduction

2.1 學習目標

  • 學習如何對數據集整體概況進行分析,包括數據集的基本情況(缺失值,異常值)
  • 學習了解變量間的相互關系、變量與預測值之間的存在關系
  • 完成相應學習打卡任務

2.2 內容介紹

  • 數據總體了解:
    • 讀取數據集并了解數據集大小,原始特征維度;
    • 通過info熟悉數據類型;
    • 粗略查看數據集中各特征基本統計量;
  • 缺失值和唯一值:
    • 查看數據缺失值情況
    • 查看唯一值特征情況
  • 深入數據-查看數據類型
    • 類別型數據
    • 數值型數據
      • 離散數值型數據
      • 連續數值型數據
  • 數據間相關關系
    • 特征和特征之間關系
    • 特征和目標變量之間關系
  • 用pandas_profiling生成數據報告

2.3 代碼示例

2.3.1 導入數據分析及可視化過程需要的庫

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import datetime import warnings warnings.filterwarnings('ignore') /Users/exudingtao/opt/anaconda3/lib/python3.7/site-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.import pandas.util.testing as tm

以上庫都是pip install 安裝就好,如果本機有python2,python3兩個python環境傻傻分不清哪個的話,可以pip3 install 。或者直接在notebook中’!pip3 install ****'安裝。

說明:

本次數據分析探索,尤其可視化部分均選取某些特定變量進行了舉例,所以它只是一個方法的展示而不是整個賽題數據分析的解決方案。

2.3.2 讀取文件

data_train = pd.read_csv('./train.csv') data_test_a = pd.read_csv('./testA.csv')

2.3.2.1讀取文件的拓展知識

  • pandas讀取數據時相對路徑載入報錯時,嘗試使用os.getcwd()查看當前工作目錄。
  • TSV與CSV的區別:
    • 從名稱上即可知道,TSV是用制表符(Tab,’\t’)作為字段值的分隔符;CSV是用半角逗號(’,’)作為字段值的分隔符;
    • Python對TSV文件的支持:
      Python的csv模塊準確的講應該叫做dsv模塊,因為它實際上是支持范式的分隔符分隔值文件(DSV,delimiter-separated values)的。
      delimiter參數值默認為半角逗號,即默認將被處理文件視為CSV。當delimiter=’\t’時,被處理文件就是TSV。
  • 讀取文件的部分(適用于文件特別大的場景)
    • 通過nrows參數,來設置讀取文件的前多少行,nrows是一個大于等于0的整數。
    • 分塊讀取
data_train_sample = pd.read_csv("./train.csv",nrows=5) #設置chunksize參數,來控制每次迭代數據的大小 chunker = pd.read_csv("./train.csv",chunksize=5) for item in chunker:print(type(item))#<class 'pandas.core.frame.DataFrame'>print(len(item))#5

2.3.3總體了解

查看數據集的樣本個數和原始特征維度

data_test_a.shape (200000, 48) data_train.shape (800000, 47) data_train.columns Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade','subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership','annualIncome', 'verificationStatus', 'issueDate', 'isDefault','purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType', 'earliesCreditLine', 'title','policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n11', 'n12', 'n13', 'n14'],dtype='object')

查看一下具體的列名,賽題理解部分已經給出具體的特征含義,這里方便閱讀再給一下:

  • id 為貸款清單分配的唯一信用證標識
  • loanAmnt 貸款金額
  • term 貸款期限(year)
  • interestRate 貸款利率
  • installment 分期付款金額
  • grade 貸款等級
  • subGrade 貸款等級之子級
  • employmentTitle 就業職稱
  • employmentLength 就業年限(年)
  • homeOwnership 借款人在登記時提供的房屋所有權狀況
  • annualIncome 年收入
  • verificationStatus 驗證狀態
  • issueDate 貸款發放的月份
  • purpose 借款人在貸款申請時的貸款用途類別
  • postCode 借款人在貸款申請中提供的郵政編碼的前3位數字
  • regionCode 地區編碼
  • dti 債務收入比
  • delinquency_2years 借款人過去2年信用檔案中逾期30天以上的違約事件數
  • ficoRangeLow 借款人在貸款發放時的fico所屬的下限范圍
  • ficoRangeHigh 借款人在貸款發放時的fico所屬的上限范圍
  • openAcc 借款人信用檔案中未結信用額度的數量
  • pubRec 貶損公共記錄的數量
  • pubRecBankruptcies 公開記錄清除的數量
  • revolBal 信貸周轉余額合計
  • revolUtil 循環額度利用率,或借款人使用的相對于所有可用循環信貸的信貸金額
  • totalAcc 借款人信用檔案中當前的信用額度總數
  • initialListStatus 貸款的初始列表狀態
  • applicationType 表明貸款是個人申請還是與兩個共同借款人的聯合申請
  • earliesCreditLine 借款人最早報告的信用額度開立的月份
  • title 借款人提供的貸款名稱
  • policyCode 公開可用的策略_代碼=1新產品不公開可用的策略_代碼=2
  • n系列匿名特征 匿名特征n0-n14,為一些貸款人行為計數特征的處理

通過info()來熟悉數據類型

data_train.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 800000 entries, 0 to 799999 Data columns (total 47 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 800000 non-null int64 1 loanAmnt 800000 non-null float642 term 800000 non-null int64 3 interestRate 800000 non-null float644 installment 800000 non-null float645 grade 800000 non-null object 6 subGrade 800000 non-null object 7 employmentTitle 799999 non-null float648 employmentLength 753201 non-null object 9 homeOwnership 800000 non-null int64 10 annualIncome 800000 non-null float6411 verificationStatus 800000 non-null int64 12 issueDate 800000 non-null object 13 isDefault 800000 non-null int64 14 purpose 800000 non-null int64 15 postCode 799999 non-null float6416 regionCode 800000 non-null int64 17 dti 799761 non-null float6418 delinquency_2years 800000 non-null float6419 ficoRangeLow 800000 non-null float6420 ficoRangeHigh 800000 non-null float6421 openAcc 800000 non-null float6422 pubRec 800000 non-null float6423 pubRecBankruptcies 799595 non-null float6424 revolBal 800000 non-null float6425 revolUtil 799469 non-null float6426 totalAcc 800000 non-null float6427 initialListStatus 800000 non-null int64 28 applicationType 800000 non-null int64 29 earliesCreditLine 800000 non-null object 30 title 799999 non-null float6431 policyCode 800000 non-null float6432 n0 759730 non-null float6433 n1 759730 non-null float6434 n2 759730 non-null float6435 n2.1 759730 non-null float6436 n4 766761 non-null float6437 n5 759730 non-null float6438 n6 759730 non-null float6439 n7 759730 non-null float6440 n8 759729 non-null float6441 n9 759730 non-null float6442 n10 766761 non-null float6443 n11 730248 non-null float6444 n12 759730 non-null float6445 n13 759730 non-null float6446 n14 759730 non-null float64 dtypes: float64(33), int64(9), object(5) memory usage: 286.9+ MB

總體粗略的查看數據集各個特征的一些基本統計量

data_train.describe() idloanAmntterminterestRateinstallmentemploymentTitlehomeOwnershipannualIncomeverificationStatusisDefault...n5n6n7n8n9n10n11n12n13n14countmeanstdmin25%50%75%max
800000.000000800000.000000800000.000000800000.000000800000.000000799999.000000800000.0000008.000000e+05800000.000000800000.000000...759730.000000759730.000000759730.000000759729.000000759730.000000766761.000000730248.000000759730.000000759730.000000759730.000000
399999.50000014416.8188753.48274513.238391437.94772372005.3517140.6142137.613391e+041.0096830.199513...8.1079378.5759948.28295314.6224885.59234511.6438960.0008150.0033840.0893662.178606
230940.2520158716.0861780.8558324.765757261.460393106585.6402040.6757496.894751e+040.7827160.399634...4.7992107.4005364.5616898.1246103.2161845.4841040.0300750.0620410.5090691.844377
0.000000500.0000003.0000005.31000015.6900000.0000000.0000000.000000e+000.0000000.000000...0.0000000.0000000.0000001.0000000.0000000.0000000.0000000.0000000.0000000.000000
199999.7500008000.0000003.0000009.750000248.450000427.0000000.0000004.560000e+040.0000000.000000...5.0000004.0000005.0000009.0000003.0000008.0000000.0000000.0000000.0000001.000000
399999.50000012000.0000003.00000012.740000375.1350007755.0000001.0000006.500000e+041.0000000.000000...7.0000007.0000007.00000013.0000005.00000011.0000000.0000000.0000000.0000002.000000
599999.25000020000.0000003.00000015.990000580.710000117663.5000001.0000009.000000e+042.0000000.000000...11.00000011.00000010.00000019.0000007.00000014.0000000.0000000.0000000.0000003.000000
799999.00000040000.0000005.00000030.9900001715.420000378351.0000005.0000001.099920e+072.0000001.000000...70.000000132.00000079.000000128.00000045.00000082.0000004.0000004.00000039.00000030.000000

8 rows × 42 columns

data_train.head(3).append(data_train.tail(3)) idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership...n5n6n7n8n9n10n11n12n13n14012799997799998799999
035000.0519.52917.97EE2320.02 years2...9.08.04.012.02.07.00.00.00.02.0
118000.0518.49461.90DD2219843.05 years0...NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN
212000.0516.99298.17DD331698.08 years0...0.021.04.05.03.011.00.00.00.04.0
7999976000.0313.33203.12CC32582.010+ years1...4.026.04.010.04.05.00.00.01.04.0
79999819200.036.92592.14AA4151.010+ years0...10.06.012.022.08.016.00.00.00.05.0
7999999000.0311.06294.91BB313.05 years0...3.04.04.08.03.07.00.00.00.02.0

6 rows × 47 columns

2.3.4查看數據集中特征缺失值,唯一值等

查看缺失值

print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.') There are 22 columns in train dataset with missing values.

上面得到訓練集有22列特征有缺失值,進一步查看缺失特征中缺失率大于50%的特征

have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict() fea_null_moreThanHalf = {} for key,value in have_null_fea_dict.items():if value > 0.5:fea_null_moreThanHalf[key] = value fea_null_moreThanHalf {}

具體的查看缺失特征及缺失率

# nan可視化 missing = data_train.isnull().sum()/len(data_train) missing = missing[missing > 0] missing.sort_values(inplace=True) missing.plot.bar() <matplotlib.axes._subplots.AxesSubplot at 0x1229ab890>

  • 縱向了解哪些列存在 “nan”, 并可以把nan的個數打印,主要的目的在于查看某一列nan存在的個數是否真的很大,如果nan存在的過多,說明這一列對label的影響幾乎不起作用了,可以考慮刪掉。如果缺失值很小一般可以選擇填充。
  • 另外可以橫向比較,如果在數據集中,某些樣本數據的大部分列都是缺失的且樣本足夠的情況下可以考慮刪除。

Tips:
比賽大殺器lgb模型可以自動處理缺失值,Task4模型會具體學習模型了解模型哦!

查看訓練集測試集中特征屬性只有一值的特征

one_value_fea = [col for col in data_train.columns if data_train[col].nunique() <= 1] one_value_fea_test = [col for col in data_test_a.columns if data_test_a[col].nunique() <= 1] one_value_fea ['policyCode'] one_value_fea_test ['policyCode'] print(f'There are {len(one_value_fea)} columns in train dataset with one unique value.') print(f'There are {len(one_value_fea_test)} columns in test dataset with one unique value.') There are 1 columns in train dataset with one unique value. There are 1 columns in test dataset with one unique value.

總結:

47列數據中有22列都缺少數據,這在現實世界中很正常。‘policyCode’具有一個唯一值(或全部缺失)。有很多連續變量和一些分類變量。

2.3.5 查看特征的數值類型有哪些,對象類型有哪些

  • 特征一般都是由類別型特征和數值型特征組成,而數值型特征又分為連續型和離散型。
  • 類別型特征有時具有非數值關系,有時也具有數值關系。比如‘grade’中的等級A,B,C等,是否只是單純的分類,還是A優于其他要結合業務判斷。
  • 數值型特征本是可以直接入模的,但往往風控人員要對其做分箱,轉化為WOE編碼進而做標準評分卡等操作。從模型效果上來看,特征分箱主要是為了降低變量的復雜性,減少變量噪音對模型的影響,提高自變量和因變量的相關度。從而使模型更加穩定。
numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns) category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns))) numerical_fea ['id','loanAmnt','term','interestRate','installment','employmentTitle','homeOwnership','annualIncome','verificationStatus','isDefault','purpose','postCode','regionCode','dti','delinquency_2years','ficoRangeLow','ficoRangeHigh','openAcc','pubRec','pubRecBankruptcies','revolBal','revolUtil','totalAcc','initialListStatus','applicationType','title','policyCode','n0','n1','n2','n2.1','n4','n5','n6','n7','n8','n9','n10','n11','n12','n13','n14'] category_fea ['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine'] data_train.grade 0 E 1 D 2 D 3 A 4 C.. 799995 C 799996 A 799997 C 799998 A 799999 B Name: grade, Length: 800000, dtype: object

數值型變量分析,數值型肯定是包括連續型變量和離散型變量的,找出來

  • 劃分數值型變量中的連續變量和離散型變量
#過濾數值型類別特征 def get_numerical_serial_fea(data,feas):numerical_serial_fea = []numerical_noserial_fea = []for fea in feas:temp = data[fea].nunique()if temp <= 10:numerical_noserial_fea.append(fea)continuenumerical_serial_fea.append(fea)return numerical_serial_fea,numerical_noserial_fea numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea) numerical_serial_fea ['id','loanAmnt','interestRate','installment','employmentTitle','annualIncome','purpose','postCode','regionCode','dti','delinquency_2years','ficoRangeLow','ficoRangeHigh','openAcc','pubRec','pubRecBankruptcies','revolBal','revolUtil','totalAcc','title','n0','n1','n2','n2.1','n4','n5','n6','n7','n8','n9','n10','n13','n14'] numerical_noserial_fea ['term','homeOwnership','verificationStatus','isDefault','initialListStatus','applicationType','policyCode','n11','n12']
  • 數值類別型變量分析
data_train['term'].value_counts()#離散型變量 3 606902 5 193098 Name: term, dtype: int64 data_train['homeOwnership'].value_counts()#離散型變量 0 395732 1 317660 2 86309 3 185 5 81 4 33 Name: homeOwnership, dtype: int64 data_train['verificationStatus'].value_counts()#離散型變量 1 309810 2 248968 0 241222 Name: verificationStatus, dtype: int64 data_train['initialListStatus'].value_counts()#離散型變量 0 466438 1 333562 Name: initialListStatus, dtype: int64 data_train['applicationType'].value_counts()#離散型變量 0 784586 1 15414 Name: applicationType, dtype: int64 data_train['policyCode'].value_counts()#離散型變量,無用,全部一個值 1.0 800000 Name: policyCode, dtype: int64 data_train['n11'].value_counts()#離散型變量,相差懸殊,用不用再分析 0.0 729682 1.0 540 2.0 24 4.0 1 3.0 1 Name: n11, dtype: int64 data_train['n12'].value_counts()#離散型變量,相差懸殊,用不用再分析 0.0 757315 1.0 2281 2.0 115 3.0 16 4.0 3 Name: n12, dtype: int64
  • 數值連續型變量分析
#每個數字特征得分布可視化 f = pd.melt(data_train, value_vars=numerical_serial_fea) g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False) g = g.map(sns.distplot, "value")

  • 查看某一個數值型變量的分布,查看變量是否符合正態分布,如果不符合正太分布的變量可以log化后再觀察下是否符合正態分布。
  • 如果想統一處理一批數據變標準化 必須把這些之前已經正態化的數據提出
  • 正態化的原因:一些情況下正態非正態可以讓模型更快的收斂,一些模型要求數據正態(eg. GMM、KNN),保證數據不要過偏態即可,過于偏態可能會影響模型預測結果。
#Ploting Transaction Amount Values Distribution plt.figure(figsize=(16,12)) plt.suptitle('Transaction Values Distribution', fontsize=22) plt.subplot(221) sub_plot_1 = sns.distplot(data_train['loanAmnt']) sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18) sub_plot_1.set_xlabel("") sub_plot_1.set_ylabel("Probability", fontsize=15)plt.subplot(222) sub_plot_2 = sns.distplot(np.log(data_train['loanAmnt'])) sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18) sub_plot_2.set_xlabel("") sub_plot_2.set_ylabel("Probability", fontsize=15) Text(0, 0.5, 'Probability')

  • 非數值類別型變量分析
category_fea ['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine'] data_train['grade'].value_counts() B 233690 C 227118 A 139661 D 119453 E 55661 F 19053 G 5364 Name: grade, dtype: int64 data_train['subGrade'].value_counts() C1 50763 B4 49516 B5 48965 B3 48600 C2 47068 C3 44751 C4 44272 B2 44227 B1 42382 C5 40264 A5 38045 A4 30928 D1 30538 D2 26528 A1 25909 D3 23410 A3 22655 A2 22124 D4 21139 D5 17838 E1 14064 E2 12746 E3 10925 E4 9273 E5 8653 F1 5925 F2 4340 F3 3577 F4 2859 F5 2352 G1 1759 G2 1231 G3 978 G4 751 G5 645 Name: subGrade, dtype: int64 data_train['employmentLength'].value_counts() 10+ years 262753 2 years 72358 < 1 year 64237 3 years 64152 1 year 52489 5 years 50102 4 years 47985 6 years 37254 8 years 36192 7 years 35407 9 years 30272 Name: employmentLength, dtype: int64 data_train['issueDate'].value_counts() 2016-03-01 29066 2015-10-01 25525 2015-07-01 24496 2015-12-01 23245 2014-10-01 21461... 2007-08-01 23 2007-07-01 21 2008-09-01 19 2007-09-01 7 2007-06-01 1 Name: issueDate, Length: 139, dtype: int64 data_train['earliesCreditLine'].value_counts() Aug-2001 5567 Sep-2003 5403 Aug-2002 5403 Oct-2001 5258 Aug-2000 5246... May-1960 1 Apr-1958 1 Feb-1960 1 Aug-1946 1 Mar-1958 1 Name: earliesCreditLine, Length: 720, dtype: int64 data_train['isDefault'].value_counts() 0 640390 1 159610 Name: isDefault, dtype: int64

總結:

  • 上面我們用value_counts()等函數看了特征屬性的分布,但是圖表是概括原始信息最便捷的方式。
  • 數無形時少直覺。
  • 同一份數據集,在不同的尺度刻畫上顯示出來的圖形反映的規律是不一樣的。python將數據轉化成圖表,但結論是否正確需要由你保證。

2.3.6 變量分布可視化

單一變量分布可視化

plt.figure(figsize=(8, 8)) sns.barplot(data_train["employmentLength"].value_counts(dropna=False)[:20],data_train["employmentLength"].value_counts(dropna=False).keys()[:20]) plt.show()

根絕y值不同可視化x某個特征的分布

  • 首先查看類別型變量在不同y值上的分布
train_loan_fr = data_train.loc[data_train['isDefault'] == 1] train_loan_nofr = data_train.loc[data_train['isDefault'] == 0] fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 8)) train_loan_fr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax1, title='Count of grade fraud') train_loan_nofr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax2, title='Count of grade non-fraud') train_loan_fr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh', ax=ax3, title='Count of employmentLength fraud') train_loan_nofr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh', ax=ax4, title='Count of employmentLength non-fraud') plt.show()

  • 其次查看連續型變量在不同y值上的分布
fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(15, 6)) data_train.loc[data_train['isDefault'] == 1] \['loanAmnt'].apply(np.log) \.plot(kind='hist',bins=100,title='Log Loan Amt - Fraud',color='r',xlim=(-3, 10),ax= ax1) data_train.loc[data_train['isDefault'] == 0] \['loanAmnt'].apply(np.log) \.plot(kind='hist',bins=100,title='Log Loan Amt - Not Fraud',color='b',xlim=(-3, 10),ax=ax2) <matplotlib.axes._subplots.AxesSubplot at 0x126a44b50>

total = len(data_train) total_amt = data_train.groupby(['isDefault'])['loanAmnt'].sum().sum() plt.figure(figsize=(12,5)) plt.subplot(121)##1代表行,2代表列,所以一共有2個圖,1代表此時繪制第一個圖。 plot_tr = sns.countplot(x='isDefault',data=data_train)#data_train‘isDefault’這個特征每種類別的數量** plot_tr.set_title("Fraud Loan Distribution \n 0: good user | 1: bad user", fontsize=14) plot_tr.set_xlabel("Is fraud by count", fontsize=16) plot_tr.set_ylabel('Count', fontsize=16) for p in plot_tr.patches:height = p.get_height()plot_tr.text(p.get_x()+p.get_width()/2.,height + 3,'{:1.2f}%'.format(height/total*100),ha="center", fontsize=15) percent_amt = (data_train.groupby(['isDefault'])['loanAmnt'].sum()) percent_amt = percent_amt.reset_index() plt.subplot(122) plot_tr_2 = sns.barplot(x='isDefault', y='loanAmnt', dodge=True, data=percent_amt) plot_tr_2.set_title("Total Amount in loanAmnt \n 0: good user | 1: bad user", fontsize=14) plot_tr_2.set_xlabel("Is fraud by percent", fontsize=16) plot_tr_2.set_ylabel('Total Loan Amount Scalar', fontsize=16) for p in plot_tr_2.patches:height = p.get_height()plot_tr_2.text(p.get_x()+p.get_width()/2.,height + 3,'{:1.2f}%'.format(height/total_amt * 100),ha="center", fontsize=15)

2.3.6 時間格式數據處理及查看

#轉化成時間格式 issueDateDT特征表示數據日期離數據集中日期最早的日期(2007-06-01)的天數 data_train['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d') startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d') data_train['issueDateDT'] = data_train['issueDate'].apply(lambda x: x-startdate).dt.days #轉化成時間格式 data_test_a['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d') startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d') data_test_a['issueDateDT'] = data_test_a['issueDate'].apply(lambda x: x-startdate).dt.days plt.hist(data_train['issueDateDT'], label='train'); plt.hist(data_test_a['issueDateDT'], label='test'); plt.legend(); plt.title('Distribution of issueDateDT dates'); #train 和 test issueDateDT 日期有重疊 所以使用基于時間的分割進行驗證是不明智的

2.3.7 掌握透視圖可以讓我們更好的了解數據

#透視圖 索引可以有多個,“columns(列)”是可選的,聚合函數aggfunc最后是被應用到了變量“values”中你所列舉的項目上。 pivot = pd.pivot_table(data_train, index=['grade'], columns=['issueDateDT'], values=['loanAmnt'], aggfunc=np.sum) pivot loanAmntissueDateDT0306192122153183214245274...3926395739874018404840794110414041714201gradeABCDEFG
NaN53650.042000.019500.034425.063950.043500.0168825.085600.0101825.0...13093850.011757325.011945975.09144000.07977650.06888900.05109800.03919275.02694025.02245625.0
NaN13000.024000.032125.07025.095750.0164300.0303175.0434425.0538450.0...16863100.017275175.016217500.011431350.08967750.07572725.04884600.04329400.03922575.03257100.0
NaN68750.08175.010000.061800.052550.0175375.0151100.0243725.0393150.0...17502375.017471500.016111225.011973675.010184450.07765000.05354450.04552600.02870050.02246250.0
NaNNaN5500.02850.028625.0NaN167975.0171325.0192900.0269325.0...11403075.010964150.010747675.07082050.07189625.05195700.03455175.03038500.02452375.01771750.0
7500.0NaN10000.0NaN17975.01500.094375.0116450.042000.0139775.0...3983050.03410125.03107150.02341825.02225675.01643675.01091025.01131625.0883950.0802425.0
NaNNaN31250.02125.0NaNNaNNaN49000.027000.043000.0...1074175.0868925.0761675.0685325.0665750.0685200.0316700.0315075.072300.0NaN
NaNNaNNaNNaNNaNNaNNaN24625.0NaNNaN...56100.0243275.0224825.064050.0198575.0245825.053125.023750.025100.01000.0

7 rows × 139 columns

2.3.8 用pandas_profiling生成數據報告

import pandas_profiling pfr = pandas_profiling.ProfileReport(data_train) pfr.to_file("./example.html")

2.4 總結

數據探索性分析是我們初步了解數據,熟悉數據為特征工程做準備的階段,甚至很多時候EDA階段提取出來的特征可以直接當作規則來用。可見EDA的重要性,這個階段的主要工作還是借助于各個簡單的統計量來對數據整體的了解,分析各個類型變量相互之間的關系,以及用合適的圖形可視化出來直觀觀察。

總結

以上是生活随笔為你收集整理的【算法竞赛学习】金融风控之贷款违约预测-数据分析的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。