日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

kaggle初探--泰坦尼克号生存预测

發(fā)布時(shí)間:2023/12/18 编程问答 33 豆豆
生活随笔 收集整理的這篇文章主要介紹了 kaggle初探--泰坦尼克号生存预测 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

繼續(xù)學(xué)習(xí)數(shù)據(jù)挖掘,嘗試了kaggle上的泰坦尼克號(hào)生存預(yù)測(cè)。

Titanic for Machine Learning

導(dǎo)入和讀取

# data processing import numpy as np import pandas as pd import re #visiulization import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline plt.style.use('ggplot') train = pd.read_csv('D:/data/titanic/train.csv') test = pd.read_csv('D:/data/titanic/test.csv') train.head() .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked01234
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th…female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
train.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB

數(shù)據(jù)特征有:PassengerId,無特別意義
Pclass,客艙等級(jí),對(duì)生存有影響嗎?是否高等倉(cāng)的有更多機(jī)會(huì)?
Name,姓名,可幫助我們判斷性別,大概年齡。
Sex,女性的生產(chǎn)率是否更高?
Age,不同年齡段是否對(duì)生存有影響?
SibSp和Parch,指是否有兄弟姐妹和配偶父母,有親人的情況下生存率是提高還是降低?
Fare,票價(jià),高票價(jià)是否有更多機(jī)會(huì)?
Cabin,Embarked,客艙和登錄港口……自然理解對(duì)生存應(yīng)該沒有影響

train.describe() .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } PassengerIdSurvivedPclassAgeSibSpParchFarecountmeanstdmin25%50%75%max
891.000000891.000000891.000000714.000000891.000000891.000000891.000000
446.0000000.3838382.30864229.6991180.5230080.38159432.204208
257.3538420.4865920.83607114.5264971.1027430.80605749.693429
1.0000000.0000001.0000000.4200000.0000000.0000000.000000
223.5000000.0000002.00000020.1250000.0000000.0000007.910400
446.0000000.0000003.00000028.0000000.0000000.00000014.454200
668.5000001.0000003.00000038.0000001.0000000.00000031.000000
891.0000001.0000003.00000080.0000008.0000006.000000512.329200
train.describe(include=['O'])#['O'] indicates category feature .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } NameSexTicketCabinEmbarkedcountuniquetopfreq
891891891204889
89126811473
Hippach, Mrs. Louis Albert (Ida Sophia Fischer)male1601C23 C25 C27S
157774644

目標(biāo)Survived特征

survive_num = train.Survived.value_counts() survive_num.plot.pie(explode=[0,0.1],autopct='%1.1f%%',labels=['died','survived'],shadow=True) plt.show()

x=[0,1] plt.bar(x,survive_num,width=0.35) plt.xticks(x,('died','survived')) plt.show()

特征分析

num_f = [f for f in train.columns if train.dtypes[f] != 'object'] cat_f = [f for f in train.columns if train.dtypes[f]=='object'] print('there are %d numerical features:'%len(num_f),num_f) print('there are %d category features:'%len(cat_f),cat_f)

there are 7 numerical features: [‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’]
there are 5 category features: [‘Name’, ‘Sex’, ‘Ticket’, ‘Cabin’, ‘Embarked’]

feature類別:
- 數(shù)值型
- 特征型:可排序/不可排序型
- category不可排序型:sex,Embarked

category特征

性別

train.groupby(['Sex'])['Survived'].count() Sex female 314 male 577 Name: Survived, dtype: int64 f,ax = plt.subplots(figsize=(8,6)) fig = sns.countplot(x='Sex',hue='Survived',data=train) fig.set_title('Sex:Survived vs Dead') plt.show()

train.groupby(['Sex'])['Survived'].sum()/train.groupby(['Sex'])['Survived'].count() Sex female 0.742038 male 0.188908 Name: Survived, dtype: float64 船上原有人數(shù),男性遠(yuǎn)高于女性;存活率,女性在75%左右,遠(yuǎn)高于男性18%-19%.可見女性存活率遠(yuǎn)高于男性,是重要特征。

Embarked

sns.factorplot('Embarked','Survived',data=train) plt.show()

f,ax = plt.subplots(1,3,figsize=(24,6)) sns.countplot('Embarked',data=train,ax=ax[0]) ax[0].set_title('No. Of Passengers Boarded') sns.countplot(x='Embarked',hue='Survived',data=train,ax=ax[1]) ax[1].set_title('Embarked vs Survived') sns.countplot('Embarked',hue='Pclass',data=train,ax=ax[2]) ax[2].set_title('Embarked vs Pclass') #plt.subplots_adjust(wspace=0.2,hspace=0.5) plt.show()

#pd.pivot_table(train,index='Embarked',columns='Pclass',values='Fare') sns.boxplot(x='Embarked',y='Fare',hue='Pclass',data=train) plt.show()

從圖中看出大部分乘客來自S port,其中多數(shù)為class 3,但是class 1 的人數(shù)也是3個(gè)口中最多的,C port的存活率最高,為0.55,因?yàn)镃 port中class1的人比例較高,Q port 絕大部分乘客是class 3的。C口1,2倉(cāng)的票價(jià)均值較高,可能是暗示這個(gè)口上的人的社會(huì)地位較高。不過,從邏輯上說登錄口對(duì)生存率是沒有影響的,所以可以將其轉(zhuǎn)成啞變量或drop.

Pclass

train.groupby('Pclass')['Survived'].value_counts() Pclass Survived 1 1 136 0 80 2 0 97 1 87 3 0 372 1 119 Name: Survived, dtype: int64 plt.subplots(figsize=(8,6)) f = sns.countplot('Pclass',hue='Survived',data=train)

sns.factorplot('Pclass','Survived',hue='Sex',data=train) plt.show()

class1,2的存活率明顯較高,1有半數(shù)以上存活,2也基本持平,1,2倉(cāng)女性甚至接近于1,所以客艙等級(jí)對(duì)生存有很大影響。

SibSp

train[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } SibSpSurvived1203456
10.535885
20.464286
00.345395
30.250000
40.166667
50.000000
80.000000
sns.factorplot('SibSp','Survived',data=train) plt.show()

#pd.pivot_table(train,values='Survived',index='SibSp',columns='Pclass') sns.countplot(x='SibSp',hue='Pclass',data=train) plt.show()

在沒有同伴的情況下,存活率大概在0.3左右,有一個(gè)同伴的存活率最高>0.5,可能原因是1,2倉(cāng)的乘客比例較高,隨后,隨著同伴數(shù)量增加而降低,降低的主要原因可能是,超過3人以上的乘客主要在class3,class3中3人以上存活率很低

Parch

#pd.pivot_table(train,values='Survived',index='Parch',columns='Pclass') sns.countplot(x='Parch',hue='Pclass',data=train) plt.show()

sns.factorplot('Parch','Survived',data=train) plt.show()

趨勢(shì)跟SibSp相似,一個(gè)人存活率較低,在有1-3parents時(shí)存活率較高,隨后迅速降低,因?yàn)槎鄶?shù)乘客來自class3

Age

train.groupby('Survived')['Age'].describe() .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } countmeanstdmin25%50%75%maxSurvived01
424.030.62617914.1721101.0021.028.039.074.0
290.028.34369014.9509520.4219.028.036.080.0
f,ax = plt.subplots(1,2,figsize=(16,6)) sns.violinplot('Pclass','Age',hue='Survived',data=train,split=True,ax=ax[0]) ax[0].set_title('Pclass Age & Survived') sns.violinplot('Sex','Age',hue='Survived',data=train,split=True,ax=ax[1]) ax[1].set_title('Sex Age & Survived') plt.show()

1等倉(cāng)獲救年齡總體偏低,生存率年齡跨度大,尤其是20歲以上至50歲的生存率較高,可能和1等倉(cāng)人年齡總體偏大有關(guān);10歲左右的兒童在2,3等倉(cāng)的生存率明顯提升,對(duì)于男性而言同理,兒童有個(gè)明顯提升,;女性的生存年齡集中在中青年;20-40歲左右的中青年人死亡人數(shù)最多。

Name

name主要用途是可以幫助我們分辨性別,幫助補(bǔ)充有相同title的年齡缺失值

#用正則表達(dá)式幫助找出姓名中表示年齡的title def getTitle(data):name_sal = []for i in range(len(data['Name'])):name_sal.append(re.findall(r'.\w*\.',data.Name[i]))Salut = []for i in range(len(name_sal)):name = str(name_sal[i])name = name[1:-1].replace("'","")name = name.replace(".","").strip()name = name.replace(" ","")Salut.append(name)data['Title'] = SalutgetTitle(train) train.head(2) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedTitle01
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNSMr
211Cumings, Mrs. John Bradley (Florence Briggs Th…female38.010PC 1759971.2833C85CMrs
pd.crosstab(train['Title'],train['Sex']) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } SexfemalemaleTitleCaptColCountessDonDrJonkheerLadyMajorMasterMissMlleMmeMrMrsMrs,LMsRevSir
01
02
10
01
16
01
10
02
040
1820
20
10
0517
1240
10
10
06
01

補(bǔ)習(xí)一波英語:Mme:稱呼非英語民族的”上層社會(huì)”已婚婦女,及有職業(yè)的婦女,相當(dāng)于Mrs;Jonkheer:鄉(xiāng)紳;Capt:船長(zhǎng);Lady:貴族夫人;Don唐:是西班牙語中貴族和有地位者的尊稱;the Countess:女伯爵;Ms:Ms.或Mz:婚姻狀態(tài)不明的婦女;Col:上校;Major:少校;Mlle:小姐;Rev:牧師。

Fare

train.groupby('Pclass')['Fare'].mean() Pclass 1 84.154687 2 20.662183 3 13.675550 Name: Fare, dtype: float64 sns.distplot(train['Fare'].dropna()) plt.xlim((0,200)) plt.xticks(np.arange(0,200,10)) plt.show()

初步分析總結(jié):
- 對(duì)于性別,女性生存率明顯高于男性
- 頭等艙生存率很高,3等倉(cāng)很低,class1,2女性生存率接近于1
- 10歲左右的兒童生存率又明顯提升
- SibSp和Parch相似,一個(gè)人存活率較低,有1-2SibSp或者1-3Parents生存率較高,但超過后生存率大幅下降
- name和age可以對(duì)所有數(shù)據(jù)進(jìn)行處理,用name提取性別title,借助均值對(duì)age進(jìn)行補(bǔ)充

數(shù)據(jù)處理

#合并訓(xùn)練集和測(cè)試集 passID = test['PassengerId'] all_data = pd.concat([train,test],keys=["train","test"]) all_data.shape #all_data.head() (1309, 13) #統(tǒng)計(jì)缺失值 NAs = pd.concat([train.isnull().sum(),train.isnull().sum()/train.isnull().count(),test.isnull().sum(),test.isnull().sum()/test.isnull().count()],axis=1,keys=["train","percent_train","test","percent"]) NAs[NAs.sum(axis=1)>1].sort_values(by="percent",ascending=False) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } trainpercent_traintestpercentCabinAgeFareEmbarked
6870.771044327.00.782297
1770.19865386.00.205742
00.0000001.00.002392
20.0022450.00.000000
#刪除無意義特征 all_data.drop(['PassengerId','Cabin'],axis=1,inplace=True) all_data.head(2) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } AgeEmbarkedFareNameParchPclassSexSibSpSurvivedTicketTitletrain01
22.0S7.2500Braund, Mr. Owen Harris03male10.0A/5 21171Mr
38.0C71.2833Cumings, Mrs. John Bradley (Florence Briggs Th…01female11.0PC 17599Mrs

Age處理

#先提取name中的title getTitle(all_data) pd.crosstab(all_data['Title'], all_data['Sex']) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } SexfemalemaleTitleCaptColCountessDonDonaDrJonkheerLadyMajorMasterMissMlleMmeMrMrsMrs,LMsRevSir
01
04
10
01
10
17
01
10
02
061
2600
20
10
0757
1960
10
20
08
01
all_data['Title'] = all_data['Title'].replace(['Lady','Dr','Dona','Mme','Countess'],'Mrs') all_data['Title'] =all_data['Title'].replace('Mlle','Miss') all_data['Title'] =all_data['Title'].replace('Mrs,L','Mrs') all_data['Title'] = all_data['Title'].replace('Ms', 'Miss') #all_data['Title'] = all_data['Title'].replace('Mme', 'Mrs') all_data['Title'] = all_data['Title'].replace(['Capt','Col','Don','Major','Rev','Jonkheer','Sir'],'Mr') ''' all_data['Title'] = all_data.Title.replace({'Mlle':'Miss','Mme':'Mrs','Ms':'Miss','Dr':'Mrs','Major':'Mr','Lady':'Mrs','Countess':'Mrs','Jonkheer':'Mr','Col':'Mr','Rev':'Mr','Capt':'Mr','Sir':'Mr','Don':'Mr','Mrs,L':'Mrs'})''' all_data.Title.isnull().sum() 0 all_data[:train.shape[0]].groupby('Title')['Age'].mean() Title Master 4.574167 Miss 21.845638 Mr 32.891990 Mrs 36.188034 Name: Age, dtype: float64 #通過訓(xùn)練集中title對(duì)應(yīng)的age均值替換 all_data.loc[(all_data.Age.isnull()) & (all_data.Title=='Mr'),'Age']=32 all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Mrs'),'Age']=36 all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Master'),'Age']=5 all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Miss'),'Age']=22 #all_data.loc[(all_data.Age.isnull())&(all_data.Title=='other'),'Age']=46all_data.Age.isnull().sum() 0 all_data[:train.shape[0]][['Title', 'Survived']].groupby(['Title'], as_index=False).mean() .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } TitleSurvived0123
Master0.575000
Miss0.702703
Mr0.158192
Mrs0.777778
f,ax = plt.subplots(1,2,figsize=(16,6)) sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='female','Age'],color='red',ax=ax[0]) sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='male','Age'],color='blue',ax=ax[0])sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Age' ],color='red', label='Not Survived', ax=ax[1]) sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Age' ],color='blue', label='Survived', ax=ax[1]) plt.legend(loc='best') plt.show()

  • 16歲左右兒童存活率較高,最年長(zhǎng)乘客(80歲)幸存
  • 大量16~40青少年沒有存活
  • 大多數(shù)乘客在16~40歲
  • 為輔助分類,將年齡分段,創(chuàng)造新特征,同時(shí)增加兒童特征

add isChild

def male_female_child(passenger):# 取年齡和性別age,sex = passenger# 提出兒童特征if age < 16:return 'child'else:return sex # 創(chuàng)建新特征 all_data['person'] = all_data[['Age','Sex']].apply(male_female_child,axis=1) #0-80歲的年齡分布,若分段成3組,按少年、中青年、老年分all_data['Age_band']=0 all_data.loc[all_data['Age']<=16,'Age_band']=0 all_data.loc[(all_data['Age']>16)&(all_data['Age']<=40),'Age_band']=1 all_data.loc[all_data['Age']>40,'Age_band']=2

Name處理

df = pd.get_dummies(all_data['Title'],prefix='Title') all_data = pd.concat([all_data,df],axis=1) all_data.drop('Title',axis=1,inplace=True) #drop name all_data.drop('Name',axis=1,inplace=True)

fiilna Embarked

all_data.loc[all_data.Embarked.isnull()] .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } AgeEmbarkedFareParchPclassSexSibSpSurvivedTicketTitlepersonAge_bandtrain61829
38.0NaN80.001female01.01135722female1
62.0NaN80.001female01.01135723female2

票價(jià)80,一等艙,很大概率是C口

all_data['Embarked'].fillna('C',inplace=True)all_data.Embarked.isnull().any() False embark_dummy = pd.get_dummies(all_data.Embarked) all_data = pd.concat([all_data,embark_dummy],axis=1) all_data.head(2) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } AgeEmbarkedFareParchPclassSexSibSpSurvivedTicketpersonAge_bandTitle_MasterTitle_MissTitle_MrTitle_MrsCQStrain01
22.0S7.250003male10.0A/5 21171male10010001
38.0C71.283301female11.0PC 17599female10001100

add SibSp and Parch

#創(chuàng)造familysize和alone兩個(gè)新特征 all_data['Family_size'] = all_data['SibSp']+all_data['Parch']#是所有親屬總和 all_data['alone'] = 0#不是一個(gè)人 all_data.loc[all_data.Family_size==0,'alone']=1#代表是一個(gè)人 f,ax=plt.subplots(1,2,figsize=(16,6)) sns.factorplot('Family_size','Survived',data=all_data[:train.shape[0]],ax=ax[0]) ax[0].set_title('Family_size vs Survived') sns.factorplot('alone','Survived',data=all_data[:train.shape[0]],ax=ax[1]) ax[1].set_title('alone vs Survived') plt.close(2) plt.close(3) plt.show()

當(dāng)乘客一個(gè)人的時(shí)候,生存率很低,大概在0.3左右,有1-3家庭成員時(shí)生存率上升,但>4時(shí),生存率又急速下降。

#再將family size分段 all_data['Family_size'] = np.where(all_data['Family_size']==0, 'solo',np.where(all_data['Family_size']<=3, 'normal', 'big')) sns.factorplot('alone','Survived',hue='Sex',data=all_data[:train.shape[0]],col='Pclass') plt.show()

對(duì)于女性,1,2等倉(cāng)來說,是否一個(gè)人對(duì)生存率影響不大,但對(duì)于3等倉(cāng)女性,一個(gè)人時(shí)反而生存率提高。

all_data['poor_girl'] = 0 all_data.loc[(all_data['Sex']=='female')&(all_data['Pclass']==3)&(all_data['alone']==1),'poor_girl']=1

連續(xù)變量Fare填充、分段

#補(bǔ)充全缺失值 all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==1),'Fare']=84 all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==2),'Fare']=21 all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==3),'Fare']=14 sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Fare' ],color='red', label='Not Survived') sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Fare' ],color='blue', label='Survived') plt.xlim((0,100)) (0, 100)

sns.lmplot('Fare','Survived',data=all_data[:train.shape[0]]) plt.show()

#Fare平均分成3段取均值 all_data['Fare_band'] = pd.qcut(all_data['Fare'],3)all_data[:train.shape[0]].groupby('Fare_band')['Survived'].mean() Fare_band (-0.001, 8.662] 0.198052 (8.662, 26.0] 0.402778 (26.0, 512.329] 0.559322 Name: Survived, dtype: float64 #將連續(xù)變量fare分段,離散化all_data['Fare_cut'] = 0 all_data.loc[all_data['Fare']<=8.662,'Fare_cut'] = 0 all_data.loc[((all_data['Fare']>8.662) & (all_data['Fare']<=26)),'Fare_cut'] = 1 #all_data.loc[((all_data['Fare']>14.454) & (all_data['Fare']<=31.275)),'Fare_cut'] = 2 all_data.loc[((all_data['Fare']>26) & (all_data['Fare']<513)),'Fare_cut'] = 2sns.factorplot('Fare_cut','Survived',hue='Sex',data=all_data[:train.shape[0]]) plt.show()

價(jià)格上升,生存率增加,對(duì)男性尤為明顯

# creat a feature about rich man all_data['rich_man'] = 0 all_data.loc[((all_data['Fare']>=80) & (all_data['Sex']=='male')),'rich_man'] = 1

類型特征數(shù)值化

all_data.head() .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } AgeEmbarkedFareParchPclassSexSibSpSurvivedTicketperson…Title_MrsCQSFamily_sizealonepoor_girlFare_bandFare_cutrich_mantrain01234
22.0S7.250003male10.0A/5 21171male0001normal00(-0.001, 8.662]00
38.0C71.283301female11.0PC 17599female1100normal00(26.0, 512.329]20
26.0S7.925003female01.0STON/O2. 3101282female0001solo11(-0.001, 8.662]00
35.0S53.100001female11.0113803female1001normal00(26.0, 512.329]20
35.0S8.050003male00.0373450male0001solo10(-0.001, 8.662]00

5 rows × 24 columns

舍棄特征有Embarked(已離散化),Fare,Fare_band(已用Fare_cut代替),Sex(已用Person代替),Age(有Age_band),Ticket,S,SibSp,Parch

''' 舍棄不需要的特征:Age,用Age_band分段代替了, Fare,Fare_band用Fare_cut分段代替了 Ticket無意義 ''' #all_data.drop(['Age','Fare','Fare_band','Ticket'],axis=1,inplace=True) #all_data.drop(['Age','Fare','Fare_band','Ticket','Embarked','C'],axis=1,inplace=True) all_data.drop(['Age','Fare','Ticket','Embarked','C','Fare_band','SibSp','Parch'],axis=1,inplace=True) all_data.head(2) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } PclassSexSurvivedpersonAge_bandTitle_MasterTitle_MissTitle_MrTitle_MrsQSFamily_sizealonepoor_girlFare_cutrich_mantrain01
3male0.0male1001001normal0000
1female1.0female1000100normal0020
df1 = pd.get_dummies(all_data['Family_size'],prefix='Family_size') df2 = pd.get_dummies(all_data['person'],prefix='person') df3 = pd.get_dummies(all_data['Age_band'],prefix='age') all_data = pd.concat([all_data,df1,df2,df3],axis=1) all_data.head() .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } PclassSexSurvivedpersonAge_bandTitle_MasterTitle_MissTitle_MrTitle_MrsQ…rich_manFamily_size_bigFamily_size_normalFamily_size_soloperson_childperson_femaleperson_maleage_0age_1age_2train01234
3male0.0male1001000010001010
1female1.0female1000100010010010
3female1.0female1010000001010010
1female1.0female1000100010010010
3male0.0male1001000001001010

5 rows × 25 columns

all_data.drop(['Sex','person','Age_band','Family_size'],axis=1,inplace=True) all_data.head() .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } PclassSurvivedTitle_MasterTitle_MissTitle_MrTitle_MrsQSalonepoor_girl…rich_manFamily_size_bigFamily_size_normalFamily_size_soloperson_childperson_femaleperson_maleage_0age_1age_2train01234
30.0001001000010001010
11.0000100000010010010
31.0010001110001010010
11.0000101000010010010
30.0001001100001001010

5 rows × 21 columns

建立模型

from sklearn.model_selection import cross_val_score, train_test_split from sklearn.metrics import confusion_matrix#retun array of prredict and target from sklearn.model_selection import cross_val_predict#use to retun the predict of cross val from sklearn.model_selection import GridSearchCV from sklearn import svm from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier train_data = all_data[:train.shape[0]] test_data = all_data[train.shape[0]:] print('train data:'+str(train_data.shape)) print('test data:'+str(test_data.shape)) train data:(668, 21) test data:(641, 21) train,test = train_test_split(train_data,test_size = 0.25, random_state=0,stratify=train_data['Survived']) train_x = train.drop('Survived',axis=1)train_y = train['Survived']test_x = test.drop('Survived',axis=1) test_y = test['Survived'] print(train_x.shape) print(test_x.shape) (668, 20) (223, 20) # define score on train and test data def cv_score(model):cv_result = cross_val_score(model,train_x,train_y,cv=10,scoring = "accuracy")return(cv_result)def cv_score_test(model):cv_result_test = cross_val_score(model,test_x,test_y,cv=10,scoring = "accuracy")return(cv_result_test)

rbf SVM

# RBF SVM modelparam_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], } clf_svc = GridSearchCV(svm.SVC(kernel='rbf', class_weight='balanced'), param_grid) clf_svc = clf_svc.fit(train_x, train_y) print("Best estimator found by grid search:") print(clf_svc.best_estimator_) acc_svc_train = cv_score(clf_svc.best_estimator_).mean() acc_svc_test = cv_score_test(clf_svc.best_estimator_).mean() print(acc_svc_train) print(acc_svc_test) Best estimator found by grid search: SVC(C=1000.0, cache_size=200, class_weight=’balanced’, coef0=0.0, decision_function_shape=None, degree=3, gamma=0.0001, kernel=’rbf’, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) 0.826306967835 0.816196122718

決策樹

#a simple treeclf_tree = DecisionTreeClassifier() clf_tree.fit(train_x,train_y) acc_tree_train = cv_score(clf_tree).mean() acc_tree_test = cv_score_test(clf_tree).mean() print(acc_tree_train) print(acc_tree_test) 0.808216271583 0.811631846414

KNN

#test n_neighbors pred = [] for i in range(1,11):model = KNeighborsClassifier(n_neighbors=i)model.fit(train_x,train_y)pred.append(cv_score(model).mean()) n = list(range(1,11)) plt.plot(n,pred) plt.xticks(range(1,11)) plt.show()

clf_knn = KNeighborsClassifier(n_neighbors=4) clf_knn.fit(train_x,train_y) acc_knn_train = cv_score(clf_knn).mean() acc_knn_test = cv_score_test(clf_knn).mean() print(acc_knn_train) print(acc_knn_test) 0.826239790353 0.829653679654

邏輯回歸

#logistic regressionclf_LR = LogisticRegression() clf_LR.fit(train_x,train_y) acc_LR_train = cv_score(clf_LR).mean() acc_LR_test = cv_score_test(clf_LR).mean() print(acc_LR_train) print(acc_LR_test) 0.838226647511 0.811848296631

高斯貝葉斯

clf_gb = GaussianNB() clf_gb.fit(train_x,train_y) acc_gb_train = cv_score(clf_gb).mean() acc_gb_test = cv_score_test(clf_gb).mean() print(acc_gb_train) print(acc_gb_test) 0.794959693511 0.789695087521

隨機(jī)森林

n_estimators = range(100,1000,100) grid = {'n_estimators':n_estimators}clf_forest = GridSearchCV(RandomForestClassifier(random_state=0),param_grid=grid,verbose=True) clf_forest.fit(train_x,train_y) print(clf_forest.best_estimator_) print(clf_forest.best_score_) #print(cv_score(clf_forest).mean()) #print(cv_score_test(clf_forest).mean()) Fitting 3 folds for each of 9 candidates, totalling 27 fits [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 32.2s finished RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’, max_depth=None, max_features=’auto’, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1, oob_score=False, random_state=0, verbose=0, warm_start=False) 0.817365269461 clf_forest = RandomForestClassifier(n_estimators=200) clf_forest.fit(train_x,train_y) acc_forest_train = cv_score(clf_forest).mean() acc_forest_test = cv_score_test(clf_forest).mean() print(acc_forest_train) print(acc_forest_test) 0.811178066885 0.811434217956 pd.Series(clf_forest.feature_importances_,train_x.columns).sort_values(ascending=True).plot.barh(width=0.8) plt.show()

models = pd.DataFrame({'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'],'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train],'score on test':[acc_svc_test,acc_tree_test,acc_knn_test,acc_LR_test,acc_gb_test,acc_forest_test] }) models.sort_values(by='score on test', ascending=False) ''' models = pd.DataFrame({'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'],'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train] }) ''' models.sort_values(by='score on test', ascending=False) .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } modelscore on testscore on train203154
KNN0.8296540.826240
SVM0.8161960.826307
Logistic regression0.8118480.838227
Decision Tree0.8116320.808216
Random Forest0.8114340.811178
Gaussion Bayes0.7896950.794960

Ensemble

from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import VotingClassifier from sklearn.ensemble import GradientBoostingClassifier # bagging Decision tree from sklearn.ensemble import BaggingClassifier bag_tree = BaggingClassifier(base_estimator=clf_svc.best_estimator_,n_estimators=200,random_state=0) bag_tree.fit(train_x,train_y) acc_bagtree_train = cv_score(bag_tree).mean() acc_bagtree_test =cv_score_test(bag_tree).mean() print(acc_bagtree_train) print(acc_bagtree_test) 0.82782211935 0.816196122718

Adaboosting

n_estimators = range(100,1000,100) a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] grid = {'n_estimators':n_estimators,'learning_rate':a} ada = GridSearchCV(AdaBoostClassifier(),param_grid=grid,verbose=True) ada.fit(train_x,train_y) print(ada.best_estimator_) print(ada.best_score_) #acc_ada_train = cv_score(ada).mean() #acc_ada_test = cv_score_test(ada).mean()#print(acc_ada_train) #print(acc_ada_test) Fitting 3 folds for each of 90 candidates, totalling 270 fits[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed: 5.4min finishedAdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,learning_rate=0.05, n_estimators=200, random_state=None) 0.835329341317 ada = AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.2) ada.fit(train_x,train_y)acc_ada_train = cv_score(ada).mean() acc_ada_test = cv_score_test(ada).mean()print(acc_ada_train) print(acc_ada_test) 0.829248144305 0.825719932242 #confusion matrix to see the presictiony_pred = cross_val_predict(ada,test_x,test_y,cv=10) sns.heatmap(confusion_matrix(test_y,y_pred),cmap='winter',annot=True,fmt='2.0f') plt.show()

GradientBoosting

n_estimators = range(100,1000,100) a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] grid = {'n_estimators':n_estimators,'learning_rate':a} grad = GridSearchCV(GradientBoostingClassifier(),param_grid=grid,verbose=True) grad.fit(train_x,train_y) print(grad.best_estimator_) print(grad.best_score_) Fitting 3 folds for each of 90 candidates, totalling 270 fits[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed: 2.4min finishedGradientBoostingClassifier(criterion='friedman_mse', init=None,learning_rate=0.05, loss='deviance', max_depth=3,max_features=None, max_leaf_nodes=None,min_impurity_split=1e-07, min_samples_leaf=1,min_samples_split=2, min_weight_fraction_leaf=0.0,n_estimators=200, presort='auto', random_state=None,subsample=1.0, verbose=0, warm_start=False) 0.824850299401 #use best estimator in gradientclf_grad=GradientBoostingClassifier(n_estimators=200,random_state=0,learning_rate=0.05) clf_grad.fit(train_x,train_y) acc_grad_train = cv_score(clf_grad).mean() acc_grad_test = cv_score_test(clf_grad).mean()print(acc_grad_train) print(acc_grad_test) 0.818709926304 0.807500470544 from sklearn.metrics import precision_score class Ensemble(object):def __init__(self,estimators):self.estimator_names = []self.estimators = []for i in estimators:self.estimator_names.append(i[0])self.estimators.append(i[1])self.clf = LogisticRegression()def fit(self, train_x, train_y):for i in self.estimators:i.fit(train_x,train_y)x = np.array([i.predict(train_x) for i in self.estimators]).Ty = train_yself.clf.fit(x, y)def predict(self,x):x = np.array([i.predict(x) for i in self.estimators]).T#print(x)return self.clf.predict(x)def score(self,x,y):s = precision_score(y,self.predict(x))return s ensem = Ensemble([('Ada',ada),('Bag',bag_tree),('SVM',clf_svc.best_estimator_),('LR',clf_LR),('gbdt',clf_grad)]) score = 0 for i in range(0,10):ensem.fit(train_x, train_y)sco = round(ensem.score(test_x,test_y) * 100, 2)score+=sco print(score/10) 89.83

提交

pre = ensem.predict(test_data) pd.DataFrame({'PassengerId':temp['PassengerId'],'Survived':pre}) submission = pd.DataFrame({'PassengerId':passID,'Survived':pre})

提交結(jié)果看,ensemble模型和單個(gè)模型比并沒有明顯提升,分析可能是基模型相關(guān)性較強(qiáng),訓(xùn)練數(shù)據(jù)不夠多,或者是one-hot編碼會(huì)不會(huì)引入共線性。雖然測(cè)試集和訓(xùn)練集結(jié)果相差不大,但提交結(jié)果降低明顯,分析可能是數(shù)據(jù)不夠,訓(xùn)練不充分,特征較少且相關(guān)性強(qiáng),可以考慮引入更多特征。

總結(jié)

以上是生活随笔為你收集整理的kaggle初探--泰坦尼克号生存预测的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。