當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python统计分析--2.预分析：异常值、缺失值处理

發(fā)布時間：2023/12/15 python 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 python统计分析--2.预分析：异常值、缺失值处理小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

文章目錄

1.缺失值處理
- 1.1 導(dǎo)入數(shù)據(jù)
- 1.2 觀察數(shù)據(jù)
- 1.3 缺失值處理方法
2. 異常值處理
- 2.1 異常值---強異常值的處理
- 2.2 特征篩選(Filter過濾法)
- 2.3 共線性
- 2.4 logistics、對數(shù)、指數(shù)、逆、冪、曲線的繪制
3.編碼
3.1 異常值---多變量異常值處理
- 3.2 特征篩選

1.缺失值處理

1.1 導(dǎo)入數(shù)據(jù)

先導(dǎo)入各種需要的包，導(dǎo)入數(shù)據(jù)

#導(dǎo)入包 import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.formula.api as smf from sklearn import linear_model import seaborn as sns %matplotlib inline plt.rcParams["font.sans-serif"]=["SimHei"] plt.rcParams["axes.unicode_minus"]=False #使用pandas讀取數(shù)據(jù)支持xls和xlsx data=pd.read_excel(r"殘耗.xlsx") data.head(2) data.info()

1.2 觀察數(shù)據(jù)

觀察并記錄分布異常的變量

# 第1，2，3步，確定需求、y、x都是根據(jù)理論確定#------------第4步--------------， #描述數(shù)據(jù)——獲取每個變量的分布形態(tài)、均值、中位數(shù)、最大值、最小值等常用指標(biāo)。 #分布形態(tài)——記錄分布異常的變量 data.iloc[:,1:].hist(figsize=(20,16)) #統(tǒng)計量 data.iloc[:,1:].describe()

對圖片中一些分布在3：1以上的數(shù)據(jù)進(jìn)行適當(dāng)?shù)恼{(diào)整，屬于異常值處理

1.3 缺失值處理方法

大數(shù)據(jù)分析缺失值處理方法
缺失值在3%以內(nèi)一般用中位數(shù)來填寫
缺失值在3%~20%以內(nèi)一般用模型添補來填寫（KNN 或者隨機森林）
缺失值在20%~80%以內(nèi)一般用缺失值分類法來填寫
缺失值在80%以上一般用業(yè)務(wù)人員分析（一般刪除這組數(shù)據(jù)）

這里采用隨機森林方法去填充缺失值

#----------第5步-------------- #5.1 缺失值---60%以上的缺失，分類法處理或刪除，其他中位數(shù)填補; data.isnull().sum()/data.shape[0] #缺失值比例 #data51=data.fillna(value=data.median())#中位數(shù)填補（中位數(shù)模型77.5%，隨機森林79.2%）# import sklearn.neighbors._base # import sys # sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base#-----------或使用模型填補---------- # 隨機森林添補異常值 #MissForest和RandForestregressor兩個都是隨機森林 #MissForest是集成的包不需要設(shè)置參數(shù) #RandForestregressor需要自己手動設(shè)置參數(shù)，所以一般采用MissForest from missingpy import KNNImputer,MissForestimput=MissForest(n_estimators=2,min_samples_leaf=9000,n_jobs=-1,copy=False) data5=imput.fit_transform(data.iloc[:,1:])data51=pd.DataFrame(data5,columns=data.iloc[:,1:].columns) # data51.info() x,y=data51.iloc[:,1:],data51['v殘耗'] reg=linear_model.LinearRegression() reg.fit(x,y) reg.score(x,y)

用隨機森林得到的正確率為：0.7907806712612082

2. 異常值處理

2.1 異常值—強異常值的處理

當(dāng)data51中的第i列大于j時，data51[i]=j
當(dāng)data51中的第i列小于t時，data51[i]=t
具體解釋看注釋

# 5.2異常值---強異常值的處理（模型83%） var=[(-0.01,'lHH',140000),(-0.01,'偏離位',10000),(0,'助燃',100),(-0.01,'助燃反應(yīng)',2000),(-0.01,'助燃檸檬',10000),(20,'助燃添加',29.7),(0,'助燃點',1000),(-0.01,'吸阻',1000),(10,'吸阻過濾',129),(0,'噪聲',100),(-10000,'圓周點位',29.7),(-0.01,'撤回點位',1000),(0,'收緊度',1000),(0,'標(biāo)注',129),(0,'檢查點位',100),(-0.01,'氣體綜合',10000),(0,'消耗煙脂',500),(-200,'溫控',200),(-0.01,'煙堿HW',2000),(-0.01,'煙堿量',10000),(0,'焦油量',200),(-0.01,'起點位',1000),(-0.01,'過濾時效',1500),(30,'通路',40),(-10000,'鈉元素',500),(20,'鉀元素',100)] # 當(dāng)data51中的第i列大于j時，data51[i]=j # 當(dāng)data51中的第i列小于t時，data51[i]=t for (t,i,j) in var:data51[i+str("01")]=np.where(data51[i]>=j,j,np.where(data51[i]<=t,t,data51[i].copy())) # print(data51[i+str("01")].describe()) data52=data51.iloc[:,[*range(0,6),*range(32,58)]] # data52.info() x,y=data52.iloc[:,2:],data52['v殘耗'] reg=linear_model.LinearRegression().fit(x,y) reg.score(x,y)

2.2 特征篩選(Filter過濾法)

#5.3 特征篩選(Filter過濾法)--業(yè)務(wù)上不重要的 # SelectKBest表示選擇的數(shù)量 # SelectPercentile表示選擇的百分比 #f_regression表示回歸算法 from sklearn.feature_selection import SelectKBest,SelectPercentile,f_regression # 選取數(shù)據(jù)中所有行和第二列開始到最后一列作為x # 選取殘差作為y（因變量） x,y=data52.iloc[:,2:],data52['v殘耗'] #選取f_regression算法，選擇百分比為60% fit=SelectPercentile(score_func=f_regression,percentile=60)fitt=fit.fit_transform(x,y) # fit.get_support(indices=True)是選擇出指定的列,指定的列array([ 0, 1, 2, 5, 6, 8, 10, 11, 12, 13, 15, 17, 19, 22, 23, 25, 26,28], dtype=int64) # pd.concat表示合并數(shù)據(jù)集 data53=pd.concat([data52['v殘耗'],x.iloc[:,fit.get_support(indices=True)]],axis=1) data53

2.3 共線性

具體見注釋

#5.4 共線性--嚴(yán)重共線性0.9以上，合并或刪除，共線性指的是x與x之間，不是指x與y之間 #corr（）表示相關(guān)分析,不把小于0.9的替換成0.01的話會分不清楚那些是高相關(guān)還是低相關(guān) # d=data53.corr();d[d<=0.9]=0.01#賦值顯示高相關(guān)的變量，提取出高相關(guān)的變量 # # 繪制熱力圖 # sns.heatmap(d) # print([data53['氣體綜合01'].corr(data53['煙堿量01']),data53['過濾時效01'].corr(data53['v3燃料類型'])]) # plt.scatter(data53['v3燃料類型'],data53['過濾時效01'])#刪除過濾時效01，在業(yè)務(wù)上不重要 # plt.scatter(data53['氣體綜合01'],data53['煙堿量01'])# #擬合線性形式的模型 from scipy.optimize import curve_fitdef f(x,b0,b1):return b0+b1*x #調(diào)整x和y的任意函數(shù)關(guān)系,如b0*np.exp(b1*dt['x']) popt,pcov=curve_fit(f,data53["煙堿量01"],data53["氣體綜合01"]) b0=popt[0] b1=popt[1]# data53["成分煙堿"]=b0+b1*data53["煙堿量01"]#整合新字段并計算r方 print("r**2:",(data53["成分煙堿"].corr(data53['v殘耗'])))#如果與y的相關(guān)高于單個x與y的相關(guān)則保留；# #drop彈出指標(biāo)。"氣體綜合01",'過濾時效01',"成分煙堿"這些都是弱相關(guān)或者強相關(guān)合并后需要刪除的變量 data54=data53.drop(["氣體綜合01",'過濾時效01',"成分煙堿"],axis=1)#最終決定刪除"氣體綜合01"和'過濾時效01' data54.shape#--------------函數(shù)及圖形--------------------------- plt.subplots(2,3,figsize=(16,8));b0=1;b1=2; plt.subplot(231);x=np.random.randint(-5,5,100);y=1/(1+np.exp((-b0-b1*x))) plt.scatter(x,y,label='logistic');plt.legend() plt.subplot(232);b0=5;b1=2;x=np.random.randint(0,100,100);y=b0 + (b1 * np.log(x)) plt.scatter(x,y,label='對數(shù)');plt.legend() plt.subplot(233);b0=5;b1=2;x=np.random.randint(0,10,100);y=b0 * (np.exp((b1 * x))) plt.scatter(x,y,label='指數(shù)');plt.legend() plt.subplot(234);b0=5;b1=2;x=np.random.randint(0,10,100);y=b0 + (b1 / x) plt.scatter(x,y,label='逆');plt.legend() plt.subplot(235);b0=5;b1=2;x=np.random.randint(0,10,100);y=b0 * (x**b1) plt.scatter(x,y,label='冪');plt.legend() plt.subplot(236);b0=5;b1=2;x=np.random.randint(-100,100,100);y=np.exp(b0 + (b1/x)) plt.scatter(x,y,label='S 曲線');plt.legend()

共線性–嚴(yán)重共線性0.9以上，合并或刪除，共線性指的是x與x之間，不是指x與y之間

繪制熱力圖觀察自變量之間的相關(guān)性
從圖中可以看出自變量之間那些是強相關(guān)性的，不用與殘耗相關(guān)

V3燃料類型與過濾時效01之間的相關(guān)性，一般不采用散點圖，效果不是很明顯

氣體綜合01和煙堿量01之間的相關(guān)性

2.4 logistics、對數(shù)、指數(shù)、逆、冪、曲線的繪制

#--------------函數(shù)及圖形--------------------------- plt.subplots(2,3,figsize=(16,8));b0=1;b1=2; plt.subplot(231);x=np.random.randint(-5,5,100);y=1/(1+np.exp((-b0-b1*x))) plt.scatter(x,y,label='logistic');plt.legend() plt.subplot(232);b0=5;b1=2;x=np.random.randint(0,100,100);y=b0 + (b1 * np.log(x)) plt.scatter(x,y,label='對數(shù)');plt.legend() plt.subplot(233);b0=5;b1=2;x=np.random.randint(0,10,100);y=b0 * (np.exp((b1 * x))) plt.scatter(x,y,label='指數(shù)');plt.legend() plt.subplot(234);b0=5;b1=2;x=np.random.randint(0,10,100);y=b0 + (b1 / x) plt.scatter(x,y,label='逆');plt.legend() plt.subplot(235);b0=5;b1=2;x=np.random.randint(0,10,100);y=b0 * (x**b1) plt.scatter(x,y,label='冪');plt.legend() plt.subplot(236);b0=5;b1=2;x=np.random.randint(-100,100,100);y=np.exp(b0 + (b1/x)) plt.scatter(x,y,label='S 曲線');plt.legend()

圖形如下

3.編碼

#5.5 變量變換-----關(guān)注y是否需要變換 # data54['v殘耗log']=np.log(data54['v殘耗'])#本案例中無需變換 #5.6 編碼-----消除異常值、分組（標(biāo)簽化） # data53['煙堿量02']=pd.qcut(data53['煙堿量01'],q=4)#本案例中無需變換 data54.shape

總的結(jié)果

x,y=data54.iloc[:,1:],data54['v殘耗'] reg=linear_model.LinearRegression()# 用于大數(shù)據(jù)回歸 reg.fit(x,y) reg.score(x,y)plt.subplots(1,2,figsize=(12,8)) plt.subplot(121) r2=reg.score(x,y);plt.plot(y,reg.predict(x),'o',label=r2)#r方評分和圖示 plt.legend() plt.subplot(122) resid=y-reg.predict(x) std_resid=(resid-np.mean(resid))/np.std(resid) plt.plot(reg.predict(x),std_resid,'o',label="殘差圖")#r方評分和圖示 plt.legend()

3.1 異常值—多變量異常值處理

# 5.2+ 異常值---多變量異常值處理 data54["標(biāo)準(zhǔn)化殘差"]=std_resid#復(fù)制保存 data54_99=data54[np.abs(data54["標(biāo)準(zhǔn)化殘差"])<=6]#設(shè)置異常條件 data54_2=data54_99.drop(["標(biāo)準(zhǔn)化殘差"],axis=1) print(data54_2.shape)x,y=data54_2.iloc[:,1:],data54_2['v殘耗'] reg=linear_model.LinearRegression() reg.fit(x,y) print(reg.score(x,y)) resid=y-reg.predict(x) plt.plot(reg.predict(x),(resid-np.mean(resid))/np.std(resid),'o',label="殘差圖")#r方評分和圖示 plt.legend()

3.2 特征篩選

#5.3+ 特征篩選(Wrapper包裝法RFE;Embedded嵌入法SelectFromModel) from sklearn.feature_selection import RFE,SelectFromModel from sklearn.ensemble import RandomForestRegressor from sklearn.svm import LinearSVRx54_1,y54_1=data54_2.iloc[:,1:],data54_2['v殘耗'] # rfr=RandomForestRegressor(n_estimators=10,min_samples_leaf=10000) # selector=RFE(rfr,n_features_to_select=5).fit(x54_1,y54_1) data54_3=pd.concat([data54_2['v殘耗'],data54_2[data54_2.columns[selector.get_support(indices=True)]]],axis=1)x,y=data54_3.iloc[:,2:],data54_3['v殘耗'] reg=linear_model.LinearRegression().fit(x,y) print(reg.score(x,y)) data54_3.head(6)

總結(jié)

以上是生活随笔為你收集整理的python统计分析--2.预分析：异常值、缺失值处理的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python统计分析 --- 1.方差分
下一篇： python统计分析--3.线性回归四种