當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

Python—实训day9—使用pandas进行数据预处理

發(fā)布時間：2023/12/18 python 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python—实训day9—使用pandas进行数据预处理小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1合并數(shù)據(jù)

1.1堆疊合并數(shù)據(jù)

1.1.1橫向堆疊（行對齊，左右拼接）

橫向堆疊，即將兩個表在X軸向拼接在一起，可以使用concat函數(shù)完成，concat函數(shù)的基本語法如下。

pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)

常用參數(shù)如下所示。

①當(dāng)axis=1的時候，concat做行對齊，然后將不同列名稱的兩張或多張表合并。當(dāng)兩個表索引不完全一樣時，可以使用join參數(shù)選擇是內(nèi)連接還是外連接。在內(nèi)連接的情況下，僅僅返回索引重疊部分。在外連接的情況下，則顯示索引的并集部分?jǐn)?shù)據(jù)，不足的地方則使用空值填補(bǔ)。

②當(dāng)兩張表完全一樣時，不論join參數(shù)取值是inner或者outer，結(jié)果都是將兩個表完全按照X軸拼接起來。

import pandas as pd

detail = pd.read_excel(r'F:\Desktop\meal_order_detail.xlsx')

detail.shape #(2779, 19)

info = pd.read_csv(r'F:\Desktop\meal_order_info.csv', encoding='gbk')

info.shape #(945, 21)

a = detail.iloc[:, :10]

a.shape #(2779, 10)

b = detail.iloc[:, 10:]

b.shape #(2779, 9)

#橫向堆疊

c = pd.concat([a, b], axis=1)

c.shape #(2779, 19)

1.1.2縱向堆疊（列對齊，上下拼接）

（1）concat函數(shù)：

使用concat函數(shù)時，在默認(rèn)情況下，即axis=0時，concat做列對齊，將不同行索引的兩張或多張表縱向合并。在兩張表的列名并不完全相同的情況下，可join參數(shù)取值為inner時，返回的僅僅是列名交集所代表的列，取值為outer時，返回的是兩者列名的并集所代表的列，其原理示意如圖。

不論join參數(shù)取值是inner或者outer，結(jié)果都是將兩個表完全按照Y軸拼接起來。

（2）append方法

append方法也可以用于縱向合并兩張表。但是append方法實現(xiàn)縱向表堆疊有一個前提條件，那就是兩張表的列名需要完全一致。append方法的基本語法如下：

pandas.DataFrame.append(self, other, ignore_index=False, verify_integrity=False)

常用參數(shù)如下所示。

a = detail.iloc[:100, :]

a.shape #(100, 19)

b = detail.iloc[100:, :]

b.shape #(2679, 19)

#--concat函數(shù)

c = pd.concat([a, b], axis=0)

c.shape #(2779, 19)

#--append方法:兩張表的列名需要完全一致

d = a.append(b)

d.shape #(2779, 19)

1.2主鍵合并數(shù)據(jù)

主鍵合并，即通過一個或多個鍵將兩個數(shù)據(jù)集的行連接起來，類似于SQL中的JOIN。針對同一個主鍵存在兩張包含不同字段的表，將其根據(jù)某幾個字段一一對應(yīng)拼接起來，結(jié)果集列數(shù)為兩個元數(shù)據(jù)的列數(shù)和減去連接鍵的數(shù)量。

和數(shù)據(jù)庫的join一樣，merge函數(shù)也有左連接（left）、右連接（right）、內(nèi)連接（inner）和外連接（outer），但比起數(shù)據(jù)庫SQL語言中的join和merge函數(shù)還有其自身獨到之處，例如可以在合并過程中對數(shù)據(jù)集中的數(shù)據(jù)進(jìn)行排序等。

pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)

可根據(jù)merge函數(shù)中的參數(shù)說明，并按照需求修改相關(guān)參數(shù)，就可以多種方法實現(xiàn)主鍵合并。

e = pd.merge(detail, info, on='emp_id')

e.shape #(7856, 39)

detail.shape #(2779, 19)

info.shape #(945, 21)

1.3重疊數(shù)據(jù)合并

數(shù)據(jù)分析和處理過程中若出現(xiàn)兩份數(shù)據(jù)的內(nèi)容幾乎一致的情況，但是某些特征在其中一張表上是完整的，而在另外一張表上的數(shù)據(jù)則是缺失的時候，可以用combine_first方法進(jìn)行重疊數(shù)據(jù)合并，其原理如下。

combine_first的具體用法如下。

pandas.DataFrame.combine_first(other)

參數(shù)及其說明如下。

import numpy as np

a = pd.DataFrame({'id':[1, np.nan, 3, np.nan, 5], 'cpu':[np.nan, 'i3', 'i5', np.nan, np.nan]})

b = pd.DataFrame({'id':[np.nan, 2, 3, np.nan, 5], 'cpu':['i7', 'i3', np.nan, 'i5', 'i3']})

a.combine_first(b)

2清洗數(shù)據(jù)

2.1檢測與處理重復(fù)值

2.1.1記錄重復(fù)

記錄重復(fù)，即一個或者多個特征某幾個記錄的值完全相同

①方法一是利用列表（list）去重，自定義去重函數(shù)。

②方法二是利用集合（set）的元素是唯一的特性去重，如dish_set = set(dishes)

比較上述兩種方法可以發(fā)現(xiàn)，方法一代碼冗長。方法二代碼簡單了許多，但會導(dǎo)致數(shù)據(jù)的排列發(fā)生改變。

③pandas提供了一個名為drop_duplicates的去重方法。該方法只對DataFrame或者Series類型有效。這種方法不會改變數(shù)據(jù)原始排列，并且兼具代碼簡潔和運行穩(wěn)定的特點。該方法不僅支持單一特征的數(shù)據(jù)去重，還能夠依據(jù)DataFrame的其中一個或者幾個特征進(jìn)行去重操作。

pandas.DataFrame(Series).drop_duplicates(self, subset=None, keep='first', inplace=False)

#--方法一：編寫自定義函數(shù)

def delRep(data):

list1 = [] #用于存放無重復(fù)數(shù)據(jù)

for i in data:

if i not in list1:

list1.append(i)

return list1

list1 = delRep(detail['order_id'])

print('去重前的數(shù)據(jù)大小：', len(detail['order_id']))

print('去重后的數(shù)據(jù)大小：', len(list1))

#--方法二：利用集合去重（數(shù)據(jù)的順序會發(fā)生改變）

set([1, 2, 2, 4, 3, 1])

set(detail['order_id'])

#--方法三：drop_duplicates方法去重

detail.drop_duplicates('order_id', inplace=True)

detail['order_id'].drop_duplicates()

2.1.2特征去重

結(jié)合相關(guān)的數(shù)學(xué)和統(tǒng)計學(xué)知識，去除連續(xù)型特征重復(fù)可以利用特征間的相似度將兩個相似度為1的特征去除一個。在pandas中相似度的計算方法為corr，使用該方法計算相似度時，默認(rèn)為“pearson”法?，可以通過“method”參數(shù)調(diào)節(jié)，目前還支持“spearman”法和“kendall”法。

但是通過相似度矩陣去重存在一個弊端，該方法只能對數(shù)值型重復(fù)特征去重，類別型特征之間無法通過計算相似系數(shù)來衡量相似度。

除了使用相似度矩陣進(jìn)行特征去重之外，可以通過DataFrame.equals的方法進(jìn)行特征去重。

#--第一種方法：corr方法判斷特征是否去重

detail[['counts', 'amounts']].corr()

detail[['counts', 'amounts', 'dishes_name']].corr() #只能對數(shù)值型重復(fù)特征去重，類別型特征之間無法通過計算相似系數(shù)來衡量相似度。

#--第二種方法：equals方法判斷特征是否去重

detail['counts'].equals(detail['amounts'])

2.2檢測與處理缺失值

2.2.1檢測缺失值

數(shù)據(jù)中的某個或某些特征的值是不完整的，這些值稱為缺失值。

pandas提供了識別缺失值的方法isnull以及識別非缺失值的方法notnull，這兩種方法在使用時返回的都是布爾值True和False。

結(jié)合sum函數(shù)和isnull、notnull函數(shù)，可以檢測數(shù)據(jù)中缺失值的分布以及數(shù)據(jù)中一共含有多少缺失值。

isnull和notnull之間結(jié)果正好相反，因此使用其中任意一個都可以判斷出數(shù)據(jù)中缺失值的位置。

detail.isnull() #布爾值

detail.isnull().sum() #統(tǒng)計每一列的缺失值個數(shù)

detail.isnull().sum().sum() #統(tǒng)計出整個數(shù)據(jù)集的缺失值個數(shù)

2.2.2刪除法

刪除法分為刪除觀測記錄和刪除特征兩種，它屬于利用減少樣本量來換取信息完整度的一種方法，是一種最簡單的缺失值處理方法。

pandas中提供了簡便的刪除缺失值的方法dropna，該方法既可以刪除觀測記錄，亦可以刪除特征。

pandas.DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)

常用參數(shù)及其說明如下。

detail.dropna(axis=1, how='all', inplace=True)

#--------------------------------

#----

#----2.2.3替換法

a = pd.DataFrame({'id':[1, np.nan, 3, np.nan, 5], 'cpu':[np.nan, 'i3', 'i5', 'i3', np.nan]})

#--數(shù)值型數(shù)據(jù)

id_mean = a['id'].mean()

a['id'].fillna(id_mean) #均值填補(bǔ)缺失值

#--類別型數(shù)據(jù):選擇使用眾數(shù)來替換缺失值

zhongshu = a['cpu'].value_counts().index[0]

a['cpu'].fillna(zhongshu)

#----2.2.4插值法(插值是一種通過已知的、離散的數(shù)據(jù)點，在范圍內(nèi)推求新數(shù)據(jù)點的過程或方法）

#拉格朗日插值

from scipy.interpolate import lagrange

x = np.array([1, 2, 4, 6, 9])

y = np.array([2, 4, 6, 9, 10])

model = lagrange(x, y)

model([5])

#------------2.3檢測與處理異常值--------------------

#----2.3.1 3σ原則

u = detail['counts'].mean() #均值

o = detail['counts'].std() #標(biāo)準(zhǔn)差

detail['counts'].apply(lambda x:x < u-3*o or x > u+3*o)

#----2.3.2箱線圖分析

import matplotlib.pyplot as plt

a = plt.boxplot(detail['counts'])

plt.show()

a['fliers'][0].get_ydata()

#==================3標(biāo)準(zhǔn)化數(shù)據(jù)

#------------3.1離差標(biāo)準(zhǔn)化數(shù)據(jù)--------------------

def MinMaxScaler(data) :

new_data = (data - data.min())/(data.max()-data.min())

return new_data

MinMaxScaler(detail['amounts'])

#------------3.2標(biāo)準(zhǔn)差標(biāo)準(zhǔn)化--------------------

def StandarScaler(data):

new_data = (data - data.mean())/data.std()

return new_data

StandarScaler(detail['amounts'])

#------------3.3小數(shù)定標(biāo)標(biāo)準(zhǔn)化數(shù)據(jù)--------------------

import numpy as np

def DecimalScaler(data):

new_data = data/10**np.ceil(np.log10(data.abs().max())) #ceil表取整數(shù)

return new_data

DecimalScaler(detail['amounts'])

#==================4轉(zhuǎn)換數(shù)據(jù)

#------------4.1啞變量處理類別數(shù)據(jù)--------------------

pd.get_dummies(detail['dishes_name'])

pd.get_dummies(detail)

#------------4.2離散化連續(xù)型數(shù)據(jù)--------------------

#----4.2.1等寬離散化

detail['amounts'] = pd.cut(detail['amounts'], 5)

#----4.2.2等頻離散化

detail = pd.read_excel(r'F:\Desktop\meal_order_detail.xlsx')

def SameRateCue(data, k):

w = data.quantile(np.arange(0, 1+1/k, 1/k))

new_data = pd.cut(data, w)

return new_data

SameRateCue(detail['amounts'], 5)

SameRateCue(detail['amounts'], 5).value_counts()

#---4.2.3k-means聚類離散化

def KmeansCut(data,k):

from sklearn.cluster import KMeans #pip install scikit-learn -i https://mirrors.aliyun.com/pypi/simple

#建立模型

kmodel=KMeans(n_clusters=k, n_jobs=4) #n_jobs是并行數(shù)，一般等于CPU數(shù)較好

#訓(xùn)練模型

kmodel.fit(np.array(data).reshape(len(data), 1)) #轉(zhuǎn)換成一列

#輸出聚類中心并排序

c=pd.DataFrame(kmodel.cluster_centers_).sort_values(0) #把聚類中心作成一個表格并排序

#相鄰兩項求中點，作為邊界點

w=c.rolling(2).mean().iloc[1:] #不要第0個，所以是從第1個開始

#把首末邊界點加上

w=[0]+list(w[0])+[data.max()] #相當(dāng)于加了個0和最大值

data=pd.cut(data,w)

return data

KmeansCut(detail['amounts'],5).value_counts()

總結(jié)

以上是生活随笔為你收集整理的Python—实训day9—使用pandas进行数据预处理的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Python—实训day8—掌握Data
下一篇： Python—实训day10—Matpl