當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【机器学习基础】前置知识（四）：一文掌握Pandas用法

發(fā)布時(shí)間：2025/3/12 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了【机器学习基础】前置知识（四）：一文掌握Pandas用法小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Pandas提供快速，靈活和富于表現(xiàn)力的數(shù)據(jù)結(jié)構(gòu)，是強(qiáng)大的數(shù)據(jù)分析Python庫。

本文收錄于機(jī)器學(xué)習(xí)前置教程系列。

一、Series和DataFrame

Pandas建立在NumPy之上，更多NumPy相關(guān)的知識(shí)點(diǎn)可以參考我之前寫的文章《前置機(jī)器學(xué)習(xí)（三）：30分鐘掌握常用NumPy用法》。

《前置機(jī)器學(xué)習(xí)（三）：30分鐘掌握常用NumPy用法》：?http://blog.caiyongji.com/2020/12/06/pre-ml-numpy-3.html

Pandas特別適合處理表格數(shù)據(jù)，如SQL表格、EXCEL表格。有序或無序的時(shí)間序列。具有行和列標(biāo)簽的任意矩陣數(shù)據(jù)。

打開Jupyter Notebook，導(dǎo)入numpy和pandas開始我們的教程：

import?numpy?as?np import?pandas?as?pd

1. pandas.Series

Series是帶有索引的一維ndarray數(shù)組。索引值可不唯一，但必須是可哈希的。

pd.Series([1,?3,?5,?np.nan,?6,?8])

輸出：

0????1.0 1????3.0 2????5.0 3????NaN 4????6.0 5????8.0 dtype:?float64

我們可以看到默認(rèn)索引值為0、1、2、3、4、5這樣的數(shù)字。添加index屬性，指定其為'c','a','i','yong','j','i'。

pd.Series([1,?3,?5,?np.nan,?6,?8],?index=['c','a','i','yong','j','i'])

輸出如下，我們可以看到index是可重復(fù)的。

c???????1.0 a???????3.0 i???????5.0 yong????NaN j???????6.0 i???????8.0 dtype:?float64

2. pandas.DataFrame

DataFrame是帶有行和列的表格結(jié)構(gòu)。可以理解為多個(gè)Series對象的字典結(jié)構(gòu)。

pd.DataFrame(np.array([[1,?2,?3],?[4,?5,?6],?[7,?8,?9]]),?index=['i','ii','iii'],?columns=['A',?'B',?'C'])

輸出表格如下，其中index對應(yīng)它的行，columns對應(yīng)它的列。

ABC

i	1	2	3
ii	4	5	6
iii	7	8	9

二、Pandas常見用法

1. 訪問數(shù)據(jù)

準(zhǔn)備數(shù)據(jù)，隨機(jī)生成6行4列的二維數(shù)組，行標(biāo)簽為從20210101到20210106的日期，列標(biāo)簽為A、B、C、D。

import?numpy?as?np import?pandas?as?pd np.random.seed(20201212) df?=?pd.DataFrame(np.random.randn(6,?4),?index=pd.date_range('20210101',?periods=6),?columns=list('ABCD')) df

展示表格如下：

ABCD

2021-01-01	0.270961	-0.405463	0.348373	0.828572
2021-01-02	0.696541	0.136352	-1.64592	-0.69841
2021-01-03	0.325415	-0.602236	-0.134508	1.28121
2021-01-04	-0.33032	-1.40384	-0.93809	1.48804
2021-01-05	0.348708	1.27175	0.626011	-0.253845
2021-01-06	-0.816064	1.30197	0.656281	-1.2718

1.1 head()和tail()

查看表格前幾行：

df.head(2)

展示表格如下：

ABCD

2021-01-01	0.270961	-0.405463	0.348373	0.828572
2021-01-02	0.696541	0.136352	-1.64592	-0.69841

查看表格后幾行：

df.tail(3)

展示表格如下：

ABCD

2021-01-04	-0.33032	-1.40384	-0.93809	1.48804
2021-01-05	0.348708	1.27175	0.626011	-0.253845
2021-01-06	-0.816064	1.30197	0.656281	-1.2718

1.2 describe()

describe方法用于生成DataFrame的描述統(tǒng)計(jì)信息。可以很方便的查看數(shù)據(jù)集的分布情況。注意，這里的統(tǒng)計(jì)分布不包含NaN值。

df.describe()

展示如下：

ABCD

count	6	6	6	6
mean	0.0825402	0.0497552	-0.181309	0.22896
std	0.551412	1.07834	0.933155	1.13114
min	-0.816064	-1.40384	-1.64592	-1.2718
25%	-0.18	-0.553043	-0.737194	-0.587269
50%	0.298188	-0.134555	0.106933	0.287363
75%	0.342885	0.987901	0.556601	1.16805
max	0.696541	1.30197	0.656281	1.48804

我們首先回顧一下我們掌握的數(shù)學(xué)公式。

平均數(shù)(mean)：

方差(variance):

標(biāo)準(zhǔn)差(std):

我們解釋一下pandas的describe統(tǒng)計(jì)信息各屬性的意義。我們僅以?A?列為例。

count表示計(jì)數(shù)。A列有6個(gè)數(shù)據(jù)不為空。
mean表示平均值。A列所有不為空的數(shù)據(jù)平均值為0.0825402。
std表示標(biāo)準(zhǔn)差。A列的標(biāo)準(zhǔn)差為0.551412。
min表示最小值。A列最小值為-0.816064。即，0%的數(shù)據(jù)比-0.816064小。
25%表示四分之一分位數(shù)。A列的四分之一分位數(shù)為-0.18。即，25%的數(shù)據(jù)比-0.18小。
50%表示二分之一分位數(shù)。A列的四分之一分位數(shù)為0.298188。即，50%的數(shù)據(jù)比0.298188小。
75%表示四分之三分位數(shù)。A列的四分之三分位數(shù)為0.342885。即，75%的數(shù)據(jù)比0.342885小。
max表示最大值。A列的最大值為0.696541。即，100%的數(shù)據(jù)比0.696541小。

1.3 T

T一般表示Transpose的縮寫，即轉(zhuǎn)置。行列轉(zhuǎn)換。

df.T

展示表格如下：

2021-01-012021-01-022021-01-032021-01-042021-01-052021-01-06

A	0.270961	0.696541	0.325415	-0.33032	0.348708	-0.816064
B	-0.405463	0.136352	-0.602236	-1.40384	1.27175	1.30197
C	0.348373	-1.64592	-0.134508	-0.93809	0.626011	0.656281
D	0.828572	-0.69841	1.28121	1.48804	-0.253845	-1.2718

1.4 sort_values()

指定某一列進(jìn)行排序，如下代碼根據(jù)C列進(jìn)行正序排序。

df.sort_values(by='C')

展示表格如下：

ABCD

2021-01-02	0.696541	0.136352	-1.64592	-0.69841
2021-01-04	-0.33032	-1.40384	-0.93809	1.48804
2021-01-03	0.325415	-0.602236	-0.134508	1.28121
2021-01-01	0.270961	-0.405463	0.348373	0.828572
2021-01-05	0.348708	1.27175	0.626011	-0.253845
2021-01-06	-0.816064	1.30197	0.656281	-1.2718

1.5 nlargest()

選擇某列最大的n行數(shù)據(jù)。如：df.nlargest(2,'A')表示，返回A列最大的2行數(shù)據(jù)。

df.nlargest(2,'A')

展示表格如下：

ABCD

2021-01-02	0.696541	0.136352	-1.64592	-0.69841
2021-01-05	0.348708	1.27175	0.626011	-0.253845

1.6 sample()

sample方法表示查看隨機(jī)的樣例數(shù)據(jù)。

df.sample(5)表示返回隨機(jī)5行數(shù)據(jù)。

df.sample(5)

參數(shù)frac表示fraction，分?jǐn)?shù)的意思。frac=0.01即返回1%的隨機(jī)數(shù)據(jù)作為樣例展示。

df.sample(frac=0.01)

2. 選擇數(shù)據(jù)

2.1 根據(jù)標(biāo)簽選擇

我們輸入df['A']命令選取A列。

df['A']

輸出A列數(shù)據(jù)，同時(shí)也是一個(gè)Series對象：

2021-01-01????0.270961 2021-01-02????0.696541 2021-01-03????0.325415 2021-01-04???-0.330320 2021-01-05????0.348708 2021-01-06???-0.816064 Name:?A,?dtype:?float64

df[0:3]該代碼與df.head(3)同理。但df[0:3]是NumPy的數(shù)組選擇方式，這說明了Pandas對于NumPy具有良好的支持。

df[0:3]

展示表格如下：

ABCD

2021-01-01	0.270961	-0.405463	0.348373	0.828572
2021-01-02	0.696541	0.136352	-1.64592	-0.69841
2021-01-03	0.325415	-0.602236	-0.134508	1.28121

通過loc方法指定行列標(biāo)簽。

df.loc['2021-01-01':'2021-01-02',?['A',?'B']]

展示表格如下：

2021-01-01	0.270961	-0.405463
2021-01-02	0.696541	0.136352

2.2 根據(jù)位置選擇

iloc?與loc不同。loc指定具體的標(biāo)簽，而iloc指定標(biāo)簽的索引位置。df.iloc[3:5, 0:3]表示選取索引為3、4的行，索引為0、1、2的列。即，第4、5行，第1、2、3列。注意，索引序號(hào)從0開始。冒號(hào)表示區(qū)間，左右兩側(cè)分別表示開始和結(jié)束。如3:5表示左開右閉區(qū)間[3,5)，即不包含5自身。

df.iloc[3:5,?0:3]

ABC

2021-01-04	-0.33032	-1.40384	-0.93809
2021-01-05	0.348708	1.27175	0.626011

df.iloc[:,?1:3]

2021-01-01	-0.405463	0.348373
2021-01-02	0.136352	-1.64592
2021-01-03	-0.602236	-0.134508
2021-01-04	-1.40384	-0.93809
2021-01-05	1.27175	0.626011
2021-01-06	1.30197	0.656281

2.3 布爾索引

DataFrame可根據(jù)條件進(jìn)行篩選，當(dāng)條件判斷True時(shí)，返回。當(dāng)條件判斷為False時(shí)，過濾掉。

我們設(shè)置一個(gè)過濾器用來判斷A列是否大于0。

filter?=?df['A']?>?0 filter

輸出結(jié)果如下，可以看到2021-01-04和2021-01-06的行為False。

2021-01-01?????True 2021-01-02?????True 2021-01-03?????True 2021-01-04????False 2021-01-05?????True 2021-01-06????False Name:?A,?dtype:?bool

我們通過過濾器查看數(shù)據(jù)集。

df[filter] #?df[df['A']?>?0]

查看表格我們可以發(fā)現(xiàn)，2021-01-04和2021-01-06的行被過濾掉了。

ABCD

2021-01-01	0.270961	-0.405463	0.348373	0.828572
2021-01-02	0.696541	0.136352	-1.64592	-0.69841
2021-01-03	0.325415	-0.602236	-0.134508	1.28121
2021-01-05	0.348708	1.27175	0.626011	-0.253845

3. 處理缺失值

準(zhǔn)備數(shù)據(jù)。

df2?=?df.copy() df2.loc[:3,?'E']?=?1.0 f_series?=?{'2021-01-02':?1.0,'2021-01-03':?2.0,'2021-01-04':?3.0,'2021-01-05':?4.0,'2021-01-06':?5.0} df2['F']?=?pd.Series(f_series) df2

展示表格如下：

ABCDFE

2021-01-01	0.270961	-0.405463	0.348373	0.828572	nan	1
2021-01-02	0.696541	0.136352	-1.64592	-0.69841	1	1
2021-01-03	0.325415	-0.602236	-0.134508	1.28121	2	1
2021-01-04	-0.33032	-1.40384	-0.93809	1.48804	3	nan
2021-01-05	0.348708	1.27175	0.626011	-0.253845	4	nan
2021-01-06	-0.816064	1.30197	0.656281	-1.2718	5	nan

3.1 dropna()

使用dropna方法清空NaN值。注意：dropa方法返回新的DataFrame，并不會(huì)改變原有的DataFrame。

df2.dropna(how='any')

以上代碼表示，當(dāng)行數(shù)據(jù)有任意的數(shù)值為空時(shí)，刪除。

ABCDFE

2021-01-02	0.696541	0.136352	-1.64592	-0.69841	1	1
2021-01-03	0.325415	-0.602236	-0.134508	1.28121	2	1

3.2 fillna()

使用filna命令填補(bǔ)NaN值。

df2.fillna(df2.mean())

以上代碼表示，使用每一列的平均值來填補(bǔ)空缺。同樣地，fillna并不會(huì)更新原有的DataFrame，如需更新原有DataFrame使用代碼df2 = df2.fillna(df2.mean())。

展示表格如下：

ABCDFE

2021-01-01	0.270961	-0.405463	0.348373	0.828572	3	1
2021-01-02	0.696541	0.136352	-1.64592	-0.69841	1	1
2021-01-03	0.325415	-0.602236	-0.134508	1.28121	2	1
2021-01-04	-0.33032	-1.40384	-0.93809	1.48804	3	1
2021-01-05	0.348708	1.27175	0.626011	-0.253845	4	1
2021-01-06	-0.816064	1.30197	0.656281	-1.2718	5	1

4. 操作方法

4.1 agg()

agg是Aggregate的縮寫，意為聚合。

常用聚合方法如下：

mean(): Compute mean of groups
sum(): Compute sum of group values
size(): Compute group sizes
count(): Compute count of group
std(): Standard deviation of groups
var(): Compute variance of groups
sem(): Standard error of the mean of groups
describe(): Generates descriptive statistics
first(): Compute first of group values
last(): Compute last of group values
nth() : Take nth value, or a subset if n is a list
min(): Compute min of group values
max(): Compute max of group values

df.mean()

返回各列平均值

A????0.082540 B????0.049755 C???-0.181309 D????0.228960 dtype:?float64

可通過加參數(shù)axis查看行平均值。

df.mean(axis=1)

輸出：

2021-01-01????0.260611 2021-01-02???-0.377860 2021-01-03????0.217470 2021-01-04???-0.296053 2021-01-05????0.498156 2021-01-06???-0.032404 dtype:?float64

如果我們想查看某一列的多項(xiàng)聚合統(tǒng)計(jì)怎么辦？
這時(shí)我們可以調(diào)用agg方法：

df.agg(['std','mean'])['A']

返回結(jié)果顯示標(biāo)準(zhǔn)差std和均值mean：

std?????0.551412 mean????0.082540 Name:?A,?dtype:?float64

對于不同的列應(yīng)用不同的聚合函數(shù)：

df.agg({'A':['max','mean'],'B':['mean','std','var']})

返回結(jié)果如下：

max	0.696541	nan
mean	0.0825402	0.0497552
std	nan	1.07834
var	nan	1.16281

4.2 apply()

apply()是對方法的調(diào)用。如df.apply(np.sum)表示每一列調(diào)用np.sum方法，返回每一列的數(shù)值和。

df.apply(np.sum)

輸出結(jié)果為：

A????0.495241 B????0.298531 C???-1.087857 D????1.373762 dtype:?float64

apply方法支持lambda表達(dá)式。

df.apply(lambda?n:?n*2)

ABCD

2021-01-01	0.541923	-0.810925	0.696747	1.65714
2021-01-02	1.39308	0.272704	-3.29185	-1.39682
2021-01-03	0.65083	-1.20447	-0.269016	2.56242
2021-01-04	-0.66064	-2.80768	-1.87618	2.97607
2021-01-05	0.697417	2.5435	1.25202	-0.50769
2021-01-06	-1.63213	2.60393	1.31256	-2.5436

4.3 value_counts()

value_counts方法查看各行、列的數(shù)值重復(fù)統(tǒng)計(jì)。我們重新生成一些整數(shù)數(shù)據(jù)，來保證有一定的數(shù)據(jù)重復(fù)。

np.random.seed(101) df3?=?pd.DataFrame(np.random.randint(0,9,size?=?(6,4)),columns=list('ABCD')) df3

ABCD

0	1	6	7	8
1	4	8	5	0
2	5	8	1	3
3	8	3	3	2
4	8	3	7	0
5	7	8	4	3

調(diào)用value_counts()方法。

df3['A'].value_counts()

查看輸出我們可以看到 A列的數(shù)字8有兩個(gè)，其他數(shù)字的數(shù)量為1。

8????2 7????1 5????1 4????1 1????1 Name:?A,?dtype:?int64

4.4 str

Pandas內(nèi)置字符串處理方法。

names?=?pd.Series(['andrew','bobo','claire','david','4']) names.str.upper()

通過以上代碼我們將Series中的字符串全部設(shè)置為大寫。

0????ANDREW 1??????BOBO 2????CLAIRE 3?????DAVID 4?????????4 dtype:?object

首字母大寫：

names.str.capitalize()

輸出為：

0????Andrew 1??????Bobo 2????Claire 3?????David 4?????????4 dtype:?object

判斷是否為數(shù)字：

names.str.isdigit()

輸出為：

0????False 1????False 2????False 3????False 4?????True dtype:?bool

字符串分割：

tech_finance?=?['GOOG,APPL,AMZN','JPM,BAC,GS'] tickers?=?pd.Series(tech_finance) tickers.str.split(',').str[0:2]

以逗號(hào)分割字符串，結(jié)果為：

0????[GOOG,?APPL] 1??????[JPM,?BAC] dtype:?object

5. 合并

5.1 concat()

concat用來將數(shù)據(jù)集串聯(lián)起來。我們先準(zhǔn)備數(shù)據(jù)。

data_one?=?{'Col1':?['A0',?'A1',?'A2',?'A3'],'Col2':?['B0',?'B1',?'B2',?'B3']} data_two?=?{'Col1':?['C0',?'C1',?'C2',?'C3'],?'Col2':?['D0',?'D1',?'D2',?'D3']} one?=?pd.DataFrame(data_one) two?=?pd.DataFrame(data_two)

使用concat方法將兩個(gè)數(shù)據(jù)集串聯(lián)起來。

pt(pd.concat([one,two]))

得到表格：

Col1Col2

0	A0	B0
1	A1	B1
2	A2	B2
3	A3	B3
0	C0	D0
1	C1	D1
2	C2	D2
3	C3	D3

5.2 merge()

merge相當(dāng)于SQL操作中的join方法，用于將兩個(gè)數(shù)據(jù)集通過某種關(guān)系連接起來

registrations?=?pd.DataFrame({'reg_id':[1,2,3,4],'name':['Andrew','Bobo','Claire','David']}) logins?=?pd.DataFrame({'log_id':[1,2,3,4],'name':['Xavier','Andrew','Yolanda','Bobo']})

我們根據(jù)name來連接兩個(gè)張表，連接方式為outer。

pd.merge(left=registrations,?right=logins,?how='outer',on='name')

返回結(jié)果為：

reg_idnamelog_id

0	1	Andrew	2
1	2	Bobo	4
2	3	Claire	nan
3	4	David	nan
4	nan	Xavier	1
5	nan	Yolanda	3

我們注意，how : {'left', 'right', 'outer', 'inner'}?有4種連接方式。表示是否選取左右兩側(cè)表的nan值。如left表示保留左側(cè)表中所有數(shù)據(jù)，當(dāng)遇到右側(cè)表數(shù)據(jù)為nan值時(shí)，不顯示右側(cè)的數(shù)據(jù)。簡單來說，把left表和right表看作兩個(gè)集合。

left表示取左表全部集合+兩表交集
right表示取右表全部集合+兩表交集
outer表示取兩表并集
inner表示取兩表交集

6. 分組GroupBy

Pandas中的分組功能非常類似于SQL語句SELECT Column1, Column2, mean(Column3), sum(Column4)FROM SomeTableGROUP BY Column1, Column2。即使沒有接觸過SQL也沒有關(guān)系，分組就相當(dāng)于把表格數(shù)據(jù)按照某一列進(jìn)行拆分、統(tǒng)計(jì)、合并的過程。

準(zhǔn)備數(shù)據(jù)。

np.random.seed(20201212) df?=?pd.DataFrame({'A':?['foo',?'bar',?'foo',?'bar',?'foo',?'bar',?'foo',?'foo'],'B':?['one',?'one',?'two',?'three',?'two',?'two',?'one',?'three'],'C':?np.random.randn(8),'D':?np.random.randn(8)}) df

可以看到，我們的A列和B列有很多重復(fù)數(shù)據(jù)。這時(shí)我們可以根據(jù)foo/bar或者one/two進(jìn)行分組。

ABCD

0	foo	one	0.270961	0.325415
1	bar	one	-0.405463	-0.602236
2	foo	two	0.348373	-0.134508
3	bar	three	0.828572	1.28121
4	foo	two	0.696541	-0.33032
5	bar	two	0.136352	-1.40384
6	foo	one	-1.64592	-0.93809
7	foo	three	-0.69841	1.48804

6.1 單列分組

我們應(yīng)用groupby方法將上方表格中的數(shù)據(jù)進(jìn)行分組。

df.groupby('A')

執(zhí)行上方代碼可以看到，groupby方法返回的是一個(gè)類型為DataFrameGroupBy的對象。我們無法直接查看，需要應(yīng)用聚合函數(shù)。參考本文4.1節(jié)。

<pandas.core.groupby.generic.DataFrameGroupBy?object?at?0x0000014C6742E248>

我們應(yīng)用聚合函數(shù)sum試試。

df.groupby('A').sum()

展示表格如下：

ACD

bar	0.559461	-0.724868
foo	-1.02846	0.410533

6.2 多列分組

groupby方法支持將多個(gè)列作為參數(shù)傳入。

df.groupby(['A',?'B']).sum()

分組后顯示結(jié)果如下：

ABCD

bar	one	-0.405463	-0.602236
	one	-0.405463	-0.602236
	three	0.828572	1.28121
	two	0.136352	-1.40384
foo	one	-1.37496	-0.612675
	three	-0.69841	1.48804
	two	1.04491	-0.464828

6.3 應(yīng)用多聚合方法

我們應(yīng)用agg()，將聚合方法數(shù)組作為參數(shù)傳入方法。下方代碼根據(jù)A分類且只統(tǒng)計(jì)C列的數(shù)值。

df.groupby('A')['C'].agg([np.sum,?np.mean,?np.std])

可以看到bar組與foo組各聚合函數(shù)的結(jié)果如下：

Asummeanstd

bar	0.559461	0.186487	0.618543
foo	-1.02846	-0.205692	0.957242

6.4 不同列進(jìn)行不同聚合統(tǒng)計(jì)

下方代碼對C、D列分別進(jìn)行不同的聚合統(tǒng)計(jì)，對C列進(jìn)行求和，對D列進(jìn)行標(biāo)準(zhǔn)差統(tǒng)計(jì)。

df.groupby('A').agg({'C':?'sum',?'D':?lambda?x:?np.std(x,?ddof=1)})

輸出如下：

ACD

bar	0.559461	1.37837
foo	-1.02846	0.907422

6.5 更多

更多關(guān)于Pandas的goupby方法請參考官網(wǎng):?Pandas User Guide - groupby

Pandas User Guide - groupby:?https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

三、Pandas 進(jìn)階用法

1. reshape

reshape表示重塑表格。對于復(fù)雜表格，我們需要將其轉(zhuǎn)換成適合我們理解的樣子，比如根據(jù)某些屬性分組后進(jìn)行單獨(dú)統(tǒng)計(jì)。

1.1 stack() 和 unstack()

stack方法將表格分為索引和數(shù)據(jù)兩個(gè)部分。索引各列保留，數(shù)據(jù)堆疊放置。

準(zhǔn)備數(shù)據(jù)。

tuples?=?list(zip(*[['bar',?'bar',?'baz',?'baz','foo',?'foo',?'qux',?'qux'],['one',?'two',?'one',?'two','one',?'two',?'one',?'two']])) index?=?pd.MultiIndex.from_tuples(tuples,?names=['first',?'second'])

根據(jù)上方代碼，我們創(chuàng)建了一個(gè)復(fù)合索引。

MultiIndex([('bar',?'one'),('bar',?'two'),('baz',?'one'),('baz',?'two'),('foo',?'one'),('foo',?'two'),('qux',?'one'),('qux',?'two')],names=['first',?'second'])

我們創(chuàng)建一個(gè)具備復(fù)合索引的DataFrame。

np.random.seed(20201212) df?=?pd.DataFrame(np.random.randn(8,?2),?index=index,?columns=['A',?'B']) df

輸出如下：

ABCD

bar	one	0.270961	-0.405463
	two	0.348373	0.828572
baz	one	0.696541	0.136352
	two	-1.64592	-0.69841
foo	one	0.325415	-0.602236
	two	-0.134508	1.28121
qux	one	-0.33032	-1.40384
	two	-0.93809	1.48804

我們執(zhí)行stack方法。

stacked?=?df.stack() stacked

輸出堆疊（壓縮）后的表格如下。注意：你使用Jupyter Notebook/Lab進(jìn)行的輸出可能和如下結(jié)果不太一樣。下方輸出的各位為了方便在Markdown中顯示有一定的調(diào)整。

first??second??? bar????one?????A????0.942502 bar????one?????B????0.060742 bar????two?????A????1.340975 bar????two?????B???-1.712152 baz????one?????A????1.899275 baz????one?????B????1.237799 baz????two?????A???-1.589069 baz????two?????B????1.288342 foo????one?????A???-0.326792 foo????one?????B????1.576351 foo????two?????A????1.526528 foo????two?????B????1.410695 qux????one?????A????0.420718 qux????one?????B???-0.288002 qux????two?????A????0.361586 qux????two?????B????0.177352 dtype:?float64

我們執(zhí)行unstack將數(shù)據(jù)進(jìn)行展開。

stacked.unstack()

輸出原表格。

ABCD

bar	one	0.270961	-0.405463
	two	0.348373	0.828572
baz	one	0.696541	0.136352
	two	-1.64592	-0.69841
foo	one	0.325415	-0.602236
	two	-0.134508	1.28121
qux	one	-0.33032	-1.40384
	two	-0.93809	1.48804

我們加入?yún)?shù)level。

stacked.unstack(level=0) #stacked.unstack(level=1)

當(dāng)level=0時(shí)得到如下輸出，大家可以試試level=1時(shí)輸出什么。

secondfirstbarbazfooqux

one	A	0.942502	1.89927	-0.326792	0.420718
one	B	0.060742	1.2378	1.57635	-0.288002
two	A	1.34097	-1.58907	1.52653	0.361586
two	B	-1.71215	1.28834	1.4107	0.177352

1.2 pivot_table()

pivot_table表示透視表，是一種對數(shù)據(jù)動(dòng)態(tài)排布并且分類匯總的表格格式。

我們生成無索引列的DataFrame。

np.random.seed(99) df?=?pd.DataFrame({'A':?['one',?'one',?'two',?'three']?*?3,'B':?['A',?'B',?'C']?*?4,'C':?['foo',?'foo',?'foo',?'bar',?'bar',?'bar']?*?2,'D':?np.random.randn(12),'E':?np.random.randn(12)}) df

展示表格如下：

ABCDE

0	one	A	foo	-0.142359	0.0235001
1	one	B	foo	2.05722	0.456201
2	two	C	foo	0.283262	0.270493
3	three	A	bar	1.32981	-1.43501
4	one	B	bar	-0.154622	0.882817
5	one	C	bar	-0.0690309	-0.580082
6	two	A	foo	0.75518	-0.501565
7	three	B	foo	0.825647	0.590953
8	one	C	foo	-0.113069	-0.731616
9	one	A	bar	-2.36784	0.261755
10	two	B	bar	-0.167049	-0.855796
11	three	C	bar	0.685398	-0.187526

通過觀察數(shù)據(jù)，我們可以顯然得出A、B、C列的具備一定屬性含義。我們執(zhí)行pivot_table方法。

pd.pivot_table(df,?values=['D','E'],?index=['A',?'B'],?columns=['C'])

上方代碼的意思為，將D、E列作為數(shù)據(jù)列，A、B作為復(fù)合行索引，C的數(shù)據(jù)值作為列索引。

('D', 'bar')('D', 'foo')('E', 'bar')('E', 'foo')

('one', 'A')	-2.36784	-0.142359	0.261755	0.0235001
('one', 'B')	-0.154622	2.05722	0.882817	0.456201
('one', 'C')	-0.0690309	-0.113069	-0.580082	-0.731616
('three', 'A')	1.32981	nan	-1.43501	nan
('three', 'B')	nan	0.825647	nan	0.590953
('three', 'C')	0.685398	nan	-0.187526	nan
('two', 'A')	nan	0.75518	nan	-0.501565
('two', 'B')	-0.167049	nan	-0.855796	nan
('two', 'C')	nan	0.283262	nan	0.270493

2. 時(shí)間序列

date_range是Pandas自帶的生成日期間隔的方法。我們執(zhí)行下方代碼：

rng?=?pd.date_range('1/1/2021',?periods=100,?freq='S') pd.Series(np.random.randint(0,?500,?len(rng)),?index=rng)

date_range方法從2021年1月1日0秒開始，以1秒作為時(shí)間間隔執(zhí)行100次時(shí)間段的劃分。輸出結(jié)果如下：

2021-01-01?00:00:00????475 2021-01-01?00:00:01????145 2021-01-01?00:00:02?????13 2021-01-01?00:00:03????240 2021-01-01?00:00:04????183...? 2021-01-01?00:01:35????413 2021-01-01?00:01:36????330 2021-01-01?00:01:37????272 2021-01-01?00:01:38????304 2021-01-01?00:01:39????151 Freq:?S,?Length:?100,?dtype:?int32

我們將freq的參數(shù)值從S(second)改為M(Month)試試看。

rng?=?pd.date_range('1/1/2021',?periods=100,?freq='M') pd.Series(np.random.randint(0,?500,?len(rng)),?index=rng)

輸出：

2021-01-31????311 2021-02-28????256 2021-03-31????327 2021-04-30????151 2021-05-31????484...? 2028-12-31????170 2029-01-31????492 2029-02-28????205 2029-03-31?????90 2029-04-30????446 Freq:?M,?Length:?100,?dtype:?int32

我們設(shè)置可以以季度作為頻率進(jìn)行日期生成。

prng?=?pd.period_range('2018Q1',?'2020Q4',?freq='Q-NOV') pd.Series(np.random.randn(len(prng)),?prng)

輸出2018第一季度到2020第四季度間的全部季度。

2018Q1????0.833025 2018Q2???-0.509514 2018Q3???-0.735542 2018Q4???-0.224403 2019Q1???-0.119709 2019Q2???-1.379413 2019Q3????0.871741 2019Q4????0.877493 2020Q1????0.577611 2020Q2???-0.365737 2020Q3???-0.473404 2020Q4????0.529800 Freq:?Q-NOV,?dtype:?float64

3. 分類

Pandas有一種特殊的數(shù)據(jù)類型叫做"目錄"，即dtype="category"，我們根據(jù)將某些列設(shè)置為目錄來進(jìn)行分類。

準(zhǔn)備數(shù)據(jù)。

df?=?pd.DataFrame({"id":?[1,?2,?3,?4,?5,?6],?"raw_grade":?['a',?'b',?'b',?'a',?'a',?'e']}) df

idraw_grade

0	1	a
1	2	b
2	3	b
3	4	a
4	5	a
5	6	e

我們添加一個(gè)新列g(shù)rade并將它的數(shù)據(jù)類型設(shè)置為category。

df["grade"]?=?df["raw_grade"].astype("category") df["grade"]

我們可以看到grade列只有3種值a,b,e。

0????a 1????b 2????b 3????a 4????a 5????e Name:?grade,?dtype:?category Categories?(3,?object):?['a',?'b',?'e']

我們按順序替換a、b、e為very good、good、very bad。

df["grade"].cat.categories?=?["very?good",?"good",?"very?bad"]

此時(shí)的表格為：

idraw_gradegrade

0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

我們對表格進(jìn)行排序：

df.sort_values(by="grade",?ascending=False)

idraw_gradegrade

5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very good

查看各類別的數(shù)量：

df.groupby("grade").size()

以上代碼輸出為：

grade very?good????3 good?????????2 very?bad?????1 dtype:?int64

4. IO

Pandas支持直接從文件中讀寫數(shù)據(jù)，如CSV、JSON、EXCEL等文件格式。Pandas支持的文件格式如下。

Format TypeData DescriptionReaderWriter

text	CSV	read_csv	to_csv
text	Fixed-Width Text File	read_fwf
text	JSON	read_json	to_json
text	HTML	read_html	to_html
text	Local clipboard	read_clipboard	to_clipboard
	MS Excel	read_excel	to_excel
binary	OpenDocument	read_excel
binary	HDF5 Format	read_hdf	to_hdf
binary	Feather Format	read_feather	to_feather
binary	Parquet Format	read_parquet	to_parquet
binary	ORC Format	read_orc
binary	Msgpack	read_msgpack	to_msgpack
binary	Stata	read_stata	to_stata
binary	SAS	read_sas
binary	SPSS	read_spss
binary	Python Pickle Format	read_pickle	to_pickle
SQL	SQL	read_sql	to_sql
SQL	Google BigQuery	read_gbq	to_gbq

我們僅以CSV文件為例作為講解。其他格式請參考上方表格。

我們從CSV文件導(dǎo)入數(shù)據(jù)。大家不用特別在意下方網(wǎng)址的域名地址。

df?=?pd.read_csv("http://blog.caiyongji.com/assets/housing.csv")

查看前5行數(shù)據(jù)：

df.head(5)

longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity

0	-122.23	37.88	41	880	129	322	126	8.3252	452600	NEAR BAY
1	-122.22	37.86	21	7099	1106	2401	1138	8.3014	358500	NEAR BAY
2	-122.24	37.85	52	1467	190	496	177	7.2574	352100	NEAR BAY
3	-122.25	37.85	52	1274	235	558	219	5.6431	341300	NEAR BAY
4	-122.25	37.85	52	1627	280	565	259	3.8462	342200	NEAR BAY

5. 繪圖

Pandas支持matplotlib，matplotlib是功能強(qiáng)大的Python可視化工具。本節(jié)僅對Pandas支持的繪圖方法進(jìn)行簡單介紹，我們將會(huì)在下一篇文章中進(jìn)行matplotlib的詳細(xì)介紹。為了不錯(cuò)過更新，歡迎大家關(guān)注我。

np.random.seed(999) df?=?pd.DataFrame(np.random.rand(10,?4),?columns=['a',?'b',?'c',?'d'])

我們直接調(diào)用plot方法進(jìn)行展示。這里有兩個(gè)需要注意的地方：

該plot方法是通過Pandas調(diào)用的plot方法，而非matplotlib。

我們知道Python語言是無需分號(hào)進(jìn)行結(jié)束語句的。此處的分號(hào)表示執(zhí)行繪圖渲染后直接顯示圖像。

df.plot();

df.plot.bar();

df.plot.bar(stacked=True);

四、更多

我們下篇將講解matplotlib的相關(guān)知識(shí)點(diǎn)。

往期精彩回顧適合初學(xué)者入門人工智能的路線及資料下載機(jī)器學(xué)習(xí)及深度學(xué)習(xí)筆記等資料打印機(jī)器學(xué)習(xí)在線手冊深度學(xué)習(xí)筆記專輯《統(tǒng)計(jì)學(xué)習(xí)方法》的代碼復(fù)現(xiàn)專輯 AI基礎(chǔ)下載機(jī)器學(xué)習(xí)的數(shù)學(xué)基礎(chǔ)專輯本站知識(shí)星球“黃博的機(jī)器學(xué)習(xí)圈子”（92416895）本站qq群704220115。加入微信群請掃碼：

總結(jié)

以上是生活随笔為你收集整理的【机器学习基础】前置知识（四）：一文掌握Pandas用法的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：傲游浏览器如何关闭智能填表智能填表功能
下一篇：阿里云天池发布完整开源数据集！实测可下！