當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Pandas高级教程之:GroupBy用法

發布時間：2024/2/28 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 Pandas高级教程之:GroupBy用法小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

簡介
分割數據
- 多index
- get_group
- dropna
- groups屬性
- index的層級
group的遍歷
聚合操作
- 通用聚合方法
- 同時使用多個聚合方法
- NamedAgg
- 不同的列指定不同的聚合方法
轉換操作
過濾操作
Apply操作

簡介

pandas中的DF數據類型可以像數據庫表格一樣進行groupby操作。通常來說groupby操作可以分為三部分：分割數據，應用變換和和合并數據。

本文將會詳細講解Pandas中的groupby操作。

分割數據

分割數據的目的是將DF分割成為一個個的group。為了進行groupby操作，在創建DF的時候需要指定相應的label：

df = pd.DataFrame(...: {...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],...: "C": np.random.randn(8),...: "D": np.random.randn(8),...: }...: )...:df Out[61]: A B C D 0 foo one -0.490565 -0.233106 1 bar one 0.430089 1.040789 2 foo two 0.653449 -1.155530 3 bar three -0.610380 -0.447735 4 foo two -0.934961 0.256358 5 bar two -0.256263 -0.661954 6 foo one -1.132186 -0.304330 7 foo three 2.129757 0.445744

默認情況下，groupby的軸是x軸。可以一列group，也可以多列group：

In [8]: grouped = df.groupby("A")In [9]: grouped = df.groupby(["A", "B"])

多index

在0.24版本中，如果我們有多index，可以從中選擇特定的index進行group：

In [10]: df2 = df.set_index(["A", "B"])In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))In [12]: grouped.sum() Out[12]: C D A bar -1.591710 -1.739537 foo -0.752861 -1.402938

get_group

get_group 可以獲取分組之后的數據：

In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})In [25]: df3.groupby(["X"]).get_group("A") Out[25]: X Y 0 A 1 2 A 3In [26]: df3.groupby(["X"]).get_group("B") Out[26]: X Y 1 B 4 3 B 2

dropna

默認情況下，NaN數據會被排除在groupby之外，通過設置 dropna=False 可以允許NaN數據：

In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])In [29]: df_dropna Out[29]: a b c 0 1 2.0 3 1 1 NaN 4 2 2 1.0 3 3 1 2.0 2 # Default ``dropna`` is set to True, which will exclude NaNs in keys In [30]: df_dropna.groupby(by=["b"], dropna=True).sum() Out[30]: a c b 1.0 2 3 2.0 2 5# In order to allow NaN in keys, set ``dropna`` to False In [31]: df_dropna.groupby(by=["b"], dropna=False).sum() Out[31]: a c b 1.0 2 3 2.0 2 5 NaN 1 4

groups屬性

groupby對象有個groups屬性，它是一個key-value字典，key是用來分類的數據，value是分類對應的值。

In [34]: grouped = df.groupby(["A", "B"])In [35]: grouped.groups Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}In [36]: len(grouped) Out[36]: 6

index的層級

對于多級index對象，groupby可以指定group的index層級：

In [40]: arrays = [....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],....: ["one", "two", "one", "two", "one", "two", "one", "two"],....: ]....: In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])In [42]: s = pd.Series(np.random.randn(8), index=index)In [43]: s Out[43]: first second bar one -0.919854two -0.042379 baz one 1.247642two -0.009920 foo one 0.290213two 0.495767 qux one 0.362949two 1.548106 dtype: float64

group第一級：

In [44]: grouped = s.groupby(level=0)In [45]: grouped.sum() Out[45]: first bar -0.962232 baz 1.237723 foo 0.785980 qux 1.911055 dtype: float64

group第二級：

In [46]: s.groupby(level="second").sum() Out[46]: second one 0.980950 two 1.991575 dtype: float64

group的遍歷

得到group對象之后，我們可以通過for語句來遍歷group：

In [62]: grouped = df.groupby('A')In [63]: for name, group in grouped:....: print(name)....: print(group)....: barA B C D 1 bar one 0.254161 1.511763 3 bar three 0.215897 -0.990582 5 bar two -0.077118 1.211526 fooA B C D 0 foo one -0.575247 1.346061 2 foo two -1.143704 1.627081 4 foo two 1.193555 -0.441652 6 foo one -0.408530 0.268520 7 foo three -0.862495 0.024580

如果是多字段group，group的名字是一個元組：

In [64]: for name, group in df.groupby(['A', 'B']):....: print(name)....: print(group)....: ('bar', 'one')A B C D 1 bar one 0.254161 1.511763 ('bar', 'three')A B C D 3 bar three 0.215897 -0.990582 ('bar', 'two')A B C D 5 bar two -0.077118 1.211526 ('foo', 'one')A B C D 0 foo one -0.575247 1.346061 6 foo one -0.408530 0.268520 ('foo', 'three')A B C D 7 foo three -0.862495 0.02458 ('foo', 'two')A B C D 2 foo two -1.143704 1.627081 4 foo two 1.193555 -0.441652

聚合操作

分組之后，就可以進行聚合操作：

In [67]: grouped = df.groupby("A")In [68]: grouped.aggregate(np.sum) Out[68]: C D A bar 0.392940 1.732707 foo -1.796421 2.824590In [69]: grouped = df.groupby(["A", "B"])In [70]: grouped.aggregate(np.sum) Out[70]: C D A B bar one 0.254161 1.511763three 0.215897 -0.990582two -0.077118 1.211526 foo one -0.983776 1.614581three -0.862495 0.024580two 0.049851 1.185429

對于多index數據來說，默認返回值也是多index的。如果想使用新的index，可以添加 as_index = False：

In [71]: grouped = df.groupby(["A", "B"], as_index=False)In [72]: grouped.aggregate(np.sum) Out[72]: A B C D 0 bar one 0.254161 1.511763 1 bar three 0.215897 -0.990582 2 bar two -0.077118 1.211526 3 foo one -0.983776 1.614581 4 foo three -0.862495 0.024580 5 foo two 0.049851 1.185429In [73]: df.groupby("A", as_index=False).sum() Out[73]: A C D 0 bar 0.392940 1.732707 1 foo -1.796421 2.824590

上面的效果等同于reset_index

In [74]: df.groupby(["A", "B"]).sum().reset_index()

grouped.size() 計算group的大小：

In [75]: grouped.size() Out[75]: A B size 0 bar one 1 1 bar three 1 2 bar two 1 3 foo one 2 4 foo three 1 5 foo two 2

grouped.describe() 描述group的信息：

In [76]: grouped.describe() Out[76]: C ... D count mean std min 25% 50% ... std min 25% 50% 75% max 0 1.0 0.254161 NaN 0.254161 0.254161 0.254161 ... NaN 1.511763 1.511763 1.511763 1.511763 1.511763 1 1.0 0.215897 NaN 0.215897 0.215897 0.215897 ... NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582 2 1.0 -0.077118 NaN -0.077118 -0.077118 -0.077118 ... NaN 1.211526 1.211526 1.211526 1.211526 1.211526 3 2.0 -0.491888 0.117887 -0.575247 -0.533567 -0.491888 ... 0.761937 0.268520 0.537905 0.807291 1.076676 1.346061 4 1.0 -0.862495 NaN -0.862495 -0.862495 -0.862495 ... NaN 0.024580 0.024580 0.024580 0.024580 0.024580 5 2.0 0.024925 1.652692 -1.143704 -0.559389 0.024925 ... 1.462816 -0.441652 0.075531 0.592714 1.109898 1.627081[6 rows x 16 columns]

通用聚合方法

下面是通用的聚合方法：

函數描述

mean()	平均值
sum()	求和
size()	計算size
count()	group的統計
std()	標準差
var()	方差
sem()	均值的標準誤
describe()	統計信息描述
first()	第一個group值
last()	最后一個group值
nth()	第n個group值
min()	最小值
max()	最大值

同時使用多個聚合方法

可以同時指定多個聚合方法：

In [81]: grouped = df.groupby("A")In [82]: grouped["C"].agg([np.sum, np.mean, np.std]) Out[82]: sum mean std A bar 0.392940 0.130980 0.181231 foo -1.796421 -0.359284 0.912265

可以重命名：

In [84]: (....: grouped["C"]....: .agg([np.sum, np.mean, np.std])....: .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})....: )....: Out[84]: foo bar baz A bar 0.392940 0.130980 0.181231 foo -1.796421 -0.359284 0.912265

NamedAgg

NamedAgg 可以對聚合進行更精準的定義，它包含 column 和aggfunc 兩個定制化的字段。

In [88]: animals = pd.DataFrame(....: {....: "kind": ["cat", "dog", "cat", "dog"],....: "height": [9.1, 6.0, 9.5, 34.0],....: "weight": [7.9, 7.5, 9.9, 198.0],....: }....: )....: In [89]: animals Out[89]: kind height weight 0 cat 9.1 7.9 1 dog 6.0 7.5 2 cat 9.5 9.9 3 dog 34.0 198.0In [90]: animals.groupby("kind").agg(....: min_height=pd.NamedAgg(column="height", aggfunc="min"),....: max_height=pd.NamedAgg(column="height", aggfunc="max"),....: average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),....: )....: Out[90]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75

或者直接使用一個元組：

In [91]: animals.groupby("kind").agg(....: min_height=("height", "min"),....: max_height=("height", "max"),....: average_weight=("weight", np.mean),....: )....: Out[91]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75

不同的列指定不同的聚合方法

通過給agg方法傳入一個字典，可以指定不同的列使用不同的聚合：

In [95]: grouped.agg({"C": "sum", "D": "std"}) Out[95]: C D A bar 0.392940 1.366330 foo -1.796421 0.884785

轉換操作

轉換是將對象轉換為同樣大小對象的操作。在數據分析的過程中，經常需要進行數據的轉換操作。

可以接lambda操作：

In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())

填充na值：

In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))

過濾操作

filter方法可以通過lambda表達式來過濾我們不需要的數據：

In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3])In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2) Out[137]: 3 3 4 3 5 3 dtype: int64

Apply操作

有些數據可能不適合進行聚合或者轉換操作，Pandas提供了一個 apply 方法，用來進行更加靈活的轉換操作。

In [156]: df Out[156]: A B C D 0 foo one -0.575247 1.346061 1 bar one 0.254161 1.511763 2 foo two -1.143704 1.627081 3 bar three 0.215897 -0.990582 4 foo two 1.193555 -0.441652 5 bar two -0.077118 1.211526 6 foo one -0.408530 0.268520 7 foo three -0.862495 0.024580In [157]: grouped = df.groupby("A")# could also just call .describe() In [158]: grouped["C"].apply(lambda x: x.describe()) Out[158]: A bar count 3.000000mean 0.130980std 0.181231min -0.07711825% 0.069390... foo min -1.14370425% -0.86249550% -0.57524775% -0.408530max 1.193555 Name: C, Length: 16, dtype: float64

可以外接函數：

In [159]: grouped = df.groupby('A')['C']In [160]: def f(group):.....: return pd.DataFrame({'original': group,.....: 'demeaned': group - group.mean()}).....: In [161]: grouped.apply(f) Out[161]: original demeaned 0 -0.575247 -0.215962 1 0.254161 0.123181 2 -1.143704 -0.784420 3 0.215897 0.084917 4 1.193555 1.552839 5 -0.077118 -0.208098 6 -0.408530 -0.049245 7 -0.862495 -0.503211

本文已收錄于 http://www.flydean.com/11-python-pandas-groupby/

最通俗的解讀，最深刻的干貨，最簡潔的教程，眾多你不知道的小技巧等你來發現！

超強干貨來襲云風專訪：近40年碼齡，通宵達旦的技術人生

總結

以上是生活随笔為你收集整理的Pandas高级教程之:GroupBy用法的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Pandas高级教程之:统计方法
下一篇：密码学系列之:memory-bound函