Pandas高级教程之:GroupBy用法
文章目錄
- 簡介
- 分割數據
- 多index
- get_group
- dropna
- groups屬性
- index的層級
- group的遍歷
- 聚合操作
- 通用聚合方法
- 同時使用多個聚合方法
- NamedAgg
- 不同的列指定不同的聚合方法
- 轉換操作
- 過濾操作
- Apply操作
簡介
pandas中的DF數據類型可以像數據庫表格一樣進行groupby操作。通常來說groupby操作可以分為三部分:分割數據,應用變換和和合并數據。
本文將會詳細講解Pandas中的groupby操作。
分割數據
分割數據的目的是將DF分割成為一個個的group。為了進行groupby操作,在創建DF的時候需要指定相應的label:
df = pd.DataFrame(...: {...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],...: "C": np.random.randn(8),...: "D": np.random.randn(8),...: }...: )...:df Out[61]: A B C D 0 foo one -0.490565 -0.233106 1 bar one 0.430089 1.040789 2 foo two 0.653449 -1.155530 3 bar three -0.610380 -0.447735 4 foo two -0.934961 0.256358 5 bar two -0.256263 -0.661954 6 foo one -1.132186 -0.304330 7 foo three 2.129757 0.445744默認情況下,groupby的軸是x軸。可以一列group,也可以多列group:
In [8]: grouped = df.groupby("A")In [9]: grouped = df.groupby(["A", "B"])多index
在0.24版本中,如果我們有多index,可以從中選擇特定的index進行group:
In [10]: df2 = df.set_index(["A", "B"])In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))In [12]: grouped.sum() Out[12]: C D A bar -1.591710 -1.739537 foo -0.752861 -1.402938get_group
get_group 可以獲取分組之后的數據:
In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})In [25]: df3.groupby(["X"]).get_group("A") Out[25]: X Y 0 A 1 2 A 3In [26]: df3.groupby(["X"]).get_group("B") Out[26]: X Y 1 B 4 3 B 2dropna
默認情況下,NaN數據會被排除在groupby之外,通過設置 dropna=False 可以允許NaN數據:
In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])In [29]: df_dropna Out[29]: a b c 0 1 2.0 3 1 1 NaN 4 2 2 1.0 3 3 1 2.0 2 # Default ``dropna`` is set to True, which will exclude NaNs in keys In [30]: df_dropna.groupby(by=["b"], dropna=True).sum() Out[30]: a c b 1.0 2 3 2.0 2 5# In order to allow NaN in keys, set ``dropna`` to False In [31]: df_dropna.groupby(by=["b"], dropna=False).sum() Out[31]: a c b 1.0 2 3 2.0 2 5 NaN 1 4groups屬性
groupby對象有個groups屬性,它是一個key-value字典,key是用來分類的數據,value是分類對應的值。
In [34]: grouped = df.groupby(["A", "B"])In [35]: grouped.groups Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}In [36]: len(grouped) Out[36]: 6index的層級
對于多級index對象,groupby可以指定group的index層級:
In [40]: arrays = [....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],....: ["one", "two", "one", "two", "one", "two", "one", "two"],....: ]....: In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])In [42]: s = pd.Series(np.random.randn(8), index=index)In [43]: s Out[43]: first second bar one -0.919854two -0.042379 baz one 1.247642two -0.009920 foo one 0.290213two 0.495767 qux one 0.362949two 1.548106 dtype: float64group第一級:
In [44]: grouped = s.groupby(level=0)In [45]: grouped.sum() Out[45]: first bar -0.962232 baz 1.237723 foo 0.785980 qux 1.911055 dtype: float64group第二級:
In [46]: s.groupby(level="second").sum() Out[46]: second one 0.980950 two 1.991575 dtype: float64group的遍歷
得到group對象之后,我們可以通過for語句來遍歷group:
In [62]: grouped = df.groupby('A')In [63]: for name, group in grouped:....: print(name)....: print(group)....: barA B C D 1 bar one 0.254161 1.511763 3 bar three 0.215897 -0.990582 5 bar two -0.077118 1.211526 fooA B C D 0 foo one -0.575247 1.346061 2 foo two -1.143704 1.627081 4 foo two 1.193555 -0.441652 6 foo one -0.408530 0.268520 7 foo three -0.862495 0.024580如果是多字段group,group的名字是一個元組:
In [64]: for name, group in df.groupby(['A', 'B']):....: print(name)....: print(group)....: ('bar', 'one')A B C D 1 bar one 0.254161 1.511763 ('bar', 'three')A B C D 3 bar three 0.215897 -0.990582 ('bar', 'two')A B C D 5 bar two -0.077118 1.211526 ('foo', 'one')A B C D 0 foo one -0.575247 1.346061 6 foo one -0.408530 0.268520 ('foo', 'three')A B C D 7 foo three -0.862495 0.02458 ('foo', 'two')A B C D 2 foo two -1.143704 1.627081 4 foo two 1.193555 -0.441652聚合操作
分組之后,就可以進行聚合操作:
In [67]: grouped = df.groupby("A")In [68]: grouped.aggregate(np.sum) Out[68]: C D A bar 0.392940 1.732707 foo -1.796421 2.824590In [69]: grouped = df.groupby(["A", "B"])In [70]: grouped.aggregate(np.sum) Out[70]: C D A B bar one 0.254161 1.511763three 0.215897 -0.990582two -0.077118 1.211526 foo one -0.983776 1.614581three -0.862495 0.024580two 0.049851 1.185429對于多index數據來說,默認返回值也是多index的。如果想使用新的index,可以添加 as_index = False:
In [71]: grouped = df.groupby(["A", "B"], as_index=False)In [72]: grouped.aggregate(np.sum) Out[72]: A B C D 0 bar one 0.254161 1.511763 1 bar three 0.215897 -0.990582 2 bar two -0.077118 1.211526 3 foo one -0.983776 1.614581 4 foo three -0.862495 0.024580 5 foo two 0.049851 1.185429In [73]: df.groupby("A", as_index=False).sum() Out[73]: A C D 0 bar 0.392940 1.732707 1 foo -1.796421 2.824590上面的效果等同于reset_index
In [74]: df.groupby(["A", "B"]).sum().reset_index()grouped.size() 計算group的大小:
In [75]: grouped.size() Out[75]: A B size 0 bar one 1 1 bar three 1 2 bar two 1 3 foo one 2 4 foo three 1 5 foo two 2grouped.describe() 描述group的信息:
In [76]: grouped.describe() Out[76]: C ... D count mean std min 25% 50% ... std min 25% 50% 75% max 0 1.0 0.254161 NaN 0.254161 0.254161 0.254161 ... NaN 1.511763 1.511763 1.511763 1.511763 1.511763 1 1.0 0.215897 NaN 0.215897 0.215897 0.215897 ... NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582 2 1.0 -0.077118 NaN -0.077118 -0.077118 -0.077118 ... NaN 1.211526 1.211526 1.211526 1.211526 1.211526 3 2.0 -0.491888 0.117887 -0.575247 -0.533567 -0.491888 ... 0.761937 0.268520 0.537905 0.807291 1.076676 1.346061 4 1.0 -0.862495 NaN -0.862495 -0.862495 -0.862495 ... NaN 0.024580 0.024580 0.024580 0.024580 0.024580 5 2.0 0.024925 1.652692 -1.143704 -0.559389 0.024925 ... 1.462816 -0.441652 0.075531 0.592714 1.109898 1.627081[6 rows x 16 columns]通用聚合方法
下面是通用的聚合方法:
| mean() | 平均值 |
| sum() | 求和 |
| size() | 計算size |
| count() | group的統計 |
| std() | 標準差 |
| var() | 方差 |
| sem() | 均值的標準誤 |
| describe() | 統計信息描述 |
| first() | 第一個group值 |
| last() | 最后一個group值 |
| nth() | 第n個group值 |
| min() | 最小值 |
| max() | 最大值 |
同時使用多個聚合方法
可以同時指定多個聚合方法:
In [81]: grouped = df.groupby("A")In [82]: grouped["C"].agg([np.sum, np.mean, np.std]) Out[82]: sum mean std A bar 0.392940 0.130980 0.181231 foo -1.796421 -0.359284 0.912265可以重命名:
In [84]: (....: grouped["C"]....: .agg([np.sum, np.mean, np.std])....: .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})....: )....: Out[84]: foo bar baz A bar 0.392940 0.130980 0.181231 foo -1.796421 -0.359284 0.912265NamedAgg
NamedAgg 可以對聚合進行更精準的定義,它包含 column 和aggfunc 兩個定制化的字段。
In [88]: animals = pd.DataFrame(....: {....: "kind": ["cat", "dog", "cat", "dog"],....: "height": [9.1, 6.0, 9.5, 34.0],....: "weight": [7.9, 7.5, 9.9, 198.0],....: }....: )....: In [89]: animals Out[89]: kind height weight 0 cat 9.1 7.9 1 dog 6.0 7.5 2 cat 9.5 9.9 3 dog 34.0 198.0In [90]: animals.groupby("kind").agg(....: min_height=pd.NamedAgg(column="height", aggfunc="min"),....: max_height=pd.NamedAgg(column="height", aggfunc="max"),....: average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),....: )....: Out[90]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75或者直接使用一個元組:
In [91]: animals.groupby("kind").agg(....: min_height=("height", "min"),....: max_height=("height", "max"),....: average_weight=("weight", np.mean),....: )....: Out[91]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75不同的列指定不同的聚合方法
通過給agg方法傳入一個字典,可以指定不同的列使用不同的聚合:
In [95]: grouped.agg({"C": "sum", "D": "std"}) Out[95]: C D A bar 0.392940 1.366330 foo -1.796421 0.884785轉換操作
轉換是將對象轉換為同樣大小對象的操作。在數據分析的過程中,經常需要進行數據的轉換操作。
可以接lambda操作:
In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())填充na值:
In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))過濾操作
filter方法可以通過lambda表達式來過濾我們不需要的數據:
In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3])In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2) Out[137]: 3 3 4 3 5 3 dtype: int64Apply操作
有些數據可能不適合進行聚合或者轉換操作,Pandas提供了一個 apply 方法,用來進行更加靈活的轉換操作。
In [156]: df Out[156]: A B C D 0 foo one -0.575247 1.346061 1 bar one 0.254161 1.511763 2 foo two -1.143704 1.627081 3 bar three 0.215897 -0.990582 4 foo two 1.193555 -0.441652 5 bar two -0.077118 1.211526 6 foo one -0.408530 0.268520 7 foo three -0.862495 0.024580In [157]: grouped = df.groupby("A")# could also just call .describe() In [158]: grouped["C"].apply(lambda x: x.describe()) Out[158]: A bar count 3.000000mean 0.130980std 0.181231min -0.07711825% 0.069390... foo min -1.14370425% -0.86249550% -0.57524775% -0.408530max 1.193555 Name: C, Length: 16, dtype: float64可以外接函數:
In [159]: grouped = df.groupby('A')['C']In [160]: def f(group):.....: return pd.DataFrame({'original': group,.....: 'demeaned': group - group.mean()}).....: In [161]: grouped.apply(f) Out[161]: original demeaned 0 -0.575247 -0.215962 1 0.254161 0.123181 2 -1.143704 -0.784420 3 0.215897 0.084917 4 1.193555 1.552839 5 -0.077118 -0.208098 6 -0.408530 -0.049245 7 -0.862495 -0.503211本文已收錄于 http://www.flydean.com/11-python-pandas-groupby/
最通俗的解讀,最深刻的干貨,最簡潔的教程,眾多你不知道的小技巧等你來發現!
超強干貨來襲 云風專訪:近40年碼齡,通宵達旦的技術人生總結
以上是生活随笔為你收集整理的Pandas高级教程之:GroupBy用法的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Pandas高级教程之:统计方法
- 下一篇: 密码学系列之:memory-bound函