當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

50道练习带你玩转Pandas

發布時間：2025/3/8 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 50道练习带你玩转Pandas 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

作者：王大毛，和鯨社區

出處：https://www.kesci.com/home/project/5ddc974ef41512002cec1dca

修改：黃海廣

Pandas 是基于 NumPy 的一種數據處理工具，該工具為了解決數據分析任務而創建。Pandas 納入了大量庫和一些標準的數據模型，提供了高效地操作大型數據集所需的函數和方法。這些練習著重DataFrame和Series對象的基本操作，包括數據的索引、分組、統計和清洗。

本文的代碼可以到github下載：https://github.com/fengdu78/Data-Science-Notes/tree/master/3.pandas/4.Pandas50

基本操作

1.導入 Pandas 庫并簡寫為 pd，并輸出版本號

import pandas as pd pd.__version__ '0.22.0'

2. 從列表創建 Series

arr = [0, 1, 2, 3, 4] df = pd.Series(arr) # 如果不指定索引，則默認從 0 開始 df 0 0 1 1 2 2 3 3 4 4 dtype: int64

3. 從字典創建 Series

d = {'a':1,'b':2,'c':3,'d':4,'e':5} df = pd.Series(d) df a 1 b 2 c 3 d 4 e 5 dtype: int64

4. 從 NumPy 數組創建 DataFrame

import numpy as np dates = pd.date_range('today', periods=6) # 定義時間序列作為 index num_arr = np.random.randn(6, 4) # 傳入 numpy 隨機數組 columns = ['A', 'B', 'C', 'D'] # 將列表作為列名 df = pd.DataFrame(num_arr, index=dates, columns=columns) df
ABCD2020-01-10 22:46:01.6420212020-01-11 22:46:01.6420212020-01-12 22:46:01.6420212020-01-13 22:46:01.6420212020-01-14 22:46:01.6420212020-01-15 22:46:01.642021

0.277099	0.665053	0.882637	-0.598895
0.365233	-2.529804	-0.699849	0.159623
-0.831850	-2.099049	-0.976407	-0.342800
0.680800	1.682999	0.144469	-2.503013
-0.413880	0.876169	-1.047877	0.996865
1.373956	0.029732	-0.549268	-0.287584

5. 從CSV中創建 DataFrame，分隔符為“；”，編碼格式為gbk

df = pd.read_csv('test.csv', encoding='gbk', sep=';')6. 從字典對象創建DataFrame，并設置索引import numpy as np data = {'animal':['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],'priority':['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no'] }labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] df = pd.DataFrame(data, index=labels) df
ageanimalpriorityvisitsabcdefghij

2.5	cat	yes	1
3.0	cat	yes	3
0.5	snake	no	2
NaN	dog	yes	3
5.0	dog	no	2
2.0	cat	no	3
4.5	snake	no	1
NaN	cat	yes	1
7.0	dog	no	2
3.0	dog	no	1

7. 顯示df的基礎信息，包括行的數量；列名；每一列值的數量、類型

df.info() # 方法二 # df.describe() <class 'pandas.core.frame.DataFrame'> Index: 10 entries, a to j Data columns (total 4 columns): age 8 non-null float64 animal 10 non-null object priority 10 non-null object visits 10 non-null int64 dtypes: float64(1), int64(1), object(2) memory usage: 400.0+ bytes

8. 展示df的前3行

df.iloc[:3] # 方法二 #df.head(3)
ageanimalpriorityvisitsabc

2.5	cat	yes	1
3.0	cat	yes	3
0.5	snake	no	2

9. 取出df的animal和age列

df.loc[:, ['animal', 'age']] # 方法二 # df[['animal', 'age']]
animalageabcdefghij

cat	2.5
cat	3.0
snake	0.5
dog	NaN
dog	5.0
cat	2.0
snake	4.5
cat	NaN
dog	7.0
dog	3.0

10. 取出索引為[3, 4, 8]行的animal和age列

df.loc[df.index[[3, 4, 8]], ['animal', 'age']]
animalagedei

dog	NaN
dog	5.0
dog	7.0

11. 取出age值大于3的行

df[df['age'] > 3]
ageanimalpriorityvisitsegi

5.0	dog	no	2
4.5	snake	no	1
7.0	dog	no	2

12. 取出age值缺失的行

df[df['age'].isnull()]
ageanimalpriorityvisitsdh

NaN	dog	yes	3
NaN	cat	yes	1

13.取出age在2,4間的行（不含）

df[(df['age']>2) & (df['age']>4)] # 方法二 # df[df['age'].between(2, 4)]
ageanimalpriorityvisitsegi

5.0	dog	no	2
4.5	snake	no	1
7.0	dog	no	2

14. f 行的age改為1.5

df.loc['f', 'age'] = 1.5

15. 計算visits的總和

df['visits'].sum() 19

16. 計算每個不同種類animal的age的平均數

df.groupby('animal')['age'].mean() animal cat 2.333333 dog 5.000000 snake 2.500000 Name: age, dtype: float64

17. 在df中插入新行k，然后刪除該行

#插入 df.loc['k'] = [5.5, 'dog', 'no', 2] # 刪除 df = df.drop('k') df
ageanimalpriorityvisitsabcdefghij

2.5	cat	yes	1
3.0	cat	yes	3
0.5	snake	no	2
NaN	dog	yes	3
5.0	dog	no	2
1.5	cat	no	3
4.5	snake	no	1
NaN	cat	yes	1
7.0	dog	no	2
3.0	dog	no	1

18. 計算df中每個種類animal的數量

df['animal'].value_counts() dog 4 cat 4 snake 2 Name: animal, dtype: int64

19. 先按age降序排列，后按visits升序排列

df.sort_values(by=['age', 'visits'], ascending=[False, True])
ageanimalpriorityvisitsiegjbafchd

7.0	dog	no	2
5.0	dog	no	2
4.5	snake	no	1
3.0	dog	no	1
3.0	cat	yes	3
2.5	cat	yes	1
1.5	cat	no	3
0.5	snake	no	2
NaN	cat	yes	1
NaN	dog	yes	3

20. 將priority列中的yes, no替換為布爾值True, False

df['priority'] = df['priority'].map({'yes': True, 'no': False}) df
ageanimalpriorityvisitsabcdefghij

2.5	cat	True	1
3.0	cat	True	3
0.5	snake	False	2
NaN	dog	True	3
5.0	dog	False	2
1.5	cat	False	3
4.5	snake	False	1
NaN	cat	True	1
7.0	dog	False	2
3.0	dog	False	1

21. 將animal列中的snake替換為python

df['animal'] = df['animal'].replace('snake', 'python') df
ageanimalpriorityvisitsabcdefghij

2.5	cat	True	1
3.0	cat	True	3
0.5	python	False	2
NaN	dog	True	3
5.0	dog	False	2
1.5	cat	False	3
4.5	python	False	1
NaN	cat	True	1
7.0	dog	False	2
3.0	dog	False	1

22. 對每種animal的每種不同數量visits，計算平均age，即，返回一個表格，行是aniaml種類，列是visits數量，表格值是行動物種類列訪客數量的平均年齡

df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean') visits123animal

catdogpython

2.5	NaN	2.25
3.0	6.0	NaN
4.5	0.5	NaN

進階操作

23. 有一列整數列A的DatraFrame，刪除數值重復的行

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]}) print(df) df1 = df.loc[df['A'].shift() != df['A']] # 方法二 # df1 = df.drop_duplicates(subset='A') print(df1) A 0 1 1 2 2 2 3 3 4 4 5 5 6 5 7 5 8 6 9 7 10 7A 0 1 1 2 3 3 4 4 5 5 8 6 9 7

24. 一個全數值DatraFrame，每個數字減去該行的平均數

df = pd.DataFrame(np.random.random(size=(5, 3))) print(df) df1 = df.sub(df.mean(axis=1), axis=0) print(df1) 0 1 2 0 0.465407 0.152497 0.861174 1 0.623682 0.627339 0.495652 2 0.835176 0.862376 0.693047 3 0.319698 0.306709 0.654063 4 0.234855 0.194232 0.4385970 1 2 0 -0.027619 -0.340529 0.368148 1 0.041457 0.045115 -0.086572 2 0.038310 0.065509 -0.103819 3 -0.107125 -0.120114 0.227239 4 -0.054373 -0.094996 0.149368

25. 一個有5列的DataFrame，求哪一列的和最小

df = pd.DataFrame(np.random.random(size=(5, 5)), columns=list('abcde')) print(df) df.sum().idxmin() a b c d e 0 0.653658 0.730994 0.223025 0.456730 0.288283 1 0.937546 0.640995 0.197359 0.671524 0.006035 2 0.392762 0.174955 0.053928 0.318634 0.464534 3 0.741499 0.197861 0.988105 0.633780 0.914250 4 0.469285 0.309043 0.162127 0.032480 0.863017'c'

26. 給定DataFrame，求A列每個值的前3大的B的值的和

df = pd.DataFrame({'A': list('aaabbcaabcccbbc'),'B': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]}) print(df) df1 = df.groupby('A')['B'].nlargest(3).sum(level=0) print(df1) A B 0 a 12 1 a 345 2 a 3 3 b 1 4 b 45 5 c 14 6 a 4 7 a 52 8 b 54 9 c 23 10 c 235 11 c 21 12 b 57 13 b 3 14 c 87 A a 409 b 156 c 345 Name: B, dtype: int64

27. 給定DataFrame，有列A, B，A的值在1-100（含），對A列每10步長，求對應的B的和

df = pd.DataFrame({'A': [1, 2, 11, 11, 33, 34, 35, 40, 79, 99],'B': [1, 2, 11, 11, 33, 34, 35, 40, 79, 99] }) print(df) df1 = df.groupby(pd.cut(df['A'], np.arange(0, 101, 10)))['B'].sum() print(df1) A B 0 1 1 1 2 2 2 11 11 3 11 11 4 33 33 5 34 34 6 35 35 7 40 40 8 79 79 9 99 99 A (0, 10] 3 (10, 20] 22 (20, 30] 0 (30, 40] 142 (40, 50] 0 (50, 60] 0 (60, 70] 0 (70, 80] 79 (80, 90] 0 (90, 100] 99 Name: B, dtype: int64

28. 給定DataFrame，計算每個元素至左邊最近的0（或者至開頭）的距離，生成新列y

df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]}) # 方法一 x = (df['X'] != 0).cumsum() y = x != x.shift() df['Y'] = y.groupby((y != y.shift()).cumsum()).cumsum() print(df) X Y 0 7 1.0 1 2 2.0 2 0 0.0 3 3 1.0 4 4 2.0 5 2 3.0 6 5 4.0 7 0 0.0 8 3 1.0 9 4 2.0 # 方法二 df['Y'] = df.groupby((df['X'] == 0).cumsum()).cumcount() first_zero_idx = (df['X'] == 0).idxmax() df['Y'].iloc[0:first_zero_idx] += 1 print(df) X Y 0 7 1 1 2 2 2 0 0 3 3 1 4 4 2 5 2 3 6 5 4 7 0 0 8 3 1 9 4 2

29. 一個全數值的DataFrame，返回最大3個值的坐標

df = pd.DataFrame(np.random.random(size=(5, 3))) print(df) df.unstack().sort_values()[-3:].index.tolist() 0 1 2 0 0.974321 0.454025 0.018815 1 0.323491 0.468609 0.834424 2 0.340960 0.826835 0.503252 3 0.812414 0.202745 0.965168 4 0.633172 0.270281 0.915212[(2, 4), (2, 3), (0, 0)]

30. 給定DataFrame，將負值代替為同組的平均值

df = pd.DataFrame({'grps':list('aaabbcaabcccbbc'),'vals': [-12, 345, 3, 1, 45, 14, 4, -52, 54, 23, -235, 21, 57, 3, 87] }) print(df)def replace(group):mask = group < 0group[mask] = group[~mask].mean()return groupdf['vals'] = df.groupby(['grps'])['vals'].transform(replace) print(df) grps vals 0 a -12 1 a 345 2 a 3 3 b 1 4 b 45 5 c 14 6 a 4 7 a -52 8 b 54 9 c 23 10 c -235 11 c 21 12 b 57 13 b 3 14 c 87grps vals 0 a 117.333333 1 a 345.000000 2 a 3.000000 3 b 1.000000 4 b 45.000000 5 c 14.000000 6 a 4.000000 7 a 117.333333 8 b 54.000000 9 c 23.000000 10 c 36.250000 11 c 21.000000 12 b 57.000000 13 b 3.000000 14 c 87.000000

31. 計算3位滑動窗口的平均值，忽略NAN

df = pd.DataFrame({'group': list('aabbabbbabab'),'value': [1, 2, 3, np.nan, 2, 3, np.nan, 1, 7, 3, np.nan, 8] }) print(df)g1 = df.groupby(['group'])['value'] g2 = df.fillna(0).groupby(['group'])['value']s = g2.rolling(3, min_periods=1).sum() / g1.rolling(3, min_periods=1).count() s.reset_index(level=0, drop=True).sort_index() group value 0 a 1.0 1 a 2.0 2 b 3.0 3 b NaN 4 a 2.0 5 b 3.0 6 b NaN 7 b 1.0 8 a 7.0 9 b 3.0 10 a NaN 11 b 8.00 1.000000 1 1.500000 2 3.000000 3 3.000000 4 1.666667 5 3.000000 6 3.000000 7 2.000000 8 3.666667 9 2.000000 10 4.500000 11 4.000000 Name: value, dtype: float64

Series 和 Datetime索引

32. 創建Series s，將2015所有工作日作為隨機值的索引

dti = pd.date_range(start='2015-01-01', end='2015-12-31', freq='B') s = pd.Series(np.random.rand(len(dti)), index=dti)s.head(10) 2015-01-01 0.503458 2015-01-02 0.194185 2015-01-05 0.550930 2015-01-06 0.174309 2015-01-07 0.316911 2015-01-08 0.288385 2015-01-09 0.293285 2015-01-12 0.340436 2015-01-13 0.630009 2015-01-14 0.076130 Freq: B, dtype: float64

33. 所有禮拜三的值求和

s[s.index.weekday == 2].sum() 27.272318047689705

34. 求每個自然月的平均數

s.resample('M').mean() 2015-01-31 0.375417 2015-02-28 0.551560 2015-03-31 0.540772 2015-04-30 0.450957 2015-05-31 0.369119 2015-06-30 0.588625 2015-07-31 0.584358 2015-08-31 0.609751 2015-09-30 0.511285 2015-10-31 0.555546 2015-11-30 0.528777 2015-12-31 0.574317 Freq: M, dtype: float64

35. 每連續4個月為一組，求最大值所在的日期

s.groupby(pd.Grouper(freq='4M')).idxmax() 2015-01-31 2015-01-15 2015-05-31 2015-02-04 2015-09-30 2015-06-02 2016-01-31 2015-12-08 dtype: datetime64[ns]

36. 創建2015-2016每月第三個星期四的序列

pd.date_range('2015-01-01', '2016-12-31', freq='WOM-3THU') #數據清洗 df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm','Budapest_PaRis', 'Brussels_londOn'],'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )','12. Air France', '"Swiss Air"']}) df
AirlineFlightNumberFrom_ToRecentDelays01234

KLM(!)	10045.0	LoNDon_paris	[23, 47]
<Air France> (12)	NaN	MAdrid_miLAN	[]
(British Airways. )	10065.0	londON_StockhOlm	[24, 43, 87]
12. Air France	NaN	Budapest_PaRis	[13]
"Swiss Air"	10085.0	Brussels_londOn	[67, 32]

37. FlightNumber列中有些值缺失了，他們本來應該是每一行增加10，填充缺失的數值，并且令數據類型為整數

df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int) df
AirlineFlightNumberFrom_ToRecentDelays01234

KLM(!)	10045	LoNDon_paris	[23, 47]
<Air France> (12)	10055	MAdrid_miLAN	[]
(British Airways. )	10065	londON_StockhOlm	[24, 43, 87]
12. Air France	10075	Budapest_PaRis	[13]
"Swiss Air"	10085	Brussels_londOn	[67, 32]

38. 將From_To列從_分開，分成From, To兩列，并刪除原始列

temp = df.From_To.str.split('_', expand=True) temp.columns = ['From', 'To'] df = df.join(temp) df = df.drop('From_To', axis=1) df
AirlineFlightNumberRecentDelaysFromTo01234

KLM(!)	10045	[23, 47]	LoNDon	paris
<Air France> (12)	10055	[]	MAdrid	miLAN
(British Airways. )	10065	[24, 43, 87]	londON	StockhOlm
12. Air France	10075	[13]	Budapest	PaRis
"Swiss Air"	10085	[67, 32]	Brussels	londOn

39. 將From, To大小寫統一首字母大寫其余小寫

df['From'] = df['From'].str.capitalize() df['To'] = df['To'].str.capitalize() df
AirlineFlightNumberRecentDelaysFromTo01234

KLM(!)	10045	[23, 47]	London	Paris
<Air France> (12)	10055	[]	Madrid	Milan
(British Airways. )	10065	[24, 43, 87]	London	Stockholm
12. Air France	10075	[13]	Budapest	Paris
"Swiss Air"	10085	[67, 32]	Brussels	London

40. Airline列，有一些多余的標點符號，需要提取出正確的航司名稱。舉例：'(British Airways. )' 應該改為 'British Airways'.

df['Airline'] = df['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip() df
AirlineFlightNumberRecentDelaysFromTo01234

KLM	10045	[23, 47]	London	Paris
Air France	10055	[]	Madrid	Milan
British Airways	10065	[24, 43, 87]	London	Stockholm
Air France	10075	[13]	Budapest	Paris
Swiss Air	10085	[67, 32]	Brussels	London

41. Airline列，數據被以列表的形式錄入，但是我們希望每個數字被錄入成單獨一列，delay_1, delay_2, ...沒有的用NAN替代。

delays = df['RecentDelays'].apply(pd.Series) delays.columns = ['delay_{}'.format(n) for n in range(1, len(delays.columns)+1)] df = df.drop('RecentDelays', axis=1).join(delays) df
AirlineFlightNumberFromTodelay_1delay_2delay_301234

KLM	10045	London	Paris	23.0	47.0	NaN
Air France	10055	Madrid	Milan	NaN	NaN	NaN
British Airways	10065	London	Stockholm	24.0	43.0	87.0
Air France	10075	Budapest	Paris	13.0	NaN	NaN
Swiss Air	10085	Brussels	London	67.0	32.0	NaN

層次化索引

42. 用 letters = ['A', 'B', 'C']和 numbers = list(range(10))的組合作為系列隨機值的層次化索引

letters = ['A', 'B', 'C'] numbers = list(range(4))mi = pd.MultiIndex.from_product([letters, numbers]) s = pd.Series(np.random.rand(12), index=mi) s A 0 0.2507851 0.1469782 0.5960623 0.064608 B 0 0.7096601 0.5157782 0.4831633 0.524490 C 0 0.3604341 0.9876202 0.5271513 0.636960 dtype: float64

43. 檢查s是否是字典順序排序的

s.index.is_lexsorted() # 方法二 # s.index.lexsort_depth == s.index.nlevels True

44. 選擇二級索引為1, 3的行

s.loc[:, [1, 3]] A 1 0.1469783 0.064608 B 1 0.5157783 0.524490 C 1 0.9876203 0.636960 dtype: float64

45. 對s進行切片操作，取一級索引至B，二級索引從2開始到最后

s.loc[pd.IndexSlice[:'B', 2:]] # 方法二 # s.loc[slice(None, 'B'), slice(2, None)] A 2 0.5960623 0.064608 B 2 0.4831633 0.524490 dtype: float64

46. 計算每個一級索引的和（A, B, C每一個的和）

s.sum(level=0) #方法二 #s.unstack().sum(axis=0) A 1.058433 B 2.233091 C 2.512164 dtype: float64

47. 交換索引等級，新的Series是字典順序嗎？不是的話請排序

new_s = s.swaplevel(0, 1) print(new_s) print(new_s.index.is_lexsorted()) new_s = new_s.sort_index() print(new_s) 0 A 0.250785 1 A 0.146978 2 A 0.596062 3 A 0.064608 0 B 0.709660 1 B 0.515778 2 B 0.483163 3 B 0.524490 0 C 0.360434 1 C 0.987620 2 C 0.527151 3 C 0.636960 dtype: float64 False 0 A 0.250785B 0.709660C 0.360434 1 A 0.146978B 0.515778C 0.987620 2 A 0.596062B 0.483163C 0.527151 3 A 0.064608B 0.524490C 0.636960 dtype: float64 ## 可視化 import matplotlib.pyplot as plt df = pd.DataFrame({"xs": [1, 5, 2, 8, 1], "ys": [4, 2, 1, 9, 6]}) plt.style.use('ggplot')

48. 畫出df的散點圖

df.plot.scatter("xs", "ys", color = "black", marker = "x") <matplotlib.axes._subplots.AxesSubplot at 0x1f188ddacc0>

49. 可視化指定4維DataFrame

df = pd.DataFrame({"productivity": [5, 2, 3, 1, 4, 5, 6, 7, 8, 3, 4, 8, 9],"hours_in": [1, 9, 6, 5, 3, 9, 2, 9, 1, 7, 4, 2, 2],"happiness": [2, 1, 3, 2, 3, 1, 2, 3, 1, 2, 2, 1, 3],"caffienated": [0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0] })df.plot.scatter("hours_in", "productivity", s=df.happiness * 100, c=df.caffienated) <matplotlib.axes._subplots.AxesSubplot at 0x1f18aea4c18>

50. 在同一個圖中可視化2組數據，共用X軸，但y軸不同

df = pd.DataFrame({"revenue": [57, 68, 63, 71, 72, 90, 80, 62, 59, 51, 47, 52],"advertising":[2.1, 1.9, 2.7, 3.0, 3.6, 3.2, 2.7, 2.4, 1.8, 1.6, 1.3, 1.9],"month":range(12) })ax = df.plot.bar("month", "revenue", color="green") df.plot.line("month", "advertising", secondary_y=True, ax=ax) ax.set_xlim((-1, 12)) (-1, 12) 本文的代碼可以到github下載：https://github.com/fengdu78/Data-Science-Notes/tree/master/3.pandas/4.Pandas50

備注：公眾號菜單包含了整理了一本AI小抄，非常適合在通勤路上用學習。

往期精彩回顧2019年公眾號文章精選適合初學者入門人工智能的路線及資料下載機器學習在線手冊深度學習在線手冊AI基礎下載（第一部分）備注：加入本站微信群或者qq群，請回復“加群”加入知識星球（4500+用戶，ID：92416895），請回復“知識星球”

喜歡文章，點個在看

總結

以上是生活随笔為你收集整理的50道练习带你玩转Pandas的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： TensorFlow 2.1.0 来了，
下一篇：聚类 | 超详细的性能度量和相似度方法总