日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

50道练习带你玩转Pandas

發布時間:2025/3/8 编程问答 34 豆豆
生活随笔 收集整理的這篇文章主要介紹了 50道练习带你玩转Pandas 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

作者:王大毛,和鯨社區

出處:https://www.kesci.com/home/project/5ddc974ef41512002cec1dca

修改:黃海廣

Pandas 是基于 NumPy 的一種數據處理工具,該工具為了解決數據分析任務而創建。Pandas 納入了大量庫和一些標準的數據模型,提供了高效地操作大型數據集所需的函數和方法。這些練習著重DataFrame和Series對象的基本操作,包括數據的索引、分組、統計和清洗。

本文的代碼可以到github下載:https://github.com/fengdu78/Data-Science-Notes/tree/master/3.pandas/4.Pandas50

基本操作

1.導入 Pandas 庫并簡寫為 pd,并輸出版本號

import pandas as pd pd.__version__ '0.22.0'

2. 從列表創建 Series

arr = [0, 1, 2, 3, 4] df = pd.Series(arr) # 如果不指定索引,則默認從 0 開始 df 0 0 1 1 2 2 3 3 4 4 dtype: int64

3. 從字典創建 Series

d = {'a':1,'b':2,'c':3,'d':4,'e':5} df = pd.Series(d) df a 1 b 2 c 3 d 4 e 5 dtype: int64

4. 從 NumPy 數組創建 DataFrame

import numpy as np dates = pd.date_range('today', periods=6) # 定義時間序列作為 index num_arr = np.random.randn(6, 4) # 傳入 numpy 隨機數組 columns = ['A', 'B', 'C', 'D'] # 將列表作為列名 df = pd.DataFrame(num_arr, index=dates, columns=columns) df
ABCD2020-01-10 22:46:01.6420212020-01-11 22:46:01.6420212020-01-12 22:46:01.6420212020-01-13 22:46:01.6420212020-01-14 22:46:01.6420212020-01-15 22:46:01.642021
0.2770990.6650530.882637-0.598895
0.365233-2.529804-0.6998490.159623
-0.831850-2.099049-0.976407-0.342800
0.6808001.6829990.144469-2.503013
-0.4138800.876169-1.0478770.996865
1.3739560.029732-0.549268-0.287584

5. 從CSV中創建 DataFrame,分隔符為“;”,編碼格式為gbk

df = pd.read_csv('test.csv', encoding='gbk', sep=';')6. 從字典對象創建DataFrame,并設置索引import numpy as np data = {'animal':['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],'priority':['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no'] }labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] df = pd.DataFrame(data, index=labels) df
ageanimalpriorityvisitsabcdefghij
2.5catyes1
3.0catyes3
0.5snakeno2
NaNdogyes3
5.0dogno2
2.0catno3
4.5snakeno1
NaNcatyes1
7.0dogno2
3.0dogno1

7. 顯示df的基礎信息,包括行的數量;列名;每一列值的數量、類型

df.info() # 方法二 # df.describe() <class 'pandas.core.frame.DataFrame'> Index: 10 entries, a to j Data columns (total 4 columns): age 8 non-null float64 animal 10 non-null object priority 10 non-null object visits 10 non-null int64 dtypes: float64(1), int64(1), object(2) memory usage: 400.0+ bytes

8. 展示df的前3行

df.iloc[:3] # 方法二 #df.head(3)
ageanimalpriorityvisitsabc
2.5catyes1
3.0catyes3
0.5snakeno2

9. 取出df的animal和age列

df.loc[:, ['animal', 'age']] # 方法二 # df[['animal', 'age']]
animalageabcdefghij
cat2.5
cat3.0
snake0.5
dogNaN
dog5.0
cat2.0
snake4.5
catNaN
dog7.0
dog3.0

10. 取出索引為[3, 4, 8]行的animal和age列

df.loc[df.index[[3, 4, 8]], ['animal', 'age']]
animalagedei
dogNaN
dog5.0
dog7.0

11. 取出age值大于3的行

df[df['age'] > 3]
ageanimalpriorityvisitsegi
5.0dogno2
4.5snakeno1
7.0dogno2

12. 取出age值缺失的行

df[df['age'].isnull()]
ageanimalpriorityvisitsdh
NaNdogyes3
NaNcatyes1

13.取出age在2,4間的行(不含)

df[(df['age']>2) & (df['age']>4)] # 方法二 # df[df['age'].between(2, 4)]
ageanimalpriorityvisitsegi
5.0dogno2
4.5snakeno1
7.0dogno2

14. f 行的age改為1.5

df.loc['f', 'age'] = 1.5

15. 計算visits的總和

df['visits'].sum() 19

16. 計算每個不同種類animal的age的平均數

df.groupby('animal')['age'].mean() animal cat 2.333333 dog 5.000000 snake 2.500000 Name: age, dtype: float64

17. 在df中插入新行k,然后刪除該行

#插入 df.loc['k'] = [5.5, 'dog', 'no', 2] # 刪除 df = df.drop('k') df
ageanimalpriorityvisitsabcdefghij
2.5catyes1
3.0catyes3
0.5snakeno2
NaNdogyes3
5.0dogno2
1.5catno3
4.5snakeno1
NaNcatyes1
7.0dogno2
3.0dogno1

18. 計算df中每個種類animal的數量

df['animal'].value_counts() dog 4 cat 4 snake 2 Name: animal, dtype: int64

19. 先按age降序排列,后按visits升序排列

df.sort_values(by=['age', 'visits'], ascending=[False, True])
ageanimalpriorityvisitsiegjbafchd
7.0dogno2
5.0dogno2
4.5snakeno1
3.0dogno1
3.0catyes3
2.5catyes1
1.5catno3
0.5snakeno2
NaNcatyes1
NaNdogyes3

20. 將priority列中的yes, no替換為布爾值True, False

df['priority'] = df['priority'].map({'yes': True, 'no': False}) df
ageanimalpriorityvisitsabcdefghij
2.5catTrue1
3.0catTrue3
0.5snakeFalse2
NaNdogTrue3
5.0dogFalse2
1.5catFalse3
4.5snakeFalse1
NaNcatTrue1
7.0dogFalse2
3.0dogFalse1

21. 將animal列中的snake替換為python

df['animal'] = df['animal'].replace('snake', 'python') df
ageanimalpriorityvisitsabcdefghij
2.5catTrue1
3.0catTrue3
0.5pythonFalse2
NaNdogTrue3
5.0dogFalse2
1.5catFalse3
4.5pythonFalse1
NaNcatTrue1
7.0dogFalse2
3.0dogFalse1

22. 對每種animal的每種不同數量visits,計算平均age,即,返回一個表格,行是aniaml種類,列是visits數量,表格值是行動物種類列訪客數量的平均年齡

df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean') visits123animal


catdogpython
2.5NaN2.25
3.06.0NaN
4.50.5NaN

進階操作

23. 有一列整數列A的DatraFrame,刪除數值重復的行

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]}) print(df) df1 = df.loc[df['A'].shift() != df['A']] # 方法二 # df1 = df.drop_duplicates(subset='A') print(df1) A 0 1 1 2 2 2 3 3 4 4 5 5 6 5 7 5 8 6 9 7 10 7A 0 1 1 2 3 3 4 4 5 5 8 6 9 7

24. 一個全數值DatraFrame,每個數字減去該行的平均數

df = pd.DataFrame(np.random.random(size=(5, 3))) print(df) df1 = df.sub(df.mean(axis=1), axis=0) print(df1) 0 1 2 0 0.465407 0.152497 0.861174 1 0.623682 0.627339 0.495652 2 0.835176 0.862376 0.693047 3 0.319698 0.306709 0.654063 4 0.234855 0.194232 0.4385970 1 2 0 -0.027619 -0.340529 0.368148 1 0.041457 0.045115 -0.086572 2 0.038310 0.065509 -0.103819 3 -0.107125 -0.120114 0.227239 4 -0.054373 -0.094996 0.149368

25. 一個有5列的DataFrame,求哪一列的和最小

df = pd.DataFrame(np.random.random(size=(5, 5)), columns=list('abcde')) print(df) df.sum().idxmin() a b c d e 0 0.653658 0.730994 0.223025 0.456730 0.288283 1 0.937546 0.640995 0.197359 0.671524 0.006035 2 0.392762 0.174955 0.053928 0.318634 0.464534 3 0.741499 0.197861 0.988105 0.633780 0.914250 4 0.469285 0.309043 0.162127 0.032480 0.863017'c'

26. 給定DataFrame,求A列每個值的前3大的B的值的和

df = pd.DataFrame({'A': list('aaabbcaabcccbbc'),'B': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]}) print(df) df1 = df.groupby('A')['B'].nlargest(3).sum(level=0) print(df1) A B 0 a 12 1 a 345 2 a 3 3 b 1 4 b 45 5 c 14 6 a 4 7 a 52 8 b 54 9 c 23 10 c 235 11 c 21 12 b 57 13 b 3 14 c 87 A a 409 b 156 c 345 Name: B, dtype: int64

27. 給定DataFrame,有列A, B,A的值在1-100(含),對A列每10步長,求對應的B的和

df = pd.DataFrame({'A': [1, 2, 11, 11, 33, 34, 35, 40, 79, 99],'B': [1, 2, 11, 11, 33, 34, 35, 40, 79, 99] }) print(df) df1 = df.groupby(pd.cut(df['A'], np.arange(0, 101, 10)))['B'].sum() print(df1) A B 0 1 1 1 2 2 2 11 11 3 11 11 4 33 33 5 34 34 6 35 35 7 40 40 8 79 79 9 99 99 A (0, 10] 3 (10, 20] 22 (20, 30] 0 (30, 40] 142 (40, 50] 0 (50, 60] 0 (60, 70] 0 (70, 80] 79 (80, 90] 0 (90, 100] 99 Name: B, dtype: int64

28. 給定DataFrame,計算每個元素至左邊最近的0(或者至開頭)的距離,生成新列y

df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]}) # 方法一 x = (df['X'] != 0).cumsum() y = x != x.shift() df['Y'] = y.groupby((y != y.shift()).cumsum()).cumsum() print(df) X Y 0 7 1.0 1 2 2.0 2 0 0.0 3 3 1.0 4 4 2.0 5 2 3.0 6 5 4.0 7 0 0.0 8 3 1.0 9 4 2.0 # 方法二 df['Y'] = df.groupby((df['X'] == 0).cumsum()).cumcount() first_zero_idx = (df['X'] == 0).idxmax() df['Y'].iloc[0:first_zero_idx] += 1 print(df) X Y 0 7 1 1 2 2 2 0 0 3 3 1 4 4 2 5 2 3 6 5 4 7 0 0 8 3 1 9 4 2

29. 一個全數值的DataFrame,返回最大3個值的坐標

df = pd.DataFrame(np.random.random(size=(5, 3))) print(df) df.unstack().sort_values()[-3:].index.tolist() 0 1 2 0 0.974321 0.454025 0.018815 1 0.323491 0.468609 0.834424 2 0.340960 0.826835 0.503252 3 0.812414 0.202745 0.965168 4 0.633172 0.270281 0.915212[(2, 4), (2, 3), (0, 0)]

30. 給定DataFrame,將負值代替為同組的平均值

df = pd.DataFrame({'grps':list('aaabbcaabcccbbc'),'vals': [-12, 345, 3, 1, 45, 14, 4, -52, 54, 23, -235, 21, 57, 3, 87] }) print(df)def replace(group):mask = group < 0group[mask] = group[~mask].mean()return groupdf['vals'] = df.groupby(['grps'])['vals'].transform(replace) print(df) grps vals 0 a -12 1 a 345 2 a 3 3 b 1 4 b 45 5 c 14 6 a 4 7 a -52 8 b 54 9 c 23 10 c -235 11 c 21 12 b 57 13 b 3 14 c 87grps vals 0 a 117.333333 1 a 345.000000 2 a 3.000000 3 b 1.000000 4 b 45.000000 5 c 14.000000 6 a 4.000000 7 a 117.333333 8 b 54.000000 9 c 23.000000 10 c 36.250000 11 c 21.000000 12 b 57.000000 13 b 3.000000 14 c 87.000000

31. 計算3位滑動窗口的平均值,忽略NAN

df = pd.DataFrame({'group': list('aabbabbbabab'),'value': [1, 2, 3, np.nan, 2, 3, np.nan, 1, 7, 3, np.nan, 8] }) print(df)g1 = df.groupby(['group'])['value'] g2 = df.fillna(0).groupby(['group'])['value']s = g2.rolling(3, min_periods=1).sum() / g1.rolling(3, min_periods=1).count() s.reset_index(level=0, drop=True).sort_index() group value 0 a 1.0 1 a 2.0 2 b 3.0 3 b NaN 4 a 2.0 5 b 3.0 6 b NaN 7 b 1.0 8 a 7.0 9 b 3.0 10 a NaN 11 b 8.00 1.000000 1 1.500000 2 3.000000 3 3.000000 4 1.666667 5 3.000000 6 3.000000 7 2.000000 8 3.666667 9 2.000000 10 4.500000 11 4.000000 Name: value, dtype: float64

Series 和 Datetime索引

32. 創建Series s,將2015所有工作日作為隨機值的索引

dti = pd.date_range(start='2015-01-01', end='2015-12-31', freq='B') s = pd.Series(np.random.rand(len(dti)), index=dti)s.head(10) 2015-01-01 0.503458 2015-01-02 0.194185 2015-01-05 0.550930 2015-01-06 0.174309 2015-01-07 0.316911 2015-01-08 0.288385 2015-01-09 0.293285 2015-01-12 0.340436 2015-01-13 0.630009 2015-01-14 0.076130 Freq: B, dtype: float64

33. 所有禮拜三的值求和

s[s.index.weekday == 2].sum() 27.272318047689705

34. 求每個自然月的平均數

s.resample('M').mean() 2015-01-31 0.375417 2015-02-28 0.551560 2015-03-31 0.540772 2015-04-30 0.450957 2015-05-31 0.369119 2015-06-30 0.588625 2015-07-31 0.584358 2015-08-31 0.609751 2015-09-30 0.511285 2015-10-31 0.555546 2015-11-30 0.528777 2015-12-31 0.574317 Freq: M, dtype: float64

35. 每連續4個月為一組,求最大值所在的日期

s.groupby(pd.Grouper(freq='4M')).idxmax() 2015-01-31 2015-01-15 2015-05-31 2015-02-04 2015-09-30 2015-06-02 2016-01-31 2015-12-08 dtype: datetime64[ns]

36. 創建2015-2016每月第三個星期四的序列

pd.date_range('2015-01-01', '2016-12-31', freq='WOM-3THU') #數據清洗 df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm','Budapest_PaRis', 'Brussels_londOn'],'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )','12. Air France', '"Swiss Air"']}) df
AirlineFlightNumberFrom_ToRecentDelays01234
KLM(!)10045.0LoNDon_paris[23, 47]
<Air France> (12)NaNMAdrid_miLAN[]
(British Airways. )10065.0londON_StockhOlm[24, 43, 87]
12. Air FranceNaNBudapest_PaRis[13]
"Swiss Air"10085.0Brussels_londOn[67, 32]

37. FlightNumber列中有些值缺失了,他們本來應該是每一行增加10,填充缺失的數值,并且令數據類型為整數

df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int) df
AirlineFlightNumberFrom_ToRecentDelays01234
KLM(!)10045LoNDon_paris[23, 47]
<Air France> (12)10055MAdrid_miLAN[]
(British Airways. )10065londON_StockhOlm[24, 43, 87]
12. Air France10075Budapest_PaRis[13]
"Swiss Air"10085Brussels_londOn[67, 32]

38. 將From_To列從_分開,分成From, To兩列,并刪除原始列

temp = df.From_To.str.split('_', expand=True) temp.columns = ['From', 'To'] df = df.join(temp) df = df.drop('From_To', axis=1) df
AirlineFlightNumberRecentDelaysFromTo01234
KLM(!)10045[23, 47]LoNDonparis
<Air France> (12)10055[]MAdridmiLAN
(British Airways. )10065[24, 43, 87]londONStockhOlm
12. Air France10075[13]BudapestPaRis
"Swiss Air"10085[67, 32]BrusselslondOn

39. 將From, To大小寫統一首字母大寫其余小寫

df['From'] = df['From'].str.capitalize() df['To'] = df['To'].str.capitalize() df
AirlineFlightNumberRecentDelaysFromTo01234
KLM(!)10045[23, 47]LondonParis
<Air France> (12)10055[]MadridMilan
(British Airways. )10065[24, 43, 87]LondonStockholm
12. Air France10075[13]BudapestParis
"Swiss Air"10085[67, 32]BrusselsLondon

40. Airline列,有一些多余的標點符號,需要提取出正確的航司名稱。舉例:'(British Airways. )' 應該改為 'British Airways'.

df['Airline'] = df['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip() df
AirlineFlightNumberRecentDelaysFromTo01234
KLM10045[23, 47]LondonParis
Air France10055[]MadridMilan
British Airways10065[24, 43, 87]LondonStockholm
Air France10075[13]BudapestParis
Swiss Air10085[67, 32]BrusselsLondon

41. Airline列,數據被以列表的形式錄入,但是我們希望每個數字被錄入成單獨一列,delay_1, delay_2, ...沒有的用NAN替代。

delays = df['RecentDelays'].apply(pd.Series) delays.columns = ['delay_{}'.format(n) for n in range(1, len(delays.columns)+1)] df = df.drop('RecentDelays', axis=1).join(delays) df
AirlineFlightNumberFromTodelay_1delay_2delay_301234
KLM10045LondonParis23.047.0NaN
Air France10055MadridMilanNaNNaNNaN
British Airways10065LondonStockholm24.043.087.0
Air France10075BudapestParis13.0NaNNaN
Swiss Air10085BrusselsLondon67.032.0NaN

層次化索引

42. 用 letters = ['A', 'B', 'C']和 numbers = list(range(10))的組合作為系列隨機值的層次化索引

letters = ['A', 'B', 'C'] numbers = list(range(4))mi = pd.MultiIndex.from_product([letters, numbers]) s = pd.Series(np.random.rand(12), index=mi) s A 0 0.2507851 0.1469782 0.5960623 0.064608 B 0 0.7096601 0.5157782 0.4831633 0.524490 C 0 0.3604341 0.9876202 0.5271513 0.636960 dtype: float64

43. 檢查s是否是字典順序排序的

s.index.is_lexsorted() # 方法二 # s.index.lexsort_depth == s.index.nlevels True

44. 選擇二級索引為1, 3的行

s.loc[:, [1, 3]] A 1 0.1469783 0.064608 B 1 0.5157783 0.524490 C 1 0.9876203 0.636960 dtype: float64

45. 對s進行切片操作,取一級索引至B,二級索引從2開始到最后

s.loc[pd.IndexSlice[:'B', 2:]] # 方法二 # s.loc[slice(None, 'B'), slice(2, None)] A 2 0.5960623 0.064608 B 2 0.4831633 0.524490 dtype: float64

46. 計算每個一級索引的和(A, B, C每一個的和)

s.sum(level=0) #方法二 #s.unstack().sum(axis=0) A 1.058433 B 2.233091 C 2.512164 dtype: float64

47. 交換索引等級,新的Series是字典順序嗎?不是的話請排序

new_s = s.swaplevel(0, 1) print(new_s) print(new_s.index.is_lexsorted()) new_s = new_s.sort_index() print(new_s) 0 A 0.250785 1 A 0.146978 2 A 0.596062 3 A 0.064608 0 B 0.709660 1 B 0.515778 2 B 0.483163 3 B 0.524490 0 C 0.360434 1 C 0.987620 2 C 0.527151 3 C 0.636960 dtype: float64 False 0 A 0.250785B 0.709660C 0.360434 1 A 0.146978B 0.515778C 0.987620 2 A 0.596062B 0.483163C 0.527151 3 A 0.064608B 0.524490C 0.636960 dtype: float64 ## 可視化 import matplotlib.pyplot as plt df = pd.DataFrame({"xs": [1, 5, 2, 8, 1], "ys": [4, 2, 1, 9, 6]}) plt.style.use('ggplot')

48. 畫出df的散點圖

df.plot.scatter("xs", "ys", color = "black", marker = "x") <matplotlib.axes._subplots.AxesSubplot at 0x1f188ddacc0>

49. 可視化指定4維DataFrame

df = pd.DataFrame({"productivity": [5, 2, 3, 1, 4, 5, 6, 7, 8, 3, 4, 8, 9],"hours_in": [1, 9, 6, 5, 3, 9, 2, 9, 1, 7, 4, 2, 2],"happiness": [2, 1, 3, 2, 3, 1, 2, 3, 1, 2, 2, 1, 3],"caffienated": [0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0] })df.plot.scatter("hours_in", "productivity", s=df.happiness * 100, c=df.caffienated) <matplotlib.axes._subplots.AxesSubplot at 0x1f18aea4c18>

50. 在同一個圖中可視化2組數據,共用X軸,但y軸不同

df = pd.DataFrame({"revenue": [57, 68, 63, 71, 72, 90, 80, 62, 59, 51, 47, 52],"advertising":[2.1, 1.9, 2.7, 3.0, 3.6, 3.2, 2.7, 2.4, 1.8, 1.6, 1.3, 1.9],"month":range(12) })ax = df.plot.bar("month", "revenue", color="green") df.plot.line("month", "advertising", secondary_y=True, ax=ax) ax.set_xlim((-1, 12)) (-1, 12) 本文的代碼可以到github下載:https://github.com/fengdu78/Data-Science-Notes/tree/master/3.pandas/4.Pandas50

備注:公眾號菜單包含了整理了一本AI小抄非常適合在通勤路上用學習

往期精彩回顧2019年公眾號文章精選適合初學者入門人工智能的路線及資料下載機器學習在線手冊深度學習在線手冊AI基礎下載(第一部分)備注:加入本站微信群或者qq群,請回復“加群”加入知識星球(4500+用戶,ID:92416895),請回復“知識星球”

喜歡文章,點個在看

總結

以上是生活随笔為你收集整理的50道练习带你玩转Pandas的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。