當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

十分钟入门Pandas

發(fā)布時間：2023/12/14 编程问答 18 豆豆

生活随笔收集整理的這篇文章主要介紹了十分钟入门Pandas 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

文章目錄

1 創(chuàng)建對象
- 1.1 Series
- 1.2 DataFrame
2 查看數(shù)據(jù)
3 選擇
- 3.1 獲取行、列
- 3.2 設(shè)置行、列
4 缺失值
4 操作
- 4.1 統(tǒng)計
- 4.2 Apply
- 4.3 直方圖
- 4.4 字符串函數(shù)
5 合并
- 5.1 Concat
- 5.2 Join
- 5.3 Append
6 Grouping
7 Reshaping
- 7.1 Stack層疊
- 7.2 Pivot tables
8 Time series
9 Categoricals
10 Plotting
11 讀取、寫入數(shù)據(jù)
- 11.1 CSV
- 11.2 HDF5
- 11.3 Excel
12 注意

原文：pandas官方文檔 10 minutes to pandas

import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline

1 創(chuàng)建對象

1.1 Series

Series是一個列向量，它僅包含一列數(shù)據(jù)及其索引，通過向Series中傳遞一個列表來創(chuàng)建:

s = pd.Series([1, 3, 5, np.nan, 6, 8]) print (s) 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64

1.2 DataFrame

DataFrame是一個矩陣，通過傳遞一個Numpy數(shù)組，以日期作為行索引，以’ABCD’作為列索引來創(chuàng)建：

dates = pd.date_range('20191214', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) print (df) A B C D 2019-12-14 0.616569 0.051401 1.121409 0.307267 2019-12-15 0.483995 -1.251278 -1.772022 1.282650 2019-12-16 0.366885 -1.617911 0.438302 1.124135 2019-12-17 0.717372 0.233221 1.839167 -1.535794 2019-12-18 1.753858 0.266110 -0.333931 0.281345 2019-12-19 1.048606 0.984771 -0.279106 0.592285

通過傳遞一個字典來創(chuàng)建DataFrame，每個value都會被轉(zhuǎn)化成一個 Series：

df2 = pd.DataFrame({'A': 1.,'B': pd.Timestamp('20130102'),'C': pd.Series(1, index=list(range(4)), dtype='float32'),'D': np.array([3] * 4, dtype='int32'),'E': pd.Categorical(["test", "train", "test", "train"]),'F': 'foo'}) print (df2) A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo #DataFrame每一列的數(shù)據(jù)類型都不同: print (df2.dtypes) A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object

所以我們可以認為，DataFrame是由一列一列的Series拼成的

2 查看數(shù)據(jù)

用head和tail查看開始和結(jié)尾幾行的數(shù)據(jù)，用index和columns查看行索引和列索引，用vlaues顯示不帶索引的內(nèi)容：

print (df.head(), '\n') print (df.tail(3), '\n') print (df.index, '\n') print (df.columns, '\n') print (df.values, '\n') A B C D 2019-12-14 0.616569 0.051401 1.121409 0.307267 2019-12-15 0.483995 -1.251278 -1.772022 1.282650 2019-12-16 0.366885 -1.617911 0.438302 1.124135 2019-12-17 0.717372 0.233221 1.839167 -1.535794 2019-12-18 1.753858 0.266110 -0.333931 0.281345 A B C D 2019-12-17 0.717372 0.233221 1.839167 -1.535794 2019-12-18 1.753858 0.266110 -0.333931 0.281345 2019-12-19 1.048606 0.984771 -0.279106 0.592285 DatetimeIndex(['2019-12-14', '2019-12-15', '2019-12-16', '2019-12-17','2019-12-18', '2019-12-19'],dtype='datetime64[ns]', freq='D') Index(['A', 'B', 'C', 'D'], dtype='object') [[ 0.61656893 0.05140077 1.12140918 0.30726721][ 0.48399521 -1.25127827 -1.77202211 1.2826497 ][ 0.36688539 -1.61791122 0.43830175 1.12413452][ 0.71737246 0.23322127 1.83916745 -1.53579412][ 1.75385779 0.26611012 -0.33393125 0.281345 ][ 1.04860622 0.98477115 -0.27910636 0.59228493]]

Numpy arrays整個數(shù)組只有一個數(shù)據(jù)類型，而DataFrame每一列一個數(shù)據(jù)類型。所以在將df2轉(zhuǎn)化為numpy的時候，每列的數(shù)據(jù)類型統(tǒng)一為object:

print (df.to_numpy()) df2.to_numpy() [[ 0.61656893 0.05140077 1.12140918 0.30726721][ 0.48399521 -1.25127827 -1.77202211 1.2826497 ][ 0.36688539 -1.61791122 0.43830175 1.12413452][ 0.71737246 0.23322127 1.83916745 -1.53579412][ 1.75385779 0.26611012 -0.33393125 0.281345 ][ 1.04860622 0.98477115 -0.27910636 0.59228493]]array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],dtype=object)

<kbdDescribe顯示每一列數(shù)據(jù)的統(tǒng)計概要:

print(df.describe()) A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.831214 -0.222281 0.168970 0.341981 std 0.508670 0.998449 1.264150 1.008979 min 0.366885 -1.617911 -1.772022 -1.535794 25% 0.517139 -0.925609 -0.320225 0.287826 50% 0.666971 0.142311 0.079598 0.449776 75% 0.965798 0.257888 0.950632 0.991172 max 1.753858 0.984771 1.839167 1.282650

T表示矩陣的轉(zhuǎn)置：

print (df.T) 2019-12-14 2019-12-15 2019-12-16 2019-12-17 2019-12-18 2019-12-19 A 0.616569 0.483995 0.366885 0.717372 1.753858 1.048606 B 0.051401 -1.251278 -1.617911 0.233221 0.266110 0.984771 C 1.121409 -1.772022 0.438302 1.839167 -0.333931 -0.279106 D 0.307267 1.282650 1.124135 -1.535794 0.281345 0.592285

sort_index按照軸排序，axis=1表示第二個維度，即按照列索引排序：

print (df.sort_index(axis=1, ascending=False)) D C B A 2019-12-14 0.307267 1.121409 0.051401 0.616569 2019-12-15 1.282650 -1.772022 -1.251278 0.483995 2019-12-16 1.124135 0.438302 -1.617911 0.366885 2019-12-17 -1.535794 1.839167 0.233221 0.717372 2019-12-18 0.281345 -0.333931 0.266110 1.753858 2019-12-19 0.592285 -0.279106 0.984771 1.048606

sort_values按照值排序，by='B’表示對第二列排序：

print (df.sort_values(by='B')) A B C D 2019-12-16 0.366885 -1.617911 0.438302 1.124135 2019-12-15 0.483995 -1.251278 -1.772022 1.282650 2019-12-14 0.616569 0.051401 1.121409 0.307267 2019-12-17 0.717372 0.233221 1.839167 -1.535794 2019-12-18 1.753858 0.266110 -0.333931 0.281345 2019-12-19 1.048606 0.984771 -0.279106 0.592285

3 選擇

3.1 獲取行、列

與numpy類似，DataFrame也可以用[]進行選擇：

print (df['A'], '\n') print (df[0:3], '\n') #行切片 print (df['20191216':'20191219'], '\n') 2019-12-14 0.616569 2019-12-15 0.483995 2019-12-16 0.366885 2019-12-17 0.717372 2019-12-18 1.753858 2019-12-19 1.048606 Freq: D, Name: A, dtype: float64 A B C D 2019-12-14 0.616569 0.051401 1.121409 0.307267 2019-12-15 0.483995 -1.251278 -1.772022 1.282650 2019-12-16 0.366885 -1.617911 0.438302 1.124135 A B C D 2019-12-16 0.366885 -1.617911 0.438302 1.124135 2019-12-17 0.717372 0.233221 1.839167 -1.535794 2019-12-18 1.753858 0.266110 -0.333931 0.281345 2019-12-19 1.048606 0.984771 -0.279106 0.592285

通過label進行選擇

print (df.loc[dates[0]], '\n') print (df.loc[:, ['A', 'B']], '\n') print (df.loc['20191216':'20191218', ['A', 'B']], '\n') print (df.at[dates[0], 'A'], '\n') #選擇單個數(shù)據(jù)時用at，速度更快 A 0.616569 B 0.051401 C 1.121409 D 0.307267 Name: 2019-12-14 00:00:00, dtype: float64 A B 2019-12-14 0.616569 0.051401 2019-12-15 0.483995 -1.251278 2019-12-16 0.366885 -1.617911 2019-12-17 0.717372 0.233221 2019-12-18 1.753858 0.266110 2019-12-19 1.048606 0.984771 A B 2019-12-16 0.366885 -1.617911 2019-12-17 0.717372 0.233221 2019-12-18 1.753858 0.266110 0.6165689271991402

通過整數(shù)下標(biāo)進行選擇

print (df.iloc[3], '\n') print (df.iloc[3:5, 0:2], '\n') print (df.iloc[[1,2,4], [0,2]], '\n') print (df.iloc[1:3, :], '\n') print (df.iat[1,1], '\n') A 0.717372 B 0.233221 C 1.839167 D -1.535794 Name: 2019-12-17 00:00:00, dtype: float64 A B 2019-12-17 0.717372 0.233221 2019-12-18 1.753858 0.266110 A C 2019-12-15 0.483995 -1.772022 2019-12-16 0.366885 0.438302 2019-12-18 1.753858 -0.333931 A B C D 2019-12-15 0.483995 -1.251278 -1.772022 1.282650 2019-12-16 0.366885 -1.617911 0.438302 1.124135 -1.2512782747247186

通過布爾值下標(biāo)進行選擇

print (df[df.A > 0], '\n') #可以用df.A來表示 print (df[df > 0], '\n') df2 = df.copy() df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three'] print (df2, '\n') print (df2[df2['E'].isin(['two', 'four'])]) #用isin來過濾 A B C D 2019-12-14 0.616569 0.051401 1.121409 0.307267 2019-12-15 0.483995 -1.251278 -1.772022 1.282650 2019-12-16 0.366885 -1.617911 0.438302 1.124135 2019-12-17 0.717372 0.233221 1.839167 -1.535794 2019-12-18 1.753858 0.266110 -0.333931 0.281345 2019-12-19 1.048606 0.984771 -0.279106 0.592285 A B C D 2019-12-14 0.616569 0.051401 1.121409 0.307267 2019-12-15 0.483995 NaN NaN 1.282650 2019-12-16 0.366885 NaN 0.438302 1.124135 2019-12-17 0.717372 0.233221 1.839167 NaN 2019-12-18 1.753858 0.266110 NaN 0.281345 2019-12-19 1.048606 0.984771 NaN 0.592285 A B C D E 2019-12-14 0.616569 0.051401 1.121409 0.307267 one 2019-12-15 0.483995 -1.251278 -1.772022 1.282650 one 2019-12-16 0.366885 -1.617911 0.438302 1.124135 two 2019-12-17 0.717372 0.233221 1.839167 -1.535794 three 2019-12-18 1.753858 0.266110 -0.333931 0.281345 four 2019-12-19 1.048606 0.984771 -0.279106 0.592285 three A B C D E 2019-12-16 0.366885 -1.617911 0.438302 1.124135 two 2019-12-18 1.753858 0.266110 -0.333931 0.281345 four

3.2 設(shè)置行、列

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20191215', periods=6)) print (s1, '\n')df['F'] = s1 df.at[dates[0], 'A'] = 0 df.iat[0, 1] = 0 df.loc[:, 'D'] = np.array([5] * len(df)) print (df)df2 = df.copy() df2[df2 > 0] = -df2 print (df2) 2019-12-15 1 2019-12-16 2 2019-12-17 3 2019-12-18 4 2019-12-19 5 2019-12-20 6 Freq: D, dtype: int64 A B C D F 2019-12-14 0.000000 0.000000 1.121409 5 NaN 2019-12-15 0.483995 -1.251278 -1.772022 5 1.0 2019-12-16 0.366885 -1.617911 0.438302 5 2.0 2019-12-17 0.717372 0.233221 1.839167 5 3.0 2019-12-18 1.753858 0.266110 -0.333931 5 4.0 2019-12-19 1.048606 0.984771 -0.279106 5 5.0A B C D F 2019-12-14 0.000000 0.000000 -1.121409 -5 NaN 2019-12-15 -0.483995 -1.251278 -1.772022 -5 -1.0 2019-12-16 -0.366885 -1.617911 -0.438302 -5 -2.0 2019-12-17 -0.717372 -0.233221 -1.839167 -5 -3.0 2019-12-18 -1.753858 -0.266110 -0.333931 -5 -4.0 2019-12-19 -1.048606 -0.984771 -0.279106 -5 -5.0

4 缺失值

pandas用np.nan表示缺失值，通常它不會被計算。

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E']) df1.loc[dates[0]:dates[1], 'E'] = 1 print (df1, '\n') print (df1.dropna(how='any'), '\n') #刪除所有存在缺失值的行 print (df1.fillna(value=5), '\n') #將所有缺失值用5填充 print (pd.isna(df1)) #獲取所有位置是否是缺失值的boolean值 A B C D F E 2019-12-14 0.000000 0.000000 1.121409 5 NaN 1.0 2019-12-15 0.483995 -1.251278 -1.772022 5 1.0 1.0 2019-12-16 0.366885 -1.617911 0.438302 5 2.0 NaN 2019-12-17 0.717372 0.233221 1.839167 5 3.0 NaN A B C D F E 2019-12-15 0.483995 -1.251278 -1.772022 5 1.0 1.0 A B C D F E 2019-12-14 0.000000 0.000000 1.121409 5 5.0 1.0 2019-12-15 0.483995 -1.251278 -1.772022 5 1.0 1.0 2019-12-16 0.366885 -1.617911 0.438302 5 2.0 5.0 2019-12-17 0.717372 0.233221 1.839167 5 3.0 5.0 A B C D F E 2019-12-14 False False False False True False 2019-12-15 False False False False False False 2019-12-16 False False False False False True 2019-12-17 False False False False False True

4 操作

4.1 統(tǒng)計

通常，操作都會把 NaN 排除在外

print (df.mean(), '\n') print (df.mean(1),'\n') #對列求均值 print (pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2), '\n') #將所有數(shù)據(jù)下移兩位 A 0.728453 B -0.230848 C 0.168970 D 5.000000 F 3.000000 dtype: float64 2019-12-14 1.530352 2019-12-15 0.692139 2019-12-16 1.237455 2019-12-17 2.157952 2019-12-18 2.137207 2019-12-19 2.350854 Freq: D, dtype: float64 2019-12-14 NaN 2019-12-15 NaN 2019-12-16 1.0 2019-12-17 3.0 2019-12-18 5.0 2019-12-19 NaN Freq: D, dtype: float64

4.2 Apply

對數(shù)據(jù)apply一個函數(shù)

print (df, '\n') print (df.apply(np.cumsum), '\n') #cumsum表示每行逐次累加 print (df.apply(lambda x: x.max()-x.min())) #每一列的最大值減去最小值 A B C D F 2019-12-14 0.000000 0.000000 1.121409 5 NaN 2019-12-15 0.483995 -1.251278 -1.772022 5 1.0 2019-12-16 0.366885 -1.617911 0.438302 5 2.0 2019-12-17 0.717372 0.233221 1.839167 5 3.0 2019-12-18 1.753858 0.266110 -0.333931 5 4.0 2019-12-19 1.048606 0.984771 -0.279106 5 5.0 A B C D F 2019-12-14 0.000000 0.000000 1.121409 5 NaN 2019-12-15 0.483995 -1.251278 -0.650613 10 1.0 2019-12-16 0.850881 -2.869189 -0.212311 15 3.0 2019-12-17 1.568253 -2.635968 1.626856 20 6.0 2019-12-18 3.322111 -2.369858 1.292925 25 10.0 2019-12-19 4.370717 -1.385087 1.013819 30 15.0 A 1.753858 B 2.602682 C 3.611190 D 0.000000 F 4.000000 dtype: float64

4.3 直方圖

s = pd.Series(np.random.randint(0, 7, size=10)) print (s, '\n') print (s.value_counts()) 0 4 1 3 2 3 3 6 4 4 5 5 6 2 7 2 8 3 9 0 dtype: int32 3 3 4 2 2 2 6 1 5 1 0 1 dtype: int64

4.4 字符串函數(shù)

Series在str屬性中自帶了很多字符串函數(shù)：

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) print (s.str.lower(), '\n') 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object

5 合并

5.1 Concat

簡單地按行拼接

df = pd.DataFrame(np.random.randn(10, 4)) print (df, '\n')#break it into pieces pieces = [df[:3], df[3:7], df[7:]]print (pd.concat(pieces)) 0 1 2 3 0 1.434680 -0.776149 -1.218836 2.397118 1 -1.632868 -0.806080 0.634619 -0.760958 2 1.028005 -0.020240 0.696301 0.090692 3 -1.516807 0.952234 0.758215 -0.272409 4 0.540413 0.258468 -0.563238 -1.380825 5 0.021616 0.097505 -1.052843 0.272892 6 -0.957512 0.088345 -0.972309 -1.224522 7 0.211805 1.674883 -0.919839 -0.159892 8 1.651509 0.908129 -0.416815 -0.629014 9 -0.616352 0.160981 1.059860 -0.521867 0 1 2 3 0 1.434680 -0.776149 -1.218836 2.397118 1 -1.632868 -0.806080 0.634619 -0.760958 2 1.028005 -0.020240 0.696301 0.090692 3 -1.516807 0.952234 0.758215 -0.272409 4 0.540413 0.258468 -0.563238 -1.380825 5 0.021616 0.097505 -1.052843 0.272892 6 -0.957512 0.088345 -0.972309 -1.224522 7 0.211805 1.674883 -0.919839 -0.159892 8 1.651509 0.908129 -0.416815 -0.629014 9 -0.616352 0.160981 1.059860 -0.521867

5.2 Join

SQL樣式的合并方法

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]}) right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]}) print (left, '\n') print (right, '\n') print (pd.merge(left, right, on='key')) key lval 0 foo 1 1 foo 2 key rval 0 foo 4 1 foo 5 key lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5

5.3 Append

合并行為DataFrame

df = pd.DataFrame(np.random.randn(8,4), columns=['A', 'B', 'C', 'D']) print (df, '\n') s =df.iloc[3] print (df.append(s, ignore_index=True)) A B C D 0 0.481732 1.270100 0.518214 0.028305 1 0.004151 0.602318 1.274942 -2.683102 2 0.130687 -0.402626 0.701901 -1.182579 3 -1.047072 0.656017 0.468616 -0.694842 4 -0.097726 1.669621 0.901135 1.337648 5 0.493724 -0.865273 -1.390465 1.143148 6 -0.691091 0.460376 0.540177 -0.205331 7 -0.634427 2.625610 0.067031 -2.041728 A B C D 0 0.481732 1.270100 0.518214 0.028305 1 0.004151 0.602318 1.274942 -2.683102 2 0.130687 -0.402626 0.701901 -1.182579 3 -1.047072 0.656017 0.468616 -0.694842 4 -0.097726 1.669621 0.901135 1.337648 5 0.493724 -0.865273 -1.390465 1.143148 6 -0.691091 0.460376 0.540177 -0.205331 7 -0.634427 2.625610 0.067031 -2.041728 8 -1.047072 0.656017 0.468616 -0.694842

6 Grouping

與SQL中的group by類似，包含以下步驟：

根據(jù)規(guī)則將數(shù)據(jù)分組
將函數(shù)獨立的應(yīng)用到每個分組
將結(jié)果合并

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C': np.random.randn(8),'D': np.random.randn(8)}) print (df, '\n') print (df.groupby('A').sum(), '\n') print (df.groupby(['A', 'B']).sum()) A B C D 0 foo one 0.182951 -0.346016 1 bar one 1.906573 -0.386902 2 foo two 0.102443 -1.087078 3 bar three 1.023196 0.467237 4 foo two -0.940578 -0.414947 5 bar two -1.075379 -0.844441 6 foo one 0.514689 0.038393 7 foo three -2.136912 -1.001738 C D A bar 1.854390 -0.764105 foo -2.277406 -2.811385 C D A B bar one 1.906573 -0.386902three 1.023196 0.467237two -1.075379 -0.844441 foo one 0.697640 -0.307622three -2.136912 -1.001738two -0.838134 -1.502025

7 Reshaping

7.1 Stack層疊

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']])) print (tuples) index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B']) df2 = df[:4] print (df2, '\n') [('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]A B first second bar one 0.220077 1.659987two -1.504953 0.350697 baz one 1.631244 0.637820two -0.083256 0.370784

stacked = df2.stack() print (stacked, '\n')print(stacked.unstack(), '\n') print(stacked.unstack(0), '\n') print(stacked.unstack(1), '\n') first second bar one A 0.220077B 1.659987two A -1.504953B 0.350697 baz one A 1.631244B 0.637820two A -0.083256B 0.370784 dtype: float64 A B first second bar one 0.220077 1.659987two -1.504953 0.350697 baz one 1.631244 0.637820two -0.083256 0.370784 first bar baz second one A 0.220077 1.631244B 1.659987 0.637820 two A -1.504953 -0.083256B 0.350697 0.370784 second one two first bar A 0.220077 -1.504953B 1.659987 0.350697 baz A 1.631244 -0.083256B 0.637820 0.370784

7.2 Pivot tables

pivot是把指定索引所對應(yīng)的內(nèi)容，作為新表的行(index)、列(columns)

df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,'B': ['A', 'B', 'C'] * 4,'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,'D': np.random.randn(12),'E': np.random.randn(12)}) print (df, '\n')print (pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']), '\n') A B C D E 0 one A foo -0.647116 -0.383797 1 one B foo -1.009740 -0.149929 2 two C foo -0.000333 0.233550 3 three A bar 0.308752 -0.204858 4 one B bar 0.157966 -0.066707 5 one C bar 1.028976 -1.697499 6 two A foo -0.170791 -0.161074 7 three B foo 1.222346 -0.465335 8 one C foo -0.024742 -1.984697 9 one A bar -1.253050 0.675345 10 two B bar -0.124737 -0.169243 11 three C bar 0.501580 1.049995 C bar foo A B one A -1.253050 -0.647116B 0.157966 -1.009740C 1.028976 -0.024742 three A 0.308752 NaNB NaN 1.222346C 0.501580 NaN two A NaN -0.170791B -0.124737 NaNC NaN -0.000333

8 Time series

rng = pd.date_range('1/1/2020', periods=120, freq='S') ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng) print (ts, '\n') print (ts.resample('1Min').sum()) #resapmle重采樣，每隔一分鐘為index，對值進行求和 2020-01-01 00:00:00 349 2020-01-01 00:00:01 141 2020-01-01 00:00:02 467 2020-01-01 00:00:03 152 2020-01-01 00:00:04 491... 2020-01-01 00:01:55 29 2020-01-01 00:01:56 383 2020-01-01 00:01:57 310 2020-01-01 00:01:58 204 2020-01-01 00:01:59 115 Freq: S, Length: 120, dtype: int32 2020-01-01 00:00:00 15490 2020-01-01 00:01:00 13744 Freq: T, dtype: int32 # 時區(qū)表示 rng = pd.date_range('3/6/2020 00:00', periods=5, freq='D') ts = pd.Series(np.random.randn(len(rng)), rng) print (ts, '\n') ts_utc = ts.tz_localize('UTC') print (ts_utc, '\n') 2020-03-06 -0.264762 2020-03-07 1.358651 2020-03-08 -0.316810 2020-03-09 -0.134225 2020-03-10 -0.960168 Freq: D, dtype: float64 2020-03-06 00:00:00+00:00 -0.264762 2020-03-07 00:00:00+00:00 1.358651 2020-03-08 00:00:00+00:00 -0.316810 2020-03-09 00:00:00+00:00 -0.134225 2020-03-10 00:00:00+00:00 -0.960168 Freq: D, dtype: float64

#轉(zhuǎn)換為另一個時區(qū) print (ts_utc.tz_convert('US/Eastern'), '\n') 2020-03-05 19:00:00-05:00 -0.264762 2020-03-06 19:00:00-05:00 1.358651 2020-03-07 19:00:00-05:00 -0.316810 2020-03-08 20:00:00-04:00 -0.134225 2020-03-09 20:00:00-04:00 -0.960168 Freq: D, dtype: float64

#在時間跨度的表示之間進行轉(zhuǎn)換 rng = pd.date_range('1/1/2020', periods=5, freq='M') ts = pd.Series(np.random.randn(len(rng)), index=rng) print (ts, '\n') ps = ts.to_period() print (ps, '\n') print (ps.to_timestamp(),'\n') 2020-01-31 -0.325968 2020-02-29 -0.989123 2020-03-31 -0.323522 2020-04-30 0.491792 2020-05-31 -0.161315 Freq: M, dtype: float64 2020-01 -0.325968 2020-02 -0.989123 2020-03 -0.323522 2020-04 0.491792 2020-05 -0.161315 Freq: M, dtype: float64 2020-01-01 -0.325968 2020-02-01 -0.989123 2020-03-01 -0.323522 2020-04-01 0.491792 2020-05-01 -0.161315 Freq: MS, dtype: float64

#在period和timestamp之間進行轉(zhuǎn)換可以使用一些方便的算術(shù)函數(shù): prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV') ts = pd.Series(np.random.randn(len(prng)), prng) print (ts[:6], '\n') ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9 print (ts.head(), '\n') 1990Q1 -0.694847 1990Q2 1.123105 1990Q3 0.395293 1990Q4 -0.517952 1991Q1 0.689373 1991Q2 0.664599 Freq: Q-NOV, dtype: float64 1990-03-01 09:00 -0.694847 1990-06-01 09:00 1.123105 1990-09-01 09:00 0.395293 1990-12-01 09:00 -0.517952 1991-03-01 09:00 0.689373 Freq: H, dtype: float64

9 Categoricals

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],"raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})#將raw_grade轉(zhuǎn)化為categorical的數(shù)據(jù)類型 df["grade"] = df["raw_grade"].astype("category") print (df["grade"], '\n') 0 a 1 b 2 b 3 a 4 a 5 e Name: grade, dtype: category Categories (3, object): [a, b, e]

df["grade"].cat.categories = ["very good", "good", "very bad"] print (df["grade"], '\n')#重新排序類別，同時添加缺少的類別;新類別按照舊類別的順序匹配 df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) print (df["grade"], '\n') 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (3, object): [very good, good, very bad] 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]

#排序是按類別中的順序進行的，而不是詞匯順序。 print (df.sort_values(by="grade"), '\n')print (df.groupby("grade").size()) id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a very good 3 4 a very good 4 5 a very good grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64

10 Plotting

ts = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2020', periods=1000)) ts = ts.cumsum() ts.plot()

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D']) df = df.cumsum() plt.figure() df.plot() plt.legend(loc='best') #best表示自動分配最佳位置

11 讀取、寫入數(shù)據(jù)

11.1 CSV

#寫入 df.to_csv('foo.csv')#讀取 print (pd.read_csv('foo.csv'), '\n') Unnamed: 0 A B C D 0 2020-01-01 0.674980 0.021207 -0.984289 -0.261462 1 2020-01-02 2.434732 0.361986 1.523058 -0.428913 2 2020-01-03 4.390606 0.680106 0.511064 -0.722390 3 2020-01-04 4.721387 0.118680 0.446910 -0.788997 4 2020-01-05 5.966886 -1.940727 1.120733 -0.169760 .. ... ... ... ... ... 995 2022-09-22 9.467967 9.121625 57.829448 31.631881 996 2022-09-23 8.552319 10.117774 56.513708 30.713588 997 2022-09-24 10.064727 9.115677 56.434259 32.638695 998 2022-09-25 9.974457 7.912468 56.121923 32.281864 999 2022-09-26 8.935891 8.412897 58.160465 32.012264[1000 rows x 5 columns]

11.2 HDF5

#寫入 df.to_hdf('foo.h5', 'df')#讀取 print (pd.read_hdf('foo.h5', 'df'), '\n') A B C D 2020-01-01 0.674980 0.021207 -0.984289 -0.261462 2020-01-02 2.434732 0.361986 1.523058 -0.428913 2020-01-03 4.390606 0.680106 0.511064 -0.722390 2020-01-04 4.721387 0.118680 0.446910 -0.788997 2020-01-05 5.966886 -1.940727 1.120733 -0.169760 ... ... ... ... ... 2022-09-22 9.467967 9.121625 57.829448 31.631881 2022-09-23 8.552319 10.117774 56.513708 30.713588 2022-09-24 10.064727 9.115677 56.434259 32.638695 2022-09-25 9.974457 7.912468 56.121923 32.281864 2022-09-26 8.935891 8.412897 58.160465 32.012264[1000 rows x 4 columns]

11.3 Excel

#寫入 df.to_excel('foo.xlsx', sheet_name='Sheet1')#讀取 print (pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']), '\n') Unnamed: 0 A B C D 0 2020-01-01 0.674980 0.021207 -0.984289 -0.261462 1 2020-01-02 2.434732 0.361986 1.523058 -0.428913 2 2020-01-03 4.390606 0.680106 0.511064 -0.722390 3 2020-01-04 4.721387 0.118680 0.446910 -0.788997 4 2020-01-05 5.966886 -1.940727 1.120733 -0.169760 .. ... ... ... ... ... 995 2022-09-22 9.467967 9.121625 57.829448 31.631881 996 2022-09-23 8.552319 10.117774 56.513708 30.713588 997 2022-09-24 10.064727 9.115677 56.434259 32.638695 998 2022-09-25 9.974457 7.912468 56.121923 32.281864 999 2022-09-26 8.935891 8.412897 58.160465 32.012264[1000 rows x 5 columns]

12 注意

不能直接把返回值當(dāng)作布爾值：

if pd.Series([False, True, False]):print("I was true") ---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-210-5c782b38cd2f> in <module> ----> 1 if pd.Series([False, True, False]):2 print("I was true")D:\Applications\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)1553 "The truth value of a {0} is ambiguous. "1554 "Use a.empty, a.bool(), a.item(), a.any() or a.all().".format( -> 1555 self.__class__.__name__1556 )1557 )ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

總結(jié)

以上是生活随笔為你收集整理的十分钟入门Pandas的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：建模比赛心得
下一篇：程序员必须要掌握的 Git 操作技巧