當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

八、Pandas的基本使用

發布時間：2024/7/5 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了八、Pandas的基本使用小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Pandas的基本使用

點擊標題即可獲取文章源代碼和筆記

4.1.0 概要

Pandas基礎處理Pandas是什么？為什么用？核心數據結構DataFramePanelSeries基本操作運算畫圖文件的讀取與存儲高級處理4.1Pandas介紹4.1.1 Pandas介紹 - 數據處理工具panel + data + analysispanel面板數據 - 計量經濟學三維數據4.1.2 為什么使用Pandas便捷的數據處理能力讀取文件方便封裝了Matplotlib、Numpy的畫圖和計算4.1.3 DataFrame結構：既有行索引，又有列索引的二維數組屬性：shapeindexcolumnsvaluesT方法：head()tail()3 DataFrame索引的設置1）修改行列索引值2）重設索引3）設置新索引2 PanelDataFrame的容器3 Series帶索引的一維數組屬性indexvalues總結：DataFrame是Series的容器Panel是DataFrame的容器 4.2 基本數據操作4.2.1 索引操作1）直接索引先列后行2）按名字索引loc3）按數字索引iloc4）組合索引數字、名字4.2.3 排序對內容排序dataframeseries對索引排序dataframeseries 4.3 DataFrame運算算術運算邏輯運算邏輯運算符布爾索引邏輯運算函數query()isin()統計運算min max mean median var stdnp.argmax()np.argmin()自定義運算apply(func, axis=0)Truefunc:自定義函數 4.4 Pandas畫圖sr.plot() 4.5 文件讀取與存儲4.5.1 CSVpd.read_csv(path)usecols=names=dataframe.to_csv(path)columns=[]index=Falseheader=False4.5.2 HDF5hdf5 存儲 3維數據的文件key1 dataframe1二維數據key2 dataframe2二維數據pd.read_hdf(path, key=)df.to_hdf(path, key=)4.5.3 JSONpd.read_json(path)orient="records"lines=Truedf.to_json(patn)orient="records"lines=True

4.1.3 DataFrame

import numpy as np # 創建一個符合正態分布的10個股票5天的漲跌幅數據 stock_change = np.random.normal(0,1,(10,5)) stock_change array([[ 0.77072465, 1.30408183, -0.44043464, 0.8900768 , -0.80947118],[ 0.92407994, 0.01646795, -1.26614793, 1.52393669, -0.85373051],[-1.68378051, 0.4302981 , 0.8069393 , 0.60557427, -0.03960376],[ 0.75708007, -0.39899325, 0.23027082, -0.89585658, -1.86590247],[-0.41516245, -1.31841546, 0.16256478, -0.67449097, -1.26234013],[-0.27687242, -0.74154521, -0.03755446, 1.24182603, -0.79444361],[-0.2549323 , -0.41034663, -1.85076521, -1.28663451, -0.28566877],[ 1.22453612, -1.60200055, -1.83171522, -0.85322799, -1.70950421],[ 2.00461483, 1.49338564, 0.33928513, -0.1776084 , -0.39698965],[ 0.2184662 , -0.03868143, -0.21432675, 0.00604093, 1.35011139]]) import pandas as pd pd.DataFrame(stock_change) 012340123456789

0.770725	1.304082	-0.440435	0.890077	-0.809471
0.924080	0.016468	-1.266148	1.523937	-0.853731
-1.683781	0.430298	0.806939	0.605574	-0.039604
0.757080	-0.398993	0.230271	-0.895857	-1.865902
-0.415162	-1.318415	0.162565	-0.674491	-1.262340
-0.276872	-0.741545	-0.037554	1.241826	-0.794444
-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
1.224536	-1.602001	-1.831715	-0.853228	-1.709504
2.004615	1.493386	0.339285	-0.177608	-0.396990
0.218466	-0.038681	-0.214327	0.006041	1.350111

# 構造行索引序列 stock_code = ['股票' + str(i) for i in range(stock_change.shape[0])] stock_code ['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9'] # 添加行索引 data = pd.DataFrame(stock_change,index=stock_code) data 01234股票0股票1股票2股票3股票4股票5股票6股票7股票8股票9

0.770725	1.304082	-0.440435	0.890077	-0.809471
0.924080	0.016468	-1.266148	1.523937	-0.853731
-1.683781	0.430298	0.806939	0.605574	-0.039604
0.757080	-0.398993	0.230271	-0.895857	-1.865902
-0.415162	-1.318415	0.162565	-0.674491	-1.262340
-0.276872	-0.741545	-0.037554	1.241826	-0.794444
-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
1.224536	-1.602001	-1.831715	-0.853228	-1.709504
2.004615	1.493386	0.339285	-0.177608	-0.396990
0.218466	-0.038681	-0.214327	0.006041	1.350111

# 添加列索引 date = pd.date_range(start="20200618",periods=5,freq="B") # start 開始時間， periods 間隔時間，freq 按照什么間隔 d w 5h date DatetimeIndex(['2020-06-18', '2020-06-19', '2020-06-22', '2020-06-23','2020-06-24'],dtype='datetime64[ns]', freq='B') # 添加列索引 data = pd.DataFrame(stock_change,index=stock_code,columns=date) data 2020-06-182020-06-192020-06-222020-06-232020-06-24股票0股票1股票2股票3股票4股票5股票6股票7股票8股票9

0.770725	1.304082	-0.440435	0.890077	-0.809471
0.924080	0.016468	-1.266148	1.523937	-0.853731
-1.683781	0.430298	0.806939	0.605574	-0.039604
0.757080	-0.398993	0.230271	-0.895857	-1.865902
-0.415162	-1.318415	0.162565	-0.674491	-1.262340
-0.276872	-0.741545	-0.037554	1.241826	-0.794444
-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
1.224536	-1.602001	-1.831715	-0.853228	-1.709504
2.004615	1.493386	0.339285	-0.177608	-0.396990
0.218466	-0.038681	-0.214327	0.006041	1.350111

DataFrame屬性

data.shape (10, 5) data.index Index(['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9'], dtype='object') data.columns DatetimeIndex(['2020-06-18', '2020-06-19', '2020-06-22', '2020-06-23','2020-06-24'],dtype='datetime64[ns]', freq='B') data.values array([[ 0.77072465, 1.30408183, -0.44043464, 0.8900768 , -0.80947118],[ 0.92407994, 0.01646795, -1.26614793, 1.52393669, -0.85373051],[-1.68378051, 0.4302981 , 0.8069393 , 0.60557427, -0.03960376],[ 0.75708007, -0.39899325, 0.23027082, -0.89585658, -1.86590247],[-0.41516245, -1.31841546, 0.16256478, -0.67449097, -1.26234013],[-0.27687242, -0.74154521, -0.03755446, 1.24182603, -0.79444361],[-0.2549323 , -0.41034663, -1.85076521, -1.28663451, -0.28566877],[ 1.22453612, -1.60200055, -1.83171522, -0.85322799, -1.70950421],[ 2.00461483, 1.49338564, 0.33928513, -0.1776084 , -0.39698965],[ 0.2184662 , -0.03868143, -0.21432675, 0.00604093, 1.35011139]]) data.T 股票0股票1股票2股票3股票4股票5股票6股票7股票8股票92020-06-182020-06-192020-06-222020-06-232020-06-24

0.770725	0.924080	-1.683781	0.757080	-0.415162	-0.276872	-0.254932	1.224536	2.004615	0.218466
1.304082	0.016468	0.430298	-0.398993	-1.318415	-0.741545	-0.410347	-1.602001	1.493386	-0.038681
-0.440435	-1.266148	0.806939	0.230271	0.162565	-0.037554	-1.850765	-1.831715	0.339285	-0.214327
0.890077	1.523937	0.605574	-0.895857	-0.674491	1.241826	-1.286635	-0.853228	-0.177608	0.006041
-0.809471	-0.853731	-0.039604	-1.865902	-1.262340	-0.794444	-0.285669	-1.709504	-0.396990	1.350111

DataFrame方法

data.head() # 返回前5行數據 2020-06-182020-06-192020-06-222020-06-232020-06-24股票0股票1股票2股票3股票4

0.770725	1.304082	-0.440435	0.890077	-0.809471
0.924080	0.016468	-1.266148	1.523937	-0.853731
-1.683781	0.430298	0.806939	0.605574	-0.039604
0.757080	-0.398993	0.230271	-0.895857	-1.865902
-0.415162	-1.318415	0.162565	-0.674491	-1.262340

data.tail() # 返回后5行數據 2020-06-182020-06-192020-06-222020-06-232020-06-24股票5股票6股票7股票8股票9

-0.276872	-0.741545	-0.037554	1.241826	-0.794444
-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
1.224536	-1.602001	-1.831715	-0.853228	-1.709504
2.004615	1.493386	0.339285	-0.177608	-0.396990
0.218466	-0.038681	-0.214327	0.006041	1.350111

3 DataFrame索引的設置

修改行列索引值

data.index[2] '股票2' data.index[2] = "股票88" # 注意：單獨修改每一列的索引是不行的，在DataFrame中，只能對索引進行整體的修改 ---------------------------------------------------------------------------TypeError Traceback (most recent call last)<ipython-input-19-9e95917cc4d9> in <module> ----> 1 data.index[2] = "股票88"D:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)3908 3909 def __setitem__(self, key, value): -> 3910 raise TypeError("Index does not support mutable operations")3911 3912 def __getitem__(self, key):TypeError: Index does not support mutable operations stock_ = ["股票_{}".format(i) for i in range(10)] data.index = stock_ data.index Index(['股票_0', '股票_1', '股票_2', '股票_3', '股票_4', '股票_5', '股票_6', '股票_7', '股票_8','股票_9'],dtype='object')

重設索引

reset_index（drop=False）
設置新的下標索引
drop：默認為False，不刪除原來索引，如果為True，刪除原來的索引值

# 重置索引，drop=False data.reset_index() index2020-06-18 00:00:002020-06-19 00:00:002020-06-22 00:00:002020-06-23 00:00:002020-06-24 00:00:000123456789

股票_0	0.770725	1.304082	-0.440435	0.890077	-0.809471
股票_1	0.924080	0.016468	-1.266148	1.523937	-0.853731
股票_2	-1.683781	0.430298	0.806939	0.605574	-0.039604
股票_3	0.757080	-0.398993	0.230271	-0.895857	-1.865902
股票_4	-0.415162	-1.318415	0.162565	-0.674491	-1.262340
股票_5	-0.276872	-0.741545	-0.037554	1.241826	-0.794444
股票_6	-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
股票_7	1.224536	-1.602001	-1.831715	-0.853228	-1.709504
股票_8	2.004615	1.493386	0.339285	-0.177608	-0.396990
股票_9	0.218466	-0.038681	-0.214327	0.006041	1.350111

# 重置索引，drop=True data.reset_index(drop=True) 2020-06-182020-06-192020-06-222020-06-232020-06-240123456789

0.770725	1.304082	-0.440435	0.890077	-0.809471
0.924080	0.016468	-1.266148	1.523937	-0.853731
-1.683781	0.430298	0.806939	0.605574	-0.039604
0.757080	-0.398993	0.230271	-0.895857	-1.865902
-0.415162	-1.318415	0.162565	-0.674491	-1.262340
-0.276872	-0.741545	-0.037554	1.241826	-0.794444
-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
1.224536	-1.602001	-1.831715	-0.853228	-1.709504
2.004615	1.493386	0.339285	-0.177608	-0.396990
0.218466	-0.038681	-0.214327	0.006041	1.350111

以某列值設置為新的索引

set_index(keys,drop=True)
keys:列索引名或者列索引名稱的列表
drop:boolean,default True 當作新的索引，刪除原來的索引列

設置新索引案例

1.創建

df = pd.DataFrame({'month':[1,4,7,10],'year':[2012,2014,2013,2014],'sale':[55,40,84,31] }) df monthyearsale0123

1	2012	55
4	2014	40
7	2013	84
10	2014	31

2、以月份設置新的索引

df.set_index('month') yearsalemonth14710

2012	55
2014	40
2013	84
2014	31

設置多個索引，以年和月份

new_df = df.set_index(['year','month']) new_df saleyearmonth201212014420137201410

55
40
84
31

new_df.index MultiIndex([(2012, 1),(2014, 4),(2013, 7),(2014, 10)],names=['year', 'month'])

4.1.4 MultiIndex 與 Panel的關系

1 Multilndex多級或分層索引對象。

index屬性

names: levels的名稱

levels：每個level的元組值

new_df.index.names FrozenList(['year', 'month']) new_df.index.levels FrozenList([[2012, 2013, 2014], [1, 4, 7, 10]])

2 Panel

p = pd.Panel() p # 新版本已移除該函數 D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version"""Entry point for launching an IPython kernel.<pandas.__getattr__.<locals>.Panel at 0x203fd31ea08> data 2020-06-182020-06-192020-06-222020-06-232020-06-24股票_0股票_1股票_2股票_3股票_4股票_5股票_6股票_7股票_8股票_9

0.770725	1.304082	-0.440435	0.890077	-0.809471
0.924080	0.016468	-1.266148	1.523937	-0.853731
-1.683781	0.430298	0.806939	0.605574	-0.039604
0.757080	-0.398993	0.230271	-0.895857	-1.865902
-0.415162	-1.318415	0.162565	-0.674491	-1.262340
-0.276872	-0.741545	-0.037554	1.241826	-0.794444
-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
1.224536	-1.602001	-1.831715	-0.853228	-1.709504
2.004615	1.493386	0.339285	-0.177608	-0.396990
0.218466	-0.038681	-0.214327	0.006041	1.350111

Series

data.iloc[1,:] # 帶索引的一維數組 2020-06-18 0.924080 2020-06-19 0.016468 2020-06-22 -1.266148 2020-06-23 1.523937 2020-06-24 -0.853731 Freq: B, Name: 股票_1, dtype: float64 type(data.iloc[1,:]) pandas.core.series.Series

屬性

data.iloc[1,:].index DatetimeIndex(['2020-06-18', '2020-06-19', '2020-06-22', '2020-06-23','2020-06-24'],dtype='datetime64[ns]', freq='B') data.iloc[1,:].values array([ 0.92407994, 0.01646795, -1.26614793, 1.52393669, -0.85373051])

1. 創建Series

通過已有數據創建

指定內容，默認索引

pd.Series(np.arange(10)) 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 dtype: int32

指定索引

pd.Series([6.7,5.6,3,10,2],index=[1,2,3,4,5]) 1 6.7 2 5.6 3 3.0 4 10.0 5 2.0 dtype: float64

通過字典數據創建

pd.Series({'red':100,'blue':200,'green':500,'yellow':1000 }) red 100 blue 200 green 500 yellow 1000 dtype: int64

總結

DataFrame 是 Series的容器
Panel 是 DataFrame的容器

4.2 基本數據操作

datas = pd.read_excel("./datas/szfj_baoan.xls") datas districtroomnumhallAREAC_floorfloor_numschoolsubwayper_price01234...12461247124812491250

baoan	3	2	89.3	middle	31	0	0	7.0773
baoan	4	2	127.0	high	31	0	0	6.9291
baoan	1	1	28.0	low	39	0	0	3.9286
baoan	1	1	28.0	middle	30	0	0	3.3568
baoan	2	2	78.0	middle	8	1	1	5.0769
...	...	...	...	...	...	...	...	...
baoan	4	2	89.3	low	8	0	0	4.2553
baoan	2	1	67.0	middle	30	0	0	3.8060
baoan	2	2	67.4	middle	29	1	0	5.3412
baoan	2	2	73.1	low	15	1	0	5.9508
baoan	3	2	86.2	middle	32	0	1	4.5244

1251 rows × 9 columns

datas.columns Index(['district', 'roomnum', 'hall', 'AREA', 'C_floor', 'floor_num', 'school','subway', 'per_price'],dtype='object') # 刪除列 datas = datas.drop(columns=[ 'school','subway',],axis=0) datas districtroomnumhallAREAC_floorfloor_numper_price01234...12461247124812491250

baoan	3	2	89.3	middle	31	7.0773
baoan	4	2	127.0	high	31	6.9291
baoan	1	1	28.0	low	39	3.9286
baoan	1	1	28.0	middle	30	3.3568
baoan	2	2	78.0	middle	8	5.0769
...	...	...	...	...	...	...
baoan	4	2	89.3	low	8	4.2553
baoan	2	1	67.0	middle	30	3.8060
baoan	2	2	67.4	middle	29	5.3412
baoan	2	2	73.1	low	15	5.9508
baoan	3	2	86.2	middle	32	4.5244

1251 rows × 7 columns

4.2.1 索引操作

1.直接使用行列索引（先列后行）

datas["per_price"][0] 7.0773

2. 按名字索引(先行后列)

datas.loc[0]["per_price"] 7.0773 datas.loc[0,"per_price"] 7.0773

3.按數字索引

datas.iloc[0,6] 7.0773 # 通過索引值獲取行名 datas.index[0:4] RangeIndex(start=0, stop=4, step=1) datas.loc[datas.index[0:4],["district","roomnum"]] districtroomnum0123

baoan	3
baoan	4
baoan	1
baoan	1

# datas.columns.get_indexer() 通過列名獲取索引值 datas.columns.get_indexer(["district","roomnum"]) array([0, 1], dtype=int64) datas.iloc[0:4,datas.columns.get_indexer(["district","roomnum"])] districtroomnum0123

baoan	3
baoan	4
baoan	1
baoan	1

4.2.2 賦值操作

# 直接修改原來的值 datas["hall"] = 5 datas.head() districtroomnumhallAREAC_floorfloor_numper_price01234

baoan	3	5	89.3	middle	31	7.0773
baoan	4	5	127.0	high	31	6.9291
baoan	1	5	28.0	low	39	3.9286
baoan	1	5	28.0	middle	30	3.3568
baoan	2	5	78.0	middle	8	5.0769

# 或者 datas.hall = 1 datas.head() districtroomnumhallAREAC_floorfloor_numper_price01234

baoan	3	1	89.3	middle	31	7.0773
baoan	4	1	127.0	high	31	6.9291
baoan	1	1	28.0	low	39	3.9286
baoan	1	1	28.0	middle	30	3.3568
baoan	2	1	78.0	middle	8	5.0769

datas.iloc[0,0] = "zzzz" datas.head() districtroomnumhallAREAC_floorfloor_numper_price01234

zzzz	3	1	89.3	middle	31	7.0773
baoan	4	1	127.0	high	31	6.9291
baoan	1	1	28.0	low	39	3.9286
baoan	1	1	28.0	middle	30	3.3568
baoan	2	1	78.0	middle	8	5.0769

4.2.3 排序

# 對內容進行排序, ascending=False降序排列，默認為True升序排列 datas.sort_values(by="per_price",ascending=False) districtroomnumhallAREAC_floorfloor_numper_price917356576296186...91184111886841047

baoan	4	1	93.59	high	28	21.9040
baoan	8	1	248.99	low	7	21.2860
baoan	1	1	21.95	middle	22	19.3622
baoan	4	1	93.59	high	28	19.2328
baoan	3	1	113.60	middle	31	16.5493
...	...	...	...	...	...	...
baoan	2	1	89.00	middle	16	1.6854
baoan	2	1	75.00	high	7	1.6667
baoan	3	1	110.00	middle	33	1.5909
baoan	3	1	89.00	middle	26	1.2247
baoan	3	1	98.90	middle	26	1.1931

1251 rows × 7 columns

datas.sort_values(by="per_price") districtroomnumhallAREAC_floorfloor_numper_price10476841188841911...186296576356917

baoan	3	1	98.90	middle	26	1.1931
baoan	3	1	89.00	middle	26	1.2247
baoan	3	1	110.00	middle	33	1.5909
baoan	2	1	75.00	high	7	1.6667
baoan	2	1	89.00	middle	16	1.6854
...	...	...	...	...	...	...
baoan	3	1	113.60	middle	31	16.5493
baoan	4	1	93.59	high	28	19.2328
baoan	1	1	21.95	middle	22	19.3622
baoan	8	1	248.99	low	7	21.2860
baoan	4	1	93.59	high	28	21.9040

1251 rows × 7 columns

# 按照多個字段進行排序 # 先按照“district”字段的內容進行排序，如果值相同，再按照“per_price”字段的內容進行排序 datas.sort_values(by=["district","per_price"]) districtroomnumhallAREAC_floorfloor_numper_price10476841188841911...2965763569170

baoan	3	1	98.90	middle	26	1.1931
baoan	3	1	89.00	middle	26	1.2247
baoan	3	1	110.00	middle	33	1.5909
baoan	2	1	75.00	high	7	1.6667
baoan	2	1	89.00	middle	16	1.6854
...	...	...	...	...	...	...
baoan	4	1	93.59	high	28	19.2328
baoan	1	1	21.95	middle	22	19.3622
baoan	8	1	248.99	low	7	21.2860
baoan	4	1	93.59	high	28	21.9040
zzzz	3	1	89.30	middle	31	7.0773

1251 rows × 7 columns

# 按照行索引大小進行排序,默認從小到大排序 datas.sort_index() districtroomnumhallAREAC_floorfloor_numper_price01234...12461247124812491250

zzzz	3	1	89.3	middle	31	7.0773
baoan	4	1	127.0	high	31	6.9291
baoan	1	1	28.0	low	39	3.9286
baoan	1	1	28.0	middle	30	3.3568
baoan	2	1	78.0	middle	8	5.0769
...	...	...	...	...	...	...
baoan	4	1	89.3	low	8	4.2553
baoan	2	1	67.0	middle	30	3.8060
baoan	2	1	67.4	middle	29	5.3412
baoan	2	1	73.1	low	15	5.9508
baoan	3	1	86.2	middle	32	4.5244

1251 rows × 7 columns

sr = datas["per_price"] sr 0 7.0773 1 6.9291 2 3.9286 3 3.3568 4 5.0769... 1246 4.2553 1247 3.8060 1248 5.3412 1249 5.9508 1250 4.5244 Name: per_price, Length: 1251, dtype: float64 # 對Series類型的數據的內容進行排序 sr.sort_values() 1047 1.1931 684 1.2247 1188 1.5909 841 1.6667 911 1.6854... 186 16.5493 296 19.2328 576 19.3622 356 21.2860 917 21.9040 Name: per_price, Length: 1251, dtype: float64 # 對Series類型的數據的索引進行排序 sr.sort_index() 0 7.0773 1 6.9291 2 3.9286 3 3.3568 4 5.0769... 1246 4.2553 1247 3.8060 1248 5.3412 1249 5.9508 1250 4.5244 Name: per_price, Length: 1251, dtype: float64

4.3 DataFrame運算

算術運算

# 對Series類型進行操作 datas["roomnum"] + 3 0 6 1 7 2 4 3 4 4 5.. 1246 7 1247 5 1248 5 1249 5 1250 6 Name: roomnum, Length: 1251, dtype: int64 datas["roomnum"].add(3).head() 0 6 1 7 2 4 3 4 4 5 Name: roomnum, dtype: int64 datas.iloc[:,1:4] roomnumhallAREA01234...12461247124812491250

3	1	89.3
4	1	127.0
1	1	28.0
1	1	28.0
2	1	78.0
...	...	...
4	1	89.3
2	1	67.0
2	1	67.4
2	1	73.1
3	1	86.2

1251 rows × 3 columns

# 對DataFrame類型進行操作 datas.iloc[:,1:4] + 10 roomnumhallAREA01234...12461247124812491250

13	11	99.3
14	11	137.0
11	11	38.0
11	11	38.0
12	11	88.0
...	...	...
14	11	99.3
12	11	77.0
12	11	77.4
12	11	83.1
13	11	96.2

1251 rows × 3 columns

邏輯運算

# 邏輯判斷的結果可以作為篩選的依據 datas['AREA'] > 100 0 False 1 True 2 False 3 False 4 False... 1246 False 1247 False 1248 False 1249 False 1250 False Name: AREA, Length: 1251, dtype: bool # 可以進行布爾索引 datas[datas['AREA'] > 100] districtroomnumhallAREAC_floorfloor_numper_price15162526...12321238123912411243

baoan	4	1	127.00	high	31	6.9291
baoan	4	1	125.17	middle	15	5.8161
baoan	3	1	151.00	high	20	4.9669
baoan	3	1	116.00	high	18	5.0000
baoan	5	1	151.25	high	30	7.6033
...	...	...	...	...	...	...
baoan	5	1	127.17	low	24	5.1113
baoan	4	1	130.74	low	30	13.0029
baoan	3	1	102.10	middle	28	10.8717
baoan	5	1	151.30	high	29	7.2703
baoan	4	1	142.25	high	32	6.3269

322 rows × 7 columns

# 多個邏輯判斷 # 篩選面積大于100 并且放假小于40000的數據 (datas["AREA"]>100) & (datas["per_price"]< 40000) 0 False 1 True 2 False 3 False 4 False... 1246 False 1247 False 1248 False 1249 False 1250 False Length: 1251, dtype: bool # 布爾索引 datas[(datas["AREA"]>100) & (datas["per_price"]< 40000)] districtroomnumhallAREAC_floorfloor_numper_price15162526...12321238123912411243

baoan	4	1	127.00	high	31	6.9291
baoan	4	1	125.17	middle	15	5.8161
baoan	3	1	151.00	high	20	4.9669
baoan	3	1	116.00	high	18	5.0000
baoan	5	1	151.25	high	30	7.6033
...	...	...	...	...	...	...
baoan	5	1	127.17	low	24	5.1113
baoan	4	1	130.74	low	30	13.0029
baoan	3	1	102.10	middle	28	10.8717
baoan	5	1	151.30	high	29	7.2703
baoan	4	1	142.25	high	32	6.3269

322 rows × 7 columns

邏輯運算函數

# 條件查詢函數 datas.query("AREA>100 & per_price<40000") districtroomnumhallAREAC_floorfloor_numper_price15162526...12321238123912411243

baoan	4	1	127.00	high	31	6.9291
baoan	4	1	125.17	middle	15	5.8161
baoan	3	1	151.00	high	20	4.9669
baoan	3	1	116.00	high	18	5.0000
baoan	5	1	151.25	high	30	7.6033
...	...	...	...	...	...	...
baoan	5	1	127.17	low	24	5.1113
baoan	4	1	130.74	low	30	13.0029
baoan	3	1	102.10	middle	28	10.8717
baoan	5	1	151.30	high	29	7.2703
baoan	4	1	142.25	high	32	6.3269

322 rows × 7 columns

datas["roomnum"].isin([4,5]) 0 False 1 True 2 False 3 False 4 False... 1246 True 1247 False 1248 False 1249 False 1250 False Name: roomnum, Length: 1251, dtype: bool # 可以指定值進行判斷，從而進行篩選操作 # 篩選出房間數量為4或者5的數據 datas[datas["roomnum"].isin([4,5])] districtroomnumhallAREAC_floorfloor_numper_price15262936...12321238124112431246

baoan	4	1	127.00	high	31	6.9291
baoan	4	1	125.17	middle	15	5.8161
baoan	5	1	151.25	high	30	7.6033
baoan	4	1	143.45	middle	25	6.9711
baoan	4	1	134.60	middle	32	9.1828
...	...	...	...	...	...	...
baoan	5	1	127.17	low	24	5.1113
baoan	4	1	130.74	low	30	13.0029
baoan	5	1	151.30	high	29	7.2703
baoan	4	1	142.25	high	32	6.3269
baoan	4	1	89.30	low	8	4.2553

224 rows × 7 columns

統計運算

# 計算每一列的總數，均值，標準差，最小值，分位數，最大值等 datas.describe() roomnumhallAREAfloor_numper_pricecountmeanstdmin25%50%75%max

1251.000000	1251.0	1251.000000	1251.000000	1251.000000
2.906475	1.0	92.409976	24.598721	6.643429
0.940663	0.0	37.798122	9.332119	2.435132
1.000000	1.0	21.950000	1.000000	1.193100
2.000000	1.0	75.000000	17.000000	5.075850
3.000000	1.0	87.800000	28.000000	5.906800
3.000000	1.0	101.375000	31.000000	7.761950
8.000000	1.0	352.900000	53.000000	21.904000

統計函數

# axis=0 求每一列的最大值 axis=1求每一行的最大值 datas.max(axis=0) district zzzz roomnum 8 hall 1 AREA 352.9 C_floor middle floor_num 53 per_price 21.904 dtype: object # 方差 datas.var(axis=0) roomnum 0.884846 hall 0.000000 AREA 1428.698032 floor_num 87.088446 per_price 5.929870 dtype: float64 # 標準差 datas.std(axis=0) roomnum 0.940663 hall 0.000000 AREA 37.798122 floor_num 9.332119 per_price 2.435132 dtype: float64 datas.iloc[:,3] 0 89.3 1 127.0 2 28.0 3 28.0 4 78.0... 1246 89.3 1247 67.0 1248 67.4 1249 73.1 1250 86.2 Name: AREA, Length: 1251, dtype: float64 # 求最大值所在的下標（索引） datas.iloc[:,3].idxmax(axis=0) 759 datas.iloc[759,3] 352.9 # 求最小值所在的下標（索引） datas.iloc[:,3].idxmin(axis=0) 576 datas.iloc[576,3] 21.95

累計統計函數

datas["per_price"] 0 7.0773 1 6.9291 2 3.9286 3 3.3568 4 5.0769... 1246 4.2553 1247 3.8060 1248 5.3412 1249 5.9508 1250 4.5244 Name: per_price, Length: 1251, dtype: float64 # 累加 datas["per_price"].cumsum() 0 7.0773 1 14.0064 2 17.9350 3 21.2918 4 26.3687... 1246 8291.3076 1247 8295.1136 1248 8300.4548 1249 8306.4056 1250 8310.9300 Name: per_price, Length: 1251, dtype: float64 datas["per_price"].sort_index().cumsum().plot() <matplotlib.axes._subplots.AxesSubplot at 0x2039a3a3dc8>

import matplotlib.pyplot as plt datas["per_price"].sort_index().cumsum().plot() plt.show()

自定義運算

# 自定義一個計算最大值-最小值的函數 datas[["per_price"]].apply(lambda x : x.max()-x.min(),axis=0) per_price 20.7109 dtype: float64

4.4 Pandas畫圖

# 查看面積和房價之間的關系 datas.plot(x="AREA",y="per_price",kind="scatter") <matplotlib.axes._subplots.AxesSubplot at 0x203a343dec8>

# 查看樓層和房價之間的關系 datas.plot(x="floor_num",y="per_price",kind="scatter") <matplotlib.axes._subplots.AxesSubplot at 0x203a3a81bc8>

datas.plot(x="AREA",y="per_price",kind="barh") <matplotlib.axes._subplots.AxesSubplot at 0x203a2147f08>

4.5 文件的讀取與存儲

1.讀取csv文件 read_csv()

iris_data = pd.read_csv("./datas/iris.data.csv") iris_data.head() feature1feature2feature3feature4result01234

5.1	3.5	1.4	0.2	Iris-setosa
4.9	3.0	1.4	0.2	Iris-setosa
4.7	3.2	1.3	0.2	Iris-setosa
4.6	3.1	1.5	0.2	Iris-setosa
5.0	3.6	1.4	0.2	Iris-setosa

# usecols：指定讀取的列名，列表形式 iris_data1 = pd.read_csv("./datas/iris.data.csv",usecols=["feature1","feature2","result"]) iris_data1.head() feature1feature2result01234

5.1	3.5	Iris-setosa
4.9	3.0	Iris-setosa
4.7	3.2	Iris-setosa
4.6	3.1	Iris-setosa
5.0	3.6	Iris-setosa

iris_data2 = pd.read_csv("./datas/iris.data2.csv") iris_data2.head() 5.13.51.40.2Iris-setosa01234

4.9	3.0	1.4	0.2	Iris-setosa
4.7	3.2	1.3	0.2	Iris-setosa
4.6	3.1	1.5	0.2	Iris-setosa
5.0	3.6	1.4	0.2	Iris-setosa
5.4	3.9	1.7	0.4	Iris-setosa

# names:如果數據集本身沒有列名，可以自己指定列名 iris_data2 = pd.read_csv("./datas/iris.data2.csv",names=["feature1","feature2","feature3","feature4","result"]) iris_data2.head() feature1feature2feature3feature4result01234

5.1	3.5	1.4	0.2	Iris-setosa
4.9	3.0	1.4	0.2	Iris-setosa
4.7	3.2	1.3	0.2	Iris-setosa
4.6	3.1	1.5	0.2	Iris-setosa
5.0	3.6	1.4	0.2	Iris-setosa

datas.head(5) districtroomnumhallAREAC_floorfloor_numper_price01234

zzzz	3	1	89.3	middle	31	7.0773
baoan	4	1	127.0	high	31	6.9291
baoan	1	1	28.0	low	39	3.9286
baoan	1	1	28.0	middle	30	3.3568
baoan	2	1	78.0	middle	8	5.0769

# 保存per_price列的數據 # 保存的時候index=False 去掉行索引 # mode="a" 追加數據 # header=False 不要重復追加列名 datas[:-1].to_csv("./price_test",columns=['per_price'],index=False,mode="a",header=False) # 讀取，查看數據 perice_test = pd.read_csv("./price_test") perice_test per_price01234...37463747374837493750

7.0773
6.9291
3.9286
3.3568
5.0769
...
6.1932
4.2553
3.806
5.3412
5.9508

3751 rows × 1 columns

總結

以上是生活随笔為你收集整理的八、Pandas的基本使用的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

pandas

上一篇： .net get set 初始化_.NE
下一篇： React ref的转发