日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Pandas高级教程之:处理缺失数据

發布時間:2024/2/28 编程问答 27 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Pandas高级教程之:处理缺失数据 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

  • 簡介
  • NaN的例子
  • 整數類型的缺失值
  • Datetimes 類型的缺失值
  • None 和 np.nan 的轉換
  • 缺失值的計算
  • 使用fillna填充NaN數據
  • 使用dropna刪除包含NA的數據
  • 插值interpolation
  • 使用replace替換值

簡介

在數據處理中,Pandas會將無法解析的數據或者缺失的數據使用NaN來表示。雖然所有的數據都有了相應的表示,但是NaN很明顯是無法進行數學運算的。

本文將會講解Pandas對于NaN數據的處理方法。

NaN的例子

上面講到了缺失的數據會被表現為NaN,我們來看一個具體的例子:

我們先來構建一個DF:

In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],...: columns=['one', 'two', 'three'])...: In [2]: df['four'] = 'bar'In [3]: df['five'] = df['one'] > 0In [4]: df Out[4]: one two three four five a 0.469112 -0.282863 -1.509059 bar True c -1.135632 1.212112 -0.173215 bar False e 0.119209 -1.044236 -0.861849 bar True f -2.104569 -0.494929 1.071804 bar False h 0.721555 -0.706771 -1.039575 bar True

上面DF只有acefh這幾個index,我們重新index一下數據:

In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])In [6]: df2 Out[6]: one two three four five a 0.469112 -0.282863 -1.509059 bar True b NaN NaN NaN NaN NaN c -1.135632 1.212112 -0.173215 bar False d NaN NaN NaN NaN NaN e 0.119209 -1.044236 -0.861849 bar True f -2.104569 -0.494929 1.071804 bar False g NaN NaN NaN NaN NaN h 0.721555 -0.706771 -1.039575 bar True

數據缺失,就會產生很多NaN。

為了檢測是否NaN,可以使用isna()或者notna() 方法。

In [7]: df2['one'] Out[7]: a 0.469112 b NaN c -1.135632 d NaN e 0.119209 f -2.104569 g NaN h 0.721555 Name: one, dtype: float64In [8]: pd.isna(df2['one']) Out[8]: a False b True c False d True e False f False g True h False Name: one, dtype: boolIn [9]: df2['four'].notna() Out[9]: a True b False c True d False e True f True g False h True Name: four, dtype: bool

注意在Python中None是相等的:

In [11]: None == None # noqa: E711 Out[11]: True

但是np.nan是不等的:

In [12]: np.nan == np.nan Out[12]: False

整數類型的缺失值

NaN默認是float類型的,如果是整數類型,我們可以強制進行轉換:

In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype()) Out[14]: 0 1 1 2 2 <NA> 3 4 dtype: Int64

Datetimes 類型的缺失值

時間類型的缺失值使用NaT來表示:

In [15]: df2 = df.copy()In [16]: df2['timestamp'] = pd.Timestamp('20120101')In [17]: df2 Out[17]: one two three four five timestamp a 0.469112 -0.282863 -1.509059 bar True 2012-01-01 c -1.135632 1.212112 -0.173215 bar False 2012-01-01 e 0.119209 -1.044236 -0.861849 bar True 2012-01-01 f -2.104569 -0.494929 1.071804 bar False 2012-01-01 h 0.721555 -0.706771 -1.039575 bar True 2012-01-01In [18]: df2.loc[['a', 'c', 'h'], ['one', 'timestamp']] = np.nanIn [19]: df2 Out[19]: one two three four five timestamp a NaN -0.282863 -1.509059 bar True NaT c NaN 1.212112 -0.173215 bar False NaT e 0.119209 -1.044236 -0.861849 bar True 2012-01-01 f -2.104569 -0.494929 1.071804 bar False 2012-01-01 h NaN -0.706771 -1.039575 bar True NaTIn [20]: df2.dtypes.value_counts() Out[20]: float64 3 datetime64[ns] 1 bool 1 object 1 dtype: int64

None 和 np.nan 的轉換

對于數字類型的,如果賦值為None,那么會轉換為相應的NaN類型:

In [21]: s = pd.Series([1, 2, 3])In [22]: s.loc[0] = NoneIn [23]: s Out[23]: 0 NaN 1 2.0 2 3.0 dtype: float64

如果是對象類型,使用None賦值,會保持原樣:

In [24]: s = pd.Series(["a", "b", "c"])In [25]: s.loc[0] = NoneIn [26]: s.loc[1] = np.nanIn [27]: s Out[27]: 0 None 1 NaN 2 c dtype: object

缺失值的計算

缺失值的數學計算還是缺失值:

In [28]: a Out[28]: one two a NaN -0.282863 c NaN 1.212112 e 0.119209 -1.044236 f -2.104569 -0.494929 h -2.104569 -0.706771In [29]: b Out[29]: one two three a NaN -0.282863 -1.509059 c NaN 1.212112 -0.173215 e 0.119209 -1.044236 -0.861849 f -2.104569 -0.494929 1.071804 h NaN -0.706771 -1.039575In [30]: a + b Out[30]: one three two a NaN NaN -0.565727 c NaN NaN 2.424224 e 0.238417 NaN -2.088472 f -4.209138 NaN -0.989859 h NaN NaN -1.413542

但是在統計中會將NaN當成0來對待。

In [31]: df Out[31]: one two three a NaN -0.282863 -1.509059 c NaN 1.212112 -0.173215 e 0.119209 -1.044236 -0.861849 f -2.104569 -0.494929 1.071804 h NaN -0.706771 -1.039575In [32]: df['one'].sum() Out[32]: -1.9853605075978744In [33]: df.mean(1) Out[33]: a -0.895961 c 0.519449 e -0.595625 f -0.509232 h -0.873173 dtype: float64

如果是在cumsum或者cumprod中,默認是會跳過NaN,如果不想統計NaN,可以加上參數skipna=False

In [34]: df.cumsum() Out[34]: one two three a NaN -0.282863 -1.509059 c NaN 0.929249 -1.682273 e 0.119209 -0.114987 -2.544122 f -1.985361 -0.609917 -1.472318 h NaN -1.316688 -2.511893In [35]: df.cumsum(skipna=False) Out[35]: one two three a NaN -0.282863 -1.509059 c NaN 0.929249 -1.682273 e NaN -0.114987 -2.544122 f NaN -0.609917 -1.472318 h NaN -1.316688 -2.511893

使用fillna填充NaN數據

數據分析中,如果有NaN數據,那么需要對其進行處理,一種處理方法就是使用fillna來進行填充。

下面填充常量:

In [42]: df2 Out[42]: one two three four five timestamp a NaN -0.282863 -1.509059 bar True NaT c NaN 1.212112 -0.173215 bar False NaT e 0.119209 -1.044236 -0.861849 bar True 2012-01-01 f -2.104569 -0.494929 1.071804 bar False 2012-01-01 h NaN -0.706771 -1.039575 bar True NaTIn [43]: df2.fillna(0) Out[43]: one two three four five timestamp a 0.000000 -0.282863 -1.509059 bar True 0 c 0.000000 1.212112 -0.173215 bar False 0 e 0.119209 -1.044236 -0.861849 bar True 2012-01-01 00:00:00 f -2.104569 -0.494929 1.071804 bar False 2012-01-01 00:00:00 h 0.000000 -0.706771 -1.039575 bar True 0

還可以指定填充方法,比如pad:

In [45]: df Out[45]: one two three a NaN -0.282863 -1.509059 c NaN 1.212112 -0.173215 e 0.119209 -1.044236 -0.861849 f -2.104569 -0.494929 1.071804 h NaN -0.706771 -1.039575In [46]: df.fillna(method='pad') Out[46]: one two three a NaN -0.282863 -1.509059 c NaN 1.212112 -0.173215 e 0.119209 -1.044236 -0.861849 f -2.104569 -0.494929 1.071804 h -2.104569 -0.706771 -1.039575

可以指定填充的行數:

In [48]: df.fillna(method='pad', limit=1)

fill方法統計:

方法名描述
pad / ffill向前填充
bfill / backfill向后填充

可以使用PandasObject來填充:

In [53]: dff Out[53]: A B C 0 0.271860 -0.424972 0.567020 1 0.276232 -1.087401 -0.673690 2 0.113648 -1.478427 0.524988 3 NaN 0.577046 -1.715002 4 NaN NaN -1.157892 5 -1.344312 NaN NaN 6 -0.109050 1.643563 NaN 7 0.357021 -0.674600 NaN 8 -0.968914 -1.294524 0.413738 9 0.276662 -0.472035 -0.013960In [54]: dff.fillna(dff.mean()) Out[54]: A B C 0 0.271860 -0.424972 0.567020 1 0.276232 -1.087401 -0.673690 2 0.113648 -1.478427 0.524988 3 -0.140857 0.577046 -1.715002 4 -0.140857 -0.401419 -1.157892 5 -1.344312 -0.401419 -0.293543 6 -0.109050 1.643563 -0.293543 7 0.357021 -0.674600 -0.293543 8 -0.968914 -1.294524 0.413738 9 0.276662 -0.472035 -0.013960In [55]: dff.fillna(dff.mean()['B':'C']) Out[55]: A B C 0 0.271860 -0.424972 0.567020 1 0.276232 -1.087401 -0.673690 2 0.113648 -1.478427 0.524988 3 NaN 0.577046 -1.715002 4 NaN -0.401419 -1.157892 5 -1.344312 -0.401419 -0.293543 6 -0.109050 1.643563 -0.293543 7 0.357021 -0.674600 -0.293543 8 -0.968914 -1.294524 0.413738 9 0.276662 -0.472035 -0.013960

上面操作等同于:

In [56]: dff.where(pd.notna(dff), dff.mean(), axis='columns')

使用dropna刪除包含NA的數據

除了fillna來填充數據之外,還可以使用dropna刪除包含na的數據。

In [57]: df Out[57]: one two three a NaN -0.282863 -1.509059 c NaN 1.212112 -0.173215 e NaN 0.000000 0.000000 f NaN 0.000000 0.000000 h NaN -0.706771 -1.039575In [58]: df.dropna(axis=0) Out[58]: Empty DataFrame Columns: [one, two, three] Index: []In [59]: df.dropna(axis=1) Out[59]: two three a -0.282863 -1.509059 c 1.212112 -0.173215 e 0.000000 0.000000 f 0.000000 0.000000 h -0.706771 -1.039575In [60]: df['one'].dropna() Out[60]: Series([], Name: one, dtype: float64)

插值interpolation

數據分析時候,為了數據的平穩,我們需要一些插值運算interpolate() ,使用起來很簡單:

In [61]: ts Out[61]: 2000-01-31 0.469112 2000-02-29 NaN 2000-03-31 NaN 2000-04-28 NaN 2000-05-31 NaN... 2007-12-31 -6.950267 2008-01-31 -7.904475 2008-02-29 -6.441779 2008-03-31 -8.184940 2008-04-30 -9.011531 Freq: BM, Length: 100, dtype: float64 In [64]: ts.interpolate() Out[64]: 2000-01-31 0.469112 2000-02-29 0.434469 2000-03-31 0.399826 2000-04-28 0.365184 2000-05-31 0.330541... 2007-12-31 -6.950267 2008-01-31 -7.904475 2008-02-29 -6.441779 2008-03-31 -8.184940 2008-04-30 -9.011531 Freq: BM, Length: 100, dtype: float64

插值函數還可以添加參數,指定插值的方法,比如按時間插值:

In [67]: ts2 Out[67]: 2000-01-31 0.469112 2000-02-29 NaN 2002-07-31 -5.785037 2005-01-31 NaN 2008-04-30 -9.011531 dtype: float64In [68]: ts2.interpolate() Out[68]: 2000-01-31 0.469112 2000-02-29 -2.657962 2002-07-31 -5.785037 2005-01-31 -7.398284 2008-04-30 -9.011531 dtype: float64In [69]: ts2.interpolate(method='time') Out[69]: 2000-01-31 0.469112 2000-02-29 0.270241 2002-07-31 -5.785037 2005-01-31 -7.190866 2008-04-30 -9.011531 dtype: float64

按index的float value進行插值:

In [70]: ser Out[70]: 0.0 0.0 1.0 NaN 10.0 10.0 dtype: float64In [71]: ser.interpolate() Out[71]: 0.0 0.0 1.0 5.0 10.0 10.0 dtype: float64In [72]: ser.interpolate(method='values') Out[72]: 0.0 0.0 1.0 1.0 10.0 10.0 dtype: float64

除了插值Series,還可以插值DF:

In [73]: df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],....: 'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})....: In [74]: df Out[74]: A B 0 1.0 0.25 1 2.1 NaN 2 NaN NaN 3 4.7 4.00 4 5.6 12.20 5 6.8 14.40In [75]: df.interpolate() Out[75]: A B 0 1.0 0.25 1 2.1 1.50 2 3.4 2.75 3 4.7 4.00 4 5.6 12.20 5 6.8 14.40

interpolate還接收limit參數,可以指定插值的個數。

In [95]: ser.interpolate(limit=1) Out[95]: 0 NaN 1 NaN 2 5.0 3 7.0 4 NaN 5 NaN 6 13.0 7 13.0 8 NaN dtype: float64

使用replace替換值

replace可以替換常量,也可以替換list:

In [102]: ser = pd.Series([0., 1., 2., 3., 4.])In [103]: ser.replace(0, 5) Out[103]: 0 5.0 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64 In [104]: ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0]) Out[104]: 0 4.0 1 3.0 2 2.0 3 1.0 4 0.0 dtype: float64

可以替換DF中特定的數值:

In [106]: df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})In [107]: df.replace({'a': 0, 'b': 5}, 100) Out[107]: a b 0 100 100 1 1 6 2 2 7 3 3 8 4 4 9

可以使用插值替換:

In [108]: ser.replace([1, 2, 3], method='pad') Out[108]: 0 0.0 1 0.0 2 0.0 3 0.0 4 4.0 dtype: float64

本文已收錄于 http://www.flydean.com/07-python-pandas-missingdata/

最通俗的解讀,最深刻的干貨,最簡潔的教程,眾多你不知道的小技巧等你來發現!

歡迎關注我的公眾號:「程序那些事」,懂技術,更懂你!

超強干貨來襲 云風專訪:近40年碼齡,通宵達旦的技術人生

總結

以上是生活随笔為你收集整理的Pandas高级教程之:处理缺失数据的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。