日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据科学和人工智能技术笔记 十九、数据整理(下)

發布時間:2023/12/10 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 数据科学和人工智能技术笔记 十九、数据整理(下) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

十九、數據整理(下)

作者:Chris Albon

譯者:飛龍

協議:CC BY-NC-SA 4.0

連接和合并數據幀

# 導入模塊 import pandas as pd from IPython.display import display from IPython.display import Imageraw_data = {'subject_id': ['1', '2', '3', '4', '5'],'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']} df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name']) df_a subject_idfirst_namelast_name
01AlexAnderson
12AmyAckerman
23AllenAli
34AliceAoni
45AyoungAtiches
# 創建第二個數據幀 raw_data = {'subject_id': ['4', '5', '6', '7', '8'],'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']} df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name']) df_b subject_idfirst_namelast_name
04BillyBonder
15BrianBlack
26BranBalwner
37BryceBrice
48BettyBtisan
# 創建第三個數據幀 raw_data = {'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]} df_n = pd.DataFrame(raw_data, columns = ['subject_id','test_id']) df_n subject_idtest_id
0151
1215
2315
3461
4516
5714
6815
791
81061
91116
# 將兩個數據幀按行連接 df_new = pd.concat([df_a, df_b]) df_new subject_idfirst_namelast_name
01AlexAnderson
12AmyAckerman
23AllenAli
34AliceAoni
45AyoungAtiches
04BillyBonder
15BrianBlack
26BranBalwner
37BryceBrice
48BettyBtisan
# 將兩個數據幀按列連接 pd.concat([df_a, df_b], axis=1) subject_idfirst_namelast_namesubject_idfirst_namelast_name
01AlexAnderson4BillyBonder
12AmyAckerman5BrianBlack
23AllenAli6BranBalwner
34AliceAoni7BryceBrice
45AyoungAtiches8BettyBtisan
# 按兩個數據幀按 subject_id 連接 pd.merge(df_new, df_n, on='subject_id') subject_idfirst_namelast_nametest_id
01AlexAnderson51
12AmyAckerman15
23AllenAli15
34AliceAoni61
44BillyBonder61
55AyoungAtiches16
65BrianBlack16
77BryceBrice14
88BettyBtisan15
# 將兩個數據幀按照左和右數據幀的 subject_id 連接 pd.merge(df_new, df_n, left_on='subject_id', right_on='subject_id') subject_idfirst_namelast_nametest_id
01AlexAnderson51
12AmyAckerman15
23AllenAli15
34AliceAoni61
44BillyBonder61
55AyoungAtiches16
65BrianBlack16
77BryceBrice14
88BettyBtisan15

使用外連接來合并。

“全外連接產生表 A 和表 B 中所有記錄的集合,帶有來自兩側的匹配記錄。如果沒有匹配,則缺少的一側將包含空值。” – [來源](http://blog .codinghorror.com/a-visual-explanation-of-sql-joins/)

pd.merge(df_a, df_b, on='subject_id', how='outer') subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
01AlexAndersonNaNNaN
12AmyAckermanNaNNaN
23AllenAliNaNNaN
34AliceAoniBillyBonder
45AyoungAtichesBrianBlack
56NaNNaNBranBalwner
67NaNNaNBryceBrice
78NaNNaNBettyBtisan

使用內連接來合并。

“內聯接只生成匹配表 A 和表 B 的記錄集。” – 來源

pd.merge(df_a, df_b, on='subject_id', how='inner') subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
04AliceAoniBillyBonder
15AyoungAtichesBrianBlack
# 使用右連接來合并 pd.merge(df_a, df_b, on='subject_id', how='right') subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
04AliceAoniBillyBonder
15AyoungAtichesBrianBlack
26NaNNaNBranBalwner
37NaNNaNBryceBrice
48NaNNaNBettyBtisan

使用左連接來合并。

“左外連接從表 A 中生成一組完整的記錄,它們在表 B 中有匹配的記錄。如果沒有匹配,右側將包含空。” – 來源

pd.merge(df_a, df_b, on='subject_id', how='left') subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
01AlexAndersonNaNNaN
12AmyAckermanNaNNaN
23AllenAliNaNNaN
34AliceAoniBillyBonder
45AyoungAtichesBrianBlack
# 合并時添加后綴以復制列名稱 pd.merge(df_a, df_b, on='subject_id', how='left', suffixes=('_left', '_right')) subject_idfirst_name_leftlast_name_leftfirst_name_rightlast_name_right
01AlexAndersonNaNNaN
12AmyAckermanNaNNaN
23AllenAliNaNNaN
34AliceAoniBillyBonder
45AyoungAtichesBrianBlack
# 基于索引的合并 pd.merge(df_a, df_b, right_index=True, left_index=True) subject_id_xfirst_name_xlast_name_xsubject_id_yfirst_name_ylast_name_y
01AlexAnderson4BillyBonder
12AmyAckerman5BrianBlack
23AllenAli6BranBalwner
34AliceAoni7BryceBrice
45AyoungAtiches8BettyBtisan

列出 pandas 列中的唯一值

特別感謝 Bob Haffner 指出了一種更好的方法。

# 導入模塊 import pandas as pd# 設置 ipython 的最大行顯示 pd.set_option('display.max_row', 1000)# 設置 ipython 的最大列寬 pd.set_option('display.max_columns', 50)# 創建示例數據幀 data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df namereportsyear
CochiceJason42012
PimaMolly242012
Santa CruzTina312013
MaricopaJake22014
YumaAmy32014
# 列出 df['name'] 的唯一值 df.name.unique()# array(['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], dtype=object)

加載 JSON 文件

# 加載庫 import pandas as pd# 創建 JSON 文件的 URL(或者可以是文件路徑) url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json'# 將 JSON 文件加載到數據框中 df = pd.read_json(url, orient='columns')# 查看前十行 df.head(10) categorydatetimeinteger
002015-01-01 00:00:005
102015-01-01 00:00:015
1002015-01-01 00:00:105
1102015-01-01 00:00:115
1202015-01-01 00:00:128
1302015-01-01 00:00:139
1402015-01-01 00:00:148
1502015-01-01 00:00:158
1602015-01-01 00:00:162
1702015-01-01 00:00:171

加載 Excel 文件

# 加載庫 import pandas as pd# 創建 Excel 文件的 URL(或者可以是文件路徑) url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.xlsx'# 將 Excel 文件的第一頁加載到數據框中 df = pd.read_excel(url, sheetname=0, header=1)# 查看前十行 df.head(10) 52015-01-01 00:00:000
052015-01-01 00:00:010
192015-01-01 00:00:020
262015-01-01 00:00:030
362015-01-01 00:00:040
492015-01-01 00:00:050
572015-01-01 00:00:060
612015-01-01 00:00:070
762015-01-01 00:00:080
892015-01-01 00:00:090
952015-01-01 00:00:100

將 Excel 表格加載為數據幀

# 導入模塊 import pandas as pd# 加載 excel 文件并賦給 xls_file xls_file = pd.ExcelFile('../data/example.xls') xls_file# <pandas.io.excel.ExcelFile at 0x111912be0> # 查看電子表格的名稱 xls_file.sheet_names# ['Sheet1'] # 將 xls 文件 的 Sheet1 加載為數據幀 df = xls_file.parse('Sheet1') df yeardeaths_attackerdeaths_defendersoldiers_attackersoldiers_defenderwounded_attackerwounded_defender
019454254232532372354114
11956242264634625232141424
21964323123133412133131131
3196922323673212451212
419717832312563267112334
519814364223567832124124
6198232412425326222641124
719923321631527733313111431
8199926223227322522132122
920048432136278267736232563

加載 CSV

# 導入模塊 import pandas as pd import numpy as npraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, ".", "."],'postTestScore': ["25,000", "94,000", 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df first_namelast_nameagepreTestScorepostTestScore
0JasonMiller42425,000
1MollyJacobson522494,000
2Tina.363157
3JakeMilner24.62
4AmyCooze73.70
# 將數據幀保存為工作目錄中的 csv df.to_csv('pandas_dataframe_importing_csv/example.csv')df = pd.read_csv('pandas_dataframe_importing_csv/example.csv') df Unnamed: 0first_namelast_nameagepreTestScorepostTestScore
00JasonMiller42425,000
11MollyJacobson522494,000
22Tina.363157
33JakeMilner24.62
44AmyCooze73.70
# 加載無頭 CSV df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', header=None) df 012345
0NaNfirst_namelast_nameagepreTestScorepostTestScore
10.0JasonMiller42425,000
21.0MollyJacobson522494,000
32.0Tina.363157
43.0JakeMilner24.62
54.0AmyCooze73.70
# 在加載 csv 時指定列名稱 df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score']) df UIDFirst NameLast NameAgePre-Test ScorePost-Test Score
0NaNfirst_namelast_nameagepreTestScorepostTestScore
10.0JasonMiller42425,000
21.0MollyJacobson522494,000
32.0Tina.363157
43.0JakeMilner24.62
54.0AmyCooze73.70
# 通過將索引列設置為 UID 來加載 csv df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col='UID', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score']) df First NameLast NameAgePre-Test ScorePost-Test Score
UID
NaNfirst_namelast_nameagepreTestScorepostTestScore
0.0JasonMiller42425,000
1.0MollyJacobson522494,000
2.0Tina.363157
3.0JakeMilner24.62
4.0AmyCooze73.70
# 在加載 csv 時將索引列設置為名字和姓氏 df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col=['First Name', 'Last Name'], names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score']) df UIDAgePre-Test ScorePost-Test Score
First NameLast Name
first_namelast_nameNaNagepreTestScorepostTestScore
JasonMiller0.042425,000
MollyJacobson1.0522494,000
Tina.2.0363157
JakeMilner3.024.62
AmyCooze4.073.70
# 在加載 csv 時指定 '.' 為缺失值 df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=['.']) pd.isnull(df) Unnamed: 0first_namelast_nameagepreTestScorepostTestScore
0FalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalse
2FalseFalseTrueFalseFalseFalse
3FalseFalseFalseFalseTrueFalse
4FalseFalseFalseFalseTrueFalse
# 加載csv,同時指定 '.' 和 'NA' 為“姓氏”列的缺失值,指定 '.' 為 preTestScore 列的缺失值 sentinels = {'Last Name': ['.', 'NA'], 'Pre-Test Score': ['.']}df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels) df Unnamed: 0first_namelast_nameagepreTestScorepostTestScore
00JasonMiller42425,000
11MollyJacobson522494,000
22Tina.363157
33JakeMilner24.62
44AmyCooze73.70
# 在加載 csv 時跳過前 3 行 df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels, skiprows=3) df 2Tina.363157
03JakeMilner24.62
14AmyCooze73.70
# 加載 csv,同時將數字字符串中的 ',' 解釋為千位分隔符 df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', thousands=',') df Unnamed: 0first_namelast_nameagepreTestScorepostTestScore
00JasonMiller42425000
11MollyJacobson522494000
22Tina.363157
33JakeMilner24.62
44AmyCooze73.70

長到寬的格式

# 導入模塊 import pandas as pdraw_data = {'patient': [1, 1, 1, 2, 2], 'obs': [1, 2, 3, 1, 2], 'treatment': [0, 1, 0, 1, 0],'score': [6252, 24243, 2345, 2342, 23525]} df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score']) df patientobstreatmentscore
01106252
112124243
21302345
32112342
422023525

制作“寬的”數據。

現在,我們將創建一個“寬的”數據幀,其中行數按患者編號,列按觀測編號,單元格值為得分值。

df.pivot(index='patient', columns='obs', values='score') obs123
patient
16252.024243.02345.0
22342.023525.0NaN

在數據幀中小寫列名

# 導入模塊 import pandas as pd# 設置 ipython 的最大行顯示 pd.set_option('display.max_row', 1000)# 設置 ipython 的最大列寬 pd.set_option('display.max_columns', 50)# 創建示例數據幀 data = {'NAME': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'YEAR': [2012, 2012, 2013, 2014, 2014], 'REPORTS': [4, 24, 31, 2, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df NAMEREPORTSYEAR
CochiceJason42012
PimaMolly242012
Santa CruzTina312013
MaricopaJake22014
YumaAmy32014
# 小寫列名稱 # Map the lowering function to all column names df.columns = map(str.lower, df.columns)df namereportsyear
CochiceJason42012
PimaMolly242012
Santa CruzTina312013
MaricopaJake22014
YumaAmy32014

使用函數創建新列

# 導入模塊 import pandas as pd# 示例數據幀 raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]} df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore']) df regimentcompanynamepreTestScorepostTestScore
0Nighthawks1stMiller425
1Nighthawks1stJacobson2494
2Nighthawks2ndAli3157
3Nighthawks2ndMilner262
4Dragoons1stCooze370
5Dragoons1stJacon425
6Dragoons2ndRyaner2494
7Dragoons2ndSone3157
8Scouts1stSloan262
9Scouts1stPiger370
10Scouts2ndRiani262
11Scouts2ndAli370
# 創建一個接受兩個輸入,pre 和 post 的函數 def pre_post_difference(pre, post):# 返回二者的差return post - pre# 創建一個變量,它是函數的輸出 df['score_change'] = pre_post_difference(df['preTestScore'], df['postTestScore'])# 查看數據幀 df regimentcompanynamepreTestScorepostTestScorescore_change
0Nighthawks1stMiller42521
1Nighthawks1stJacobson249470
2Nighthawks2ndAli315726
3Nighthawks2ndMilner26260
4Dragoons1stCooze37067
5Dragoons1stJacon42521
6Dragoons2ndRyaner249470
7Dragoons2ndSone315726
8Scouts1stSloan26260
9Scouts1stPiger37067
10Scouts2ndRiani26260
11Scouts2ndAli37067
# 創建一個接受一個輸入 x 的函數 def score_multipler_2x_and_3x(x):# 返回兩個東西,2x 和 3xreturn x*2, x*3# 創建兩個新變量,它是函數的兩個輸出 df['post_score_x2'], df['post_score_x3'] = zip(*df['postTestScore'].map(score_multipler_2x_and_3x)) df regimentcompanynamepreTestScorepostTestScorescore_changepost_score_x2post_score_x3
0Nighthawks1stMiller425215075
1Nighthawks1stJacobson249470188282
2Nighthawks2ndAli315726114171
3Nighthawks2ndMilner26260124186
4Dragoons1stCooze37067140210
5Dragoons1stJacon425215075
6Dragoons2ndRyaner249470188282
7Dragoons2ndSone315726114171
8Scouts1stSloan26260124186
9Scouts1stPiger37067140210
10Scouts2ndRiani26260124186
11Scouts2ndAli37067140210

將外部值映射為數據幀的值

# 導入模塊 import pandas as pdraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'city': ['San Francisco', 'Baltimore', 'Miami', 'Douglas', 'Boston']} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'city']) df first_namelast_nameagecity
0JasonMiller42San Francisco
1MollyJacobson52Baltimore
2TinaAli36Miami
3JakeMilner24Douglas
4AmyCooze73Boston
# 創建值的字典 city_to_state = { 'San Francisco' : 'California', 'Baltimore' : 'Maryland', 'Miami' : 'Florida', 'Douglas' : 'Arizona', 'Boston' : 'Massachusetts'}df['state'] = df['city'].map(city_to_state) df first_namelast_nameagecitystate
0JasonMiller42San FranciscoCalifornia
1MollyJacobson52BaltimoreMaryland
2TinaAli36MiamiFlorida
3JakeMilner24DouglasArizona
4AmyCooze73BostonMassachusetts

數據幀中的缺失數據

# 導入模塊 import pandas as pd import numpy as npraw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'], 'age': [42, np.nan, 36, 24, 73], 'sex': ['m', np.nan, 'f', 'm', 'f'], 'preTestScore': [4, np.nan, np.nan, 2, 3],'postTestScore': [25, np.nan, np.nan, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore']) df first_namelast_nameagesexpreTestScorepostTestScore
0JasonMiller42.0m4.025.0
1NaNNaNNaNNaNNaNNaN
2TinaAli36.0fNaNNaN
3JakeMilner24.0m2.062.0
4AmyCooze73.0f3.070.0
# 丟棄缺失值 df_no_missing = df.dropna() df_no_missing first_namelast_nameagesexpreTestScorepostTestScore
0JasonMiller42.0m4.025.0
3JakeMilner24.0m2.062.0
4AmyCooze73.0f3.070.0

# 刪除所有單元格為 NA 的行 df_cleaned = df.dropna(how='all') df_cleaned first_namelast_nameagesexpreTestScorepostTestScore
0JasonMiller42.0m4.025.0
2TinaAli36.0fNaNNaN
3JakeMilner24.0m2.062.0
4AmyCooze73.0f3.070.0
# 創建一個缺失值填充的新列 df['location'] = np.nan df first_namelast_nameagesexpreTestScorepostTestScorelocation
0JasonMiller42.0m4.025.0NaN
1NaNNaNNaNNaNNaNNaNNaN
2TinaAli36.0fNaNNaNNaN
3JakeMilner24.0m2.062.0NaN
4AmyCooze73.0f3.070.0NaN
# 如果列僅包含缺失值,刪除列 df.dropna(axis=1, how='all') first_namelast_nameagesexpreTestScorepostTestScore
0JasonMiller42.0m4.025.0
1NaNNaNNaNNaNNaNNaN
2TinaAli36.0fNaNNaN
3JakeMilner24.0m2.062.0
4AmyCooze73.0f3.070.0
# 刪除少于五個觀測值的行 # 這對時間序列來說非常有用 df.dropna(thresh=5) first_namelast_nameagesexpreTestScorepostTestScorelocation
0JasonMiller42.0m4.025.0NaN
3JakeMilner24.0m2.062.0NaN
4AmyCooze73.0f3.070.0NaN
# 用零填充缺失數據 df.fillna(0) first_namelast_nameagesexpreTestScorepostTestScorelocation
0JasonMiller42.0m4.025.00.0
1000.000.00.00.0
2TinaAli36.0f0.00.00.0
3JakeMilner24.0m2.062.00.0
4AmyCooze73.0f3.070.00.0
# 使用 preTestScore 的平均值填充 preTestScore 中的缺失 # inplace=True 表示更改會立即保存到 df 中 df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True) df first_namelast_nameagesexpreTestScorepostTestScorelocation
0JasonMiller42.0m4.025.0NaN
1NaNNaNNaNNaN3.0NaNNaN
2TinaAli36.0f3.0NaNNaN
3JakeMilner24.0m2.062.0NaN
4AmyCooze73.0f3.070.0NaN

# 使用 postTestScore 的每個性別的均值填充 postTestScore 中的缺失 df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True) df first_namelast_nameagesexpreTestScorepostTestScorelocation
0JasonMiller42.0m4.025.0NaN
1NaNNaNNaNNaN3.0NaNNaN
2TinaAli36.0f3.070.0NaN
3JakeMilner24.0m2.062.0NaN
4AmyCooze73.0f3.070.0NaN
# 選擇年齡不是 NaN 且性別不是 NaN 的行 df[df['age'].notnull() & df['sex'].notnull()] first_namelast_nameagesexpreTestScorepostTestScorelocation
0JasonMiller42.0m4.025.0NaN
2TinaAli36.0f3.070.0NaN
3JakeMilner24.0m2.062.0NaN
4AmyCooze73.0f3.070.0NaN

pandas 中的移動平均

# 導入模塊 import pandas as pd# 創建數據 data = {'score': [1,1,1,2,2,2,3,3,3]}# 創建數據幀 df = pd.DataFrame(data)# 查看數據幀 df score
01
11
21
32
42
52
63
73
83
# 計算移動平均。也就是說,取前兩個值,取平均值 # 然后丟棄第一個,再加上第三個,以此類推。 df.rolling(window=2).mean() score
0NaN
11.0
21.0
31.5
42.0
52.0
62.5
73.0
83.0

規范化一列

# 導入所需模塊 import pandas as pd from sklearn import preprocessing# 設置圖表為內聯 %matplotlib inline# 創建示例數據幀,帶有未規范化的一列 data = {'score': [234,24,14,27,-74,46,73,-18,59,160]} df = pd.DataFrame(data) df score
0234
124
214
327
4-74
546
673
7-18
859
9160
# 查看為未規范化的數據 df['score'].plot(kind='bar')# <matplotlib.axes._subplots.AxesSubplot at 0x11b9c88d0>

# 創建 x,其中 x 的得分列的值為浮點數 x = df[['score']].values.astype(float)# 創建 minmax 處理器對象 min_max_scaler = preprocessing.MinMaxScaler()# 創建一個對象,轉換數據,擬合 minmax 處理器 x_scaled = min_max_scaler.fit_transform(x)# 在數據幀上運行規范化器 df_normalized = pd.DataFrame(x_scaled)# 查看數據幀 df_normalized 0
01.000000
10.318182
20.285714
30.327922
40.000000
50.389610
60.477273
70.181818
80.431818
90.759740
# 繪制數據幀 df_normalized.plot(kind='bar')# <matplotlib.axes._subplots.AxesSubplot at 0x11ba31c50>

Pandas 中的級聯表

# 導入模塊 import pandas as pdraw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'TestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3]} df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'TestScore']) df regimentcompanyTestScore
0Nighthawks1st4
1Nighthawks1st24
2Nighthawks2nd31
3Nighthawks2nd2
4Dragoons1st3
5Dragoons1st4
6Dragoons2nd24
7Dragoons2nd31
8Scouts1st2
9Scouts1st3
10Scouts2nd2
11Scouts2nd3
# 按公司和團隊創建分組均值的透視表 pd.pivot_table(df, index=['regiment','company'], aggfunc='mean') TestScore
regimentcompany
Dragoons1st3.5
2nd27.5
Nighthawks1st14.0
2nd16.5
Scouts1st2.5
2nd2.5
# 按公司和團隊創建分組計數的透視表 df.pivot_table(index=['regiment','company'], aggfunc='count') TestScore
regimentcompany
Dragoons1st2
2nd2
Nighthawks1st2
2nd2
Scouts1st2
2nd2

在 Pandas 中快速修改字符串列

我經常需要或想要改變一串字符串中所有項目的大小寫(例如BRAZIL到Brazil等)。 有很多方法可以實現這一目標,但我已經確定這是最容易和最快的方法。

# 導入 pandas import pandas as pd# 創建名稱的列表 first_names = pd.Series(['Steve Murrey', 'Jane Fonda', 'Sara McGully', 'Mary Jane'])# 打印列 first_names''' 0 Steve Murrey 1 Jane Fonda 2 Sara McGully 3 Mary Jane dtype: object '''# 打印列的小寫 first_names.str.lower()''' 0 steve murrey 1 jane fonda 2 sara mcgully 3 mary jane dtype: object '''# 打印列的大寫 first_names.str.upper()''' 0 STEVE MURREY 1 JANE FONDA 2 SARA MCGULLY 3 MARY JANE dtype: object '''# 打印列的標題大小寫 first_names.str.title()''' 0 Steve Murrey 1 Jane Fonda 2 Sara Mcgully 3 Mary Jane dtype: object '''# 打印以空格分割的列 first_names.str.split(" ")''' 0 [Steve, Murrey] 1 [Jane, Fonda] 2 [Sara, McGully] 3 [Mary, Jane] dtype: object '''# 打印首字母大寫的列 first_names.str.capitalize()''' 0 Steve murrey 1 Jane fonda 2 Sara mcgully 3 Mary jane dtype: object '''

明白了吧。更多字符串方法在這里。

隨機抽樣數據幀

# 導入模塊 import pandas as pd import numpy as npraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df first_namelast_nameagepreTestScorepostTestScore
0JasonMiller42425
1MollyJacobson522494
2TinaAli363157
3JakeMilner24262
4AmyCooze73370
# 不放回選擇大小為 2 的隨機子集 df.take(np.random.permutation(len(df))[:2]) first_namelast_nameagepreTestScorepostTestScore
1MollyJacobson522494
4AmyCooze73370

對數據幀的行排名

# 導入模塊 import pandas as pd# 創建數據幀 data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3],'coverage': [25, 94, 57, 62, 70]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df coveragenamereportsyear
Cochice25Jason42012
Pima94Molly242012
Santa Cruz57Tina312013
Maricopa62Jake22014
Yuma70Amy32014

5 rows × 4 columns

# 創建一個新列,該列是 coverage 值的升序排名 df['coverageRanked'] = df['coverage'].rank(ascending=1) df coveragenamereportsyearcoverageRanked
Cochice25Jason420121
Pima94Molly2420125
Santa Cruz57Tina3120132
Maricopa62Jake220143
Yuma70Amy320144

5 rows × 5 columns

正則表達式基礎

# 導入正則包 import reimport systext = 'The quick brown fox jumped over the lazy black bear.'three_letter_word = '\w{3}'pattern_re = re.compile(three_letter_word); pattern_rere.compile(r'\w{3}', re.UNICODE) re_search = re.search('..own', text)if re_search:# 打印搜索結果print(re_search.group())# brown

re.match

re.match()僅用于匹配字符串的開頭或整個字符串。對于其他任何內容,請使用re.search。

Match all three letter words in text

# 在文本中匹配所有三個字母的單詞 re_match = re.match('..own', text)if re_match:# 打印所有匹配print(re_match.group()) else:# 打印這個print('No matches')# No matches

re.split

# 使用 'e' 作為分隔符拆分字符串。 re_split = re.split('e', text); re_split# ['Th', ' quick brown fox jump', 'd ov', 'r th', ' lazy black b', 'ar.']

re.sub

用其他東西替換正則表達式模式串。3表示要進行的最大替換次數。

# 用 'E' 替換前三個 'e' 實例,然后打印出來 re_sub = re.sub('e', 'E', text, 3); print(re_sub)# ThE quick brown fox jumpEd ovEr the lazy black bear.

正則表達式示例

# 導入 regex import re# 創建一些數據 text = 'A flock of 120 quick brown foxes jumped over 30 lazy brown, bears.'re.findall('^A', text)# ['A'] re.findall('bears.$', text)# ['bears.'] re.findall('f..es', text)# ['foxes'] # 尋找所有元音 re.findall('[aeiou]', text)# ['o', 'o', 'u', 'i', 'o', 'o', 'e', 'u', 'e', 'o', 'e', 'a', 'o', 'e', 'a'] # 查找不是小寫元音的所有字符 re.findall('[^aeiou]', text)''' ['A',' ','f','l','c','k',' ','f',' ','1','2','0',' ','q','c','k',' ','b','r','w','n',' ','f','x','s',' ','j','m','p','d',' ','v','r',' ','3','0',' ','l','z','y',' ','b','r','w','n',',',' ','b','r','s','.'] '''re.findall('a|A', text)# ['A', 'a', 'a'] # 尋找任何 'fox' 的實例 re.findall('(foxes)', text)# ['foxes'] # 尋找所有五個字母的單詞 re.findall('\w\w\w\w\w', text)# ['flock', 'quick', 'brown', 'foxes', 'jumpe', 'brown', 'bears'] re.findall('\W\W', text)# [', '] re.findall('\s', text)# [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '] re.findall('\S\S', text)''' ['fl','oc','of','12','qu','ic','br','ow','fo','xe','ju','mp','ed','ov','er','30','la','zy','br','ow','n,','be','ar','s.'] '''re.findall('\d\d\d', text)# ['120'] re.findall('\D\D\D\D\D', text)''' ['A flo','ck of',' quic','k bro','wn fo','xes j','umped',' over',' lazy',' brow','n, be'] '''re.findall('\AA', text)# ['A'] re.findall('bears.\Z', text)# ['bears.'] re.findall('\b[foxes]', text)# [] re.findall('\n', text)# [] re.findall('[Ff]oxes', 'foxes Foxes Doxes')# ['foxes', 'Foxes'] re.findall('[Ff]oxes', 'foxes Foxes Doxes')# ['foxes', 'Foxes'] re.findall('[a-z]', 'foxes Foxes')# ['f', 'o', 'x', 'e', 's', 'o', 'x', 'e', 's'] re.findall('[A-Z]', 'foxes Foxes')# ['F'] re.findall('[a-zA-Z0-9]', 'foxes Foxes')# ['f', 'o', 'x', 'e', 's', 'F', 'o', 'x', 'e', 's'] re.findall('[^aeiou]', 'foxes Foxes')# ['f', 'x', 's', ' ', 'F', 'x', 's'] re.findall('[^0-9]', 'foxes Foxes')# ['f', 'o', 'x', 'e', 's', ' ', 'F', 'o', 'x', 'e', 's'] re.findall('foxes?', 'foxes Foxes')# ['foxes'] re.findall('ox*', 'foxes Foxes')# ['ox', 'ox'] re.findall('ox+', 'foxes Foxes')# ['ox', 'ox'] re.findall('\d{3}', text)# ['120'] re.findall('\d{2,}', text)# ['120', '30'] re.findall('\d{2,3}', text)# ['120', '30'] re.findall('^A', text)# ['A'] re.findall('bears.$', text)# ['bears.'] re.findall('\AA', text)# ['A'] re.findall('bears.\Z', text)# ['bears.'] re.findall('bears(?=.)', text)# ['bears'] re.findall('foxes(?!!)', 'foxes foxes!')# ['foxes'] re.findall('foxes|foxes!', 'foxes foxes!')# ['foxes', 'foxes'] re.findall('fox(es!)', 'foxes foxes!')# ['es!'] re.findall('foxes(!)', 'foxes foxes!')# ['!']

重索引序列和數據幀

# 導入模塊 import pandas as pd import numpy as np# 創建亞利桑那州南部的火災風險序列 brushFireRisk = pd.Series([34, 23, 12, 23], index = ['Bisbee', 'Douglas', 'Sierra Vista', 'Tombstone']) brushFireRisk''' Bisbee 34 Douglas 23 Sierra Vista 12 Tombstone 23 dtype: int64 '''# 重索引這個序列并創建一個新的序列變量 brushFireRiskReindexed = brushFireRisk.reindex(['Tombstone', 'Douglas', 'Bisbee', 'Sierra Vista', 'Barley', 'Tucson']) brushFireRiskReindexed''' Tombstone 23.0 Douglas 23.0 Bisbee 34.0 Sierra Vista 12.0 Barley NaN Tucson NaN dtype: float64 '''# 重索引序列并在任何缺失的索引處填入 0 brushFireRiskReindexed = brushFireRisk.reindex(['Tombstone', 'Douglas', 'Bisbee', 'Sierra Vista', 'Barley', 'Tucson'], fill_value = 0) brushFireRiskReindexed''' Tombstone 23 Douglas 23 Bisbee 34 Sierra Vista 12 Barley 0 Tucson 0 dtype: int64 '''# 創建數據幀 data = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]} df = pd.DataFrame(data) df countyreportsyear
0Cochice42012
1Pima242012
2Santa Cruz312013
3Maricopa22014
4Yuma32014
# 更改行的順序(索引) df.reindex([4, 3, 2, 1, 0]) countyreportsyear
4Yuma32014
3Maricopa22014
2Santa Cruz312013
1Pima242012
0Cochice42012
# 更改列的順序(索引) columnsTitles = ['year', 'reports', 'county'] df.reindex(columns=columnsTitles) yearreportscounty
020124Cochice
1201224Pima
2201331Santa Cruz
320142Maricopa
420143Yuma

重命名列標題

來自 StackOverflow 上的 rgalbo。

# 導入所需模塊 import pandas as pd# 創建列表的字典,作為值 raw_data = {'0': ['first_name', 'Molly', 'Tina', 'Jake', 'Amy'], '1': ['last_name', 'Jacobson', 'Ali', 'Milner', 'Cooze'], '2': ['age', 52, 36, 24, 73], '3': ['preTestScore', 24, 31, 2, 3]}# 創建數據幀 df = pd.DataFrame(raw_data)# 查看數據幀 df 0123
0first_namelast_nameagepreTestScore
1MollyJacobson5224
2TinaAli3631
3JakeMilner242
4AmyCooze733
# 從數據集的第一行創建一個名為 header 的新變量 header = df.iloc[0]''' 0 first_name 1 last_name 2 age 3 preTestScore Name: 0, dtype: object '''# 將數據幀替換為不包含第一行的新數據幀 df = df[1:]# 使用標題變量重命名數據幀的列值 df.rename(columns = header) first_namelast_nameagepreTestScore
1MollyJacobson5224
2TinaAli3631
3JakeMilner242
4AmyCooze733

重命名多個數據幀的列名

# 導入模塊 import pandas as pd# 設置 ipython 的最大行顯示 pd.set_option('display.max_row', 1000)# 設置 ipython 的最大列寬 pd.set_option('display.max_columns', 50)# 創建示例數據幀 data = {'Commander': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'Date': ['2012, 02, 08', '2012, 02, 08', '2012, 02, 08', '2012, 02, 08', '2012, 02, 08'], 'Score': [4, 24, 31, 2, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df CommanderDateScore
CochiceJason2012, 02, 084
PimaMolly2012, 02, 0824
Santa CruzTina2012, 02, 0831
MaricopaJake2012, 02, 082
YumaAmy2012, 02, 083
# 重命名列名 df.columns = ['Leader', 'Time', 'Score']df LeaderTimeScore
CochiceJason2012, 02, 084
PimaMolly2012, 02, 0824
Santa CruzTina2012, 02, 0831
MaricopaJake2012, 02, 082
YumaAmy2012, 02, 083
df.rename(columns={'Leader': 'Commander'}, inplace=True)df CommanderTimeScore
CochiceJason2012, 02, 084
PimaMolly2012, 02, 0824
Santa CruzTina2012, 02, 0831
MaricopaJake2012, 02, 082
YumaAmy2012, 02, 083

替換值

# 導入模塊 import pandas as pd import numpy as npraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [-999, -999, -999, 2, 1],'postTestScore': [2, 2, -999, 2, -999]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df first_namelast_nameagepreTestScorepostTestScore
0JasonMiller42-9992
1MollyJacobson52-9992
2TinaAli36-999-999
3JakeMilner2422
4AmyCooze731-999
# 將所有 -999 替換為 NAN df.replace(-999, np.nan) first_namelast_nameagepreTestScorepostTestScore
0JasonMiller42NaN2.0
1MollyJacobson52NaN2.0
2TinaAli36NaNNaN
3JakeMilner242.02.0
4AmyCooze731.0NaN

將數據幀保存為 CSV

# 導入模塊 import pandas as pdraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df first_namelast_nameagepreTestScorepostTestScore
0JasonMiller42425
1MollyJacobson522494
2TinaAli363157
3JakeMilner24262
4AmyCooze73370

將名為df的數據幀保存為 csv。

df.to_csv('example.csv')

在列中搜索某個值

# 導入模塊 import pandas as pdraw_data = {'first_name': ['Jason', 'Jason', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Miller', 'Ali', 'Milner', 'Cooze'], 'age': [42, 42, 36, 24, 73], 'preTestScore': [4, 4, 31, 2, 3],'postTestScore': [25, 25, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df first_namelast_nameagepreTestScorepostTestScore
0JasonMiller42425
1JasonMiller42425
2TinaAli363157
3JakeMilner24262
4AmyCooze73370
# 在列中尋找值在哪里 # 查看 postTestscore 大于 50 的地方 df['preTestScore'].where(df['postTestScore'] > 50)''' 0 NaN 1 NaN 2 31.0 3 2.0 4 3.0 Name: preTestScore, dtype: float64 '''

選擇包含特定值的行和列

# 導入模塊 import pandas as pd# 設置 ipython 的最大行顯示 pd.set_option('display.max_row', 1000)# 設置 ipython 的最大列寬 pd.set_option('display.max_columns', 50)# 創建示例數據幀 data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df namereportsyear
CochiceJason42012
PimaMolly242012
Santa CruzTina312013
MaricopaJake22014
YumaAmy32014
# 按照列值抓取行 value_list = ['Tina', 'Molly', 'Jason']df[df.name.isin(value_list)] namereportsyear
CochiceJason42012
PimaMolly242012
Santa CruzTina312013
# 獲取列值不是某個值的行 df[~df.name.isin(value_list)] namereportsyear
MaricopaJake22014
YumaAmy32014

選擇具有特定值的行

import pandas as pd# 創建示例數據幀 data = {'name': ['Jason', 'Molly'], 'country': [['Syria', 'Lebanon'],['Spain', 'Morocco']]} df = pd.DataFrame(data) df countryname
0[Syria, Lebanon]Jason
1[Spain, Morocco]Molly
df[df['country'].map(lambda country: 'Syria' in country)] countryname
0[Syria, Lebanon]Jason

使用多個過濾器選擇行

import pandas as pd# 創建示例數據幀 data = {'name': ['A', 'B', 'C', 'D', 'E'], 'score': [1,2,3,4,5]} df = pd.DataFrame(data) df namescore
0A1
1B2
2C3
3D4
4E5
# 選擇數據幀的行,其中 df.score 大于 1 且小于 5 df[(df['score'] > 1) & (df['score'] < 5)] namescore
1B2
2C3
3D4

根據條件選擇數據幀的行

# 導入模塊 import pandas as pd import numpy as np# 創建數據幀 raw_data = {'first_name': ['Jason', 'Molly', np.nan, np.nan, np.nan], 'nationality': ['USA', 'USA', 'France', 'UK', 'UK'], 'age': [42, 52, 36, 24, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'nationality', 'age']) df first_namenationalityage
0JasonUSA42
1MollyUSA52
2NaNFrance36
3NaNUK24
4NaNUK70
# 方法 1:使用布爾變量 # 如果國籍是美國,則變量為 TRUE american = df['nationality'] == "USA"# 如果年齡大于 50,則變量為 TRUE elderly = df['age'] > 50# 選擇所有國籍為美國且年齡大于 50 的案例 df[american & elderly] first_namenationalityage
1MollyUSA52
# 方法 2:使用變量屬性 # 選擇所有不缺少名字且國籍為美國的案例 df[df['first_name'].notnull() & (df['nationality'] == "USA")] first_namenationalityage
0JasonUSA42
1MollyUSA52

數據幀簡單示例

# 導入模塊 import pandas as pdraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df first_namelast_nameagepreTestScorepostTestScore
0JasonMiller42425
1MollyJacobson522494
2TinaAli363157
3JakeMilner24262
4AmyCooze73370
# 創建第二個數據幀 raw_data_2 = {'first_name': ['Sarah', 'Gueniva', 'Know', 'Sara', 'Cat'], 'last_name': ['Mornig', 'Jaker', 'Alom', 'Ormon', 'Koozer'], 'age': [53, 26, 72, 73, 24], 'preTestScore': [13, 52, 72, 26, 26],'postTestScore': [82, 52, 56, 234, 254]} df_2 = pd.DataFrame(raw_data_2, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df_2 first_namelast_nameagepreTestScorepostTestScore
0SarahMornig531382
1GuenivaJaker265252
2KnowAlom727256
3SaraOrmon7326234
4CatKoozer2426254
# 創建第三個數據幀 raw_data_3 = {'first_name': ['Sarah', 'Gueniva', 'Know', 'Sara', 'Cat'], 'last_name': ['Mornig', 'Jaker', 'Alom', 'Ormon', 'Koozer'],'postTestScore_2': [82, 52, 56, 234, 254]} df_3 = pd.DataFrame(raw_data_3, columns = ['first_name', 'last_name', 'postTestScore_2']) df_3 first_namelast_namepostTestScore_2
0SarahMornig82
1GuenivaJaker52
2KnowAlom56
3SaraOrmon234
4CatKoozer254

排序數據幀的行

# 導入模塊 import pandas as pddata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [1, 2, 1, 2, 3],'coverage': [2, 2, 3, 3, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df coveragenamereportsyear
Cochice2Jason12012
Pima2Molly22012
Santa Cruz3Tina12013
Maricopa3Jake22014
Yuma3Amy32014
# 按報告對數據框的行降序排序 df.sort_values(by='reports', ascending=0) coveragenamereportsyear
Yuma3Amy32014
Pima2Molly22012
Maricopa3Jake22014
Cochice2Jason12012
Santa Cruz3Tina12013
# 按 coverage 然后是報告對數據幀的行升序排序 df.sort_values(by=['coverage', 'reports']) coveragenamereportsyear
Cochice2Jason12012
Pima2Molly22012
Santa Cruz3Tina12013
Maricopa3Jake22014
Yuma3Amy32014

將經緯度坐標變量拆分為單獨的變量

import pandas as pd import numpy as npraw_data = {'geo': ['40.0024, -105.4102', '40.0068, -105.266', '39.9318, -105.2813', np.nan]} df = pd.DataFrame(raw_data, columns = ['geo']) df geo
040.0024, -105.4102
140.0068, -105.266
239.9318, -105.2813
3NaN
# 為要放置的循環結果創建兩個列表 lat = [] lon = []# 對于變量中的每一行 for row in df['geo']:# Try to,try:# 用逗號分隔行,轉換為浮點# 并將逗號前的所有內容追加到 latlat.append(row.split(',')[0])# 用逗號分隔行,轉換為浮點# 并將逗號后的所有內容追加到 lonlon.append(row.split(',')[1])# 但是如果你得到了錯誤except:# 向 lat 添加缺失值lat.append(np.NaN)# 向 lon 添加缺失值lon.append(np.NaN)# 從 lat 和 lon 創建新的兩列 df['latitude'] = lat df['longitude'] = londf geolatitudelongitude
040.0024, -105.410240.0024-105.4102
140.0068, -105.26640.0068-105.266
239.9318, -105.281339.9318-105.2813
3NaNNaNNaN

數據流水線

# 創建一些原始數據 raw_data = [1,2,3,4,5,6,7,8,9,10]# 定義產生 input+6 的生成器 def add_6(numbers):for x in numbers:output = x+6yield output# 定義產生 input-2 的生成器 def subtract_2(numbers):for x in numbers:output = x-2yield output# 定義產生 input*100 的生成器 def multiply_by_100(numbers):for x in numbers:output = x*100yield output# 流水線的第一步 step1 = add_6(raw_data)# 流水線的第二步 step2 = subtract_2(step1)# 流水線的第三步 pipeline = multiply_by_100(step2)# 原始數據的第一個元素 next(pipeline)# 500 # 原始數據的第二個元素 next(pipeline)# 600 # 處理所有數據 for raw_data in pipeline:print(raw_data)''' 700 800 900 1000 1100 1200 1300 1400 '''

數據幀中的字符串整理

# 導入模塊 import pandas as pd import numpy as np import re as reraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'email': ['[[email protected]](/cdn-cgi/l/email-protection)', '[[email protected]](/cdn-cgi/l/email-protection)', np.NAN, '[[email protected]](/cdn-cgi/l/email-protection)', '[[email protected]](/cdn-cgi/l/email-protection)'], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'email', 'preTestScore', 'postTestScore']) df first_namelast_nameemailpreTestScorepostTestScore
0JasonMiller[email protected]425
1MollyJacobson[email protected]2494
2TinaAliNaN3157
3JakeMilner[email protected]262
4AmyCooze[email protected]370
# 電子郵件列中的哪些字符串包含 'gmail' df['email'].str.contains('gmail')''' 0 True 1 True 2 NaN 3 False 4 False Name: email, dtype: object '''pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'df['email'].str.findall(pattern, flags=re.IGNORECASE)''' 0 [(jas203, gmail, com)] 1 [(momomolly, gmail, com)] 2 NaN 3 [(battler, milner, com)] 4 [(Ames1234, yahoo, com)] Name: email, dtype: object '''matches = df['email'].str.match(pattern, flags=re.IGNORECASE) matches''' /Users/chrisralbon/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.if __name__ == '__main__':0 (jas203, gmail, com) 1 (momomolly, gmail, com) 2 NaN 3 (battler, milner, com) 4 (Ames1234, yahoo, com) Name: email, dtype: object '''matches.str[1]''' 0 gmail 1 gmail 2 NaN 3 milner 4 yahoo Name: email, dtype: object '''

和 Pandas 一起使用列表推導式

# 導入模塊 import pandas as pd# 設置 ipython 的最大行顯示 pd.set_option('display.max_row', 1000)# 設置 ipython 的最大列寬 pd.set_option('display.max_columns', 50)data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df namereportsyear
CochiceJason42012
PimaMolly242012
Santa CruzTina312013
MaricopaJake22014
YumaAmy32014

作為循環的列表推導式。

# 創建變量 next_year = []# 對于 df.years 的每一行 for row in df['year']:# 為這一行添加 1 并將其附加到 next_yearnext_year.append(row + 1)# 創建 df.next_year df['next_year'] = next_year# 查看數據幀 df namereportsyearnext_year
CochiceJason420122013
PimaMolly2420122013
Santa CruzTina3120132014
MaricopaJake220142015
YumaAmy320142015

作為列表推導式。

# 對于 df.year 中的每一行,從行中減去 1 df['previous_year'] = [row-1 for row in df['year']]df namereportsyearnext_yearprevious_year
CochiceJason4201220132011
PimaMolly24201220132011
Santa CruzTina31201320142012
MaricopaJake2201420152013
YumaAmy3201420152013

使用 Seaborn 來可視化數據幀

import pandas as pd %matplotlib inline import random import matplotlib.pyplot as plt import seaborn as snsdf = pd.DataFrame()df['x'] = random.sample(range(1, 100), 25) df['y'] = random.sample(range(1, 100), 25)df.head() xy
01825
14267
25277
3434
41469
# 散點圖 sns.lmplot('x', 'y', data=df, fit_reg=False)# <seaborn.axisgrid.FacetGrid at 0x114563b00>

# 密度圖 sns.kdeplot(df.y)# <matplotlib.axes._subplots.AxesSubplot at 0x113ea2ef0>

sns.kdeplot(df.y, df.x)# <matplotlib.axes._subplots.AxesSubplot at 0x113d7fef0>

sns.distplot(df.x)# <matplotlib.axes._subplots.AxesSubplot at 0x114294160>

# 直方圖 plt.hist(df.x, alpha=.3) sns.rugplot(df.x);

# 箱形圖 sns.boxplot([df.y, df.x])# <matplotlib.axes._subplots.AxesSubplot at 0x1142b8b38>

# 提琴圖 sns.violinplot([df.y, df.x])# <matplotlib.axes._subplots.AxesSubplot at 0x114444a58>

# 熱力圖 sns.heatmap([df.y, df.x], annot=True, fmt="d")# <matplotlib.axes._subplots.AxesSubplot at 0x114530c88>

# 聚類圖 sns.clustermap(df)# <seaborn.matrix.ClusterGrid at 0x116f313c8>

Pandas 數據結構

# 導入模塊 import pandas as pd

序列 101

序列是一維數組(類似 R 的向量)。

# 創建 floodingReports 數量的序列 floodingReports = pd.Series([5, 6, 2, 9, 12]) floodingReports''' 0 5 1 6 2 2 3 9 4 12 dtype: int64 '''

請注意,第一列數字(0 到 4)是索引。

# 將縣名設置為 floodingReports 序列的索引 floodingReports = pd.Series([5, 6, 2, 9, 12], index=['Cochise County', 'Pima County', 'Santa Cruz County', 'Maricopa County', 'Yuma County']) floodingReports''' Cochise County 5 Pima County 6 Santa Cruz County 2 Maricopa County 9 Yuma County 12 dtype: int64 '''floodingReports['Cochise County']# 5 floodingReports[floodingReports > 6]''' Maricopa County 9 Yuma County 12 dtype: int64 '''

從字典中創建 Pandas 序列。

注意:執行此操作時,字典的鍵將成為序列索引。

# 創建字典 fireReports_dict = {'Cochise County': 12, 'Pima County': 342, 'Santa Cruz County': 13, 'Maricopa County': 42, 'Yuma County' : 52}# 將字典轉換為 pd.Series,然后查看它 fireReports = pd.Series(fireReports_dict); fireReports''' Cochise County 12 Maricopa County 42 Pima County 342 Santa Cruz County 13 Yuma County 52 dtype: int64 '''fireReports.index = ["Cochice", "Pima", "Santa Cruz", "Maricopa", "Yuma"] fireReports''' Cochice 12 Pima 42 Santa Cruz 342 Maricopa 13 Yuma 52 dtype: int64 '''

數據幀 101

數據幀就像 R 的數據幀。

# 從等長列表或 NumPy 數組的字典中創建數據幀 data = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]} df = pd.DataFrame(data) df countyreportsyear
0Cochice42012
1Pima242012
2Santa Cruz312013
3Maricopa22014
4Yuma32014
# 使用 columns 屬性設置列的順序 dfColumnOrdered = pd.DataFrame(data, columns=['county', 'year', 'reports']) dfColumnOrdered countyyearreports
0Cochice20124
1Pima201224
2Santa Cruz201331
3Maricopa20142
4Yuma20143
# 添加一列 dfColumnOrdered['newsCoverage'] = pd.Series([42.3, 92.1, 12.2, 39.3, 30.2]) dfColumnOrdered countyyearreportsnewsCoverage
0Cochice2012442.3
1Pima20122492.1
2Santa Cruz20133112.2
3Maricopa2014239.3
4Yuma2014330.2
# 刪除一列 del dfColumnOrdered['newsCoverage'] dfColumnOrdered countyyearreports
0Cochice20124
1Pima201224
2Santa Cruz201331
3Maricopa20142
4Yuma20143
# 轉置數據幀 dfColumnOrdered.T 01234
countyCochicePimaSanta CruzMaricopaYuma
year20122012201320142014
reports4243123

Pandas 時間序列基礎

# 導入模塊 from datetime import datetime import pandas as pd %matplotlib inline import matplotlib.pyplot as pyplotdata = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.280592', '2014-05-03 18:47:05.332662', '2014-05-03 18:47:05.385109', '2014-05-04 18:47:05.436523', '2014-05-04 18:47:05.486877'], 'battle_deaths': [34, 25, 26, 15, 15, 14, 26, 25, 62, 41]} df = pd.DataFrame(data, columns = ['date', 'battle_deaths']) print(df)'''date battle_deaths 0 2014-05-01 18:47:05.069722 34 1 2014-05-01 18:47:05.119994 25 2 2014-05-02 18:47:05.178768 26 3 2014-05-02 18:47:05.230071 15 4 2014-05-02 18:47:05.230071 15 5 2014-05-02 18:47:05.280592 14 6 2014-05-03 18:47:05.332662 26 7 2014-05-03 18:47:05.385109 25 8 2014-05-04 18:47:05.436523 62 9 2014-05-04 18:47:05.486877 41 '''df['date'] = pd.to_datetime(df['date'])df.index = df['date'] del df['date'] df battle_deaths
date
2014-05-01 18:47:05.06972234
2014-05-01 18:47:05.11999425
2014-05-02 18:47:05.17876826
2014-05-02 18:47:05.23007115
2014-05-02 18:47:05.23007115
2014-05-02 18:47:05.28059214
2014-05-03 18:47:05.33266226
2014-05-03 18:47:05.38510925
2014-05-04 18:47:05.43652362
2014-05-04 18:47:05.48687741
# 查看 2014 年的所有觀測 df['2014'] battle_deaths
date
2014-05-01 18:47:05.06972234
2014-05-01 18:47:05.11999425
2014-05-02 18:47:05.17876826
2014-05-02 18:47:05.23007115
2014-05-02 18:47:05.23007115
2014-05-02 18:47:05.28059214
2014-05-03 18:47:05.33266226
2014-05-03 18:47:05.38510925
2014-05-04 18:47:05.43652362
2014-05-04 18:47:05.48687741
# 查看 2014 年 5 月的所有觀測 df['2014-05'] battle_deaths
date
2014-05-01 18:47:05.06972234
2014-05-01 18:47:05.11999425
2014-05-02 18:47:05.17876826
2014-05-02 18:47:05.23007115
2014-05-02 18:47:05.23007115
2014-05-02 18:47:05.28059214
2014-05-03 18:47:05.33266226
2014-05-03 18:47:05.38510925
2014-05-04 18:47:05.43652362
2014-05-04 18:47:05.48687741
# 查看 2014.5.3 的所有觀測 df[datetime(2014, 5, 3):] battle_deaths
date
2014-05-03 18:47:05.33266226
2014-05-03 18:47:05.38510925
2014-05-04 18:47:05.43652362
2014-05-04 18:47:05.48687741

Observations between May 3rd and May 4th

# 查看 2014.5.3~4 的所有觀測 df['5/3/2014':'5/4/2014'] battle_deaths
date
2014-05-03 18:47:05.33266226
2014-05-03 18:47:05.38510925
2014-05-04 18:47:05.43652362
2014-05-04 18:47:05.48687741
# 截斷 2014.5.2 之后的觀測 df.truncate(after='5/3/2014') battle_deaths
date
2014-05-01 18:47:05.06972234
2014-05-01 18:47:05.11999425
2014-05-02 18:47:05.17876826
2014-05-02 18:47:05.23007115
2014-05-02 18:47:05.23007115
2014-05-02 18:47:05.28059214
# 2014.5 的觀測 df['5-2014'] battle_deaths
date
2014-05-01 18:47:05.06972234
2014-05-01 18:47:05.11999425
2014-05-02 18:47:05.17876826
2014-05-02 18:47:05.23007115
2014-05-02 18:47:05.23007115
2014-05-02 18:47:05.28059214
2014-05-03 18:47:05.33266226
2014-05-03 18:47:05.38510925
2014-05-04 18:47:05.43652362
2014-05-04 18:47:05.48687741
# 計算每個時間戳的觀測數 df.groupby(level=0).count() battle_deaths
date
2014-05-01 18:47:05.0697221
2014-05-01 18:47:05.1199941
2014-05-02 18:47:05.1787681
2014-05-02 18:47:05.2300712
2014-05-02 18:47:05.2805921
2014-05-03 18:47:05.3326621
2014-05-03 18:47:05.3851091
2014-05-04 18:47:05.4365231
2014-05-04 18:47:05.4868771
# 每天的 battle_deaths 均值 df.resample('D').mean() battle_deaths
date
2014-05-0129.5
2014-05-0217.5
2014-05-0325.5
2014-05-0451.5
# 每天的 battle_deaths 總數 df.resample('D').sum() battle_deaths
date
2014-05-0159
2014-05-0270
2014-05-0351
2014-05-04103
# 繪制每天的總死亡人數 df.resample('D').sum().plot()# <matplotlib.axes._subplots.AxesSubplot at 0x11187a940>

總結

以上是生活随笔為你收集整理的数据科学和人工智能技术笔记 十九、数据整理(下)的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。