2-6_Cleaning_Data
清洗數據
清洗和處理數據通常也是非常重要一個環節,這節提提這個內容。
%matplotlib inline import pandas as pd import matplotlib.pyplot as plt import numpy as np import warnings warnings.filterwarnings('ignore') plt.style.use("bmh") plt.rc('font', family='SimHei', size=25) #顯示中文 pd.set_option('display.max_columns',1000) pd.set_option('display.width', 1000) pd.set_option('display.max_colwidth',1000)什么樣的數據叫做臟數據/有問題的數據?
我們用NYC 311服務請求數據來一起看看,這個數據量不算小,同時也有一些東西確實可以處理一下。
requests = pd.read_csv('311-service-requests.csv') requests.head()| 26589651 | 10/31/2013 02:08:41 AM | NaN | NYPD | New York City Police Department | Noise - Street/Sidewalk | Loud Talking | Street/Sidewalk | 11432 | 90-03 169 STREET | 169 STREET | 90 AVENUE | 91 AVENUE | NaN | NaN | ADDRESS | JAMAICA | NaN | Precinct | Assigned | 10/31/2013 10:08:41 AM | 10/31/2013 02:35:17 AM | 12 QUEENS | QUEENS | 1042027.0 | 197389.0 | Unspecified | QUEENS | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.708275 | -73.791604 | (40.70827532593202, -73.79160395779721) |
| 26593698 | 10/31/2013 02:01:04 AM | NaN | NYPD | New York City Police Department | Illegal Parking | Commercial Overnight Parking | Street/Sidewalk | 11378 | 58 AVENUE | 58 AVENUE | 58 PLACE | 59 STREET | NaN | NaN | BLOCKFACE | MASPETH | NaN | Precinct | Open | 10/31/2013 10:01:04 AM | NaN | 05 QUEENS | QUEENS | 1009349.0 | 201984.0 | Unspecified | QUEENS | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.721041 | -73.909453 | (40.721040535628305, -73.90945306791765) |
| 26594139 | 10/31/2013 02:00:24 AM | 10/31/2013 02:40:32 AM | NYPD | New York City Police Department | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 10032 | 4060 BROADWAY | BROADWAY | WEST 171 STREET | WEST 172 STREET | NaN | NaN | ADDRESS | NEW YORK | NaN | Precinct | Closed | 10/31/2013 10:00:24 AM | 10/31/2013 02:39:42 AM | 12 MANHATTAN | MANHATTAN | 1001088.0 | 246531.0 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.843330 | -73.939144 | (40.84332975466513, -73.93914371913482) |
| 26595721 | 10/31/2013 01:56:23 AM | 10/31/2013 02:21:48 AM | NYPD | New York City Police Department | Noise - Vehicle | Car/Truck Horn | Street/Sidewalk | 10023 | WEST 72 STREET | WEST 72 STREET | COLUMBUS AVENUE | AMSTERDAM AVENUE | NaN | NaN | BLOCKFACE | NEW YORK | NaN | Precinct | Closed | 10/31/2013 09:56:23 AM | 10/31/2013 02:21:10 AM | 07 MANHATTAN | MANHATTAN | 989730.0 | 222727.0 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.778009 | -73.980213 | (40.7780087446372, -73.98021349023975) |
| 26590930 | 10/31/2013 01:53:44 AM | NaN | DOHMH | Department of Health and Mental Hygiene | Rodent | Condition Attracting Rodents | Vacant Lot | 10027 | WEST 124 STREET | WEST 124 STREET | LENOX AVENUE | ADAM CLAYTON POWELL JR BOULEVARD | NaN | NaN | BLOCKFACE | NEW YORK | NaN | NaN | Pending | 11/30/2013 01:53:44 AM | 10/31/2013 01:59:54 AM | 10 MANHATTAN | MANHATTAN | 998815.0 | 233545.0 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.807691 | -73.947387 | (40.80769092704951, -73.94738703491433) |
6.1 怎么找到臟數據?
其實也沒有特別好的辦法,還是得先拿點數據出來看看。比如說我們這里觀察到郵政編碼可能有問題的字段。
需要提到的一點是 .unique() 函數有很巧的用處,我們把所有出現過的郵政編碼列出來(之后再看看分布?),也許會有一些想法。
下面我們就把unique()用起來,然后你會發現,確確實實是存在一些問題的,比如:
- 為什么大部分被解析出數值,而有些被解析出字符串了?
- 好多缺省值(nan)
- 格式不一樣,有些是29616-0759,有些是83
- 有一些pandas不認的,比如’N/A’或者’NO CLUE’
那我們能做什么呢?
- 規整’N/A’和’NO CLUE’到缺省值的“隊列”里
- 看看83是什么鬼,然后再決定怎么處理
- 統一一下,全處理成字符串好啦
6.3 處理缺省值和字符串/浮點混亂
我們可以在pd.read_csv讀數據的時候,傳一個na_values給它,清理掉一部分的臟數據,我們還可以指明說,我們就要保證郵政編碼是字符串型的,不要給我整些數值型出來!!
na_values = ['NO CLUE', 'N/A', '0'] requests = pd.read_csv('311-service-requests.csv', na_values=na_values, dtype={'Incident Zip': str}) requests['Incident Zip'].unique() array(['11432', '11378', '10032', '10023', '10027', '11372', '11419','11417', '10011', '11225', '11218', '10003', '10029', '10466','11219', '10025', '10310', '11236', nan, '10033', '11216', '10016','10305', '10312', '10026', '10309', '10036', '11433', '11235','11213', '11379', '11101', '10014', '11231', '11234', '10457','10459', '10465', '11207', '10002', '10034', '11233', '10453','10456', '10469', '11374', '11221', '11421', '11215', '10007','10019', '11205', '11418', '11369', '11249', '10005', '10009','11211', '11412', '10458', '11229', '10065', '10030', '11222','10024', '10013', '11420', '11365', '10012', '11214', '11212','10022', '11232', '11040', '11226', '10281', '11102', '11208','10001', '10472', '11414', '11223', '10040', '11220', '11373','11203', '11691', '11356', '10017', '10452', '10280', '11217','10031', '11201', '11358', '10128', '11423', '10039', '10010','11209', '10021', '10037', '11413', '11375', '11238', '10473','11103', '11354', '11361', '11106', '11385', '10463', '10467','11204', '11237', '11377', '11364', '11434', '11435', '11210','11228', '11368', '11694', '10464', '11415', '10314', '10301','10018', '10038', '11105', '11230', '10468', '11104', '10471','11416', '10075', '11422', '11355', '10028', '10462', '10306','10461', '11224', '11429', '10035', '11366', '11362', '11206','10460', '10304', '11360', '11411', '10455', '10475', '10069','10303', '10308', '10302', '11357', '10470', '11367', '11370','10454', '10451', '11436', '11426', '10153', '11004', '11428','11427', '11001', '11363', '10004', '10474', '11430', '10000','10307', '11239', '10119', '10006', '10048', '11697', '11692','11693', '10573', '00083', '11559', '10020', '77056', '11776','70711', '10282', '11109', '10044', '02061', '77092-2016', '14225','55164-0737', '19711', '07306', '000000', '90010', '11747','23541', '11788', '07604', '10112', '11563', '11580', '07087','11042', '07093', '11501', '92123', '00000', '11575', '07109','11797', '10803', '11716', '11722', '11549-3650', '10162', '23502','11518', '07020', '08807', '11577', '07114', '11003', '07201','61702', '10103', '29616-0759', '35209-3114', '11520', '11735','10129', '11005', '41042', '11590', '06901', '07208', '11530','13221', '10954', '11111', '10107'], dtype=object)6.4 那些用“-”連接的郵編是什么鬼?
requests.loc[requests['Incident Zip'].str.contains('-').fillna(False),'Incident Zip'] 29136 77092-2016 30939 55164-0737 70539 11549-3650 85821 29616-0759 89304 35209-3114 Name: Incident Zip, dtype: object rows_with_dashes = requests['Incident Zip'].str.contains('-').fillna(False) len(requests[rows_with_dashes]) 5真心是很煩人啊,其實只有5個,輸出來看看是什么
requests[rows_with_dashes]| 26550551 | 10/24/2013 06:16:34 PM | NaN | DCA | Department of Consumer Affairs | Consumer Complaint | False Advertising | NaN | 77092 | 2700 EAST SELTICE WAY | EAST SELTICE WAY | NaN | NaN | NaN | NaN | NaN | HOUSTON | NaN | NaN | Assigned | 11/13/2013 11:15:20 AM | 10/29/2013 11:16:16 AM | 0 Unspecified | Unspecified | NaN | NaN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26548831 | 10/24/2013 09:35:10 AM | NaN | DCA | Department of Consumer Affairs | Consumer Complaint | Harassment | NaN | 55164 | P.O. BOX 64437 | 64437 | NaN | NaN | NaN | NaN | NaN | ST. PAUL | NaN | NaN | Assigned | 11/13/2013 02:30:21 PM | 10/29/2013 02:31:06 PM | 0 Unspecified | Unspecified | NaN | NaN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26488417 | 10/15/2013 03:40:33 PM | NaN | TLC | Taxi and Limousine Commission | Taxi Complaint | Driver Complaint | Street | 11549 | 365 HOFSTRA UNIVERSITY | HOFSTRA UNIVERSITY | NaN | NaN | NaN | NaN | NaN | HEMSTEAD | NaN | NaN | Assigned | 11/30/2013 01:20:33 PM | 10/16/2013 01:21:39 PM | 0 Unspecified | Unspecified | NaN | NaN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | La Guardia Airport | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26468296 | 10/10/2013 12:36:43 PM | 10/26/2013 01:07:07 AM | DCA | Department of Consumer Affairs | Consumer Complaint | Debt Not Owed | NaN | 29616 | PO BOX 25759 | BOX 25759 | NaN | NaN | NaN | NaN | NaN | GREENVILLE | NaN | NaN | Closed | 10/26/2013 09:20:28 AM | 10/26/2013 01:07:07 AM | 0 Unspecified | Unspecified | NaN | NaN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26461137 | 10/09/2013 05:23:46 PM | 10/25/2013 01:06:41 AM | DCA | Department of Consumer Affairs | Consumer Complaint | Harassment | NaN | 35209 | 600 BEACON PKWY | BEACON PKWY | NaN | NaN | NaN | NaN | NaN | BIRMINGHAM | NaN | NaN | Closed | 10/25/2013 02:43:42 PM | 10/25/2013 01:06:41 AM | 0 Unspecified | Unspecified | NaN | NaN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
本來就5個,打算直接把這些都設置成缺省值(nan)的:requests['Incident Zip'][rows_with_dashes] = np.nan
后來查了查,發現可能前5位置是真實的郵編,所以干脆截取一下好了。
搞定啦!
媽蛋查了下00000,發現根本不是什么美國加拿大的郵編,所以這個是不能這么處理的,還真得重新設為缺省值。
requests[requests['Incident Zip'] == '00000']完工!!再來看看現在的數據什么樣了。
unique_zips = requests['Incident Zip'].unique() #unique_zips.sort_values() unique_zips array(['11432', '11378', '10032', '10023', '10027', '11372', '11419','11417', '10011', '11225', '11218', '10003', '10029', '10466','11219', '10025', '10310', '11236', nan, '10033', '11216', '10016','10305', '10312', '10026', '10309', '10036', '11433', '11235','11213', '11379', '11101', '10014', '11231', '11234', '10457','10459', '10465', '11207', '10002', '10034', '11233', '10453','10456', '10469', '11374', '11221', '11421', '11215', '10007','10019', '11205', '11418', '11369', '11249', '10005', '10009','11211', '11412', '10458', '11229', '10065', '10030', '11222','10024', '10013', '11420', '11365', '10012', '11214', '11212','10022', '11232', '11040', '11226', '10281', '11102', '11208','10001', '10472', '11414', '11223', '10040', '11220', '11373','11203', '11691', '11356', '10017', '10452', '10280', '11217','10031', '11201', '11358', '10128', '11423', '10039', '10010','11209', '10021', '10037', '11413', '11375', '11238', '10473','11103', '11354', '11361', '11106', '11385', '10463', '10467','11204', '11237', '11377', '11364', '11434', '11435', '11210','11228', '11368', '11694', '10464', '11415', '10314', '10301','10018', '10038', '11105', '11230', '10468', '11104', '10471','11416', '10075', '11422', '11355', '10028', '10462', '10306','10461', '11224', '11429', '10035', '11366', '11362', '11206','10460', '10304', '11360', '11411', '10455', '10475', '10069','10303', '10308', '10302', '11357', '10470', '11367', '11370','10454', '10451', '11436', '11426', '10153', '11004', '11428','11427', '11001', '11363', '10004', '10474', '11430', '10000','10307', '11239', '10119', '10006', '10048', '11697', '11692','11693', '10573', '00083', '11559', '10020', '77056', '11776','70711', '10282', '11109', '10044', '02061', '77092', '14225','55164', '19711', '07306', '90010', '11747', '23541', '11788','07604', '10112', '11563', '11580', '07087', '11042', '07093','11501', '92123', '11575', '07109', '11797', '10803', '11716','11722', '11549', '10162', '23502', '11518', '07020', '08807','11577', '07114', '11003', '07201', '61702', '10103', '29616','35209', '11520', '11735', '10129', '11005', '41042', '11590','06901', '07208', '11530', '13221', '10954', '11111', '10107'],dtype=object)看起來干凈多了。
但是真的做完了嗎?
可以排個序,然后對應輸出一些東西
requests[is_far][['Incident Zip', 'Descriptor', 'City']].sort_values('Incident Zip')| 23502 | Harassment | NORFOLK |
| 23541 | Harassment | NORFOLK |
| 29616 | Debt Not Owed | GREENVILLE |
| 35209 | Harassment | BIRMINGHAM |
| 41042 | Harassment | FLORENCE |
| 55164 | Harassment | ST. PAUL |
| 61702 | Billing Dispute | BLOOMIGTON |
| 70711 | Contract Dispute | CLIFTON |
| 77056 | Debt Not Owed | HOUSTON |
| 77092 | False Advertising | HOUSTON |
| 90010 | Billing Dispute | LOS ANGELES |
| 92123 | Harassment | SAN DIEGO |
| 92123 | Billing Dispute | SAN DIEGO |
咳咳,突然覺得,恩,剛才做的一大堆工作,其實只是告訴你,我們可以這樣去處理和補齊數據。
但你實際上會發現,好像其實用city直接對應一下就可以補上一些東西啊。
6.5 匯個總
所以匯總一下,我們在郵編這個字段,是這樣做數據清洗的:
na_values = ['NO CLUE', 'N/A', '0'] requests = pd.read_csv('311-service-requests.csv', na_values=na_values, dtype={'Incident Zip': str}) def fix_zip_codes(zips):# Truncate everything to length 5 zips = zips.str.slice(0, 5)# Set 00000 zip codes to nanzero_zips = zips == '00000'zips[zero_zips] = np.nanreturn zips requests['Incident Zip'] = fix_zip_codes(requests['Incident Zip']) requests['Incident Zip'].unique() array(['11432', '11378', '10032', '10023', '10027', '11372', '11419','11417', '10011', '11225', '11218', '10003', '10029', '10466','11219', '10025', '10310', '11236', nan, '10033', '11216', '10016','10305', '10312', '10026', '10309', '10036', '11433', '11235','11213', '11379', '11101', '10014', '11231', '11234', '10457','10459', '10465', '11207', '10002', '10034', '11233', '10453','10456', '10469', '11374', '11221', '11421', '11215', '10007','10019', '11205', '11418', '11369', '11249', '10005', '10009','11211', '11412', '10458', '11229', '10065', '10030', '11222','10024', '10013', '11420', '11365', '10012', '11214', '11212','10022', '11232', '11040', '11226', '10281', '11102', '11208','10001', '10472', '11414', '11223', '10040', '11220', '11373','11203', '11691', '11356', '10017', '10452', '10280', '11217','10031', '11201', '11358', '10128', '11423', '10039', '10010','11209', '10021', '10037', '11413', '11375', '11238', '10473','11103', '11354', '11361', '11106', '11385', '10463', '10467','11204', '11237', '11377', '11364', '11434', '11435', '11210','11228', '11368', '11694', '10464', '11415', '10314', '10301','10018', '10038', '11105', '11230', '10468', '11104', '10471','11416', '10075', '11422', '11355', '10028', '10462', '10306','10461', '11224', '11429', '10035', '11366', '11362', '11206','10460', '10304', '11360', '11411', '10455', '10475', '10069','10303', '10308', '10302', '11357', '10470', '11367', '11370','10454', '10451', '11436', '11426', '10153', '11004', '11428','11427', '11001', '11363', '10004', '10474', '11430', '10000','10307', '11239', '10119', '10006', '10048', '11697', '11692','11693', '10573', '00083', '11559', '10020', '77056', '11776','70711', '10282', '11109', '10044', '02061', '77092', '14225','55164', '19711', '07306', '90010', '11747', '23541', '11788','07604', '10112', '11563', '11580', '07087', '11042', '07093','11501', '92123', '11575', '07109', '11797', '10803', '11716','11722', '11549', '10162', '23502', '11518', '07020', '08807','11577', '07114', '11003', '07201', '61702', '10103', '29616','35209', '11520', '11735', '10129', '11005', '41042', '11590','06901', '07208', '11530', '13221', '10954', '11111', '10107'],dtype=object)總結
以上是生活随笔為你收集整理的2-6_Cleaning_Data的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 面向2018年的设计趋势
- 下一篇: 继续树莓派4B+OLED:开机自动显示I