日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

2-6_Cleaning_Data

發布時間:2023/12/14 编程问答 28 豆豆
生活随笔 收集整理的這篇文章主要介紹了 2-6_Cleaning_Data 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

清洗數據

清洗和處理數據通常也是非常重要一個環節,這節提提這個內容。

%matplotlib inline import pandas as pd import matplotlib.pyplot as plt import numpy as np import warnings warnings.filterwarnings('ignore') plt.style.use("bmh") plt.rc('font', family='SimHei', size=25) #顯示中文 pd.set_option('display.max_columns',1000) pd.set_option('display.width', 1000) pd.set_option('display.max_colwidth',1000)

什么樣的數據叫做臟數據/有問題的數據?

我們用NYC 311服務請求數據來一起看看,這個數據量不算小,同時也有一些東西確實可以處理一下。

requests = pd.read_csv('311-service-requests.csv') requests.head() Unique KeyCreated DateClosed DateAgencyAgency NameComplaint TypeDescriptorLocation TypeIncident ZipIncident AddressStreet NameCross Street 1Cross Street 2Intersection Street 1Intersection Street 2Address TypeCityLandmarkFacility TypeStatusDue DateResolution Action Updated DateCommunity BoardBoroughX Coordinate (State Plane)Y Coordinate (State Plane)Park Facility NamePark BoroughSchool NameSchool NumberSchool RegionSchool CodeSchool Phone NumberSchool AddressSchool CitySchool StateSchool ZipSchool Not FoundSchool or Citywide ComplaintVehicle TypeTaxi Company BoroughTaxi Pick Up LocationBridge Highway NameBridge Highway DirectionRoad RampBridge Highway SegmentGarage Lot NameFerry DirectionFerry Terminal NameLatitudeLongitudeLocation01234
2658965110/31/2013 02:08:41 AMNaNNYPDNew York City Police DepartmentNoise - Street/SidewalkLoud TalkingStreet/Sidewalk1143290-03 169 STREET169 STREET90 AVENUE91 AVENUENaNNaNADDRESSJAMAICANaNPrecinctAssigned10/31/2013 10:08:41 AM10/31/2013 02:35:17 AM12 QUEENSQUEENS1042027.0197389.0UnspecifiedQUEENSUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.708275-73.791604(40.70827532593202, -73.79160395779721)
2659369810/31/2013 02:01:04 AMNaNNYPDNew York City Police DepartmentIllegal ParkingCommercial Overnight ParkingStreet/Sidewalk1137858 AVENUE58 AVENUE58 PLACE59 STREETNaNNaNBLOCKFACEMASPETHNaNPrecinctOpen10/31/2013 10:01:04 AMNaN05 QUEENSQUEENS1009349.0201984.0UnspecifiedQUEENSUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.721041-73.909453(40.721040535628305, -73.90945306791765)
2659413910/31/2013 02:00:24 AM10/31/2013 02:40:32 AMNYPDNew York City Police DepartmentNoise - CommercialLoud Music/PartyClub/Bar/Restaurant100324060 BROADWAYBROADWAYWEST 171 STREETWEST 172 STREETNaNNaNADDRESSNEW YORKNaNPrecinctClosed10/31/2013 10:00:24 AM10/31/2013 02:39:42 AM12 MANHATTANMANHATTAN1001088.0246531.0UnspecifiedMANHATTANUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.843330-73.939144(40.84332975466513, -73.93914371913482)
2659572110/31/2013 01:56:23 AM10/31/2013 02:21:48 AMNYPDNew York City Police DepartmentNoise - VehicleCar/Truck HornStreet/Sidewalk10023WEST 72 STREETWEST 72 STREETCOLUMBUS AVENUEAMSTERDAM AVENUENaNNaNBLOCKFACENEW YORKNaNPrecinctClosed10/31/2013 09:56:23 AM10/31/2013 02:21:10 AM07 MANHATTANMANHATTAN989730.0222727.0UnspecifiedMANHATTANUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.778009-73.980213(40.7780087446372, -73.98021349023975)
2659093010/31/2013 01:53:44 AMNaNDOHMHDepartment of Health and Mental HygieneRodentCondition Attracting RodentsVacant Lot10027WEST 124 STREETWEST 124 STREETLENOX AVENUEADAM CLAYTON POWELL JR BOULEVARDNaNNaNBLOCKFACENEW YORKNaNNaNPending11/30/2013 01:53:44 AM10/31/2013 01:59:54 AM10 MANHATTANMANHATTAN998815.0233545.0UnspecifiedMANHATTANUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.807691-73.947387(40.80769092704951, -73.94738703491433)

6.1 怎么找到臟數據?

其實也沒有特別好的辦法,還是得先拿點數據出來看看。比如說我們這里觀察到郵政編碼可能有問題的字段。

需要提到的一點是 .unique() 函數有很巧的用處,我們把所有出現過的郵政編碼列出來(之后再看看分布?),也許會有一些想法。

下面我們就把unique()用起來,然后你會發現,確確實實是存在一些問題的,比如:

  • 為什么大部分被解析出數值,而有些被解析出字符串了?
  • 好多缺省值(nan)
  • 格式不一樣,有些是29616-0759,有些是83
  • 有一些pandas不認的,比如’N/A’或者’NO CLUE’

那我們能做什么呢?

  • 規整’N/A’和’NO CLUE’到缺省值的“隊列”里
  • 看看83是什么鬼,然后再決定怎么處理
  • 統一一下,全處理成字符串好啦
requests['Incident Zip'].unique() array([11432.0, 11378.0, 10032.0, 10023.0, 10027.0, 11372.0, 11419.0,11417.0, 10011.0, 11225.0, 11218.0, 10003.0, 10029.0, 10466.0,11219.0, 10025.0, 10310.0, 11236.0, nan, 10033.0, 11216.0, 10016.0,10305.0, 10312.0, 10026.0, 10309.0, 10036.0, 11433.0, 11235.0,11213.0, 11379.0, 11101.0, 10014.0, 11231.0, 11234.0, 10457.0,10459.0, 10465.0, 11207.0, 10002.0, 10034.0, 11233.0, 10453.0,10456.0, 10469.0, 11374.0, 11221.0, 11421.0, 11215.0, 10007.0,10019.0, 11205.0, 11418.0, 11369.0, 11249.0, 10005.0, 10009.0,11211.0, 11412.0, 10458.0, 11229.0, 10065.0, 10030.0, 11222.0,10024.0, 10013.0, 11420.0, 11365.0, 10012.0, 11214.0, 11212.0,10022.0, 11232.0, 11040.0, 11226.0, 10281.0, 11102.0, 11208.0,10001.0, 10472.0, 11414.0, 11223.0, 10040.0, 11220.0, 11373.0,11203.0, 11691.0, 11356.0, 10017.0, 10452.0, 10280.0, 11217.0,10031.0, 11201.0, 11358.0, 10128.0, 11423.0, 10039.0, 10010.0,11209.0, 10021.0, 10037.0, 11413.0, 11375.0, 11238.0, 10473.0,11103.0, 11354.0, 11361.0, 11106.0, 11385.0, 10463.0, 10467.0,11204.0, 11237.0, 11377.0, 11364.0, 11434.0, 11435.0, 11210.0,11228.0, 11368.0, 11694.0, 10464.0, 11415.0, 10314.0, 10301.0,10018.0, 10038.0, 11105.0, 11230.0, 10468.0, 11104.0, 10471.0,11416.0, 10075.0, 11422.0, 11355.0, 10028.0, 10462.0, 10306.0,10461.0, 11224.0, 11429.0, 10035.0, 11366.0, 11362.0, 11206.0,10460.0, 10304.0, 11360.0, 11411.0, 10455.0, 10475.0, 10069.0,10303.0, 10308.0, 10302.0, 11357.0, 10470.0, 11367.0, 11370.0,10454.0, 10451.0, 11436.0, 11426.0, 10153.0, 11004.0, 11428.0,11427.0, 11001.0, 11363.0, 10004.0, 10474.0, 11430.0, 10000.0,10307.0, 11239.0, 10119.0, 10006.0, 10048.0, 11697.0, 11692.0,11693.0, 10573.0, 83.0, 11559.0, 10020.0, 77056.0, 11776.0,70711.0, 10282.0, 11109.0, 10044.0, '10452', '11233', '10468','10310', '11105', '10462', '10029', '10301', '10457', '10467','10469', '11225', '10035', '10031', '11226', '10454', '11221','10025', '11229', '11235', '11422', '10472', '11208', '11102','10032', '11216', '10473', '10463', '11213', '10040', '10302','11231', '10470', '11204', '11104', '11212', '10466', '11416','11214', '10009', '11692', '11385', '11423', '11201', '10024','11435', '10312', '10030', '11106', '10033', '10303', '11215','11222', '11354', '10016', '10034', '11420', '10304', '10019','11237', '11249', '11230', '11372', '11207', '11378', '11419','11361', '10011', '11357', '10012', '11358', '10003', '10002','11374', '10007', '11234', '10065', '11369', '11434', '11205','11206', '11415', '11236', '11218', '11413', '10458', '11101','10306', '11355', '10023', '11368', '10314', '11421', '10010','10018', '11223', '10455', '11377', '11433', '11375', '10037','11209', '10459', '10128', '10014', '10282', '11373', '10451','11238', '11211', '10038', '11694', '11203', '11691', '11232','10305', '10021', '11228', '10036', '10001', '10017', '11217','11219', '10308', '10465', '11379', '11414', '10460', '11417','11220', '11366', '10027', '11370', '10309', '11412', '11356','10456', '11432', '10022', '10013', '11367', '11040', '10026','10475', '11210', '11364', '11426', '10471', '10119', '11224','11418', '11429', '11365', '10461', '11239', '10039', '00083','11411', '10075', '11004', '11360', '10453', '10028', '11430','10307', '11103', '10004', '10069', '10005', '10474', '11428','11436', '10020', '11001', '11362', '11693', '10464', '11427','10044', '11363', '10006', '10000', '02061', '77092-2016', '10280','11109', '14225', '55164-0737', '19711', '07306', '000000','NO CLUE', '90010', '10281', '11747', '23541', '11776', '11697','11788', '07604', 10112.0, 11788.0, 11563.0, 11580.0, 7087.0,11042.0, 7093.0, 11501.0, 92123.0, 0.0, 11575.0, 7109.0, 11797.0,'10803', '11716', '11722', '11549-3650', '10162', '92123', '23502','11518', '07020', '08807', '11577', '07114', '11003', '07201','11563', '61702', '10103', '29616-0759', '35209-3114', '11520','11735', '10129', '11005', '41042', '11590', 6901.0, 7208.0,11530.0, 13221.0, 10954.0, 11735.0, 10103.0, 7114.0, 11111.0,10107.0], dtype=object)

6.3 處理缺省值和字符串/浮點混亂

我們可以在pd.read_csv讀數據的時候,傳一個na_values給它,清理掉一部分的臟數據,我們還可以指明說,我們就要保證郵政編碼是字符串型的,不要給我整些數值型出來!!

na_values = ['NO CLUE', 'N/A', '0'] requests = pd.read_csv('311-service-requests.csv', na_values=na_values, dtype={'Incident Zip': str}) requests['Incident Zip'].unique() array(['11432', '11378', '10032', '10023', '10027', '11372', '11419','11417', '10011', '11225', '11218', '10003', '10029', '10466','11219', '10025', '10310', '11236', nan, '10033', '11216', '10016','10305', '10312', '10026', '10309', '10036', '11433', '11235','11213', '11379', '11101', '10014', '11231', '11234', '10457','10459', '10465', '11207', '10002', '10034', '11233', '10453','10456', '10469', '11374', '11221', '11421', '11215', '10007','10019', '11205', '11418', '11369', '11249', '10005', '10009','11211', '11412', '10458', '11229', '10065', '10030', '11222','10024', '10013', '11420', '11365', '10012', '11214', '11212','10022', '11232', '11040', '11226', '10281', '11102', '11208','10001', '10472', '11414', '11223', '10040', '11220', '11373','11203', '11691', '11356', '10017', '10452', '10280', '11217','10031', '11201', '11358', '10128', '11423', '10039', '10010','11209', '10021', '10037', '11413', '11375', '11238', '10473','11103', '11354', '11361', '11106', '11385', '10463', '10467','11204', '11237', '11377', '11364', '11434', '11435', '11210','11228', '11368', '11694', '10464', '11415', '10314', '10301','10018', '10038', '11105', '11230', '10468', '11104', '10471','11416', '10075', '11422', '11355', '10028', '10462', '10306','10461', '11224', '11429', '10035', '11366', '11362', '11206','10460', '10304', '11360', '11411', '10455', '10475', '10069','10303', '10308', '10302', '11357', '10470', '11367', '11370','10454', '10451', '11436', '11426', '10153', '11004', '11428','11427', '11001', '11363', '10004', '10474', '11430', '10000','10307', '11239', '10119', '10006', '10048', '11697', '11692','11693', '10573', '00083', '11559', '10020', '77056', '11776','70711', '10282', '11109', '10044', '02061', '77092-2016', '14225','55164-0737', '19711', '07306', '000000', '90010', '11747','23541', '11788', '07604', '10112', '11563', '11580', '07087','11042', '07093', '11501', '92123', '00000', '11575', '07109','11797', '10803', '11716', '11722', '11549-3650', '10162', '23502','11518', '07020', '08807', '11577', '07114', '11003', '07201','61702', '10103', '29616-0759', '35209-3114', '11520', '11735','10129', '11005', '41042', '11590', '06901', '07208', '11530','13221', '10954', '11111', '10107'], dtype=object)

6.4 那些用“-”連接的郵編是什么鬼?

requests.loc[requests['Incident Zip'].str.contains('-').fillna(False),'Incident Zip'] 29136 77092-2016 30939 55164-0737 70539 11549-3650 85821 29616-0759 89304 35209-3114 Name: Incident Zip, dtype: object rows_with_dashes = requests['Incident Zip'].str.contains('-').fillna(False) len(requests[rows_with_dashes]) 5

真心是很煩人啊,其實只有5個,輸出來看看是什么

requests[rows_with_dashes] Unique KeyCreated DateClosed DateAgencyAgency NameComplaint TypeDescriptorLocation TypeIncident ZipIncident AddressStreet NameCross Street 1Cross Street 2Intersection Street 1Intersection Street 2Address TypeCityLandmarkFacility TypeStatusDue DateResolution Action Updated DateCommunity BoardBoroughX Coordinate (State Plane)Y Coordinate (State Plane)Park Facility NamePark BoroughSchool NameSchool NumberSchool RegionSchool CodeSchool Phone NumberSchool AddressSchool CitySchool StateSchool ZipSchool Not FoundSchool or Citywide ComplaintVehicle TypeTaxi Company BoroughTaxi Pick Up LocationBridge Highway NameBridge Highway DirectionRoad RampBridge Highway SegmentGarage Lot NameFerry DirectionFerry Terminal NameLatitudeLongitudeLocation2913630939705398582189304
2655055110/24/2013 06:16:34 PMNaNDCADepartment of Consumer AffairsConsumer ComplaintFalse AdvertisingNaN770922700 EAST SELTICE WAYEAST SELTICE WAYNaNNaNNaNNaNNaNHOUSTONNaNNaNAssigned11/13/2013 11:15:20 AM10/29/2013 11:16:16 AM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2654883110/24/2013 09:35:10 AMNaNDCADepartment of Consumer AffairsConsumer ComplaintHarassmentNaN55164P.O. BOX 6443764437NaNNaNNaNNaNNaNST. PAULNaNNaNAssigned11/13/2013 02:30:21 PM10/29/2013 02:31:06 PM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2648841710/15/2013 03:40:33 PMNaNTLCTaxi and Limousine CommissionTaxi ComplaintDriver ComplaintStreet11549365 HOFSTRA UNIVERSITYHOFSTRA UNIVERSITYNaNNaNNaNNaNNaNHEMSTEADNaNNaNAssigned11/30/2013 01:20:33 PM10/16/2013 01:21:39 PM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNLa Guardia AirportNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2646829610/10/2013 12:36:43 PM10/26/2013 01:07:07 AMDCADepartment of Consumer AffairsConsumer ComplaintDebt Not OwedNaN29616PO BOX 25759BOX 25759NaNNaNNaNNaNNaNGREENVILLENaNNaNClosed10/26/2013 09:20:28 AM10/26/2013 01:07:07 AM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2646113710/09/2013 05:23:46 PM10/25/2013 01:06:41 AMDCADepartment of Consumer AffairsConsumer ComplaintHarassmentNaN35209600 BEACON PKWYBEACON PKWYNaNNaNNaNNaNNaNBIRMINGHAMNaNNaNClosed10/25/2013 02:43:42 PM10/25/2013 01:06:41 AM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

本來就5個,打算直接把這些都設置成缺省值(nan)的:requests['Incident Zip'][rows_with_dashes] = np.nan
后來查了查,發現可能前5位置是真實的郵編,所以干脆截取一下好了。

long_zip_codes = requests['Incident Zip'].str.len() > 5 requests['Incident Zip'][long_zip_codes].unique() array([], dtype=object) requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5)

搞定啦!

媽蛋查了下00000,發現根本不是什么美國加拿大的郵編,所以這個是不能這么處理的,還真得重新設為缺省值。

requests[requests['Incident Zip'] == '00000'] Unique KeyCreated DateClosed DateAgencyAgency NameComplaint TypeDescriptorLocation TypeIncident ZipIncident AddressStreet NameCross Street 1Cross Street 2Intersection Street 1Intersection Street 2Address TypeCityLandmarkFacility TypeStatusDue DateResolution Action Updated DateCommunity BoardBoroughX Coordinate (State Plane)Y Coordinate (State Plane)Park Facility NamePark BoroughSchool NameSchool NumberSchool RegionSchool CodeSchool Phone NumberSchool AddressSchool CitySchool StateSchool ZipSchool Not FoundSchool or Citywide ComplaintVehicle TypeTaxi Company BoroughTaxi Pick Up LocationBridge Highway NameBridge Highway DirectionRoad RampBridge Highway SegmentGarage Lot NameFerry DirectionFerry Terminal NameLatitudeLongitudeLocation
zero_zips = requests['Incident Zip'] == '00000' requests.loc[zero_zips, 'Incident Zip'] = np.nan

完工!!再來看看現在的數據什么樣了。

unique_zips = requests['Incident Zip'].unique() #unique_zips.sort_values() unique_zips array(['11432', '11378', '10032', '10023', '10027', '11372', '11419','11417', '10011', '11225', '11218', '10003', '10029', '10466','11219', '10025', '10310', '11236', nan, '10033', '11216', '10016','10305', '10312', '10026', '10309', '10036', '11433', '11235','11213', '11379', '11101', '10014', '11231', '11234', '10457','10459', '10465', '11207', '10002', '10034', '11233', '10453','10456', '10469', '11374', '11221', '11421', '11215', '10007','10019', '11205', '11418', '11369', '11249', '10005', '10009','11211', '11412', '10458', '11229', '10065', '10030', '11222','10024', '10013', '11420', '11365', '10012', '11214', '11212','10022', '11232', '11040', '11226', '10281', '11102', '11208','10001', '10472', '11414', '11223', '10040', '11220', '11373','11203', '11691', '11356', '10017', '10452', '10280', '11217','10031', '11201', '11358', '10128', '11423', '10039', '10010','11209', '10021', '10037', '11413', '11375', '11238', '10473','11103', '11354', '11361', '11106', '11385', '10463', '10467','11204', '11237', '11377', '11364', '11434', '11435', '11210','11228', '11368', '11694', '10464', '11415', '10314', '10301','10018', '10038', '11105', '11230', '10468', '11104', '10471','11416', '10075', '11422', '11355', '10028', '10462', '10306','10461', '11224', '11429', '10035', '11366', '11362', '11206','10460', '10304', '11360', '11411', '10455', '10475', '10069','10303', '10308', '10302', '11357', '10470', '11367', '11370','10454', '10451', '11436', '11426', '10153', '11004', '11428','11427', '11001', '11363', '10004', '10474', '11430', '10000','10307', '11239', '10119', '10006', '10048', '11697', '11692','11693', '10573', '00083', '11559', '10020', '77056', '11776','70711', '10282', '11109', '10044', '02061', '77092', '14225','55164', '19711', '07306', '90010', '11747', '23541', '11788','07604', '10112', '11563', '11580', '07087', '11042', '07093','11501', '92123', '11575', '07109', '11797', '10803', '11716','11722', '11549', '10162', '23502', '11518', '07020', '08807','11577', '07114', '11003', '07201', '61702', '10103', '29616','35209', '11520', '11735', '10129', '11005', '41042', '11590','06901', '07208', '11530', '13221', '10954', '11111', '10107'],dtype=object)

看起來干凈多了。

但是真的做完了嗎?

zips = requests['Incident Zip'] # 用is_close表示0或者1開始的比較正確的郵編 is_close = zips.str.startswith('0') | zips.str.startswith('1') # 非缺省值但不以0或者1開始的郵編認為是有些困惑的 is_far = ~(is_close) & zips.notnull() zips[is_far] 12102 77056 13450 70711 29136 77092 30939 55164 44008 90010 47048 23541 57636 92123 71001 92123 71834 23502 80573 61702 85821 29616 89304 35209 94201 41042 Name: Incident Zip, dtype: object

可以排個序,然后對應輸出一些東西

requests[is_far][['Incident Zip', 'Descriptor', 'City']].sort_values('Incident Zip') Incident ZipDescriptorCity71834470488582189304942013093980573134501210229136440085763671001
23502HarassmentNORFOLK
23541HarassmentNORFOLK
29616Debt Not OwedGREENVILLE
35209HarassmentBIRMINGHAM
41042HarassmentFLORENCE
55164HarassmentST. PAUL
61702Billing DisputeBLOOMIGTON
70711Contract DisputeCLIFTON
77056Debt Not OwedHOUSTON
77092False AdvertisingHOUSTON
90010Billing DisputeLOS ANGELES
92123HarassmentSAN DIEGO
92123Billing DisputeSAN DIEGO

咳咳,突然覺得,恩,剛才做的一大堆工作,其實只是告訴你,我們可以這樣去處理和補齊數據。

但你實際上會發現,好像其實用city直接對應一下就可以補上一些東西啊。

requests['City'].str.upper().value_counts() BROOKLYN 31662 NEW YORK 22664 BRONX 18438 STATEN ISLAND 4766 JAMAICA 2246 FLUSHING 1803 ASTORIA 1568 RIDGEWOOD 1073 CORONA 707 OZONE PARK 693 LONG ISLAND CITY 678 FAR ROCKAWAY 652 ELMHURST 647 WOODSIDE 609 EAST ELMHURST 562 QUEENS VILLAGE 549 FOREST HILLS 541 JACKSON HEIGHTS 541 SOUTH RICHMOND HILL 521 MASPETH 473 WOODHAVEN 464 FRESH MEADOWS 435 SPRINGFIELD GARDENS 434 BAYSIDE 411 SOUTH OZONE PARK 410 RICHMOND HILL 404 REGO PARK 402 MIDDLE VILLAGE 396 SAINT ALBANS 387 WHITESTONE 348... NEW YORK CITY 1 LYNBROOK 1 SYRACUSE 1 CLIFTON 1 VALLEY STREAM 1 HEMSTEAD 1 EAST ROCKAWAY 1 NEW YOR 1 NEWARK AIRPORT 1 FARMINGDALE 1 RYEBROOK 1 ROSELYN 1 NORTH BERGEN 1 CHEEKTOWAGA 1 NJ 1 BRIDGE WATER 1 MINEOLA 1 BELLEVILLE 1 JERSEY CITY 1 BOHIEMA 1 WOODBURY 1 NANUET 1 ROSLYN 1 UNION CITY 1 GREENVILLE 1 STAMFORD 1 ELIZABETH 1 ST. PAUL 1 LOS ANGELES 1 HASBROCK HEIGHTS 1 Name: City, Length: 100, dtype: int64

6.5 匯個總

所以匯總一下,我們在郵編這個字段,是這樣做數據清洗的:

na_values = ['NO CLUE', 'N/A', '0'] requests = pd.read_csv('311-service-requests.csv', na_values=na_values, dtype={'Incident Zip': str}) def fix_zip_codes(zips):# Truncate everything to length 5 zips = zips.str.slice(0, 5)# Set 00000 zip codes to nanzero_zips = zips == '00000'zips[zero_zips] = np.nanreturn zips requests['Incident Zip'] = fix_zip_codes(requests['Incident Zip']) requests['Incident Zip'].unique() array(['11432', '11378', '10032', '10023', '10027', '11372', '11419','11417', '10011', '11225', '11218', '10003', '10029', '10466','11219', '10025', '10310', '11236', nan, '10033', '11216', '10016','10305', '10312', '10026', '10309', '10036', '11433', '11235','11213', '11379', '11101', '10014', '11231', '11234', '10457','10459', '10465', '11207', '10002', '10034', '11233', '10453','10456', '10469', '11374', '11221', '11421', '11215', '10007','10019', '11205', '11418', '11369', '11249', '10005', '10009','11211', '11412', '10458', '11229', '10065', '10030', '11222','10024', '10013', '11420', '11365', '10012', '11214', '11212','10022', '11232', '11040', '11226', '10281', '11102', '11208','10001', '10472', '11414', '11223', '10040', '11220', '11373','11203', '11691', '11356', '10017', '10452', '10280', '11217','10031', '11201', '11358', '10128', '11423', '10039', '10010','11209', '10021', '10037', '11413', '11375', '11238', '10473','11103', '11354', '11361', '11106', '11385', '10463', '10467','11204', '11237', '11377', '11364', '11434', '11435', '11210','11228', '11368', '11694', '10464', '11415', '10314', '10301','10018', '10038', '11105', '11230', '10468', '11104', '10471','11416', '10075', '11422', '11355', '10028', '10462', '10306','10461', '11224', '11429', '10035', '11366', '11362', '11206','10460', '10304', '11360', '11411', '10455', '10475', '10069','10303', '10308', '10302', '11357', '10470', '11367', '11370','10454', '10451', '11436', '11426', '10153', '11004', '11428','11427', '11001', '11363', '10004', '10474', '11430', '10000','10307', '11239', '10119', '10006', '10048', '11697', '11692','11693', '10573', '00083', '11559', '10020', '77056', '11776','70711', '10282', '11109', '10044', '02061', '77092', '14225','55164', '19711', '07306', '90010', '11747', '23541', '11788','07604', '10112', '11563', '11580', '07087', '11042', '07093','11501', '92123', '11575', '07109', '11797', '10803', '11716','11722', '11549', '10162', '23502', '11518', '07020', '08807','11577', '07114', '11003', '07201', '61702', '10103', '29616','35209', '11520', '11735', '10129', '11005', '41042', '11590','06901', '07208', '11530', '13221', '10954', '11111', '10107'],dtype=object)

總結

以上是生活随笔為你收集整理的2-6_Cleaning_Data的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。