数据分析 数据清理_数据清理| 数据科学
數據分析 數據清理
數據清理 (Data Cleaning)
Data cleaning is the way toward altering information to guarantee that it is right, precise, and significant. The definition may be straightforward, yet information cleaning is utilized in numerous situations. Likewise, information cleaning alludes to a large number of exercises. These exercises mean to improve the nature of your information. Generally, these assignments are cultivated by joining numerous different activities. The present blog entries will talk about the most significant information cleaning undertakings.
數據清理是更改信息以確保其正確,準確和重要的方法。 該定義可能很簡單,但是在許多情況下都使用了信息清洗。 同樣,信息清洗也涉及大量練習。 這些練習旨在改善您信息的性質。 通常,通過分配許多不同的活動來培養這些任務。 當前的博客文章將討論最重要的信息清潔工作。
輪廓匹配和數據標準化 (Outline Matching and Data Standardization)
Frequently, composition coordinating is the main errand you have to perform. Its point is to adjust the traits originating from new datasets with the ones in your current database.
通常,構圖協調是您必須執行的主要任務。 它的目的是用當前數據庫中的數據調整源自新數據集的特征。
Existing Customer Schema (Name, Country, Address, Phone)
現有客戶架構(名稱,國家/地區,地址,電話)
Approaching Customer Schema (Country, City, Street, Apt, Phone)
接近客戶模式(國家,城市,街道,公寓,電話)
To coordinate these patterns and push ahead with your information coordinating activity, you have to devise a procedure that changes over each tuple in the Incoming Customer Schema to Existing Customer Schema.
為了協調這些模式并推進您的信息協調活動,您必須設計一個過程,以將“傳入客戶模式”中的每個元組轉換為“現有客戶模式”。
Another situation we will examine here alludes to a similar two constructions however accept that the information records about your clients don't contain postal districts. If you have to see what number of clients are there for a particular code, it is critical to have the right zip esteems.
我們將在這里檢查的另一種情況暗示類似的兩種構造,但是我們接受關于您的客戶的信息記錄不包含郵政區。 如果必須查看特定代碼的客戶端數量,那么擁有正確的zip信譽至關重要。
Nonetheless, similar standards apply when you have to keep up your item index database. You should ensure that all elements of an item are both communicated in similar units and that these qualities are not missing. If not, search questions will return mistaken outcomes. The errand that ensures all qualities are utilizing a similar show is called information institutionalization. This is the errand you ought to perform before other information cleaning exercises, for example, information coordinating and information deduplication. These are in no way, shape or form unimportant exercises and, frequently, it isn't practical for you to perform them physically.
但是,當您必須保持商品索引數據庫時,也適用類似的標準。 您應確保一個項目的所有元素都以相似的單位進行交流,并且不遺漏這些品質。 如果不是,搜索問題將返回錯誤的結果。 確保所有素質都利用類似表演的方式被稱為信息制度化 。 這是在執行其他信息清除練習(例如,信息協調和重復數據刪除)之前應該執行的任務。 這些絕不是無關緊要的形式或形式,并且通常來說,您不能實際進行鍛煉。
資料比對 (Data Matching)
The point of record coordinating is to coordinate every single record from a dataset with the records from another dataset. For the most part, you have to play out this action when you import new information. Thusly, you will ensure the new datasets don't present copy substances.
記錄協調的重點是將數據集中的每個記錄與另一個數據集中的記錄進行協調。 在大多數情況下,導入新信息時必須執行此操作。 因此,您將確保新的數據集不顯示復制物質。
Consider a situation when you have to import another arrangement of client records into your business database. You should check if a similar client is spoken to in both approaching cluster or existing databases. You should keep just one record. Lamentably, because of composing mistakes or illustrative blunders, a similar record in the two pieces of information could appear to be changed. Subsequently, it probably won't coordinate the significant characteristics, for example, telephone, address, and name.
考慮一種情況,您必須將另一組客戶記錄導入到您的業務數據庫中。 您應該檢查在接近群集或現有數據庫中是否使用了類似的客戶端。 您應該只保留一個記錄。 可悲的是,由于出現了錯誤或說明性的錯誤,兩條信息中的相似記錄似乎已被更改。 隨后,它可能無法協調重要特征,例如電話,地址和名稱。
The trouble is regularly expanded on account of sections where the item depiction is a link of more than one characteristic. In this way, the objective of record coordinating is to discover sets of records in every one of the two informational collections which relate to a similar substance.
由于項目描述是多個特性鏈接的一部分,因此該問題會定期擴大。 通過這種方式,記錄協調的目的是在與相似物質相關的兩個信息收集的每一個中發現記錄集。
The most significant difficulties you have to address right now:
您現在必須解決的最重要的困難是:
Recognize the criteria that guarantee two records are undoubtedly relating to a similar true element with the huge datasets accessible today, you need to locate the most proficient calculation technique. This strategy ought to have the option to decide the previously mentioned combines over huge arrangements of information.
認識到保證兩條記錄無疑與當今擁有巨大數據集的相似真實元素相關的標準,您需要找到最精通的計算技術。 該策略應具有選擇權,可以決定上述巨大信息組合的組合。
Luckily, few apps can assist you with conquering these obstacles. By utilizing its keen fluffy coordinating motor, our item is designed to locate the most obvious matches and the least bogus matches. Moreover, you can consolidate these outcomes with the adjustable information base library.
幸運的是,很少有應用程序可以幫助您克服這些障礙。 通過使用其敏銳的蓬松協調馬達,我們的產品旨在定位最明顯的匹配項和最少的虛假匹配項。 此外,您可以使用可調整的信息庫來合并這些結果。
資料復制 (Data Duplication)
Information deduplication intends to aggregate records in a dataset. Thusly, it ensures that each gathering is speaking to a similar true substance. For best outcomes, you ought to play out this procedure both when you populate the database just because and when you include new records. When contrasted with information coordinating, deduplication is generally including the extra gathering of coordinating records. This methodology permits the gatherings to on the whole parcel the information datasets.
信息重復數據刪除旨在聚合數據集中的記錄。 因此,它確保每次聚會都在講類似的真實內容。 為了獲得最佳結果,在填充數據庫(包括添加新記錄)和添加新記錄時都應執行此過程。 與信息協調相比,重復數據刪除通常包括額外收集的協調記錄。 這種方法可以使收集者整體上收集信息數據集。
Consider a model where your database stores various records, for example,
考慮一個數據庫存儲各種記錄的模型,例如,
Nikon D750 Camera
尼康D750相機
Nikon D750 SLR
尼康D750單反
Nikon D750 Digital SLR
尼康D750數碼單反
This set has different records that speak to a similar element. Along these lines, you should be capable not exclusively to coordinate two of them however coordinate every one of the three records to a similar certifiable substance.
該集合具有不同的記錄,它們代表相似的元素。 遵循這些原則,您不應該只能夠協調其中的兩個,而應將三個記錄中的每一個都協調到類似的可驗證物質。
資料剖析 (Data Profiling)
Since information cleaning is an intelligent procedure, it is fundamental for you to have the option to assess the nature of your information. You ought to have the option to do this both when the information cleaning process. Thusly, you will have the option to check its adequacy. We call his procedure information profiling. Its most significant objectives are to guarantee that your qualities coordinate with your desires.
由于信息清除是一種智能過程,因此您可以選擇評估信息的性質,這一點至關重要。 在信息清理過程中,您都應該選擇同時執行此操作。 因此,您可以選擇檢查其適當性。 我們稱其為程序信息分析 。 其最重要的目標是確保您的品質與您的期望相協調。
Consider that you may expect a client name and address to exceptionally recognize every client in your database. Along these lines, the number of exceptional tuples must be as nearest as conceivable to the complete number of passages in your database.
考慮到您可能希望客戶名和地址能異常識別數據庫中的每個客戶。 遵循這些原則,異常元組的數量必須與數據庫中整個段落的數量盡可能接近。
Notwithstanding, even you may acquire subsets of components through a few SQL inquiries, this methodology is wasteful and tedious. Data Profiling/Statistics is anything but difficult to utilize and incredible information profiling programming made to assist you with finding designs in your informational collections. Besides, the module can check the nature of your information by examining esteem tallies, types, organizations, and culmination. The module gives a total arrangement of measurable information intended to help clean your information.
盡管如此,即使您可以通過一些SQL查詢來獲取組件的子集,這種方法也是浪費和繁瑣的。 數據剖析/統計幾乎沒有什么可利用的,而令人難以置信的信息剖析編程可幫助您在信息集合中查找設計。 此外,該模塊還可以通過檢查自尊記錄,類型,組織和高潮來檢查您信息的性質。 該模塊提供了可衡量信息的整體安排,旨在幫助您清潔信息。
翻譯自: https://www.includehelp.com/data-science/data-cleaning.aspx
數據分析 數據清理
總結
以上是生活随笔為你收集整理的数据分析 数据清理_数据清理| 数据科学的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Python中的简单图案打印程序
- 下一篇: ruby继承_Ruby继承