當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

数据可视化信息可视化_可视化数据以帮助清理数据

發(fā)布時(shí)間：2023/11/29 编程问答 23 豆豆

生活随笔收集整理的這篇文章主要介紹了数据可视化信息可视化_可视化数据以帮助清理数据小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

數(shù)據(jù)可視化信息可視化

The role of a data scientists involves retrieving hidden relationships between massive amounts of structured or unstructured data in the aim to reach or adjust certain business criteria. In recent times this role’s importance has been greatly magnified as businesses look to expand insight about the market and their customers with easily obtainable data.

數(shù)據(jù)科學(xué)家的作用涉及檢索大量結(jié)構(gòu)化或非結(jié)構(gòu)化數(shù)據(jù)之間的隱藏關(guān)系，以達(dá)到或調(diào)整某些業(yè)務(wù)標(biāo)準(zhǔn)。近年來(lái)，隨著企業(yè)希望通過(guò)易于獲得的數(shù)據(jù)來(lái)擴(kuò)大對(duì)市場(chǎng)及其客戶的洞察力，這一作用的重要性已大大提高。

It is the data scientists job to take that data and return a deeper understanding of the business problem or opportunity. This often involves the use of scientific methods of which include machine learning (ML) or neural networks (NN). While these types of structures may find meaning in thousands of data points much faster than a human can, they can be unreliable if the data that is fed into them is messy data.

數(shù)據(jù)科學(xué)家的工作是獲取這些數(shù)據(jù)并返回對(duì)業(yè)務(wù)問(wèn)題或機(jī)會(huì)的更深刻理解。這通常涉及使用科學(xué)方法，包括機(jī)器學(xué)習(xí)(ML)或神經(jīng)網(wǎng)絡(luò)(NN)。盡管這些類型的結(jié)構(gòu)可以在數(shù)千個(gè)數(shù)據(jù)點(diǎn)中找到比人類更快得多的含義，但是如果饋入其中的數(shù)據(jù)是凌亂的數(shù)據(jù)，則它們可能不可靠。

Messy data could cause have very negative consequences on your models they are of many forms of which include:

雜亂的數(shù)據(jù)可能會(huì)對(duì)您的模型造成非常不利的影響，它們的形式很多，包括：

缺少數(shù)據(jù) ： (Missing data:)

Represented as ‘NaN’ (an acronym of Not a Number) or as a ‘None’ a Python singleton object.

表示為“ NaN”(不是數(shù)字的縮寫)或Python單例對(duì)象的“無(wú)”。

Sometimes the best way to deal with problems is the simplest.

有時(shí)，解決問(wèn)題的最佳方法是最簡(jiǎn)單的。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsdf = pd.read_csv('train.csv')df.info()

A quick inspection of the returned values shows the column count of 891 is inconsistent across the different columns a clear sign of missing information. We also notice some fields are of type “object” we’ll look at that next.

快速檢查返回的值會(huì)發(fā)現(xiàn)在不同的列中891的列數(shù)不一致，明顯缺少信息。我們還注意到，接下來(lái)將要介紹一些字段屬于“對(duì)象”類型。

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
survived 891 non-null int64
pclass 891 non-null int64
name 891 non-null object
sex 891 non-null object
age 714 non-null float64
sibsp 891 non-null int64
parch 891 non-null int64
ticket 891 non-null object
fare 891 non-null float64
cabin 204 non-null object
embarked 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB

Alternatively you can plot the missing values on a heatmap using seaborn but this could be very time consuming if handling big dataframes.

或者，您可以使用seaborn在熱圖上繪制缺失值，但是如果處理大數(shù)據(jù)幀，這可能會(huì)非常耗時(shí)。

sns.heatmap(df.isnull(), cbar=False)

數(shù)據(jù)不一致： (Inconsistent data:)

Inconsistent columns types: Columns in dataframes can differ as we saw above. Columns could be of a different types such as objects, integers or floats and while this is usually the case mismatch between column type and the type of value it holds might be problematic. Most important of format types include datetime used for time and date values.
列類型不一致：數(shù)據(jù)框中的列可能會(huì)有所不同，如上所述。列可以具有不同的類型，例如對(duì)象，整數(shù)或浮點(diǎn)數(shù)，雖然通常這是列類型與其所擁有的值類型不匹配的情況，但可能會(huì)出現(xiàn)問(wèn)題。最重要的格式類型包括用于時(shí)間和日期值的日期時(shí)間。
Inconsistent value formatting: While this type of problem might mainly arise during categorical values if misspelled or typos are present it can be checked with the following:
值格式不一致：雖然這種類型的問(wèn)題可能主要在分類值期間出現(xiàn)(如果存在拼寫錯(cuò)誤或錯(cuò)字)，但可以使用以下方法進(jìn)行檢查：

df[‘a(chǎn)ge’].value_counts()

This will return the number of iterations each value is repeated throughout the dataset.

這將返回在整個(gè)數(shù)據(jù)集中重復(fù)每個(gè)值的迭代次數(shù)。

離群數(shù)據(jù) ： (Outlier data:)

A dataframe column holds information about a specific feature within the data. Hence we can have a basic idea of the range of those values. For example age, we know there is going to be a range between 0 or 100. This does not mean that outliers would not be present between that range.

數(shù)據(jù)框列保存有關(guān)數(shù)據(jù)中特定功能的信息。因此，我們可以對(duì)這些值的范圍有一個(gè)基本的了解。例如，年齡，我們知道將有一個(gè)介于0或100之間的范圍。這并不意味著在該范圍之間不會(huì)出現(xiàn)異常值。

A simple illustration of the following can be seen graphing a boxplot:

可以通過(guò)繪制箱形圖來(lái)簡(jiǎn)單了解以下內(nèi)容：

sns.boxplot(x=df['age'])
plt.show()

The values seen as dots on the righthand side could be considered as outliers in this dataframe as they fall outside the the range of commonly witnessed values.

在此數(shù)據(jù)框中，右側(cè)的點(diǎn)表示的值可以視為離群值，因?yàn)樗鼈儾辉谕ǔＲ娮C的值范圍之內(nèi)。

多重共線性： (Multicollinearity:)

While multicollinearity is not considered to be messy data it just means that the columns or features in the dataframe are correlated. For example if you were to have a a column for “price” a column for “weight” and a third for “price per weight” we expect a high multicollinearity between these fields. This could be solved by dropping some of these highly correlated columns.

雖然多重共線性不被認(rèn)為是凌亂的數(shù)據(jù)，但這僅意味著數(shù)據(jù)框中的列或要素是相關(guān)的。例如，如果您有一個(gè)“價(jià)格”列，一個(gè)“重量”列和一個(gè)“每重量?jī)r(jià)格”列，那么我們期望這些字段之間具有較高的多重共線性。這可以通過(guò)刪除一些高度相關(guān)的列來(lái)解決。

f, ax = plt.subplots(figsize=(10, 8))corr = df.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True, ax=ax)

In this case we can see that the values do not exceed 0.7 either positively nor negatively and hence it can be considered safe to continue.

在這種情況下，我們可以看到值的正或負(fù)均不超過(guò)0.7，因此可以認(rèn)為繼續(xù)操作是安全的。

使此過(guò)程更容易： (Making this process easier:)

While data scientists often go through these initial tasks repetitively, it could be made easier by creating structured functions that allows the easy visualisation of this information. Lets try:

盡管數(shù)據(jù)科學(xué)家經(jīng)常重復(fù)地完成這些初始任務(wù)，但通過(guò)創(chuàng)建結(jié)構(gòu)化的函數(shù)可以使此信息的可視化變得更加容易。我們?cè)囋嚢?#xff1a;

----------------------------------------------------------------
from quickdata import data_viz # File found in repository
----------------------------------------------------------------from sklearn.datasets import fetch_california_housingdata = fetch_california_housing()
print(data[‘DESCR’][:830])
X = pd.DataFrame(data[‘data’],columns=data[‘feature_names’])
y = data[‘target’]

1-Checking Multicollinearity

1-檢查多重共線性

The function below returns a heatmap of collinearity between independent variables as well as with the target variable.

下面的函數(shù)返回自變量之間以及目標(biāo)變量之間共線性的熱圖。

data = independent variable df X

數(shù)據(jù) =自變量df X

target = dependent variable list y

目標(biāo) =因變量列表y

remove = list of variables not to be included (default as empty list)

remove =不包括的變量列表(默認(rèn)為空列表)

add_target = boolean of whether to view heatmap with target included (default as False)

add_target =是否查看包含目標(biāo)的熱圖的布爾值(默認(rèn)為False)

inplace = manipulate your df to save the changes you made with remove/add_target (default as False)

inplace =操縱df保存使用remove / add_target所做的更改(默認(rèn)為False)

*In the case remove was passed a column name, a regplot of that column and the target is also presented to help view changes before proceeding*

*如果為remove傳遞了一個(gè)列名，該列的重新繪制圖和目標(biāo)，則在繼續(xù)操作之前還會(huì)顯示目標(biāo)以幫助查看更改*

data_viz.multicollinearity_check(data=X, target=y, remove=[‘Latitude’], add_target=False, inplace=False)

data_viz.multicollinearity_check(data = X，target = y，remove = ['Latitude']，add_target = False，inplace = False)

2- Viewing Outliers:This function returns a side-by-side view of outliers through a regplot and a boxplot visualisation of a the input data and target values over a specified split size.

2-查看離群值：此函數(shù)通過(guò)regplot和箱形圖可視化返回離群值的并排視圖，該圖顯示輸入數(shù)據(jù)和目標(biāo)值在指定分割范圍內(nèi)的情況。

data = independent variable df X

數(shù)據(jù) =自變量df X

target = dependent variable list y

目標(biāo) =因變量列表y

split = adjust the number of plotted rows as decimals between 0 and 1 or as integers

split =將繪制的行數(shù)調(diào)整為0到1之間的小數(shù)或整數(shù)

data_viz.view_outliers(data=X, target=y, split_size= 0.3 )

data_viz.view_outliers(data = X，target = y，split_size = 0.3)

It is important that these charts are read by the data scientist and not automated away to the machine. Since not all datasets follow the same rules it is important that a human interprets the visualisations and acts accordingly.

這些圖表必須由數(shù)據(jù)科學(xué)家讀取，而不是自動(dòng)傳送到計(jì)算機(jī)，這一點(diǎn)很重要。由于并非所有數(shù)據(jù)集都遵循相同的規(guī)則，因此重要的是，人類必須解釋視覺效果并據(jù)此采取行動(dòng)。

I hope this short run-through of data visualisation helps provide more clear visualisations of your data to better fuel your decisions when data cleaning.

我希望這段簡(jiǎn)短的數(shù)據(jù)可視化過(guò)程有助于為您的數(shù)據(jù)提供更清晰的可視化，以便在清理數(shù)據(jù)時(shí)更好地推動(dòng)您的決策。

The functions used in the example above is available here :

上面示例中使用的功能在此處可用：

Feel free to customise these as you see fit!

隨意自定義這些內(nèi)容！

翻譯自: https://medium.com/@rani_64949/visualisations-of-data-for-help-in-data-cleaning-dce15a94b383

數(shù)據(jù)可視化信息可視化

總結(jié)

以上是生活随笔為你收集整理的数据可视化信息可视化_可视化数据以帮助清理数据的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：数据科学家数据工程师_数据科学家应该对
下一篇：回归分析检验_回归分析