當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

eda分析_EDA理论指南

發(fā)布時間：2023/11/29 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 eda分析_EDA理论指南小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

eda分析

Most data analysis problems start with understanding the data. It is the most crucial and complicated step. This step also affects the further decisions that we make in a predictive modeling problem, one of which is what algorithm we are going to choose for a problem.

中號 OST的數(shù)據(jù)分析問題開始理解數(shù)據(jù)。這是最關(guān)鍵和最復(fù)雜的步驟。此步驟還會影響我們在預(yù)測建模問題中做出的進(jìn)一步?jīng)Q策，其中一項是我們要為問題選擇的算法。

In this article, we will see a complete tough guide for such a problem.

在本文中，我們將看到有關(guān)此問題的完整指南。

Content

內(nèi)容

Reading Data

讀取數(shù)據(jù)

Variable Identification

變量識別

Univariate analysis

單變量分析

Bivariate analysis

雙變量分析

Missing values- types and analysis

缺失值-類型和分析

Outlier treatment

離群值處理

Variable Transformation

變量變換

讀取數(shù)據(jù)和變量識別 (Reading data and Variable Identification)

Reading the data infers getting the answers to the following questions

讀取數(shù)據(jù)可以得出以下問題的答案

What is the shape of my data?
數(shù)據(jù)的形狀如何？
How many features does my data contain?
我的數(shù)據(jù)包含多少個功能？
What does it look like?
它是什么樣子的？
What are the types of variables?
變量的類型是什么？

Guide1: Types of Variables指南1：變量類型

單變量分析(UA) (Univariate Analysis (UA))

什么是UA？ (What is UA?)

When we explore a single variable at a time from a given list of features, its called UA. We summarize the variable and help us better understand the data.

當(dāng)我們一次從給定的功能列表中探索單個變量時，其稱為UA。我們總結(jié)了變量并幫助我們更好地理解了數(shù)據(jù)。

We see for the following things in UA

我們在UA中看到以下內(nèi)容

Central tendency (mean, median, mode) and dispersion of the variable
變量的集中趨勢(均值，中位數(shù)，眾數(shù))和離散
Distribution of variable- symmetric, right-skewed or left-skewed
對稱分布，右偏或左偏的分布
Missing values and outliers
缺失值和離群值
Count and count percent: Observing the frequency of each category in a categorical variable helps us to understand and deal with that variable.
計算百分比：觀察類別變量中每個類別的頻率有助于我們理解和處理該變量。

為什么選擇UA？ (Why UA?)

We explore that variable, checks for anomalies like outliers, and missing values that we will see in the latter part.

我們將探索該變量，檢查異常值(如異常值)和缺失值，我們將在后面的部分中看到這些值。

UA方法 (Methods for UA)

For Continuous Variables:

對于連續(xù)變量：

Tabular Method: Used to describe central tendencies, dispersion, and missing values.

表格方法：用于描述中心趨勢，離散度和缺失值。

Graphical Method: Used for distribution and checking Outliers. We can use Histograms for understanding distribution and Box Plots for outliers detection.

圖形方法：用于分發(fā)和檢查離群值。我們可以使用直方圖來了解分布，而可以使用箱形圖來檢測異常值。

A combination of Histograms and Box plots is called a Violin Plot

直方圖和箱形圖的組合稱為小提琴圖

Guide2: Methods of Univariate Analysis for continuous variables指南2：連續(xù)變量的單變量分析方法

For Categorical variables:

對于分類變量：

Tabular Method: “.value_counts()” operation in python gives a tabular form of frequencies.

表格方法：python中的“ .value_counts()”操作提供了表格形式的頻率。

Graphical Method: The best graph that is used in the case of a categorical variable is barplot.

圖形方法：對于分類變量，使用的最佳圖形是條形圖。

Guide3: Methods of Univariate Analysis for categorical variables指南3：分類變量的單變量分析方法

雙變量分析(BA) (Bivariate Analysis (BA))

什么是學(xué)士學(xué)位？ (What is BA?)

When we study the empirical relationship of two variables concerning each other, it is called BA.

當(dāng)我們研究兩個變量彼此相關(guān)的經(jīng)驗關(guān)系時，稱為BA。

為什么要學(xué)士學(xué)位？ (Why BA?)

It helps to detect anomalies, understand the dependence of two variables on each other, and the impact of each variable ion the target variable.

它有助于檢測異常，了解兩個變量之間的依賴性，以及每個變量對目標(biāo)變量的影響。

BA的方法 (Methods for BA)

For Continuous-Continuous types: There are two methods to study the relationship between two continuous variables i.e. A scatter plot and the correlation analysis.

對于連續(xù)-連續(xù)類型 ：有兩種方法研究兩個連續(xù)變量之間的關(guān)系，即散點圖和相關(guān)性分析 。

Guide4: Bivariate analysis for Continuous-Continuous type variables指南4：連續(xù)-連續(xù)類型變量的雙變量分析

2. For categorical-continuous types: Under this head, we can use bar plots and T-tests for the analysis purpose.

2. 對于連續(xù)類別：在此標(biāo)題下，我們可以使用條形圖和T檢驗進(jìn)行分析。

The T-test is a type of inferential statistic used to determine if there is a significant difference between the means of two or more groups/categories. Calculating a t-test requires the difference between the mean values and the standard deviation from each category.

T檢驗是一種推論統(tǒng)計量，用于確定兩個或多個組/類別的均值之間是否存在顯著差異。計算t檢驗需要每個類別的平均值和標(biāo)準(zhǔn)偏差之間的差。

Guide5: Bivariate analysis for categorical-Continuous type variables指南5：分類連續(xù)類型變量的雙變量分析

3. For Categorical-categorical types: Two-way table and Chi-square test are used to analyze the relationship of two categorical variables.

3. 對于分類類別類型：使用雙向表和卡方檢驗分析兩個分類變量之間的關(guān)系。

缺失值 (Missing Values)

缺少價值的原因？ (Reasons for Missing Values?)

There can be various missing values in data, some of which can be

數(shù)據(jù)中可能存在各種缺失值，其中一些可能是

There may not be may response recorded.
可能沒有記錄響應(yīng)。
There can be some error while recording the data
記錄數(shù)據(jù)時可能會出現(xiàn)一些錯誤
There can be some error while reading the data, etc.
讀取數(shù)據(jù)時可能會出錯，等等。

缺失值的類型？ (Types of Missing values?)

Missing Completely at Random (MCAR): These are the missing values that do not have any relation with any other variable or the variable in which they are occurring.

完全隨機缺失(MCAR)：這些缺失值與任何其他變量或發(fā)生它們的變量沒有任何關(guān)系。

Missing at random (MAR): The missing values that do not have any relation within the variable they exist but may have an observable trend in other variables. Eg. The income data for people having age greater than 60 years can be missing as people with that age are generally retired.

隨機缺失(MAR)：這些缺失值在存在的變量中沒有任何關(guān)系，但在其他變量中可能有可觀察的趨勢。例如。年齡超過60歲的人的收入數(shù)據(jù)可能會丟失，因為該年齡的人通常已經(jīng)退休。

Missing Not at Random (MNAR): The missing value has a relation in the variable they exist. Eg. House having a price more than Rs. 2 crores can be missing in the database as for that price there cannot be frequent buyers.

隨機缺失(MNAR)：缺失值與它們存在的變量有關(guān)。例如。價格超過Rs的房子。數(shù)據(jù)庫中可能缺少2千萬，因為該價格不能頻繁購買。

缺失值的處理方法 (Methods of dealing Missing Values)

There are two basic methods to deal with missing values

有兩種處理缺失值的基本方法

Deletion: We delete all the missing value rows from the dataset before training the model.

刪除：我們在訓(xùn)練模型之前從數(shù)據(jù)集中刪除所有缺失值行。

Imputation: There are various methods by which we can fill the missing values.

歸因：我們可以通過多種方法來填充缺失值。

Guide6: Treating Missing values指南6：處理缺失值

離群值 (Outliers)

離群值的類型及其識別 (Types of Outliers and their identification)

There are two types of outliers:

有兩種異常值：

Univariate Outlier: It can be identified using a box plot.

單變量離群值：可以使用箱形圖進(jìn)行識別。

Bivariate Outliers: It can be identified using a scatter plot between the two variables.

雙變量離群值：可以使用兩個變量之間的散點圖來識別。

離群值的標(biāo)準(zhǔn) (Criteria for an outlier)

Criteria for X to be outlier:Q1: median for first 25% observation when sorted in ascending order
Q2: median for last 25% observation when sorted in ascending order
Q3: median of all observationIQR: Inter quartile range = Q3-Q1
if X is outlier then X must satisfy:X > (Q3 + 1.5*IQR) OR X < (Q1-1.5*IQR)

異常值的處理 (Treatment of outlier)

We can delete that observation.

我們可以刪除該觀察。

We can impute the value of outlier by the methods discussed in ways for imputing missing values.

我們可以通過以估算缺失值的方式討論的方法來估算離群值。

We can apply transformations (to be discussed next)

我們可以應(yīng)用轉(zhuǎn)換(將在下面討論)

變量變換 (Variable Transformation)

We all know that normalization increases the accuracy of the model. But what exactly is normalization? It is one of the techniques of variable transformation.

眾所周知，歸一化可以提高模型的準(zhǔn)確性。但是規(guī)范化到底是什么？它是變量轉(zhuǎn)換的技術(shù)之一。

In variable transformation, we replace the variable by one of its functions. for example, replace the variable x by its log value.

在變量轉(zhuǎn)換中，我們用變量的功能之一代替變量。例如，將變量x替換為其對數(shù)值。

We can try to fix the following things that we have obtained as an observation in previous EDA processes:

我們可以嘗試修復(fù)在以前的EDA過程中觀察得到的以下問題：

We can change the scale of the variable (redefining the limits of a variable)

我們可以更改變量的小數(shù)位數(shù)(重新定義變量的限制)

Conversion of a non-linear relationship into a linear relationship

將非線性關(guān)系轉(zhuǎn)換為線性關(guān)系

It is observed that algorithms better perform on symmetrically distributed variables than skewed so we can convert skewed distribution to symmetric distribution.

可以看出，算法在對稱分布變量上的性能要優(yōu)于偏態(tài)分布，因此我們可以將偏態(tài)分布轉(zhuǎn)換為對稱分布。

變量轉(zhuǎn)換方法 (Methods of Variable Transformation)

Non-linear transformation: We can replace the variable by its log value, square root, or cube root. These are non-linear transformations, hence help us to deal with all the points stated above.

非線性轉(zhuǎn)換 ：我們可以用變量的對數(shù)值，平方根或立方根替換變量。這些是非線性變換，因此有助于我們處理上述所有問題。

Binning: We can divide the continuous values into various bins hence converting a continuous variable into categorical. This may help us to categorize the outlier into some categories with which our model can deal.

Binning：我們可以將連續(xù)值劃分為不同的bin，從而將連續(xù)變量轉(zhuǎn)換為分類變量。這可以幫助我們將異常值分類為模型可以處理的某些類別。

加起來 (Summing up)

This is an extensive guide for Exploratory Data Analysis. This not only includes how to detect anomalies but also how to deal and get rid of them. This is a very naive approach to EDA hence most of the chapters are covered yet.

這是探索性數(shù)據(jù)分析的詳盡指南。這不僅包括如何檢測異常，還包括如何處理和消除異常。這是一種非常幼稚的EDA方法，因此大多數(shù)章節(jié)都已介紹。

翻譯自: https://towardsdatascience.com/the-eda-theoretical-guide-b7cef7653f0d

eda分析

總結(jié)

以上是生活随笔為你收集整理的eda分析_EDA理论指南的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。