當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

熊猫烧香源码分析_学习大熊猫分析

發布時間：2023/12/10 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了熊猫烧香源码分析_学习大熊猫分析小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

熊貓燒香源碼分析

介紹(Introduction)

Being a data scientist in today's age is an incredibly exciting and rewarding career. With the explosion of technology and the immense amount of data and content created daily, data scientist continually need to be learning new ways of efficiently analysing this data. One of the most crucial parts of any new data project is the exploratory data analysis phase. As a data scientist, this phase allows you to learn and familiarize yourself with that data at hand, where the data is collected from, any gaps in the data, any potential outliers and the range of data types used. One tool that has become a staple among data scientist is Pandas Profiling. Pandas Profiling is an open-source tool written in Python that has the ability to generate interactive HTML reports which detail the types of data within the dataset; Highlights missing values; Provides descriptive statistics including mean, standard deviation and skewness; Creates histograms and returns any potential correlations.

身為當今時代的數據科學家，是一項令人難以置信的激動人心的職業。隨著技術的爆炸式增長以及每天創建的大量數據和內容，數據科學家不斷需要學習有效分析此數據的新方法。 探索性數據分析階段是任何新數據項目中最關鍵的部分之一。作為數據科學家，此階段使您可以學習和熟悉手頭的數據，從中收集數據，數據中的任何空白，任何潛在的異常值以及所用數據類型的范圍。 Pandas Profiling是數據科學家中最常用的一種工具。 Pandas Profiling是一個用Python編寫的開放源代碼工具，具有生成詳細描述數據集中數據類型的交互式HTML報告的功能；突出顯示缺失的值；提供描述性統計信息，包括均值，標準差和偏度；創建直方圖并返回任何潛在的相關性。

安裝熊貓分析 (Installing Pandas Profiling)

For this article, we are using PyCharm which is an integrated development environment created by JetBrains. PyCharm is an excellent tool to use as it handles tasks including creating a virtual environment for the project and the installation of packages referenced in your code.

對于本文，我們使用的是PyCharm ，它是JetBrains創建的集成開發環境。 PyCharm是一個出色的工具，可用于處理任務，包括為項目創建虛擬環境以及安裝代碼中引用的軟件包。

To get started open PyCharm and selected File > New Project, you will be presented with a dialogue where you can name the project and create an associated virtual environment. Virtual environments allow you to install specific python packages that your project can reference without having to globally install the packages on your machine. This is handy when you have multiple projects running that require a different version of the same package.

首先，打開PyCharm并選擇File > New Project ，將顯示一個對話框，您可以在其中命名項目并創建關聯的虛擬環境。虛擬環境允許您安裝項目可以引用的特定python軟件包，而無需在計算機上全局安裝這些軟件包。當您有多個運行的項目需要同一個程序包的不同版本時，這很方便。

Once the default packages have been installed in the virtual environment we need to install Pandas Profiling. To do this navigate to File > Settings > Project > Project Interpreter select the + button in the top right and search for pandas-profiling then press Install Package.

在虛擬環境中安裝了默認軟件包后，我們需要安裝Pandas Profiling。為此，請導航至“ File > Settings > Project > Project Interpreter選擇右上角的+按鈕并搜索pandas-profiling然后按Install Package 。

Installing Pandas Profiling using PyCharms Project Interpreter.使用PyCharms Project Interpreter安裝Pandas分析。

入門 (Getting Started)

For this example, we have created a simple Python script that you can use to get started. If this is your first time using Python please read Getting Started — Python Pandas where we explain the code within the script below.

在此示例中，我們創建了一個簡單的Python腳本，您可以使用它開始入門。如果這是您第一次使用Python，請閱讀“入門-Python Pandas” ，我們在下面的腳本中解釋代碼。

A Python script that is going to generate a HTML Pandas Profiling Report using fake data.一個Python腳本，它將使用假數據生成HTML Pandas分析報告。

After executing the script a new HTML file called pandas_profile_text.html will be created in your project root directory. To view the report right-click on the HTML file and select Open in Browser > Default.

執行腳本后，將在項目根目錄中創建一個名為pandas_profile_text.html的新HTML文件。要查看報告，請右鍵單擊HTML文件，然后選擇Open in Browser > Default 。

熊貓分析報告 (Pandas Profiling Report)

總覽(Overview)

Overview section within the Pandas Profiling Report熊貓分析報告中的概述部分

The Overview section, the first section within the Pandas Profiling Report, shows summarised statistics for the dataset as a whole. It returns the number of variables, which is the number of columns that were included in the passed DataFrame. The number of observations is the number of rows that were received. The Overview also provides the number of missing cells or duplicate rows and a percentage of total records that were impacted. The missing cells and duplicate row statistics are quite important as a data scientist as these may indicate broader data quality issues or issues with the code used to extract the data. The overview section also includes data around the size of the dataset in memory, the average record size in memory and any data types that are recognised.

概述部分(Pandas分析報告的第一部分)顯示了整個數據集的摘要統計信息。它返回變量的數量，即傳遞的DataFrame中包含的列數。觀察數是已接收的行數。概述還提供了丟失的單元格或重復的行數以及受影響的總記錄的百分比。作為數據科學家，缺失的單元格和重復的行統計信息非常重要，因為它們可能表示更廣泛的數據質量問題或用于提取數據的代碼問題。概述部分還包括有關內存中數據集大小，內存中平均記錄大小以及可識別的任何數據類型的數據。

Under the Warnings tab within the Overview section, you can find collated warnings for any of the variables within the dataset. In this example, we received a high cardinality warning for name, email and city. Within this context, the high cardinality means that the columns that were flagged contain a very high number of distinct values, you would expect this for employee number and email in the real world.

在“概述”部分的“警告”選項卡下，可以找到數據集中任何變量的整理的警告。在此示例中，我們收到了有關名稱，電子郵件和城市的高基數警告。在這種情況下，高基數意味著標記的列包含非常多的不同值，您希望在現實世界中對雇員編號和電子郵件使用此值。

變量—分類 (Variables — Categorial)

Pandas Profiling Report results for a categorical variable類別變量的Pandas分析報告結果

The Variables section within the Pandas Profiling report analyses the columns within the passed DataFrame. A categorical variable is a column that contains data that represents a Python string type.

Pandas Profiling報告中的Variables部分分析了傳遞的DataFrame中的列。分類變量是一列，其中包含表示Python字符串類型的數據。

A typical metric returned for categorical variables is the length of the strings within the column. To view the generated histogram select Toggle Details then navigate to the Length tab. The length tab also contains statistics regarding the maximum, median, mean and minimum values of the string length.

返回的用于分類變量的典型指標是列中字符串的長度。要查看生成的直方圖，請選擇“ Toggle Details然后導航到“ Length選項卡。長度選項卡還包含有關字符串長度的最大值，中位數，平均值和最小值的統計信息。

變量-數值 (Variables — Numerical)

Pandas Profiling Report results for a numerical variable熊貓分析報告結果為一個數字變量

Pandas Profiling offers an incredibly in-depth analysis of numerical variables covering quantile and descriptive statistics. It returns the minimum and maximum values within the dataset and the range between. It displays quartile values which measure the distribution of the ordered values in the dataset above and below the median by dividing the set into four bins. When considering the quartile values, if there is a greater distance between quartile one and the median verse the median and quartile three then we interpret this as meaning a greater scatter of smaller values than the larger values. The interquartile range is simply the results of quartile three minus quartile one.

熊貓分析提供了涵蓋分位數和描述性統計數據的令人難以置信的深度分析。它返回數據集中的最小值和最大值及其之間的范圍。它顯示其中通過將所述一組為四個二進制位測量有序值的在上方和下方的中值數據集的分布的四分位數的值。在考慮四分位數時，如果四分位數1與中位數和中位數與四分位數3之間的距離較大，則我們將其解釋為意味著較小值的分散程度大于較大值。 四分位數范圍僅是四分位數三減四分之一的結果。

Standard deviation reflects the distributions of the dataset with regards to its mean value. A low standard deviation implies that the values in the data set are closer to the mean, whereas a higher standard deviation value implies that the dataset values are spread over a greater range. The coefficient of variation, also known as relative standard deviation, is the ratio of the standard deviation to the mean. Kurtosis can be used to describe the shape of the data by measuring the values within the tails of the distribution relative to the mean of the ordered dataset. The Kurtosis value varies depending on the distribution of the data and the presence of extreme outliers. The median absolute deviation is another statistical measure that reflects the distribution of the data around the median and is a more robust measure of the spread when an extreme outlier is present. Skewness reflects the level of distortion from a standard bell-shaped probability distribution. Positive skewness is considered skewness to the right and has a longer tail to the right of the distribution and a negative to the left.

標準差反映有關數據集平均值的分布。低標準偏差表示數據集中的值更接近平均值，而較高的標準偏差值表示數據集值分布在較大范圍內。 變異系數，也稱為相對標準偏差，是標準偏差與平均值的比率。峰度可用于通過測量分布尾部相對于有序數據集平均值的值來描述數據的形狀。峰度值根據數據分布和極端異常值的存在而變化。 中位數絕對偏差是另一種統計量度，可反映數據在中位數附近的分布，并且是在存在極端離群值時對散布的更可靠度量。 偏斜度反映了標準鐘形概率分布的失真程度。正偏度被認為是右側偏度，并且在分布的右側具有較長的尾巴，而在左側則為負。

互動與相關 (Interaction and Correlations)

Interaction graph from the Pandas Profiling Report.熊貓分析報告中的交互圖。

The Interaction and Correlations sections are where Pandas Profiling really sets itself ahead of other exploratory tools. It analyses all the variables as pairs and highlights any highly correlating variables using Pearson, Spearman, Kendal and Phik measures. It provides a powerful easy to understand visual representation of any data that correlations strongly together. As a data scientist, this is a great starting point for questions as to why these data pairs may correlate.

交互和關聯部分是Pandas Profiling真正領先于其他探索工具的地方。它對所有變量進行成對分析，并使用Pearson，Spearman，Kendal和Phik度量突出顯示任何高度相關的變量。它提供了強大且易于理解的任何緊密關聯在一起的數據的視覺表示。作為數據科學家，這是質疑為什么這些數據對可能相互關聯的一個很好的起點。

缺失值 (Missing Values)

Missing values bar chart from the Pandas Profiling Report熊貓分析報告中的缺失值條形圖

The Missing Values section builds on the missing cells metric from the Overview section. It visually represents where the missing values are occurring against all the columns within the DataFrame. This section may highlight data quality issues and may require missing data to be mapped to a default value which we will cover in a later article.

“缺少值”部分基于“概述”部分中的“缺少單元格”度量標準。它直觀地表示DataFrame中所有列的缺失值發生在哪里。本節可能重點介紹數據質量問題，并且可能要求將丟失的數據映射到默認值，我們將在以后的文章中介紹。

樣品部分 (Sample Section)

The sample section displays a snapshot of results from the head and tail of the dataset. If the dataset is ordered on a particular column you can use this section to gain an understanding of what type of records the minimum and maximum column values are associated with.

樣本部分顯示了數據集頭部和尾部的結果快照。如果數據集在特定列上排序，則可以使用本節來了解最小和最大列值與哪種記錄類型相關聯。

概要 (Summary)

Pandas Profiling is an incredible open-source tool that every data scientist should consider adding to their toolbox for the data exploration phase in any project. It is an efficient way to digest and analyse an unfamiliar dataset by providing in-depth descriptive statistics, visual distribution graphs and a powerful set of correlation tools.

Pandas Profiling是令人難以置信的開源工具，每個數據科學家都應考慮將其添加到工具箱中，以進行任何項目中的數據探索階段。通過提供深入的描述性統計信息，可視化分布圖和一組強大的關聯工具，這是一種有效的方法來消化和分析不熟悉的數據集。

Thank you for taking the time to read our article, we hope you have found it valuable.

感謝您抽出寶貴的時間閱讀我們的文章，希望您發現它有價值。

翻譯自: https://towardsdatascience.com/learning-pandas-profiling-fc533336edc7

熊貓燒香源碼分析

總結

以上是生活随笔為你收集整理的熊猫烧香源码分析_学习大熊猫分析的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：实例7:python
下一篇：警告warning: strncpy s