當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

熊猫数据集_熊猫迈向数据科学的第三部分

發(fā)布時間：2023/11/29 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了熊猫数据集_熊猫迈向数据科学的第三部分小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

熊貓數(shù)據(jù)集

Data is almost never perfect. Data Scientist spend more time in preprocessing dataset than in creating a model. Often we come across scenario where we find some missing data in data set. Such data points are represented with NaN or Not a Number in Pandas. So it is very important that we discover columns with NaN/null values in early stages while analyzing data.

數(shù)據(jù)幾乎從來都不是完美的。與創(chuàng)建模型相比，數(shù)據(jù)科學家在預處理數(shù)據(jù)集上花費的時間更多。通常，我們會遇到在數(shù)據(jù)集中發(fā)現(xiàn)一些缺失數(shù)據(jù)的情況。此類數(shù)據(jù)點用NaN表示或用None Not Number表示。因此，在分析數(shù)據(jù)的早期發(fā)現(xiàn)具有NaN / null值的列非常重要。

We have covered many methods in Pandas library and if you haven’t read previous articles, I recommend you to go through those articles to get in a flow. But if you are following from the beginning then lets get started.

我們已經(jīng)在Pandas庫中介紹了許多方法，如果您還沒有閱讀過以前的文章，我建議您仔細閱讀這些文章以進行學習。但是，如果您從頭開始關注，那就開始吧。

In this article, we are going to learn

在本文中，我們將學習

What is NaN ?

什么是NaN？

How to find NaN in dataset ?

如何在數(shù)據(jù)集中找到NaN？

How to deal with NaN as beginner ?

如何應對NaN作為初學者？

Finally, some methods to make dataframe more readable.

最后，一些使數(shù)據(jù)框更具可讀性的方法。

如何在數(shù)據(jù)集中找到NaN？ (How to find NaN in dataset ?)

To check NaN data in a column or in entire dataframe, we use isnull() or isna(). Both of these works as same , so we will use isnull() in this article. If you want to understand why there are two methods for same task, you can learn it here. Lets begin by checking null values in entire dataset.

要檢查列或整個數(shù)據(jù)框中的NaN數(shù)據(jù)，我們使用isnull()或isna()。兩者的工作原理相同，因此我們將在本文中使用isnull() 。如果您想了解為什么有兩種方法可以完成同一任務，則可以在此處學習。首先檢查整個數(shù)據(jù)集中的空值。

>> print(titanic_data.info())output :RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Here you can see some valuable information about dataset. But information that we are interested is in Non-Null Count column. It shows number of non-null data points in each column. First line of output shows that there are total 891 entries that is 891 data points. We can also directly check number of non-null entries in each column using count() method as well.

在這里，您可以看到有關數(shù)據(jù)集的一些有價值的信息。但是我們感興趣的信息在“ 非空計數(shù)”列中。它顯示每列中非空數(shù)據(jù)點的數(shù)量。輸出的第一行顯示總共有891個條目，即891個數(shù)據(jù)點。我們也可以使用count()方法直接檢查每列中非空條目的數(shù)量。

>> print(titanic_data.count())output :PassengerId 891
Survived 891
Pclass 891
Name 891
Sex 891
Age 714
SibSp 891
Parch 891
Ticket 891
Fare 891
Cabin 204
Embarked 889
dtype: int64

From here we can conclude that Age, Cabin and Embarked are the columns with null values. There another way to get this result using isnull() method as we discussed earlier.

從這里我們可以得出結(jié)論，“ 年齡”，“機艙”和“ 登機”是具有空值的列。如前所述，還有另一種方法可以使用isnull()方法獲得此結(jié)果。

>> print(titanic_data.isnull().any())output :PassengerId False
Survived False
Pclass False
Name False
Sex False
Age True
SibSp False
Parch False
Ticket False
Fare False
Cabin True
Embarked True
dtype: bool>> print(titanic_data.isnull().sum())output :PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

As we can see this result is much better if we are solely interested in null values.

如我們所見，如果我們只對null值感興趣，則此結(jié)果會更好。

如何應對NaN作為初學者？ (How to deal with NaN as beginner ?)

It is important to know number of null values in a column as it can help us understand how to deal with null values. If there are small numbers of null values like in Embarked, then we can remove those entries from dataset. However if most of the values are null like in Cabin, then it is better to skip that column while creating model.

知道一列中空值的數(shù)量很重要，因為它可以幫助我們了解如何處理空值。如果像Embarked中那樣有少量的空值，那么我們可以從數(shù)據(jù)集中刪除這些條目。但是，如果像Cabin中的大多數(shù)值都為空，那么在創(chuàng)建模型時最好跳過該列。

There is another case where null values are not large enough to skip the column and small enough to remove entries as in the case of Age here. For such cases we have many ways to deal with null values, but as a beginner we will learn just one trick here and that is to fill it with a value. We will use fillna() method to do that.

在另一種情況下，空值的大小不足以跳過該列，而其大小不足以刪除條目，如此處的Age一樣。對于這種情況，我們有很多方法可以處理空值，但作為一個初學者，我們將在這里僅學習一個技巧，那就是用值填充它。我們將使用fillna()方法來做到這一點。

>> titanic_data.Age.fillna("Unknown", inplace = True)
>> print(titanic_data.Age.isnull().any())output :false
# It is Age column have no null values

We used inplace argument so that changes are implemented in dataframe which is calling the method. If we do not pass this argument or keep it False then changes will not appear in our dataset. We can also check if a specific column have null values in same manner as we did for whole dataset.

我們使用了inplace參數(shù)，以便在調(diào)用該方法的數(shù)據(jù)框中實現(xiàn)更改。如果我們不傳遞此參數(shù)或?qū)⑵浔Ａ魹镕alse，則更改將不會出現(xiàn)在我們的數(shù)據(jù)集中。我們還可以以與整個數(shù)據(jù)集相同的方式檢查特定列是否具有空值。

We can also replace values in a column which are not NaN using replace() method.

我們還可以使用replace()方法替換非NaN列中的值。

>> titanic_data.Sex.replace("male","M",inplace = True)
>> titanic_data.Sex.replace("female","F",inplace = True)
>> print(titanic_data.Sex)output :0 M
1 F
2 F
3 F
4 M
..
886 M
887 F
888 F
889 M
890 M
Name: Sex, Length: 891, dtype: object

一些使數(shù)據(jù)集更具可讀性的方法 (Some methods to make Dataset more readable)

rename() : There might be situation, when we realize that column name is not suitable as per our requirement. We can use rename() method to change column name.

named() ：在某些情況下，我們意識到列名不符合我們的要求。我們可以使用rename()方法來更改列名。

>> titanic_data.rename(columns={"Sex":"Gender"},inplace=True)
>> print(titanic_data.Gender)output :0 M
1 F
2 F
3 F
4 M
..
886 M
887 F
888 F
889 M
890 M
Name: Gender, Length: 891, dtype: object

2. rename_axis() : It is a simple method and as name suggest is used to provide names for axis.

2. named_axis() ：這是一種簡單的方法，顧名思義，該名稱用于提供軸的名稱。

>> titanic_data.rename_axis("Sr.No",axis='rows',inplace=True)
>> titanic_data.rename_axis("Catergory",axis='columns',inplace=True)
>> print(titanic_data.head(2))output :Catergory PassengerId Survived Pclass .....
Sr.No
0 1 0 3
1 2 1 1
[2 rows x 12 columns]

With this we come to end of this article and series on Pandas. I believe that methods which we came across in this series are very helpful for analyzing data before we can start training them. However, this is just a small fraction of methods in Pandas library and just a beginning of data exploration and preprocessing. But as a beginner, I think these are enough to get started with Data Science journey. I hope you found this series valuable. Thank you for reading. Keep practicing. Happy Coding ! 😄

這樣，我們就結(jié)束了本文和有關熊貓的系列文章的結(jié)尾。我相信本系列中遇到的方法在開始訓練數(shù)據(jù)之前對分析數(shù)據(jù)非常有幫助。但是，這只是Pandas庫中方法的一小部分，也是數(shù)據(jù)探索和預處理的開始。但是，作為一個初學者，我認為這些足以開始Data Science之旅。希望您覺得本系列有價值。感謝您的閱讀。保持練習。編碼愉快！ 😄

翻譯自: https://medium.com/swlh/pandas-first-step-towards-data-science-part-3-351321c24cc0

熊貓數(shù)據(jù)集

總結(jié)

以上是生活随笔為你收集整理的熊猫数据集_熊猫迈向数据科学的第三部分的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：算命数据_未来的数据科学家或算命精神向导
下一篇：充分利用UC berkeleys数据科学