當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python的自带数据集_盘点 | Python自带的那些数据集

發(fā)布時(shí)間：2023/12/15 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 python的自带数据集_盘点 | Python自带的那些数据集小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Seaborn自帶數(shù)據(jù)集

在學(xué)習(xí)Pandas透視表的時(shí)候，大家應(yīng)該注意到，我們使用的案例數(shù)據(jù)"泰坦尼克號(hào)"來(lái)自于seaborn自帶的在線數(shù)據(jù)庫(kù)，我們可以通過(guò)seaborn提供的函數(shù)load_dataset("數(shù)據(jù)集名稱")來(lái)獲取線上相應(yīng)的數(shù)據(jù)，返回給我們的是一個(gè)pandas的DataFrame對(duì)象。import seaborn as sns

df = sns.load_dataset('titanic')

df.sample(5)

返回的DataFrame對(duì)象非常便于我們更加深入地了解數(shù)據(jù)，示例代碼：df = sns.load_dataset("tips")

print("\n[數(shù)據(jù)集基本信息]\n")

print(df.info())

print("\n[數(shù)值變量信息]\n")

print(df.describe())

print("\n[離散變量信息]\n")

for name in df.dtypes[(df.dtypes == "category") | (df.dtypes == "object")].index:

print("{} 特征值 : {}".format(name, str(df[name].unique())))

數(shù)據(jù)集描述信息如下：[數(shù)據(jù)集基本信息]

RangeIndex: 244 entries, 0 to 243

Data columns (total 7 columns):

total_bill 244 non-null float64

tip 244 non-null float64

sex 244 non-null category

smoker 244 non-null category

day 244 non-null category

time 244 non-null category

size 244 non-null int64

dtypes: category(4), float64(2), int64(1)

[數(shù)值變量信息]

total_bill tip size

count 244.000000 244.000000 244.000000

mean 19.785943 2.998279 2.569672

std 8.902412 1.383638 0.951100

min 3.070000 1.000000 1.000000

25% 13.347500 2.000000 2.000000

50% 17.795000 2.900000 2.000000

75% 24.127500 3.562500 3.000000

max 50.810000 10.000000 6.000000

[離散變量信息]

sex 特征值 : [Female, Male]

smoker 特征值 : [No, Yes]

day 特征值 : [Sun, Sat, Thur, Fri]

time 特征值 : [Dinner, Lunch]

seaborn自帶的全量數(shù)據(jù)集，如下所示：seaborn示例數(shù)據(jù)集鏈接：https://github.com/mwaskom/seaborn-data

Sklearn自帶數(shù)據(jù)集

1. 小型數(shù)據(jù)集

數(shù)據(jù)加載、觀察示例：from sklearn import datasets

import pandas as pd, numpy as np

dataset = datasets.load_iris()

print("數(shù)據(jù)集包含的信息項(xiàng)：")

print(" ".join(dataset.keys()))

print("\n數(shù)據(jù)集描述信息：\n")

print(dataset["DESCR"])

data = dataset["data"]

target = dataset["target"]

df = pd.DataFrame(data, columns=dataset["feature_names"])

df["target"] = target

df.sample(10)

df.info()

df.describe()sklearn小型數(shù)據(jù)集詳細(xì)介紹：https://scikit-learn.org/stable/datasets/index.html#toy-datasets

2. 較大型數(shù)據(jù)集（在線下載）

20個(gè)新聞組數(shù)據(jù)集加載示例：from sklearn.datasets import fetch_20newsgroups

from pprint import pprint

newsgroups_train = fetch_20newsgroups(subset='train')

pprint(list(newsgroups_train.targernames))

print(newsgroups_train.filenames.shape) # (11314,)

print(newsgroups_train.target.shape) # (11314,)

print(newsgroups_train.target[:10]) # [ 7 4 4 1 14 16 13 3 2 4]

print(newsgroups_train['data'][:2]) # 前三篇文章["From: lerxst@wam.umd.edu (where's my thin...sklearn大型數(shù)據(jù)集詳細(xì)介紹：https://scikit-learn.org/stable/datasets/index.html#real-world-datasets sklearn

新聞數(shù)據(jù)文本分類實(shí)戰(zhàn)：https://www.jianshu.com/p/244180c064cf

其他數(shù)據(jù)源

1. UCL機(jī)器學(xué)習(xí)知識(shí)庫(kù)

UCL機(jī)器學(xué)習(xí)數(shù)據(jù)庫(kù)，包括了多個(gè)不同大小和類型的數(shù)據(jù)集，可用于分類、回歸、聚類和推薦系統(tǒng)任務(wù)。鏈接：https://archive.ics.uci.edu/ml/index.php

2. weka數(shù)據(jù)集鏈接：https://www.cs.waikato.ac.nz/ml/weka/datasets.html

3. KD-nuggets數(shù)據(jù)集鏈接：https://www.kdnuggets.com/datasets/index.html

4. UCI KDD Archive數(shù)據(jù)集鏈接：http://kdd.ics.uci.edu/

總結(jié)

以上是生活随笔為你收集整理的python的自带数据集_盘点 | Python自带的那些数据集的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：魔兽世界怀旧服怎么获得魔尘魔尘获得流程
下一篇： python基础有哪些内容_Python