當前位置：首頁 > 编程语言 > python >内容正文

python

python数据集获取与基本使用（sklearn自带的数据集、UCI数据集）

發布時間：2023/12/20 python 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 python数据集获取与基本使用（sklearn自带的数据集、UCI数据集）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

python數據集獲取（sklearn自帶的數據集、UCI數據集）

- 一、UCI數據集介紹
- 二、 sklearn自帶的小數據集
- - sklearn 自帶數據集的常規使用
  - - 鳶尾花數據集：load_iris（）：用于分類任務的數據集
    - 手寫數字數據集load_digits()
    - 乳腺癌數據集load-barest-cancer（）
    - 糖尿病數據集
    - 波士頓房價數據集
    - 體能訓練數據集：
  - 生成數據集

一、UCI數據集介紹

UCI數據集是一個常用的機器學習標準測試數據集，是加州大學歐文分校(University of CaliforniaIrvine)提出的用于機器學習的數據庫。機器學習算法的測試大多采用的便是UCI數據集了，其重要之處在于“標準”二字，新編的機器學習程序可以采用UCI數據集進行測試，類似的機器學習算法也可以一較高下。

UCI機器學習庫：http://archive.ics.uci.edu/ml/datasets.php

二、 sklearn自帶的小數據集

https://rosefun.blog.csdn.net/article/details/104407193
sklearn 的數據集種類

自帶的小數據集（packaged dataset）：sklearn.datasets.load_

可在線下載的數據集（Downloaded Dataset）：sklearn.datasets.fetch_

計算機生成的數據集（Generated Dataset）：sklearn.datasets.make_

svmlight/libsvm格式的數據集:sklearn.datasets.load_svmlight_file(…)
從買了data.org在線下載獲取的數集:sklearn.datasets.fetch_mldata(…)

推薦讀者閱讀網站：（w3cschool） https://www.w3cschool.cn/doc_scikit_learn/scikit_learn-modules-generated-sklearn-datasets-load_digits.html（scikit-learn (sklearn) 官方文檔中文版）https://sklearn.apachecn.org/docs/0.21.3/47.htmlScikit-learn英文官網https://scikit-learn.org/stable/

其中的自帶的小的數據集為：sklearn.datasets.load_

sklearn 自帶數據集的常規使用

keys 查看數據內容

以手寫數據集 load_digits() 為例,

from sklearn import datasetsdigits = datasets.load_digits() keys = digits.keys() print(keys)

Output:

dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

digits 數據集的屬性介紹

屬性

數據集介紹	DESCR
樣本數據	data
標簽數據	target
標簽名稱	target_names
圖像數據	images

注：
1.data，target 均為 numpy.ndarray 數組
2.data, target, target_name 是 sklearn 自帶數據集通用的屬性

通用示例：

# 導入sklearn自帶數據集 from sklearn import datasets digits = datasets.load_digits()# 加載手寫數字數據集 feature = digits.data# 創建特征矩陣 target = digits.target# 創建目標向量

鳶尾花數據集：load_iris（）：用于分類任務的數據集

數據介紹：

一般用于做分類測試

有150個數據集，共分為3類，每類50個樣本。每個樣本有4個特征。

每條記錄都有 4 項特征：包含4個特征（Sepal.Length（花萼長度）、Sepal.Width（花萼寬度）、Petal.Length（花瓣長度）、Petal.Width（花瓣寬度）），特征值都為正浮點數，單位為厘米。

可以通過這4個特征預測鳶尾花卉屬于（iris-setosa（山鳶尾）, iris-versicolour（雜色鳶尾）, iris-virginica（維吉尼亞鳶尾））中的哪一品種。

屬性介紹：見代碼注釋

這些數據集都可以在官網上查到，以鳶尾花為例，可以在官網上找到demo，
https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

from sklearn.datasets import load_iris#加載數據集 import numpy as npiris=load_iris() iris.keys()　　#dict_keys(['target', 'DESCR', 'data', 'target_names', 'feature_names'])#數據的條數和維數 n_samples,n_features=iris.data.shape print("Number of sample:",n_samples) #Number of sample: 150 print("Number of feature",n_features)　　#Number of feature 4#第一個樣例 print(iris.data[0])　　　　　　#[ 5.1 3.5 1.4 0.2] print(iris.data.shape)　　　　#(150, 4) print(iris.target.shape)　　#(150,) print(iris.target)"""[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 　　0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 　　1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 　　2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 　　2 2] """print(iris.target_names)　　#['setosa' 'versicolor' 'virginica'] np.bincount(iris.target)　　#[50 50 50]import matplotlib.pyplot as plt#以第3個索引為劃分依據，x_index的值可以為0，1，2，3 x_index=3 color=['blue','red','green'] for label,color in zip(range(len(iris.target_names)),color):plt.hist(iris.data[iris.target==label,x_index],label=iris.target_names[label],color=color)plt.xlabel(iris.feature_names[x_index]) plt.legend(loc="Upper right") plt.show() #畫散點圖，第一維的數據作為x軸和第二維的數據作為y軸 x_index=0 y_index=1 colors=['blue','red','green'] for label,color in zip(range(len(iris.target_names)),colors):plt.scatter(iris.data[iris.target==label,x_index],iris.data[iris.target==label,y_index],label=iris.target_names[label],c=color) plt.xlabel(iris.feature_names[x_index]) plt.ylabel(iris.feature_names[y_index]) plt.legend(loc='upper left') plt.show()

手寫數字數據集load_digits()

手寫數字數據集load_digits()：用于多分類任務的數據集

數據介紹：

手寫數字數據集：load_digits（）:用于分類任務或者降維任務的數據集。 1797張樣本圖片，每個樣本有64維特征（8*8像素的圖像）和一個[0, 9]整數的標簽

屬性介紹：
見代碼注釋

from sklearn.datasets import load_digits digits=load_digits() print(digits.data.shape) import matplotlib.pyplot as plt plt.gray() plt.matshow(digits.images[0]) plt.show()from sklearn.datasets import load_digits digits=load_digits() digits.keys() n_samples,n_features=digits.data.shape print((n_samples,n_features))print(digits.data.shape) print(digits.images.shape)import numpy as np print(np.all(digits.images.reshape((1797,64))==digits.data))fig=plt.figure(figsize=(6,6)) fig.subplots_adjust(left=0,right=1,bottom=0,top=1,hspace=0.05,wspace=0.05) #繪制數字：每張圖像8*8像素點 for i in range(64):ax=fig.add_subplot(8,8,i+1,xticks=[],yticks=[])ax.imshow(digits.images[i],cmap=plt.cm.binary,interpolation='nearest')#用目標值標記圖像ax.text(0,7,str(digits.target[i])) plt.show()

乳腺癌數據集load-barest-cancer（）

乳腺癌數據集load-barest-cancer（）：簡單經典的用于二分類任務的數據集

官網：https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer

糖尿病數據集

糖尿病數據集：load-diabetes（）：經典的用于回歸認為的數據集，值得注意的是，這10個特征中的每個特征都已經被處理成0均值，方差歸一化的特征值，

官網：https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes

波士頓房價數據集

數據介紹：

該數據集是一個回歸問題。每個類的觀察值數量是均等的，共有 506 個觀察，13 個輸入變量和1個輸出變量。每條數據包含房屋以及房屋周圍的詳細信息。其中包含城鎮犯罪率，一氧化氮濃度，住宅平均房間數，到中心區域的加權距離以及自住房平均房價等等。

波士頓房價數據集：load-boston（）：經典的用于回歸任務的數據集

體能訓練數據集：load-linnerud（）：經典的用于多變量回歸任務的數據集，其內部包含兩個小數據集：Excise是對3個訓練變量的20次觀測（體重，腰圍，脈搏），physiological是對3個生理學變量的20次觀測（引體向上，仰臥起坐，立定跳遠）

svmlight/libsvm的每一行樣本的存放格式：

<label><feature-id>:<feature-value> <feature-id>:<feature-value> ....

這種格式比較適合用來存放稀疏數據，在sklearn中，用scipy sparse CSR矩陣來存放X，用numpy數組來存放Y

from sklearn.datasets import load_svmlight_file x_train,y_train=load_svmlight_file("/path/to/train_dataset.txt","")#如果要加在多個數據的時候，可以用逗號隔開

體能訓練數據集：

體能訓練數據集：load-linnerud（）：經典的用于多變量回歸任務的數據集。

官網：https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_linnerud.html#sklearn.datasets.load_linnerud

生成數據集

生成數據集：可以用來分類任務，可以用來回歸任務，可以用來聚類任務，用于流形學習的，用于因子分解任務的

用于分類任務和聚類任務的：這些函數產生樣本特征向量矩陣以及對應的類別標簽集合

make_blobs：多類單標簽數據集，為每個類分配一個或多個正太分布的點集

make_classification：多類單標簽數據集，為每個類分配一個或多個正太分布的點集，提供了為數據添加噪聲的方式，包括維度相關性，無效特征以及冗余特征等

make_gaussian-quantiles：將一個單高斯分布的點集劃分為兩個數量均等的點集，作為兩類

make_hastie-10-2：產生一個相似的二元分類數據集，有10個維度

make_circle和make_moom產生二維二元分類數據集來測試某些算法的性能，可以為數據集添加噪聲，可以為二元分類器產生一些球形判決界面的數據

#生成多類單標簽數據集import numpy as np import matplotlib.pyplot as plt from sklearn.datasets.samples_generator import make_blobs center=[[1,1],[-1,-1],[1,-1]] cluster_std=0.3 X,labels=make_blobs(n_samples=200,centers=center,n_features=2,cluster_std=cluster_std,random_state=0) print('X.shape',X.shape) print("labels",set(labels))unique_lables=set(labels) colors=plt.cm.Spectral(np.linspace(0,1,len(unique_lables))) for k,col in zip(unique_lables,colors):x_k=X[labels==k]plt.plot(x_k[:,0],x_k[:,1],'o',markerfacecolor=col,markeredgecolor="k",markersize=14) plt.title('data by make_blob()') plt.show() #生成用于分類的數據集from sklearn.datasets.samples_generator import make_classification X,labels=make_classification(n_samples=200,n_features=2,n_redundant=0,n_informative=2,random_state=1,n_clusters_per_class=2) rng=np.random.RandomState(2) X+=2*rng.uniform(size=X.shape)unique_lables=set(labels) colors=plt.cm.Spectral(np.linspace(0,1,len(unique_lables))) for k,col in zip(unique_lables,colors):x_k=X[labels==k]plt.plot(x_k[:,0],x_k[:,1],'o',markerfacecolor=col,markeredgecolor="k",markersize=14) plt.title('data by make_classification()') plt.show()#生成球形判決界面的數據 from sklearn.datasets.samples_generator import make_circles X,labels=make_circles(n_samples=200,noise=0.2,factor=0.2,random_state=1) print("X.shape:",X.shape) print("labels:",set(labels))unique_lables=set(labels) colors=plt.cm.Spectral(np.linspace(0,1,len(unique_lables))) for k,col in zip(unique_lables,colors):x_k=X[labels==k]plt.plot(x_k[:,0],x_k[:,1],'o',markerfacecolor=col,markeredgecolor="k",markersize=14) plt.title('data by make_moons()') plt.show()

參考博客：

https://blog.csdn.net/qq_42887760/article/details/101292407

https://www.cnblogs.com/nolonely/p/6980160.html

總結

以上是生活随笔為你收集整理的python数据集获取与基本使用（sklearn自带的数据集、UCI数据集）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： SVM 调参策略
下一篇：爆肝一周，用Python在物联网设备上写