當前位置：首頁 > 编程语言 > python >内容正文

python

mnist数据集svm python_python支持向量机分类MNIST数据集

發(fā)布時間：2024/8/23 python 41 豆豆

生活随笔收集整理的這篇文章主要介紹了 mnist数据集svm python_python支持向量机分类MNIST数据集小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

支持向量機在高維或無限維空間中構造超平面或超平面集合，其可以用于分類、回歸或其他任務。直觀來說，分類邊界距離最近的訓練數(shù)據(jù)點越遠越好，因為這樣可以縮小分類器的泛化誤差。

調用sklearn.svm的svc函數(shù)，將MNIST數(shù)據(jù)集進行分類，并將整體分類精度輸出，這里用了兩種預處理的方法(將特征值變成0或者1的數(shù)；將特征值變成0-1區(qū)間的數(shù))效果不一樣，并且分別調用了兩種核函數(shù)(高斯核函數(shù)和多項式核函數(shù))。在支持向量機實驗中，將訓練集和測試集都等分成10份，并求十份數(shù)據(jù)集整體分類精度的平均值，這樣的結果較為準確客觀。可以通過修改懲罰因子C的大小來看不同的效果，并畫出圖進行比較，C=100的時候效果較為好。

#任務：比較不同的kernel的結果差異，并畫出相應的曲線來直觀的表示

import struct

from numpy import *

import numpy as np

import time

from sklearn.svm import SVC#C-Support Vector Classification

def read_image(file_name):

#先用二進制方式把文件都讀進來

file_handle=open(file_name,"rb") #以二進制打開文檔

file_content=file_handle.read() #讀取到緩沖區(qū)中

offset=0

head = struct.unpack_from(‘>IIII‘, file_content, offset) # 取前4個整數(shù)，返回一個元組

offset += struct.calcsize(‘>IIII‘)

imgNum = head[1] #圖片數(shù)

rows = head[2] #寬度

cols = head[3] #高度

images=np.empty((imgNum , 784))#empty，是它所常見的數(shù)組內的所有元素均為空，沒有實際意義，它是創(chuàng)建數(shù)組最快的方法

image_size=rows*cols#單個圖片的大小

fmt=‘>‘ + str(image_size) + ‘B‘#單個圖片的format

for i in range(imgNum):

images[i] = np.array(struct.unpack_from(fmt, file_content, offset))

# images[i] = np.array(struct.unpack_from(fmt, file_content, offset)).reshape((rows, cols))

offset += struct.calcsize(fmt)

return images

#讀取標簽

def read_label(file_name):

file_handle = open(file_name, "rb") # 以二進制打開文檔

file_content = file_handle.read() # 讀取到緩沖區(qū)中

head = struct.unpack_from(‘>II‘, file_content, 0) # 取前2個整數(shù)，返回一個元組

offset = struct.calcsize(‘>II‘)

labelNum = head[1] # label數(shù)

# print(labelNum)

bitsString = ‘>‘ + str(labelNum) + ‘B‘ # fmt格式：‘>47040000B‘

label = struct.unpack_from(bitsString, file_content, offset) # 取data數(shù)據(jù)，返回一個元組

return np.array(label)

def normalize(data):#圖片像素二值化，變成0-1分布

m=data.shape[0]

n=np.array(data).shape[1]

for i in range(m):

for j in range(n):

if data[i,j]!=0:

data[i,j]=1

else:

data[i,j]=0

return data

#另一種歸一化的方法，就是將特征值變成[0,1]區(qū)間的數(shù)

def normalize_new(data):

m=data.shape[0]

n=np.array(data).shape[1]

for i in range(m):

for j in range(n):

data[i,j]=float(data[i,j])/255

return data

def loadDataSet():

train_x_filename="train-images-idx3-ubyte"

train_y_filename="train-labels-idx1-ubyte"

test_x_filename="t10k-images-idx3-ubyte"

test_y_filename="t10k-labels-idx1-ubyte"

train_x=read_image(train_x_filename)#60000*784 的矩陣

train_y=read_label(train_y_filename)#60000*1的矩陣

test_x=read_image(test_x_filename)#10000*784

test_y=read_label(test_y_filename)#10000*1

#可以比較這兩種預處理的方式最后得到的結果

# train_x=normalize(train_x)

# test_x=normalize(test_x)

# train_x=normalize_new(train_x)

# test_x=normalize_new(test_x)

return train_x, test_x, train_y, test_y

if __name__==‘__main__‘:

classNum=10

score_train=0.0

score=0.0

temp=0.0

temp_train=0.0

print("Start reading data...")

time1=time.time()

train_x, test_x, train_y, test_y=loadDataSet()

time2=time.time()

print("read data cost",time2-time1,"second")

print("Start training data...")

# clf=SVC(C=1.0,kernel=‘poly‘)#多項式核函數(shù)

clf = SVC(C=0.01,kernel=‘rbf‘)#高斯核函數(shù)

#由于每6000個中的每個類的數(shù)量都差不多相等，所以直接按照整批劃分的方法

for i in range(classNum):

clf.fit(train_x[i*6000:(i+1)*6000,:],train_y[i*6000:(i+1)*6000])

temp=clf.score(test_x[i*1000:(i+1)*1000,:], test_y[i*1000:(i+1)*1000])

# print(temp)

temp_train=clf.score(train_x[i*6000:(i+1)*6000,:],train_y[i*6000:(i+1)*6000])

print(temp_train)

score+=(clf.score(test_x[i*1000:(i+1)*1000,:], test_y[i*1000:(i+1)*1000])/classNum)

score_train+=(temp_train/classNum)

time3 = time.time()

print("score:{:.6f}".format(score))

print("score:{:.6f}".format(score_train))

print("train data cost", time3 - time2, "second")

實驗結果：對二值化(normalize)后的不同核函數(shù)和C的結果進行了統(tǒng)計和分析。結果如下表所示：

Parameter

二值化

{ "C":1," " kernel": "poly"}

{"accuarcy":0.4312,"train time":558.61}

{"C":1, "kernel": "rbf"}

{"accuarcy":0.9212,"train time":163.15}

{"C":10, "kernel": "poly"}

{"accuarcy":0.8802,"train time":277.78}

{"C":10, "kernel": "rbf"}

{"accuarcy":0.9354,"train time":96.07}

{"C":100, "kernel": "poly"}

{"accuarcy":0.9427,"train time":146.43}

{"C":100, "kernel": "rbf"}

{"accuarcy":0.9324,"train time":163.99}

{"C":1000,"kernel":"poly"}

{"accuarcy":0.9519,"train time":132.59}

{"C":1000,"kernel":"rbf"}

{"accuarcy":0.9325,"train time":97.54}

{"C":10000,"kernel":"poly"}

{"accuarcy":0.9518,"train time":115.35}

{"C":10000,"kernel":"rbf"}

{"accuarcy":0.9325,"train time":115.77}

對于實驗的優(yōu)化方法，可以采用pca主成分分析方法，準確率和速度都有提升，代碼如下：

結果截屏：

原文：https://www.cnblogs.com/BlueBlue-Sky/p/9382702.html

創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎勵來咯，堅持創(chuàng)作打卡瓜分現(xiàn)金大獎

總結

以上是生活随笔為你收集整理的mnist数据集svm python_python支持向量机分类MNIST数据集的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：柏林噪声双线性插值初步了解（js）
下一篇： websocket python爬虫_p