當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

pytorch手动实现梯度下降法，随机梯度法--基于logistic Regression并探索Mini batch作用

發布時間：2025/4/16 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 pytorch手动实现梯度下降法，随机梯度法--基于logistic Regression并探索Mini batch作用小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

簡述

基于這次凸優化的大項目作業。
下面會圍繞著通過logistic Regression來做MNIST集上的手寫數字識別~
以此來探索logistic Regression，梯度下降法，隨機梯度法，以及Mini batch的作用。

核心任務是實現梯度下降法和隨機梯度法。但是其他的準備工作也得做的較為好~

導入的包

import os import torch import torch.nn as nn import torch.utils.data as Data import torchvision

讀取數據

EPOCH = 1 # train the training data n times, to save time, we just train 1 epoch BATCH_SIZE = 1 DOWNLOAD_MNIST = False LR = 0.001# Mnist digits dataset if not (os.path.exists('./mnist/')) or not os.listdir('./mnist/'):# not mnist dir or mnist is empyt dirDOWNLOAD_MNIST = Truetrain_data = torchvision.datasets.MNIST(root='./mnist/',train=True, # this is training datatransform=torchvision.transforms.ToTensor(),download=DOWNLOAD_MNIST, )# Data Loader for easy mini-batch return in training, the image batch shape will be (50, 1, 28, 28) train_loader = Data.DataLoader(dataset=train_data, batch_size=BATCH_SIZE) # , shuffle=True)

sigmoid函數

sigmoid函數，將數據R映射到（0,1）區間上了。

$11?e?x\frac{1}{1-e^{-x}}$

softmax函數

softmax是將根據n個數值的大小來分配概率區間

$exi∑inexi\frac{e^{x_i}}{\sum_{i}^{n}{e^{x_i}}}$

一般來說，為了避免數值越界的話，會要求減去最大值。
但是這里我們是用logistic regression，數值都會在0，1區間中，不會太大，因此不用擔心這個問題。

cross_Entropy函數

cross_Entropy 就是交叉熵。

$?∑pilog(qi)-\sum{p_i log(q_i)}$
這里，一旦我們給出了標準的label之后，我們就知道實際的p值分布為
只有一個元素為1，其他元素為0的概率分布了。

也就說，我們這就是

$log(q_{label})$
也就是對應label的概率越大越好~

任務描述

$min_{A, b}{ CE(SM( SIG(Ax+b)), label)}$

$S M$ ： softmax
$S I G$ ：sigmoid
$C E$ ：cross_Entropy
$l a b e l$ : 真實標簽

采用SDG，和DG算法

本文采用了pytorch實現，主要是為了避免手動算梯度。pytorch有autograd的機制。

本文一直采用的是固定步長

SGD

batch = 1
（GD的alpha采用的是0.001）
最后的結果是：0.836
準確率的變化情況
A和b和最優值的距離（這里用的是矩陣二范數）

實現SDG的部分代碼

從logistics regression模型中獲取了

A, b = [i for i in logits.parameters()] A.cuda() b.cuda()

通過查看pytorch的源碼實現中關于優化器部分的實現，手動設置了梯度歸零的操作，不然就會是累積梯度了。

if A.grad is not None:A.grad.zero_()b.grad.zero_()

梯度下降更新梯度

A.data = A.data - alpha * A.grad.data b.data = b.data - alpha * b.grad.data

完整代碼

import osimport torch import torch.nn as nn import torch.utils.data as Data import torchvision import matplotlib.pyplot as plt EPOCH = 5 # train the training data n times, to save time, we just train 1 epoch BATCH_SIZE = 1 DOWNLOAD_MNIST = False LR = 0.001# Mnist digits dataset if not (os.path.exists('./mnist/')) or not os.listdir('./mnist/'):# not mnist dir or mnist is empyt dirDOWNLOAD_MNIST = Truetrain_data = torchvision.datasets.MNIST(root='./mnist/',train=True, # this is training datatransform=torchvision.transforms.ToTensor(),download=DOWNLOAD_MNIST, )# Data Loader for easy mini-batch return in training, the image batch shape will be (50, 1, 28, 28) train_loader = Data.DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)class Logits(nn.Module):def __init__(self):super(Logits, self).__init__()self.linear = nn.Linear(28 * 28, 10)self.sigmoid = nn.Sigmoid()self.softmax = nn.Softmax(dim=1)def forward(self, x):x = self.linear(x)x = self.sigmoid(x)x = self.softmax(x)return xtest_data = torchvision.datasets.MNIST(root='./mnist/', train=False) test_x = torch.unsqueeze(test_data.test_data, dim=1).type(torch.FloatTensor).cuda() / 255. # shape from (2000, 28, 28) to (2000, 1, 28, 28), value in range(0,1) test_y = test_data.test_labelsalpha = 0.001logits = Logits().cuda() # optimizer = torch.optim.SGD(logits.parameters(), lr=LR) # optimize all cnn parameters # optimizer.zero_grad() loss_func = nn.CrossEntropyLoss() # the target label is not one-hottedAccurate = [] Astore = [] bstore = [] A, b = [i for i in logits.parameters()] A.cuda() b.cuda() for e in range(EPOCH):for step, (x, b_y) in enumerate(train_loader): # gives batch datab_x = x.view(-1, 28 * 28).cuda() # reshape x to (batch, time_step, input_size)b_y = b_y.cuda()output = logits(b_x) # logits outputloss = loss_func(output, b_y) # cross entropy lossif A.grad is not None:A.grad.zero_()b.grad.zero_()loss.backward() # backpropagation, compute gradientsA.data = A.data - alpha * A.grad.datab.data = b.data - alpha * b.grad.dataif step % 1500 == 0:test_output = logits(test_x.view(-1, 28 * 28))pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy())))print(Accurate[-1])Astore.append(A.detach())bstore.append(b.detach()) test_output = logits(test_x.view(-1, 28 * 28)) pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()print(pred_y, 'prediction number') print(test_y, 'real number') Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy()))) print(Accurate[-1])for i in range(len(Astore)):Astore[i] = (Astore[i] - Astore[-1]).norm()bstore[i] = (bstore[i] - bstore[-1]).norm()plt.plot(Astore, label='A') plt.plot(bstore, label='b') plt.legend() plt.show() plt.cla() plt.plot(Accurate) plt.show()

GD

將BATCHSIZE設置為6000（MNIST訓練集的數目）就是全梯度下降了。

但是這里的步長不宜過小（GD的alpha采用的是0.05）

其他關鍵的地方都是一樣的，但是因為用到了GPU計算，而且數據集也只有一個，所以先將數據集也拿出來。避免反復的調用MNIST loader讀取數據，再放到GPU上，浪費時間。

此外，將EPOCH次數，設置了為5000

在GPU環境下，很快就完成了運算

import osimport matplotlib.pyplot as plt import torch import torch.nn as nn import torch.utils.data as Data import torchvisionEPOCH = 5000 # train the training data n times, to save time, we just train 1 epoch BATCH_SIZE = 60000 DOWNLOAD_MNIST = False# Mnist digits dataset if not (os.path.exists('./mnist/')) or not os.listdir('./mnist/'):# not mnist dir or mnist is empyt dirDOWNLOAD_MNIST = Truetrain_data = torchvision.datasets.MNIST(root='./mnist/',train=True, # this is training datatransform=torchvision.transforms.ToTensor(),download=DOWNLOAD_MNIST, )# Data Loader for easy mini-batch return in training, the image batch shape will be (50, 1, 28, 28) train_loader = Data.DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)class Logits(nn.Module):def __init__(self):super(Logits, self).__init__()self.linear = nn.Linear(28 * 28, 10)self.sigmoid = nn.Sigmoid()self.softmax = nn.Softmax(dim=1)def forward(self, x):x = self.linear(x)x = self.sigmoid(x)x = self.softmax(x)return xtest_data = torchvision.datasets.MNIST(root='./mnist/', train=False) test_x = torch.unsqueeze(test_data.test_data, dim=1).type(torch.FloatTensor).cuda() / 255. # shape from (2000, 28, 28) to (2000, 1, 28, 28), value in range(0,1) test_y = test_data.test_labelsalpha = 0.05logits = Logits().cuda() # optimizer = torch.optim.SGD(logits.parameters(), lr=LR) # optimize all cnn parameters # optimizer.zero_grad() loss_func = nn.CrossEntropyLoss() # the target label is not one-hottedAccurate = [] Astore = [] bstore = [] A, b = [i for i in logits.parameters()] A.cuda() b.cuda() x, b_y = [(i, j) for i, j in train_loader][0] b_x = x.view(-1, 28 * 28).cuda() # reshape x to (batch, time_step, input_size) b_y = b_y.cuda() for e in range(EPOCH):output = logits(b_x) # logits outputloss = loss_func(output, b_y) # cross entropy lossif A.grad is not None:A.grad.zero_()b.grad.zero_()loss.backward() # backpropagation, compute gradientsA.data = A.data - alpha * A.grad.datab.data = b.data - alpha * b.grad.datatest_output = logits(test_x.view(-1, 28 * 28))# print(e)if e % 10 == 0:pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy())))print(e, Accurate[-1])Astore.append(A.detach())bstore.append(b.detach())test_output = logits(test_x.view(-1, 28 * 28)) pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()print(pred_y, 'prediction number') print(test_y, 'real number') Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy()))) print(Accurate[-1])for i in range(len(Astore)):Astore[i] = (Astore[i] - Astore[-1]).norm()bstore[i] = (bstore[i] - bstore[-1]).norm()plt.plot(Astore, label='A') plt.plot(bstore, label='b') plt.legend() plt.show() plt.cla() plt.plot(Accurate) plt.show()

探索batch

注意到當batch設置的比較大（比如像GD算法中的），那對于步長的設計要求還是蠻高的。（真實調參俠hhh）

注意到SGD使用batchsize=1的時候當第25張圖的時候，精度就很高，根據step的間隔用的是1500來計算，應該是在第37500個訓練數據的時候效果就比較突出了。
下面我們用的是batchsize=20的SGD,step的間隔是500，但是卻到了100張圖的時候.也就是1000000的時候，精度才類似。

再結合之前的GD，可以意識到mini batch的size應該太大，這里再將數字調小一點做下面的計算在下面的代碼后面：

import osimport matplotlib.pyplot as plt import torch import torch.nn as nn import torch.utils.data as Data import torchvisionEPOCH = 100 # train the training data n times, to save time, we just train 1 epoch BATCH_SIZE = 20 DOWNLOAD_MNIST = False LR = 0.001# Mnist digits dataset if not (os.path.exists('./mnist/')) or not os.listdir('./mnist/'):# not mnist dir or mnist is empyt dirDOWNLOAD_MNIST = Truetrain_data = torchvision.datasets.MNIST(root='./mnist/',train=True, # this is training datatransform=torchvision.transforms.ToTensor(),download=DOWNLOAD_MNIST, )# Data Loader for easy mini-batch return in training, the image batch shape will be (50, 1, 28, 28) train_loader = Data.DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)class Logits(nn.Module):def __init__(self):super(Logits, self).__init__()self.linear = nn.Linear(28 * 28, 10)self.sigmoid = nn.Sigmoid()self.softmax = nn.Softmax(dim=1)def forward(self, x):x = self.linear(x)x = self.sigmoid(x)x = self.softmax(x)return xtest_data = torchvision.datasets.MNIST(root='./mnist/', train=False) test_x = torch.unsqueeze(test_data.test_data, dim=1).type(torch.FloatTensor).cuda() / 255. # shape from (2000, 28, 28) to (2000, 1, 28, 28), value in range(0,1) test_y = test_data.test_labelsalpha = 0.001logits = Logits().cuda() # optimizer = torch.optim.SGD(logits.parameters(), lr=LR) # optimize all cnn parameters # optimizer.zero_grad() loss_func = nn.CrossEntropyLoss() # the target label is not one-hottedAccurate = [] Astore = [] bstore = [] A, b = [i for i in logits.parameters()] A.cuda() b.cuda() data = [(step, (x, b_y)) for step, (x, b_y) in enumerate(train_loader)] for e in range(EPOCH):for step, (x, b_y) in data: # gives batch datab_x = x.view(-1, 28 * 28).cuda() # reshape x to (batch, time_step, input_size)b_y = b_y.cuda()output = logits(b_x) # logits outputloss = loss_func(output, b_y) # cross entropy lossif A.grad is not None:A.grad.zero_()b.grad.zero_()loss.backward() # backpropagation, compute gradientsA.data = A.data - alpha * A.grad.datab.data = b.data - alpha * b.grad.dataif step % 500 == 0:test_output = logits(test_x.view(-1, 28 * 28))pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy())))print(Accurate[-1])Astore.append(A.detach())bstore.append(b.detach()) test_output = logits(test_x.view(-1, 28 * 28)) pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()print(pred_y, 'prediction number') print(test_y, 'real number') Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy()))) print(Accurate[-1])for i in range(len(Astore)):Astore[i] = (Astore[i] - Astore[-1]).norm()bstore[i] = (bstore[i] - bstore[-1]).norm()plt.plot(Astore, label='A') plt.plot(bstore, label='b') plt.legend() plt.show() plt.cla() plt.plot(Accurate) plt.show()

這里講batchsize設置為8
- batchsize=8，step區間設置為了2000，大概是第20個圖的時候效果類似。 320000個數據，比之前的batchsize=20的好多了。
- 類似的會發現在batchsize=4的時候收斂速度也會加快一點（minibatch真的要mini 哈哈哈）

import osimport matplotlib.pyplot as plt import torch import torch.nn as nn import torch.utils.data as Data import torchvisionEPOCH = 20 # train the training data n times, to save time, we just train 1 epoch BATCH_SIZE = 8 DOWNLOAD_MNIST = False LR = 0.001# Mnist digits dataset if not (os.path.exists('./mnist/')) or not os.listdir('./mnist/'):# not mnist dir or mnist is empyt dirDOWNLOAD_MNIST = Truetrain_data = torchvision.datasets.MNIST(root='./mnist/',train=True, # this is training datatransform=torchvision.transforms.ToTensor(),download=DOWNLOAD_MNIST, )# Data Loader for easy mini-batch return in training, the image batch shape will be (50, 1, 28, 28) train_loader = Data.DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)class Logits(nn.Module):def __init__(self):super(Logits, self).__init__()self.linear = nn.Linear(28 * 28, 10)self.sigmoid = nn.Sigmoid()self.softmax = nn.Softmax(dim=1)def forward(self, x):x = self.linear(x)x = self.sigmoid(x)x = self.softmax(x)return xtest_data = torchvision.datasets.MNIST(root='./mnist/', train=False) test_x = torch.unsqueeze(test_data.test_data, dim=1).type(torch.FloatTensor).cuda() / 255. # shape from (2000, 28, 28) to (2000, 1, 28, 28), value in range(0,1) test_y = test_data.test_labelsalpha = 0.001logits = Logits().cuda() # optimizer = torch.optim.SGD(logits.parameters(), lr=LR) # optimize all cnn parameters # optimizer.zero_grad() loss_func = nn.CrossEntropyLoss() # the target label is not one-hottedAccurate = [] Astore = [] bstore = [] A, b = [i for i in logits.parameters()] A.cuda() b.cuda() data = [(step, (x.view(-1, 28 * 28), b_y)) for step, (x, b_y) in enumerate(train_loader)] for e in range(EPOCH):for step, (x, b_y) in data: # gives batch datab_x = x.cuda() # reshape x to (batch, time_step, input_size)b_y = b_y.cuda()output = logits(b_x) # logits outputloss = loss_func(output, b_y) # cross entropy lossif A.grad is not None:A.grad.zero_()b.grad.zero_()loss.backward() # backpropagation, compute gradientsA.data = A.data - alpha * A.grad.datab.data = b.data - alpha * b.grad.dataif step % 2000 == 0:test_output = logits(test_x.view(-1, 28 * 28))pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy())))print(e, Accurate[-1])Astore.append(A.detach())bstore.append(b.detach()) test_output = logits(test_x.view(-1, 28 * 28)) pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()print(pred_y, 'prediction number') print(test_y, 'real number') Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy()))) print(Accurate[-1])for i in range(len(Astore)):Astore[i] = (Astore[i] - Astore[-1]).norm()bstore[i] = (bstore[i] - bstore[-1]).norm()plt.plot(Astore, label='A') plt.plot(bstore, label='b') plt.legend() plt.show() plt.cla() plt.plot(Accurate) plt.show()

batchsize=4
step區間等于4000。也就是說這里跟上面的圖用的index應該是對齊的，可以發現，這里的訓練速度快了很多。

可能是算法設計上還不夠完善（固定步長），這里發現batch越小效果越好，但是實際中batch其實要適中才是比較好的，一般來說batch=8

總結

以上是生活随笔為你收集整理的pytorch手动实现梯度下降法，随机梯度法--基于logistic Regression并探索Mini batch作用的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： CUDA编程--并行矩阵向量乘法【80+
下一篇： R语言-画edcf图、直方图、正态概率图