當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Lesson 15.2 学习率调度在PyTorch中的实现方法

發(fā)布時間：2025/4/5 编程问答 14 豆豆

生活随笔收集整理的這篇文章主要介紹了 Lesson 15.2 学习率调度在PyTorch中的实现方法小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Lesson 15.2 學(xué)習(xí)率調(diào)度在PyTorch中的實現(xiàn)方法

??學(xué)習(xí)率調(diào)度作為模型優(yōu)化的重要方法，也集成在了PyTorch的optim模塊中。我們可以通過下述代碼將學(xué)習(xí)率調(diào)度模塊進行導(dǎo)入。

from torch.optim import lr_scheduler

??接下來，我們從較為基礎(chǔ)的學(xué)習(xí)率調(diào)度方法入手，熟悉PyTorch中實現(xiàn)學(xué)習(xí)率調(diào)度的基本思路與流程。

一、優(yōu)化器與狀態(tài)字典（state_dict）

??在此前的模型訓(xùn)練過程中，我們已經(jīng)基本了解了PyTorch中的模型優(yōu)化器的基本使用方法。模型優(yōu)化器是求解損失函數(shù)的函數(shù)，其中包含了模型訓(xùn)練的諸多關(guān)鍵信息，包括模型參數(shù)、模型學(xué)習(xí)率等，同時在進行模型訓(xùn)練時，我們也是通過優(yōu)化器調(diào)整模型參數(shù)、歸零模型梯度。而在學(xué)習(xí)率調(diào)度過程中，由于我們需要動態(tài)調(diào)整學(xué)習(xí)率，而學(xué)習(xí)率又是通過傳入優(yōu)化器進而影響模型訓(xùn)練的，因此在利用PyTorch進行學(xué)習(xí)率調(diào)度的時候，核心需要考慮的問題是如何讓優(yōu)化器內(nèi)的學(xué)習(xí)率隨著迭代次數(shù)增加而不斷變化。
??為做到這一點，首先我們需要補充關(guān)于優(yōu)化器狀態(tài)字典內(nèi)容。

# 設(shè)置隨機數(shù)種子 torch.manual_seed(420) # 創(chuàng)建最高項為2的多項式回歸數(shù)據(jù)集 features, labels = tensorGenReg(w=[2, -1, 3, 1, 2], bias=False, deg=2)# 進行數(shù)據(jù)集切分與加載 train_loader, test_loader = split_loader(features, labels, batch_size=50)# 設(shè)置隨機數(shù)種子 torch.manual_seed(24) # 實例化模型 tanh_model1 = net_class2(act_fun= torch.tanh, in_features=5, BN_model='pre')# 創(chuàng)建優(yōu)化器 optimizer = torch.optim.SGD(tanh_model1.parameters(), lr=0.01)

在優(yōu)化器創(chuàng)建完成之后，我們可以使用.state_dict()方法查看優(yōu)化器狀態(tài)。

optimizer.state_dict() #{'state': {}, # 'param_groups': [{'lr': 0.01, # 'momentum': 0, # 'dampening': 0, # 'weight_decay': 0, # 'nesterov': False, # 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}]}

該方法會返回一個包含優(yōu)化器核心信息的字典，目前為止該字典包含兩個元素，第一個是優(yōu)化器狀態(tài)（state），第二個是優(yōu)化器相關(guān)參數(shù)簇（param_groups），其中，目前為止核心需要關(guān)注的是參數(shù)簇中的lr對象，該對象代表著下一次模型訓(xùn)練的時候所帶入的學(xué)習(xí)率。當然，我們可以通過如下方法提取lr對應(yīng)的value

optimizer.state_dict()['param_groups'] #[{'lr': 0.01, # 'momentum': 0, # 'dampening': 0, # 'weight_decay': 0, # 'nesterov': False, # 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}] optimizer.state_dict()['param_groups'][0] #{'lr': 0.01, # 'momentum': 0, # 'dampening': 0, # 'weight_decay': 0, # 'nesterov': False, # 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]} optimizer.state_dict()['param_groups'][0]['lr'] #0.01

參數(shù)簇中其他參數(shù)包括動量系數(shù)、特征權(quán)重、是否采用牛頓法及待訓(xùn)練參數(shù)索引。

另外，params表示訓(xùn)練參數(shù)個數(shù)（其中一個矩陣算作一個參數(shù)），可以通過如下方式進行簡單驗證。

list(tanh_model1.parameters()) # [Parameter containing: # tensor([[ 0.2365, -0.1118, -0.3801, 0.0275, 0.4168], # [-0.1995, -0.1456, 0.3497, -0.0622, -0.1708], # [-0.0901, 0.0164, -0.3643, -0.1278, 0.4336], # [-0.0959, 0.4073, -0.1746, -0.1799, -0.1333]], requires_grad=True), # Parameter containing: # tensor([-0.3999, -0.2694, 0.2703, -0.3355], requires_grad=True), # Parameter containing: # tensor([1., 1., 1., 1.], requires_grad=True), # Parameter containing: # tensor([0., 0., 0., 0.], requires_grad=True), # Parameter containing: # tensor([[ 0.1708, 0.4704, -0.0635, 0.2187], # [ 0.2336, -0.3569, -0.1928, -0.1566], # [ 0.4825, -0.4463, 0.3027, 0.4696], # [ 0.3953, 0.2131, 0.2226, -0.0267]], requires_grad=True), # Parameter containing: # tensor([ 0.2516, 0.4558, -0.1608, 0.4831], requires_grad=True), # Parameter containing: # tensor([1., 1., 1., 1.], requires_grad=True), # Parameter containing: # tensor([0., 0., 0., 0.], requires_grad=True), # Parameter containing: # tensor([[ 0.0795, -0.3507, -0.3589, 0.1764]], requires_grad=True), # Parameter containing: # tensor([-0.0705], requires_grad=True)]# 驗證帶訓(xùn)練參數(shù)個數(shù) len(list(tanh_model1.parameters())) #10tanh_model2 = net_class3(act_fun= torch.tanh, in_features=5, BN_model='pre')optimizer1 = torch.optim.SGD(tanh_model2.parameters(), lr=0.05) optimizer1.state_dict() #{'state': {}, # 'param_groups': [{'lr': 0.05, # 'momentum': 0, # 'dampening': 0, # 'weight_decay': 0, # 'nesterov': False, # 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]}]}len(list(tanh_model2.parameters())) #14

模型本地保存與讀取方法

??同時，借助state_dict()方法，我們可以實現(xiàn)模型或優(yōu)化器的本地保存與讀取。此處以模型為例，優(yōu)化器的本地保存相關(guān)操作類似。
??對于模型而言，其實也有state_dict()方法。通過該方法的調(diào)用，可以查看模型全部參數(shù)信息。

值得注意的是，模型的訓(xùn)練和保存，本質(zhì)上都是針對模型的參數(shù)。而模型的state_dict()則包含了模型當前全部的參數(shù)信息。因此，保存了模型的state_dict()就相當于是保存了模型。

# 設(shè)置隨機數(shù)種子 torch.manual_seed(24) # 實例化模型 tanh_model1 = net_class2(act_fun= torch.tanh, in_features=5, BN_model='pre')tanh_model1.state_dict() # OrderedDict([('linear1.weight', # tensor([[ 0.2365, -0.1118, -0.3801, 0.0275, 0.4168], # [-0.1995, -0.1456, 0.3497, -0.0622, -0.1708], # [-0.0901, 0.0164, -0.3643, -0.1278, 0.4336], # [-0.0959, 0.4073, -0.1746, -0.1799, -0.1333]])), # ('linear1.bias', tensor([-0.3999, -0.2694, 0.2703, -0.3355])), # ('normalize1.weight', tensor([1., 1., 1., 1.])), # ('normalize1.bias', tensor([0., 0., 0., 0.])), # ('normalize1.running_mean', tensor([0., 0., 0., 0.])), # ('normalize1.running_var', tensor([1., 1., 1., 1.])), # ('normalize1.num_batches_tracked', tensor(0)), # ('linear2.weight', # tensor([[ 0.1708, 0.4704, -0.0635, 0.2187], # [ 0.2336, -0.3569, -0.1928, -0.1566], # [ 0.4825, -0.4463, 0.3027, 0.4696], # [ 0.3953, 0.2131, 0.2226, -0.0267]])), # ('linear2.bias', tensor([ 0.2516, 0.4558, -0.1608, 0.4831])), # ('normalize2.weight', tensor([1., 1., 1., 1.])), # ('normalize2.bias', tensor([0., 0., 0., 0.])), # ('normalize2.running_mean', tensor([0., 0., 0., 0.])), # ('normalize2.running_var', tensor([1., 1., 1., 1.])), # ('normalize2.num_batches_tracked', tensor(0)), # ('linear3.weight', # tensor([[ 0.0795, -0.3507, -0.3589, 0.1764]])), # ('linear3.bias', tensor([-0.0705]))])

首先，我們可以將該存有模型全部參數(shù)信息的字典對象賦給某個變量。

t1 = tanh_model1.state_dict() t1 # OrderedDict([('linear1.weight', # tensor([[ 0.2365, -0.1118, -0.3801, 0.0275, 0.4168], # [-0.1995, -0.1456, 0.3497, -0.0622, -0.1708], # [-0.0901, 0.0164, -0.3643, -0.1278, 0.4336], # [-0.0959, 0.4073, -0.1746, -0.1799, -0.1333]])), # ('linear1.bias', tensor([-0.3999, -0.2694, 0.2703, -0.3355])), # ('normalize1.weight', tensor([1., 1., 1., 1.])), # ('normalize1.bias', tensor([0., 0., 0., 0.])), # ('normalize1.running_mean', tensor([0., 0., 0., 0.])), # ('normalize1.running_var', tensor([1., 1., 1., 1.])), # ('normalize1.num_batches_tracked', tensor(0)), # ('linear2.weight', # tensor([[ 0.1708, 0.4704, -0.0635, 0.2187], # [ 0.2336, -0.3569, -0.1928, -0.1566], # [ 0.4825, -0.4463, 0.3027, 0.4696], # [ 0.3953, 0.2131, 0.2226, -0.0267]])), # ('linear2.bias', tensor([ 0.2516, 0.4558, -0.1608, 0.4831])), # ('normalize2.weight', tensor([1., 1., 1., 1.])), # ('normalize2.bias', tensor([0., 0., 0., 0.])), # ('normalize2.running_mean', tensor([0., 0., 0., 0.])), # ('normalize2.running_var', tensor([1., 1., 1., 1.])), # ('normalize2.num_batches_tracked', tensor(0)), # ('linear3.weight', # tensor([[ 0.0795, -0.3507, -0.3589, 0.1764]])), # ('linear3.bias', tensor([-0.0705]))])

其次，我們也可以通過torch.save來將該參數(shù)保存至本地。

torch.save(tanh_model1.state_dict(), 'tanh1.pt')

??對于torch.save函數(shù)來說，第一個參數(shù)是需要保存的模型參數(shù)，而第二個參數(shù)則是保存到本地的文件名。一般來說可以令其后綴為.pt或.pth。而當我們需要讀取保存的參數(shù)結(jié)果時，則可以直接使用load_state_dict方法w。該方法的使用我們稍后就會談到。
??接下來進行模型訓(xùn)練，也就是模型參數(shù)調(diào)整?；仡櫞饲皩W(xué)習(xí)內(nèi)容，當我們進行模型訓(xùn)練時，實際上就是借助損失函數(shù)和反向傳播機制進行梯度求解，然后利用優(yōu)化器根據(jù)梯度值去更新各線性層參數(shù)。

criterion = nn.MSELoss() optimizer = torch.optim.SGD(tanh_model1.parameters(), lr=0.05) for X, y in train_loader:yhat = tanh_model1.forward(X)loss = criterion(yhat, y)optimizer.zero_grad()loss.backward()optimizer.step()

訓(xùn)練完一輪之后，我們可以查看模型狀態(tài)：

tanh_model1.state_dict() # OrderedDict([('linear1.weight', # tensor([[ 0.0436, -0.3587, -0.3227, 0.0310, 0.4388], # [-0.0870, -0.1146, 0.4255, -0.0052, -0.3548], # [-0.0154, 0.1517, -0.4181, -0.0605, 0.4350], # [-0.0627, 0.5445, 0.0345, -0.1221, 0.1262]])), # ('linear1.bias', tensor([-0.3999, -0.2694, 0.2703, -0.3355])), # ('normalize1.weight', tensor([1.0497, 0.9741, 1.0267, 1.0508])), # ('normalize1.bias', tensor([ 0.0358, -0.1734, -0.1451, 0.0043])), # ('normalize1.running_mean', # tensor([-0.3789, -0.2839, 0.2689, -0.3484])), # ('normalize1.running_var', # tensor([0.3839, 0.2907, 0.3761, 0.2507])), # ('normalize1.num_batches_tracked', tensor(42)), # ('linear2.weight', # tensor([[ 0.1514, 0.5047, -0.0870, 0.1669], # [ 0.2090, 0.0034, -0.3558, -0.4330], # [ 0.4056, -0.3937, 0.3199, 0.5734], # [ 0.3083, 0.3801, -0.0587, -0.2878]])), # ('linear2.bias', tensor([ 0.2516, 0.4558, -0.1608, 0.4831])), # ('normalize2.weight', tensor([1.0229, 0.4936, 0.2831, 0.7715])), # ('normalize2.bias', tensor([-0.0817, -1.2150, -1.1698, 0.8213])), # ('normalize2.running_mean', # tensor([ 0.2384, 0.4661, -0.1415, 0.4703])), # ('normalize2.running_var', # tensor([0.0720, 0.1388, 0.6376, 0.0972])), # ('normalize2.num_batches_tracked', tensor(42)), # ('linear3.weight', # tensor([[-0.3395, -1.3164, -1.1326, 0.8836]])), # ('linear3.bias', tensor([4.8350]))])

我們發(fā)現(xiàn)模型的參數(shù)已經(jīng)發(fā)生了變化。當然，此時t1也隨之發(fā)生了變化

t1 # OrderedDict([('linear1.weight', # tensor([[ 0.0436, -0.3587, -0.3227, 0.0310, 0.4388], # [-0.0870, -0.1146, 0.4255, -0.0052, -0.3548], # [-0.0154, 0.1517, -0.4181, -0.0605, 0.4350], # [-0.0627, 0.5445, 0.0345, -0.1221, 0.1262]])), # ('linear1.bias', tensor([-0.3999, -0.2694, 0.2703, -0.3355])), # ('normalize1.weight', tensor([1.0497, 0.9741, 1.0267, 1.0508])), # ('normalize1.bias', tensor([ 0.0358, -0.1734, -0.1451, 0.0043])), # ('normalize1.running_mean', # tensor([-0.3789, -0.2839, 0.2689, -0.3484])), # ('normalize1.running_var', # tensor([0.3839, 0.2907, 0.3761, 0.2507])), # ('normalize1.num_batches_tracked', tensor(0)), # ('linear2.weight', # tensor([[ 0.1514, 0.5047, -0.0870, 0.1669], # [ 0.2090, 0.0034, -0.3558, -0.4330], # [ 0.4056, -0.3937, 0.3199, 0.5734], # [ 0.3083, 0.3801, -0.0587, -0.2878]])), # ('linear2.bias', tensor([ 0.2516, 0.4558, -0.1608, 0.4831])), # ('normalize2.weight', tensor([1.0229, 0.4936, 0.2831, 0.7715])), # ('normalize2.bias', tensor([-0.0817, -1.2150, -1.1698, 0.8213])), # ('normalize2.running_mean', # tensor([ 0.2384, 0.4661, -0.1415, 0.4703])), # ('normalize2.running_var', # tensor([0.0720, 0.1388, 0.6376, 0.0972])), # ('normalize2.num_batches_tracked', tensor(0)), # ('linear3.weight', # tensor([[-0.3395, -1.3164, -1.1326, 0.8836]])), # ('linear3.bias', tensor([4.8350]))])

此時，如果我們想還原tanh_model1中原始參數(shù)，我們只能考慮通過使用load_state_dict方法，將本次保存的原模型參數(shù)替換當前的tanh_model1中參數(shù)，具體方法如下：

torch.load('tanh1.pt') # OrderedDict([('linear1.weight', # tensor([[ 0.2365, -0.1118, -0.3801, 0.0275, 0.4168], # [-0.1995, -0.1456, 0.3497, -0.0622, -0.1708], # [-0.0901, 0.0164, -0.3643, -0.1278, 0.4336], # [-0.0959, 0.4073, -0.1746, -0.1799, -0.1333]])), # ('linear1.bias', tensor([-0.3999, -0.2694, 0.2703, -0.3355])), # ('normalize1.weight', tensor([1., 1., 1., 1.])), # ('normalize1.bias', tensor([0., 0., 0., 0.])), # ('normalize1.running_mean', tensor([0., 0., 0., 0.])), # ('normalize1.running_var', tensor([1., 1., 1., 1.])), # ('normalize1.num_batches_tracked', tensor(0)), # ('linear2.weight', # tensor([[ 0.1708, 0.4704, -0.0635, 0.2187], # [ 0.2336, -0.3569, -0.1928, -0.1566], # [ 0.4825, -0.4463, 0.3027, 0.4696], # [ 0.3953, 0.2131, 0.2226, -0.0267]])), # ('linear2.bias', tensor([ 0.2516, 0.4558, -0.1608, 0.4831])), # ('normalize2.weight', tensor([1., 1., 1., 1.])), # ('normalize2.bias', tensor([0., 0., 0., 0.])), # ('normalize2.running_mean', tensor([0., 0., 0., 0.])), # ('normalize2.running_var', tensor([1., 1., 1., 1.])), # ('normalize2.num_batches_tracked', tensor(0)), # ('linear3.weight', # tensor([[ 0.0795, -0.3507, -0.3589, 0.1764]])), # ('linear3.bias', tensor([-0.0705]))]) tanh_model1.load_state_dict(torch.load('tanh1.pt')) #<All keys matched successfully> tanh_model1.state_dict() # OrderedDict([('linear1.weight', # tensor([[ 0.2365, -0.1118, -0.3801, 0.0275, 0.4168], # [-0.1995, -0.1456, 0.3497, -0.0622, -0.1708], # [-0.0901, 0.0164, -0.3643, -0.1278, 0.4336], # [-0.0959, 0.4073, -0.1746, -0.1799, -0.1333]])), # ('linear1.bias', tensor([-0.3999, -0.2694, 0.2703, -0.3355])), # ('normalize1.weight', tensor([1., 1., 1., 1.])), # ('normalize1.bias', tensor([0., 0., 0., 0.])), # ('normalize1.running_mean', tensor([0., 0., 0., 0.])), # ('normalize1.running_var', tensor([1., 1., 1., 1.])), # ('normalize1.num_batches_tracked', tensor(0)), # ('linear2.weight', # tensor([[ 0.1708, 0.4704, -0.0635, 0.2187], # [ 0.2336, -0.3569, -0.1928, -0.1566], # [ 0.4825, -0.4463, 0.3027, 0.4696], # [ 0.3953, 0.2131, 0.2226, -0.0267]])), # ('linear2.bias', tensor([ 0.2516, 0.4558, -0.1608, 0.4831])), # ('normalize2.weight', tensor([1., 1., 1., 1.])), # ('normalize2.bias', tensor([0., 0., 0., 0.])), # ('normalize2.running_mean', tensor([0., 0., 0., 0.])), # ('normalize2.running_var', tensor([1., 1., 1., 1.])), # ('normalize2.num_batches_tracked', tensor(0)), # ('linear3.weight', # tensor([[ 0.0795, -0.3507, -0.3589, 0.1764]])), # ('linear3.bias', tensor([-0.0705]))])

至此，我們就完成了模型訓(xùn)練與保存的基本過程。當然，除了模型可以按照上述方法保存外，優(yōu)化器也可以類似進行本地存儲。

當然，結(jié)合此前介紹的深拷貝的相關(guān)概念，此處我們能否通過深拷貝的方式將模型參數(shù)保存在當前操作空間內(nèi)然后再替換訓(xùn)練后的模型參數(shù)呢？同學(xué)們可以自行嘗試

接下來，我們通過調(diào)用optim模塊中l(wèi)r_scheduler相關(guān)函數(shù)，來實現(xiàn)優(yōu)化器中學(xué)習(xí)率的動態(tài)調(diào)整。

二、LambdaLR基本使用方法

??讓優(yōu)化器動態(tài)調(diào)整學(xué)習(xí)率的類，也被我們稱為學(xué)習(xí)率調(diào)度器類，該類實例化的對象也被稱為學(xué)習(xí)率調(diào)度器。在所有的學(xué)習(xí)率調(diào)度器中，LambdaLR類是實現(xiàn)學(xué)習(xí)率調(diào)度最簡單靈活、同時也是最通用的一種方法。
??要使用LambdaLR來完成學(xué)習(xí)率調(diào)度，首先需要準備一個lambda匿名函數(shù)，例如：

lr_lambda = lambda epoch: 0.5 ** epoch

此處我們通過lambda創(chuàng)建了一個匿名函數(shù)。該函數(shù)需要輸入一個參數(shù)，一般來說我們會將該參數(shù)視作模型迭代次數(shù)。當然上述匿名函數(shù)是個非常簡單的匿名函數(shù)，輸出結(jié)果就是0.5的epoch次方。

# 第一輪迭代時 lr_lambda(0) #1.0 # 第二輪迭代時 lr_lambda(1) #0.5

此處需要注意，一般來說epoch取值從0開始，并且用于學(xué)習(xí)率調(diào)度的匿名函數(shù)參數(shù)取值為0時，輸出結(jié)果不能為0。

??在準備好一個匿名函數(shù)之后，接下來我們需要實例化一個LambdaLR學(xué)習(xí)率調(diào)度器。同時，由于所有的學(xué)習(xí)率調(diào)度器都是通過修改某個優(yōu)化器來完成學(xué)習(xí)率調(diào)度，因此我們還需要創(chuàng)建一個對應(yīng)的優(yōu)化器（當然為了模型訓(xùn)練，也是要創(chuàng)建優(yōu)化器的）。優(yōu)化器的創(chuàng)建無須其他設(shè)置，該優(yōu)化器和學(xué)習(xí)率調(diào)度器的關(guān)聯(lián)主要是通過學(xué)習(xí)率調(diào)度器來體現(xiàn)。

# 設(shè)置隨機數(shù)種子 torch.manual_seed(24) # 實例化模型 tanh_model1 = net_class2(act_fun= torch.tanh, in_features=5, BN_model='pre')# 創(chuàng)建優(yōu)化器 optimizer = torch.optim.SGD(tanh_model1.parameters(), lr=0.05)# 查看優(yōu)化器信息 optimizer.state_dict() #{'state': {}, # 'param_groups': [{'lr': 0.05, # 'momentum': 0, # 'dampening': 0, # 'weight_decay': 0, # 'nesterov': False, # 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}]} # 創(chuàng)建學(xué)習(xí)率調(diào)度器 scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda)

注意，LambdaLR學(xué)習(xí)率調(diào)度器的創(chuàng)建必須要輸入一個lambda函數(shù)和與之關(guān)聯(lián)的優(yōu)化器。一旦優(yōu)化器創(chuàng)建完成，我們即可繼續(xù)觀察優(yōu)化器optimizer的狀態(tài)。

optimizer.state_dict() #{'state': {}, # 'param_groups': [{'lr': 0.05, # 'momentum': 0, # 'dampening': 0, # 'weight_decay': 0, # 'nesterov': False, # 'initial_lr': 0.05, # 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}]}

??此時優(yōu)化器的參數(shù)簇中多了’initial_lr’元素。該元素代表初始學(xué)習(xí)率，也就是我們在實例化優(yōu)化器時輸入的學(xué)習(xí)率。而優(yōu)化器中的lr，則仍然表示下一次迭代時的學(xué)習(xí)率。

對于LambdaLR學(xué)習(xí)調(diào)度來說，優(yōu)化器中的lr伴隨模型迭代相應(yīng)調(diào)整的方法如下：
$lr = lr\_lambda(epoch) * initial\_lr$
??并且，第一次實例化LambdaLR時epoch取值為0時，因此此時優(yōu)化器的lr計算結(jié)果如下： $lr_0 = 0.5^0 * 0.05 = 0.05$ 而在后續(xù)計算過程中，每當我們調(diào)用一次scheduler.step()，epoch數(shù)值就會+1。我們可以進行下述實驗，即當一輪訓(xùn)練完成時，我們可通過scheduler.step()來更新下一輪迭代時的學(xué)習(xí)率。

for X, y in train_loader:yhat = tanh_model1.forward(X)loss = criterion(yhat, y)optimizer.zero_grad()loss.backward()optimizer.step() scheduler.step()

需要注意，在上述模型訓(xùn)練的代碼中，之所以將學(xué)習(xí)率調(diào)度器放在模型小批量梯度下降循環(huán)的外側(cè)，也是因為一般來說遍歷一次完整訓(xùn)練集（一個epoch）才會對學(xué)習(xí)率進行一次更新，而不是每次計算完一個小批數(shù)據(jù)就對模型學(xué)習(xí)率進行更新。

optimizer.state_dict() #{'state': {}, # 'param_groups': [{'lr': 0.025, # 'momentum': 0, # 'dampening': 0, # 'weight_decay': 0, # 'nesterov': False, # 'initial_lr': 0.05, # 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}]}

而此時lr的取值0.025，則是由lr_lambda當epoch取值為1時的輸出結(jié)果和initial_lr相乘之后的結(jié)果。也就是 $lr = 0.5^1 * 0.05 = 0.025$ 而如果把上述過程封裝為一個循環(huán)（也就是此前定義的fit函數(shù)），則下次模型訓(xùn)練時學(xué)習(xí)率就調(diào)整為了0.025。
??至此，我們也就知道了scheduler.step()的真實作用——令匿名函數(shù)的自變量+1，然后令匿名函數(shù)的輸出結(jié)果與initial_lr相乘，并把計算結(jié)果傳給優(yōu)化器，作為下一次優(yōu)化器計算時的學(xué)習(xí)率。
??當然，我們也能簡單的重復(fù)optimizer.step()與scheduler.step()，即可一次次完成計算新學(xué)習(xí)率、并將新學(xué)習(xí)率傳輸給優(yōu)化器的過程。

optimizer.zero_grad() optimizer.step() scheduler.step()lr_lambda = lambda epoch: 0.5 ** epoch lr_lambda(2) * 0.05 #0.0125 optimizer.state_dict() #{'state': {}, # 'param_groups': [{'lr': 0.0125, # 'momentum': 0, # 'dampening': 0, # 'weight_decay': 0, # 'nesterov': False, # 'initial_lr': 0.05, # 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}]} ss = scheduler.state_dict() ss #{'base_lrs': [0.05], # 'last_epoch': 2, # '_step_count': 3, # 'verbose': False, # '_get_lr_called_within_step': False, # '_last_lr': [0.0125], # 'lr_lambdas': [None]}

不出意外，在第三次scheduler.step()時，匿名函數(shù)輸出結(jié)果為 $0.5^2$ ，再與initial_lr相乘之后結(jié)果為0.0125。

此處需要注意，PyTorch中要求先進行優(yōu)化器的step，再進行學(xué)習(xí)率調(diào)度的step，此處需要注意先后順序。另外，上述過程之所以提前將優(yōu)化器內(nèi)保存的模型參數(shù)清零，也是為了防止上述實驗過程最終導(dǎo)致模型參數(shù)被修改（梯度為0時模型無法修改參數(shù)）。

當然，每一輪epoch都讓模型學(xué)習(xí)率衰減50%其實是非常激進的。我們可以通過繪制圖像觀察學(xué)習(xí)率衰減情況。

# 創(chuàng)建優(yōu)化器 optimizer = torch.optim.SGD(tanh_model1.parameters(), lr=0.05) # 創(chuàng)建學(xué)習(xí)率調(diào)度器 scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda) optimizer.state_dict()['param_groups'][0]['lr'] #0.05 lr_l = [0.05] for i in range(10):optimizer.step()scheduler.step()lr = optimizer.state_dict()['param_groups'][0]['lr']lr_l.append(lr) plt.plot(lr_l) plt.xlabel('epoch') plt.ylabel('Learning rate')

接下來，我們放緩學(xué)習(xí)率衰減速率，進行學(xué)習(xí)率調(diào)度建模實驗。

三、LambdaLR學(xué)習(xí)率調(diào)度實驗

1.前期準備與匿名函數(shù)定義

??在實驗開始前，我們需要將之前定義的fit_rec函數(shù)再次進行改寫，新函數(shù)需要包含學(xué)習(xí)率調(diào)度相關(guān)方法。

def fit_rec_sc(net, criterion, optimizer, train_data,test_data,scheduler,epochs = 3, cla = False, eva = mse_cal):"""加入學(xué)習(xí)率調(diào)度后的模型訓(xùn)練函數(shù)（記錄每一次遍歷后模型評估指標）:param net：待訓(xùn)練的模型 :param criterion: 損失函數(shù):param optimizer：優(yōu)化算法:param train_data：訓(xùn)練數(shù)據(jù):param test_data: 測試數(shù)據(jù) :param scheduler: 學(xué)習(xí)率調(diào)度器:param epochs: 遍歷數(shù)據(jù)次數(shù):param cla: 是否是分類問題:param eva: 模型評估方法:return：模型評估結(jié)果"""train_l = []test_l = []for epoch in range(epochs):net.train()for X, y in train_data:if cla == True:y = y.flatten().long() # 如果是分類問題，需要對y進行整數(shù)轉(zhuǎn)化yhat = net.forward(X)loss = criterion(yhat, y)optimizer.zero_grad()loss.backward()optimizer.step()scheduler.step()net.eval()train_l.append(eva(train_data, net).detach())test_l.append(eva(test_data, net).detach())return train_l, test_l

同樣，該函數(shù)需要寫入torchLearning.py文件中。接下來，我們定義一個衰減速度更加緩慢的學(xué)習(xí)率調(diào)度器。

lr_lambda = lambda epoch: 0.95 ** epoch # 第一輪迭代時 lr_lambda(0) #1.0 # 第二輪迭代時 lr_lambda(1) #0.95 lr_lambda(100) #0.0059205292203339975

相當于每迭代一輪學(xué)習(xí)率衰減5%。

# 設(shè)置隨機數(shù)種子 torch.manual_seed(24) # 實例化模型 tanh_model1 = net_class2(act_fun=torch.tanh, in_features=5, BN_model='pre') # 創(chuàng)建優(yōu)化器 optimizer = torch.optim.SGD(tanh_model1.parameters(), lr=0.05) # 創(chuàng)建學(xué)習(xí)率調(diào)度器 scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda)

3.模型訓(xùn)練與結(jié)果比較

# 進行模型訓(xùn)練 train_l, test_l = fit_rec_sc(net = tanh_model1, criterion = nn.MSELoss(), optimizer = optimizer, train_data = train_loader,test_data = test_loader,scheduler = scheduler,epochs = 60, cla = False, eva = mse_cal) plt.plot(train_l, label='train_mse') plt.xlabel('epochs') plt.ylabel('MSE') plt.legend(loc = 1)

簡單驗證學(xué)習(xí)率最終調(diào)整結(jié)果。

optimizer.state_dict() #{'state': {}, # 'param_groups': [{'lr': 0.002303489949347597, # 'momentum': 0, # 'dampening': 0, # 'weight_decay': 0, # 'nesterov': False, # 'initial_lr': 0.05, # 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}]} lr_lambda(60) * 0.05 #0.002303489949347597

當然，我們也可以繼續(xù)進行實驗，對比恒定學(xué)習(xí)率時計算結(jié)果

對比恒定學(xué)習(xí)率為0.03時模型訓(xùn)練結(jié)果

# 設(shè)置隨機數(shù)種子 torch.manual_seed(24) # 實例化模型 tanh_model1 = net_class2(act_fun= torch.tanh, in_features=5, BN_model='pre')train_l3, test_l3 = fit_rec(net = tanh_model1, criterion = nn.MSELoss(), optimizer = optim.SGD(tanh_model1.parameters(), lr = 0.03), train_data = train_loader,test_data = test_loader,epochs = 60, cla = False, eva = mse_cal) plt.plot(train_l, label='train_l') plt.plot(train_l3, label='train_l3') plt.xlabel('epochs') plt.ylabel('MSE') plt.legend(loc = 1)

我們發(fā)現(xiàn)，相比恒定學(xué)習(xí)為0.03的模型，加入學(xué)習(xí)率調(diào)度策略的模型，模型收斂效果更好、迭代更加平穩(wěn)，且收斂速度較快。

對比恒定學(xué)習(xí)率為0.01時模型訓(xùn)練結(jié)果

# 設(shè)置隨機數(shù)種子 torch.manual_seed(24) # 實例化模型 tanh_model1 = net_class2(act_fun= torch.tanh, in_features=5, BN_model='pre')train_l1, test_l1 = fit_rec(net = tanh_model1, criterion = nn.MSELoss(), optimizer = optim.SGD(tanh_model1.parameters(), lr = 0.01), train_data = train_loader,test_data = test_loader,epochs = 60, cla = False, eva = mse_cal) plt.plot(train_l, label='train_l') plt.plot(train_l3, label='train_l3') plt.plot(train_l1, label='train_l1') plt.xlabel('epochs') plt.ylabel('MSE') plt.legend(loc = 1)

我們發(fā)現(xiàn)，相比恒定學(xué)習(xí)率為0.01的模型，擁有學(xué)習(xí)率調(diào)度的模型結(jié)果更優(yōu)秀。

對比Lesson 15.1節(jié)中學(xué)習(xí)率調(diào)度模型

# 設(shè)置隨機數(shù)種子 torch.manual_seed(24) # 實例化模型 tanh_model = net_class2(act_fun=torch.tanh, in_features=5, BN_model='pre')# 創(chuàng)建用于保存記錄結(jié)果的空列表容器 train_mse = [] test_mse = []# 創(chuàng)建可以捕捉手動輸入數(shù)據(jù)的模型訓(xùn)練流程 while input("Do you want to continue the iteration? [y/n]") == "y": # 詢問是否繼續(xù)迭代epochs = int(input("Number of epochs:")) # 下一輪迭代遍歷幾次數(shù)據(jù)lr = float(input("Update learning rate：")) # 設(shè)置下一輪迭代的學(xué)習(xí)率train_l0, test_l0 = fit_rec(net = tanh_model, criterion = nn.MSELoss(), optimizer = optim.SGD(tanh_model.parameters(), lr = lr), train_data = train_loader,test_data = test_loader,epochs = epochs, cla = False, eva = mse_cal)train_mse.extend(train_l0)test_mse.extend(test_l0) #Do you want to continue the iteration? [y/n] y #Number of epochs: 30 #Update learning rate： 0.03 #Do you want to continue the iteration? [y/n] y #Number of epochs: 30 #Update learning rate： 0.01 #Do you want to continue the iteration? [y/n] n plt.plot(train_l, label='train_l') plt.plot(train_mse, label='train_mse') plt.xlabel('epochs') plt.ylabel('MSE') plt.legend(loc = 1)

很明顯，由于上一節(jié)的模型是0.03學(xué)習(xí)率模型和0.01學(xué)習(xí)率模型簡單疊加結(jié)果，在恒定學(xué)習(xí)率模型效果均不如本節(jié)模型的情況下，上一節(jié)課中的模型學(xué)習(xí)率調(diào)度策略也無法有更好的表現(xiàn)。
??但是，令人驚訝的是，在訓(xùn)練了60輪之后，LambdaLR模型最終學(xué)習(xí)率在0.002附近，相比上述0.01學(xué)習(xí)率模型而言學(xué)習(xí)率更小。但從上述的實驗中我們發(fā)現(xiàn)，恒定學(xué)習(xí)率時從恒定0.03到恒定0.01的過程，模型準確率已經(jīng)發(fā)生了明顯的下降，但在如果是采用動態(tài)調(diào)整學(xué)習(xí)率的策略，則可以在一個最終更小的學(xué)習(xí)率取值的情況下取得一個更好的模型結(jié)果。

lr_lambda(60) * 0.05 #0.002303489949347597

??這其實說明損失函數(shù)在超平面空間的圖像比一般的想象要復(fù)雜的多，很多時候并不是越靠近全域最小值點附近的通道就越窄，會導(dǎo)致迭代過程落入局部最小值陷阱的學(xué)習(xí)率大小取值也只是絕對概念。正是由于損失函數(shù)的復(fù)雜性，才導(dǎo)致很多時候我們認為神經(jīng)網(wǎng)絡(luò)的內(nèi)部訓(xùn)練是個“黑箱”，才進一步導(dǎo)致神經(jīng)網(wǎng)絡(luò)的模型訓(xùn)練往往以模型結(jié)果為最終依據(jù)，這也是神經(jīng)網(wǎng)絡(luò)優(yōu)化算法會誕生諸多基本原理層面比較扎實，但卻找不到具體能夠證明優(yōu)化效果的理論依據(jù)的方法。
??不過，針對此類方法，和此前介紹的Batch Normalization一樣，盡管理論層面無法具體整體優(yōu)化效果，但對于使用者來說仍然需要在了解其底層原理基礎(chǔ)上積累使用經(jīng)驗或者調(diào)參經(jīng)驗。因此在后續(xù)的課程中，我們將在繼續(xù)介紹其他學(xué)習(xí)率優(yōu)化方法的同時，通過大量的實踐來快速積累使用經(jīng)驗，并且在更多事實的基礎(chǔ)上找到解釋和理解的角度。

總結(jié)

以上是生活随笔為你收集整理的Lesson 15.2 学习率调度在PyTorch中的实现方法的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Lesson 15.1 学习率调度基本概
下一篇： Lesson 16.2 图像的基本操作