當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

神经网络梯度下降_梯度下降优化器对神经网络训练的影响

發(fā)布時間：2023/12/15 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了神经网络梯度下降_梯度下降优化器对神经网络训练的影响小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

神經(jīng)網(wǎng)絡(luò) 梯度下降

co-authored with Apurva Pathak

與Apurva Pathak合著

嘗試梯度下降優(yōu)化器 (Experimenting with Gradient Descent Optimizers)

Welcome to another instalment in our Deep Learning Experiments series, where we run experiments to evaluate commonly-held assumptions about training neural networks. Our goal is to better understand the different design choices that affect model training and evaluation. To do so, we come up with questions about each design choice and then run experiments to answer them.

歡迎來到我們的深度學(xué)習(xí)實(shí)驗(yàn)系列的另一部分，我們在其中進(jìn)行實(shí)驗(yàn)以評估關(guān)于訓(xùn)練神經(jīng)網(wǎng)絡(luò)的普遍假設(shè)。我們的目標(biāo)是更好地了解影響模型訓(xùn)練和評估的不同設(shè)計選擇。為此，我們提出有關(guān)每個設(shè)計選擇的問題，然后進(jìn)行實(shí)驗(yàn)以回答這些問題。

In this article, we seek to better understand the impact of using different optimizers:

在本文中，我們試圖更好地理解使用不同的優(yōu)化器的影響：

How do different optimizers perform in practice?
在實(shí)踐中不同的優(yōu)化器如何執(zhí)行？
How sensitive is each optimizer to parameter choices such as learning rate or momentum?
每個優(yōu)化器對諸如學(xué)習(xí)率或動量之類的參數(shù)選擇有多敏感？
How quickly does each optimizer converge?
每個優(yōu)化程序收斂的速度有多快？
How much of a performance difference does choosing a good optimizer make?
選擇一個好的優(yōu)化器會對性能產(chǎn)生多大的影響？

To answer these questions, we evaluate the following optimizers:

為了回答這些問題，我們評估以下優(yōu)化器：

Stochastic gradient descent (SGD)
隨機(jī)梯度下降(SGD)
SGD with momentum
新元勢頭強(qiáng)勁
SGD with Nesterov momentum
內(nèi)斯托羅夫勢頭強(qiáng)勁的SGD
RMSprop
RMSprop
Adam
亞當(dāng)
Adagrad
阿達(dá)格勒
Cyclic Learning Rate
循環(huán)學(xué)習(xí)率

實(shí)驗(yàn)如何設(shè)置？ (How are the experiments set up?)

We train a neural net using different optimizers and compare their performance. The code for these experiments can be found on Github.

我們使用不同的優(yōu)化器訓(xùn)練神經(jīng)網(wǎng)絡(luò)并比較其性能。這些實(shí)驗(yàn)的代碼可以在Github上找到。

Dataset: we use the Cats and Dogs dataset, which consists of 23,262 images of cats and dogs, split about 50/50 between the two classes. Since the images are differently-sized, we resize them all to the same size. We use 20% of the dataset as validation data (dev set) and the rest as training data.
數(shù)據(jù)集：我們使用“貓狗”數(shù)據(jù)集，該數(shù)據(jù)集由23,262張貓和狗的圖像組成，在這兩個類別之間劃分為大約50/50。由于圖像的尺寸不同，因此我們將它們調(diào)整為相同的尺寸。我們使用數(shù)據(jù)集的20％作為驗(yàn)證數(shù)據(jù)(開發(fā)集)，其余作為訓(xùn)練數(shù)據(jù)。
Evaluation metric: we use the binary cross-entropy loss on the validation data as our primary metric to measure model performance.
評估指標(biāo)：我們使用驗(yàn)證數(shù)據(jù)上的二進(jìn)制交叉熵?fù)p失作為衡量模型性能的主要指標(biāo)。

Figure 1: Sample images from Cats and Dogs dataset圖1：貓和狗數(shù)據(jù)集的樣本圖像

Base model: we also define a base model that is inspired by VGG16, where we apply (convolution ->max-pool -> ReLU -> batch-norm -> dropout) operations repeatedly. Then, we flatten the output volume and feed it into two fully-connected layers (dense -> ReLU -> batch-norm) with 256 units each, and dropout after the first FC layer. Finally, we feed the result into a one-neuron layer with a sigmoid activation, resulting in an output between 0 and 1 that tells us whether the model predicts a cat (0) or dog (1).
基本模型：我們還定義了一個受VGG16啟發(fā)的基本模型，我們在其中重復(fù)應(yīng)用(卷積->最大池-> ReLU->批處理范數(shù)->退出)操作。然后，我們將輸出量展平，并將其饋入兩個完全連接的層(密集-> ReLU->批處理規(guī)范)，每個層具有256個單位，并在第一個FC層之后退出。最后，我們將結(jié)果輸入到具有S形激活的單神經(jīng)元層中，從而得到0到1之間的輸出，告訴我們該模型預(yù)測的是貓(0)還是狗(1)。

NN SVG)NN SVG創(chuàng)建)

Training: we use a batch size of 32 and the default weight initialization (Glorot uniform). The default optimizer is SGD with a learning rate of 0.01. We train until the validation loss fails to improve over 50 iterations.
培訓(xùn)：我們使用32批次大小和默認(rèn)的重量初始化(Glorot統(tǒng)一)。默認(rèn)優(yōu)化器為SGD，學(xué)習(xí)率為0.01。我們進(jìn)行訓(xùn)練，直到驗(yàn)證損失無法改善超過50次迭代為止。

隨機(jī)梯度下降 (Stochastic Gradient Descent)

We first start off with vanilla stochastic gradient descent. This is defined by the following update equation:

我們首先從香草隨機(jī)梯度下降開始。這由以下更新方程式定義：

Figure 3: SGD update equation圖3：SGD更新公式

where w is the weight vector and dw is the gradient of the loss function with respect to the weights. This update rule takes a step in the direction of greatest decrease in the loss function, helping us find a set of weights that minimizes the loss. Note that in pure SGD, the update is applied per example, but more commonly it is computed on a batch of examples (called a mini-batch).

其中w是權(quán)重向量，dw是損失函數(shù)相對于權(quán)重的梯度。此更新規(guī)則朝著損失函數(shù)最大減少的方向邁出了一步，從而幫助我們找到了使損失最小化的一組權(quán)重。請注意，在純SGD中，每個示例都會應(yīng)用更新，但更常見的是，它是基于一批示例(稱為迷你批處理)計算得出的。

學(xué)習(xí)率如何影響SGD？ (How does learning rate affect SGD?)

First, we explore how learning rate affects SGD. It is well known that choosing a learning rate that is too low will cause the model to converge slowly, whereas a learning rate that is too high may cause it to not converge at all.

首先，我們探討學(xué)習(xí)率如何影響SGD。眾所周知，選擇過低的學(xué)習(xí)速率會使模型收斂緩慢，而過高的學(xué)習(xí)速率可能會使模型完全收斂。

Jeremy Jordan’s websiteJeremy Jordan的網(wǎng)站

To verify this experimentally, we vary the learning rate along a log scale between 0.001 and 0.1. Let’s first plot the training losses.

為了通過實(shí)驗(yàn)驗(yàn)證這一點(diǎn)，我們沿0.001至0.1的對數(shù)刻度更改了學(xué)習(xí)率。讓我們首先繪制訓(xùn)練損失。

Figure 5: Training loss curves for SGD with different learning rates圖5：具有不同學(xué)習(xí)率的SGD的訓(xùn)練損失曲線

We indeed observe that performance is optimal when the learning rate is neither too small nor too large (the red line). Initially, increasing the learning rate speeds up convergence, but after learning rate 0.0316, convergence actually becomes slower. This may be because taking a larger step may actually overshoot the minimum loss, as illustrated in figure 4, resulting in a higher loss.

我們確實(shí)觀察到，當(dāng)學(xué)習(xí)率既不太小也不太大(紅線)時，性能是最佳的。最初，提高學(xué)習(xí)速率會加快收斂速度??，但是在學(xué)習(xí)速率達(dá)到0.0316之后，收斂實(shí)際上會變慢。這可能是因?yàn)椴扇「蟮牟襟E實(shí)際上可能會超出最小損耗，如圖4所示，從而導(dǎo)致更高的損耗。

Let’s now plot the validation losses.

現(xiàn)在讓我們繪制驗(yàn)證損失。

Figure 6: Validation loss curves for SGD with different learning rates圖6：具有不同學(xué)習(xí)率的SGD的驗(yàn)證損失曲線

We observe that validation performance suffers when we pick a learning rate that is either too small or too big. Too small (e.g. 0.001) and the validation loss does not decrease at all, or does so very slowly. Too large (e.g. 0.1) and the validation loss does not attain as low a minimum as it could with a smaller learning rate.

我們觀察到，當(dāng)選擇的學(xué)習(xí)率太小或太大時，驗(yàn)證性能都會受到影響。太小(例如0.001)，驗(yàn)證損失根本不會減少，或者會非常緩慢地減少。太大(例如0.1)，并且驗(yàn)證損失無法達(dá)到學(xué)習(xí)率較小時的最小值。

Let’s now plot the best training and validation loss attained by each learning rate*:

現(xiàn)在，讓我們來繪制每種學(xué)習(xí)率*可獲得的最佳培訓(xùn)和驗(yàn)證損失：

Figure 7: Minimum training and validation losses for SGD at different learning rates圖7：不同學(xué)習(xí)率下SGD的最小培訓(xùn)和驗(yàn)證損失

The data above confirm the ‘Goldilocks’ theory of picking a learning rate that is neither too small nor too large, since the best learning rate (3.2e-2) is in the middle of the range of values we tried.

上面的數(shù)據(jù)證實(shí)了“ Goldilocks”理論選擇的學(xué)習(xí)率既不能太小也不能太大，因?yàn)樽罴褜W(xué)習(xí)率(3.2e-2)處于我們嘗試的值范圍的中間。

*Typically, we would expect the validation loss to be higher than the training loss, since the model has not seen the validation data before. However, we see above that the validation loss is surprisingly sometimes lower than the training loss. This could be due to dropout, since neurons are dropped only at training time and not during evaluation, resulting in better performance during evaluation than during training. The effect may be particularly pronounced when the dropout rate is high, as it is in our model (0.6 dropout on FC layers).

*通常，由于模型之前沒有看到驗(yàn)證數(shù)據(jù)，因此我們希望驗(yàn)證損失高于訓(xùn)練損失。但是，我們在上面看到，驗(yàn)證損失有時會比訓(xùn)練損失低。這可能是由于輟學(xué)造成的，因?yàn)樯窠?jīng)元僅在訓(xùn)練時而不是在評估過程中被丟棄，從而導(dǎo)致評估期間的性能比訓(xùn)練期間更好。當(dāng)輟學(xué)率很高時，效果會特別明顯，就像我們的模型一樣(FC層上的輟學(xué)率為0.6)。

最佳SGD驗(yàn)證損失 (Best SGD validation loss)

Best validation loss: 0.1899
最佳驗(yàn)證損失：0.1899
Associated training loss: 0.1945
相關(guān)的訓(xùn)練損失：0.1945
Epochs to converge to minimum: 535
收斂到最少的紀(jì)元：535
Params: learning rate 0.032
參數(shù)：學(xué)習(xí)率0.032

SGD外賣 (SGD takeaways)

Choosing a good learning rate (not too big, not too small) is critical for ensuring optimal performance on SGD.
選擇一個好的學(xué)習(xí)率(不要太大，不要太小)對于確保SGD的最佳性能至關(guān)重要。

動量的隨機(jī)梯度下降 (Stochastic Gradient Descent with Momentum)

總覽 (Overview)

SGD with momentum is a variant of SGD that typically converges more quickly than vanilla SGD. It is typically defined as follows:

具有動量的SGD是SGD的變體，通常比原始SGD收斂更快。通常定義如下：

Figure 8: Update equations for SGD with momentum圖8：具有動量的SGD更新公式

Deep Learning by Goodfellow et al. explains the physical intuition behind the algorithm [0]:

Goodfellow等人的深度學(xué)習(xí) 。解釋了算法[0]背后的物理直覺：

Formally, the momentum algorithm introduces a variable v that plays the role of velocity — it is the direction and speed at which the parameters move through parameter space. The velocity is set to an exponentially decaying average of the negative gradient.

形式上，動量算法引入了一個變量v ，它起著速度的作用-它是參數(shù)在參數(shù)空間中移動的方向和速度。速度設(shè)置為負(fù)梯度的指數(shù)衰減平均值。

In other words, the parameters move through the parameter space at a velocity that changes over time. The change in velocity is dictated by two terms:

換句話說，參數(shù)以隨時間變化的速度在參數(shù)空間中移動。速度的變化由兩個術(shù)語決定：

𝛼, the learning rate, which determines to what degree the gradient acts upon the velocity
𝛼，學(xué)習(xí)率，它決定梯度對速度的作用程度
𝛽, the rate at which the velocity decays over time
𝛽，速度隨時間衰減的速率

Thus, the velocity is an exponential average of the gradients, which incorporates new gradients and naturally decays old gradients over time.

因此，速度是梯度的指數(shù)平均值，其中合并了新的梯度并隨著時間自然衰減了舊的梯度。

One can imagine a ball rolling down a hill, gathering velocity as it descends. Gravity exerts force on the ball, causing it to accelerate or decelerate, as represented by the gradient term 𝛼 * dw. The ball also encounters viscous drag, causing its velocity to decay, as represented by 𝛽.

可以想象一個球從山上滾下來，隨著球下降而加速。重力在球上施加力，使球加速或減速，如梯度項(xiàng)𝛼 * dw所示。球還遇到粘性阻力，從而導(dǎo)致其速度衰減(以represented表示)。

One effect of momentum is to accelerate updates along dimensions where the gradient direction is consistent. For example, consider the effect of momentum when the gradient is a constant c:

動量的作用之一是沿梯度方向一致的維度加速更新。例如，考慮梯度為常數(shù)c時動量的影響：

Figure 9: change in velocity over time when gradient is a constant c.圖9：當(dāng)梯度為常數(shù)c時，速度隨時間的變化。

Whereas vanilla SGD would make an update of -𝛼c each time, SGD with momentum would accelerate over time, eventually reaching a terminal velocity that is 1/1-𝛽 times greater than the vanilla update (derived using the formula for an infinite series). For example, if we set the momentum to 𝛽=0.9, then the update eventually becomes 10 times as large as the vanilla update.

而香草SGD會使的更新- αc各自時間，SGD與動量會加快隨著時間的推移，最終達(dá)到一個終極速度比所述香草更新一分之一-β倍的情況下(使用公式為一個無窮級數(shù)導(dǎo)出) 。例如，如果將動量設(shè)置為𝛽 = 0.9，則更新最終將變?yōu)樵几碌?0倍。

Another effect of momentum is that it dampens oscillations. For example, consider a case when the gradient zigzags and changes direction often along a certain dimension:

動量的另一個作用是抑制動量。例如，考慮一種情況，其中漸變之字形并經(jīng)常沿某個維度改變方向：

More on Optimization Techniques by Ekaba Bisong優(yōu)化技術(shù)》。

The momentum term dampens the oscillations because the oscillating terms cancel out when we add them into the velocity. This allows the update to be dominated by dimensions where the gradient points consistently in the same direction.

動量項(xiàng)抑制了振蕩，因?yàn)楫?dāng)我們將它們添加到速度中時，振蕩項(xiàng)會抵消。這使得更新可以由梯度始終指向同一方向的尺寸決定。

實(shí)驗(yàn) (Experiments)

Let’s look at the effect of momentum at learning rate 0.01. We try out momentum values [0, 0.5, 0.9, 0.95, 0.99].

讓我們看一下學(xué)習(xí)速率為0.01時動量的影響。我們嘗試了動量值[0，0.5，0.9，0.95，0.99]。

Figure 11: Effect of momentum on training loss (left) and validation (right) at learning rate 0.01.圖11：動量對學(xué)習(xí)率0.01時訓(xùn)練損失(左)和驗(yàn)證(右)的影響。

Above, we can see that increasing momentum up to 0.9 helps model training converge more quickly, since training and validation loss decrease at a faster rate. However, once we go past 0.9, we observe that training loss and validation loss actually suffer, with model training entirely failing to converge at momentum 0.99. Why does this happen? This could be because excessively large momentum prevents the model from adapting to new directions in the gradient updates. Another potential reason is that the weight updates become so large that it overshoots the minima. However, this remains an area for future investigation.

上圖，我們可以看到將動量增加到0.9有助于模型訓(xùn)練更快收斂，因?yàn)橛?xùn)練和驗(yàn)證損失減少的速度更快。但是，一旦超過0.9，我們就會發(fā)現(xiàn)訓(xùn)練損失和驗(yàn)證損失實(shí)際上受到了影響，模型訓(xùn)練完全無法收斂于動量0.99。為什么會這樣？這可能是因?yàn)檫^大的動量會阻止模型在梯度更新中適應(yīng)新的方向。另一個潛在原因是權(quán)重更新變得如此之大，以至于超過了最小值。但是，這仍然是未來研究的領(lǐng)域。

Do we observe the decrease in oscillation that is touted as a benefit of momentum? To measure this, we can compute an oscillation proportion for each update step — i.e. what proportion of parameter updates in the current update have the opposite sign compared to the previous update. Indeed, increasing the momentum decreases the proportion of parameters that oscillate:

我們是否觀察到被吹捧為動量的優(yōu)勢而減少了振蕩？為了衡量這一點(diǎn)，我們可以為每個更新步驟計算一個振蕩比例，即當(dāng)前更新中參數(shù)更新與先前更新相比具有相反符號的比例。的確，增加動量會降低振蕩參數(shù)的比例：

Figure 12: Effect of momentum on oscillation圖12：動量對振蕩的影響

What about the size of the updates — does the acceleration property of momentum increase the average size of the updates? Interestingly, the higher the momentum, the larger the initial updates but the smaller the later updates:

更新的大小如何-動量的加速屬性會增加更新的平均大小嗎？有趣的是，動量越高，初始更新越大，但后來更新越小：

Figure 13: Effect of momentum on average update size圖13：動量對平均更新大小的影響

Thus, increasing the momentum results in taking larger initial steps but smaller later steps. Why would this be the case? This is likely because momentum initially benefits from acceleration, causing the initial steps to be larger. Later, the momentum causes oscillations to cancel out, which could make the later steps smaller.

因此，增加動量會導(dǎo)致采取較大的初始步驟，但采取較小的后續(xù)步驟。為什么會這樣呢？這可能是因?yàn)閯恿孔畛鯐募铀僦惺芤?#xff0c;從而導(dǎo)致初始步幅變大。稍后，動量會抵消振蕩，這可能會使后續(xù)步驟變小。

One data point that supports this interpretation is the distance traversed per epoch (defined as the Euclidean distance between the weights at the beginning of the epoch and the weights at the end of the epoch). We see that even though larger momentum values take smaller later steps, they actually traverse more distance:

支持該解釋的一個數(shù)據(jù)點(diǎn)是每個歷元遍歷的距離(定義為歷元開始時權(quán)重與歷時結(jié)束時權(quán)重之間的歐幾里得距離)。我們看到，即使較大的動量值在后面走較小的步驟，它們實(shí)際上也會越過更多的距離：

Figure 14: Distance traversed per epoch for each momentum value.圖14：每個時期每個動量值經(jīng)過的距離。

This indicates that even though increasing the momentum values causes the later update steps to become smaller, the distance traversed is actually greater because the steps are more efficient — they do not cancel each other out as often.

這表明，即使增加動量值會導(dǎo)致以后的更新步驟變小，但由于這些步驟效率更高，因此經(jīng)過的距離實(shí)際上更大，因?yàn)樗鼈冎g不會互相抵消。

Now, let’s look at the effect of momentum on a small learning rate (0.001).

現(xiàn)在，讓我們看一下動量對小學(xué)習(xí)率(0.001)的影響。

Figure 15: Effect of momentum on training loss (left) and validation loss (right) at learning rate 0.001.圖15：動量對學(xué)習(xí)率0.001的訓(xùn)練損失(左)和驗(yàn)證損失(右)的影響。

Surprisingly, increasing momentum on small learning rates helps it converge, when it didn’t before! Now, let’s look at a large learning rate.

出乎意料的是，以前所未有的速度提高小學(xué)習(xí)率的勢頭有助于它收斂！現(xiàn)在，讓我們看一下大學(xué)習(xí)率。

Figure 16: Effect of momentum on training loss (left) and validation loss (right) at learning rate 0.1.圖16：學(xué)習(xí)速率為0.1時，動量對訓(xùn)練損失(左)和驗(yàn)證損失(右)的影響。

When the learning rate is large, increasing the momentum degrades performance, and can even result in the model failing to converge (see flat lines above corresponding to momentum 0.9 and 0.95).

當(dāng)學(xué)習(xí)率很高時，增加動量會降低性能，甚至可能導(dǎo)致模型無法收斂(請參見上方的平線，對應(yīng)于動量0.9和0.95)。

Now, to generalize our observations, let’s look at the minimum training loss and validation loss across all learning rates and momentums:

現(xiàn)在，為了概括我們的觀察，讓我們看一下所有學(xué)習(xí)率和動量下的最小訓(xùn)練損失和驗(yàn)證損失：

Figure 17: Minimum training loss (left) and validation loss (right) at different learning rates and momentums. Minimum value in each row is highlighted in green.圖17：不同學(xué)習(xí)率和動量下的最小訓(xùn)練損失(左)和驗(yàn)證損失(右)。每行的最小值以綠色突出顯示。

We see that the learning rate and the momentum are closely linked —the higher the learning rate, the lower the range of ‘a(chǎn)cceptable’ momentum values (i.e. values that don’t cause the model training to diverge). Conversely, the higher the momentum, the lower the range of acceptable learning rates.

我們看到學(xué)習(xí)率和動量密切相關(guān)，學(xué)習(xí)率越高，“可接受”動量值(即不會引起模型訓(xùn)練發(fā)散的值)的范圍越小。相反，動力越高，可接受的學(xué)習(xí)率范圍越小。

Altogether, the behavior across all the learning rates suggests that increasing momentum has an effect akin to increasing the learning rate. It helps smaller learning rates converge (Figure 14) but may cause larger ones to diverge (Figure 15). This makes sense if we consider the terminal velocity interpretation from Figure 9 — adding momentum can cause the updates to reach a terminal velocity much greater than than the vanilla updates themselves.

總的來說，所有學(xué)習(xí)率的行為都表明，增加動量的作用類似于提高學(xué)習(xí)率。它有助于較小的學(xué)習(xí)率收斂(圖14)，但可能導(dǎo)致較大的學(xué)習(xí)率發(fā)散(圖15)。如果我們考慮圖9中的終極速度解釋，這是有道理的-增加動量可以導(dǎo)致更新達(dá)到終極速度，其速度要比原始更新本身大得多。

Note, however, that this does not mean that increasing momentum is the same as increasing the learning rate — there are simply some similarities in terms of convergence/divergence behavior between increasing momentum and increasing the learning rate. More concretely, as we can see in Figures 12 and 13, momentum also decreases oscillations, and front-loads the large updates at the beginning of training — we would not observe the same behaviors if we simply increased the learning rate.

但是請注意，這并不意味著增加動量與增加學(xué)習(xí)率相同–在增加動量和增加學(xué)習(xí)率之間在收斂/發(fā)散行為方面僅存在一些相似之處。更具體地講，如我們在圖12和13中看到的那樣，動量還減少了振蕩，并且在訓(xùn)練開始時就將較大的更新提前加載了—如果僅提高學(xué)習(xí)速度，我們將不會觀察到相同的行為。

動量的替代表達(dá) (Alternative formulation of momentum)

There is another way to define momentum, expressed as follows:

還有另一種定義動量的方式，表示如下：

Figure 18: Alternative definition of momentum圖18：動量的替代定義

Andrew Ng uses this definition of momentum in his Deep Learning Specialization on Coursera. In this formulation, the velocity term is an exponentially moving average of the gradients, controlled by the parameter beta. The update is applied to the weights, with the size of the update controlled by the learning rate alpha. Note that this formulation is mathematically the same as the first formulation when expanded, except that all the terms are multiplied by 1-beta.

吳安德(Andrew Ng)在Coursera的深度學(xué)習(xí)專業(yè)課程中使用了這種動量定義。在此公式中，速度項(xiàng)是由參數(shù)β控制的梯度的指數(shù)移動平均值。將更新應(yīng)用于權(quán)重，更新的大小由學(xué)習(xí)率alpha控制。請注意，此公式在擴(kuò)展時與第一個公式在數(shù)學(xué)上相同，只是所有術(shù)語均乘以1-beta。

How does this formulation of momentum work in practice?

這種動量公式在實(shí)踐中如何起作用？

Figure 19: Effect of momentum (alternative formulation) on training loss (left) and validation loss (right)圖19：動量(替代公式)對訓(xùn)練損失(左)和驗(yàn)證損失(右)的影響

Surprisingly, using this alternative formulation, it looks like increasing the momentum actually slows down convergence!

出乎意料的是，使用這種替代公式，看起來增加勢頭實(shí)際上會減慢收斂速度！

Why would this be the case? This formulation of momentum, while dampening oscillations, does not enjoy the same benefit of acceleration that the other formulation does. If we consider a toy example where the gradient is always a constant c, we see that the velocity never accelerates:

為什么會這樣呢？這種動量公式在抑制振動的同時，沒有像其他公式那樣具有加速的好處。如果我們考慮一個玩具示例，其中梯度始終為c ，則我們看到速度永遠(yuǎn)不會加速：

Figure 20: Change in velocity over time with repeated gradients of constant c圖20：重復(fù)的常數(shù)c梯度隨時間變化的速度

Indeed, Andrew Ng suggests that the main benefit of this formulation of momentum is not acceleration, but the fact that it dampens oscillations, allowing you to use a larger learning rate and therefore converge more quickly. Based on our experiments, increasing momentum by itself (in this formulation) without increasing the learning rate is not enough to guarantee faster convergence.

實(shí)際上，吳安德(Andrew Ng)表示，這種動量公式化的主要好處不是加速，而是抑制振蕩的事實(shí)，使您可以使用較大的學(xué)習(xí)率，因此可以更快地收斂。根據(jù)我們的實(shí)驗(yàn)，僅靠增加動量(在此公式中)而不增加學(xué)習(xí)率不足以保證更快的收斂。

有動力的SGD最佳驗(yàn)證損失 (Best validation loss on SGD with momentum)

Best validation loss: 0.2046
最佳驗(yàn)證損失：0.2046
Associated training loss: 0.2252
相關(guān)的訓(xùn)練損失：0.2252
Epochs to converge to minimum: 402
收斂到最小限度的時代：402
Params: learning rate 0.01, momentum 0.5
參數(shù)：學(xué)習(xí)率0.01，動量0.5

SGD帶動力外賣 (SGD with momentum takeaways)

Momentum causes model training to converge more quickly, but is not guaranteed to improve the final training or validation loss, based on the parameters we tested.
根據(jù)我們測試的參數(shù)，動量會使模型訓(xùn)練收斂更快，但不能保證改善最終訓(xùn)練或驗(yàn)證損失。
The higher the learning rate, the lower the range of acceptable momentum values (ones where model training converges).
學(xué)習(xí)速率越高，可接受動量值(模型訓(xùn)練收斂的動量值)的范圍越小。

Nesterov動量的隨機(jī)梯度下降 (Stochastic Gradient Descent with Nesterov Momentum)

One issue with momentum is that while the gradient always points in the direction of greatest loss decrease, the momentum may not. To correct for this, Nesterov momentum computes the gradient at a lookahead point (w + velocity) instead of w. This gives the gradient a chance to correct for the momentum term.

動量的一個問題是，雖然梯度始終指向最大損耗減小的方向，但動量可能不會。為了對此進(jìn)行校正，涅斯捷羅夫動量計算的是先行點(diǎn)(w +速度)而不是w的梯度。這使梯度有機(jī)會校正動量項(xiàng)。

Figure 21: Nesterov update. Left: illustration. Right: equations.圖21：Nesterov更新。左：插圖。右：方程式。

To illustrate how Nesterov can help training converge more quickly, let’s look at a dummy example where the optimizer tries to descend a bowl-shaped loss surface, with the minimum at the center of the bowl.

為了說明Nesterov如何幫助更快地訓(xùn)練收斂，讓我們看一個虛擬的示例，其中優(yōu)化器嘗試下降碗形的損失表面，使最小值位于碗的中心。

Figure 22. Left: regular momentum. Right: Nesterov momentum.圖22.左：常規(guī)動量。右：內(nèi)斯特羅夫的勢頭。

As the illustrations show, Nesterov converges more quickly because it computes the gradient at a lookahead point, thus ensuring that the update approaches the minimizer more quickly.

如圖所示，Nesterov收斂更快，因?yàn)樗梢栽诔包c(diǎn)計算梯度，從而確保更新更快地接近最小化器。

Let’s try out Nesterov on a subset of the learning rates and momentums we used for regular momentum, and see if it speeds up convergence. Let’s take a look at learning rate 0.001 and momentum 0.95:

讓我們根據(jù)用于常規(guī)動量的學(xué)習(xí)速度和動量子集來測試Nesterov，看看它是否能加快收斂速度??。讓我們看一下學(xué)習(xí)率0.001和動量0.95：

Figure 23: Effect of Nesterov momentum on lr 0.001 and momentum 0.95.圖23：內(nèi)斯特羅夫動量對lr 0.001和動量0.95的影響。

Here, Nesterov does indeed seem to speed up convergence rapidly! How about if we increase the momentum to 0.99?

在這里，涅斯捷羅夫的確確實(shí)確實(shí)在加快融合的速度！如果將動量增加到0.99呢？

Figure 24: Effect of Nesterov momentum on lor 0.001 and momentum 0.99.圖24：內(nèi)斯特羅夫動量對lor 0.001和動量0.99的影響。

Now, Nesterov actually converges more slowly on the training loss, and though it initially converges more quickly on validation loss, it slows down and is overtaken by momentum after around 50 epochs.

現(xiàn)在，涅斯捷羅夫?qū)嶋H上在訓(xùn)練損失上收斂得較慢，盡管最初在驗(yàn)證損失上收斂得更快，但它變慢了，并在大約50個紀(jì)元后被動量所超越。

How should we measure speed of convergence over all the training runs? Let’s take the loss that regular momentum achieves after 50 epochs, then determine how many epochs Nesterov takes to reach that same loss. We define the convergence ratio as this number of epochs divided by 50. If it less than one, then Nesterov converges more quickly than regular momentum; conversely, if it is greater, then Nesterov converges more slowly.

我們應(yīng)該如何衡量所有訓(xùn)練運(yùn)行的收斂速度？讓我們以規(guī)則動量在50個紀(jì)元后達(dá)到的損失為例，然后確定Nesterov要達(dá)到相同的損失數(shù)個紀(jì)元。我們將收斂率定義為該時期數(shù)除以50。如果小于1，則Nesterov的收斂速度要快于常規(guī)動量。相反，如果更大，則涅斯捷羅夫收斂速度會更慢。

Figure 25. Ratio of epochs for Nesterov’s loss to converge to the regular momentum’s loss after 50 epochs. Training runs where Nesterov was faster are highlighted in green; slower in red; and runs where neither Nesterov nor regular momentum converged in yellow.圖25. Nesterov損失收斂到50個歷時之后的常規(guī)動量損失的歷時比率。 Nesterov更快的訓(xùn)練運(yùn)行以綠色突出顯示；紅色變慢并在Nesterov和常規(guī)動量都未收斂為黃色的地方運(yùn)行。

We see that in most cases (10/14) adding Nesterov causes the training loss to decrease more quickly, as seen in Table 5. The same applies to a lesser extent (8/12) for the validation loss, in Table 6.

我們看到，在大多數(shù)情況下(10/14)，添加Nesterov會導(dǎo)致訓(xùn)練損失更快地減少，如表5所示。對于表象6中的驗(yàn)證損失，情況也是如此(8/12)較小。

There does not seem to be a clear relationship between the speedup from adding Nesterov and the other parameters (learning rate and momentum), though this can be an area for future investigation.

雖然添加Nesterov所帶來的加速與其他參數(shù)(學(xué)習(xí)率和動量)之間似乎沒有明確的關(guān)系，但是這可能是未來研究的領(lǐng)域。

Nesterov動量的SGD最佳驗(yàn)證損失 (Best validation loss on SGD with Nesterov momentum)

Best validation loss: 0.2020
最佳驗(yàn)證損失：0.2020
Associated training loss: 0.1945
相關(guān)的訓(xùn)練損失：0.1945
Epochs to converge to minimum: 414
收斂到最小的時代：414
Params: learning rate 0.003, momentum 0.95
參數(shù)：學(xué)習(xí)率0.003，動量0.95

Figure 26. Minimum training and validation losses achieved by each training run. Minimum in each row is highlighted in green.圖26.每次培訓(xùn)運(yùn)行所獲得的最小培訓(xùn)和驗(yàn)證損失。每行的最小值以綠色突出顯示。

內(nèi)斯特羅夫動力外賣的SGD (SGD with Nesterov momentum takeaways)

Nesterov momentum computes the gradient at a lookahead point in order to account for the effect of momentum.
為了考慮動量的影響，涅斯捷羅夫動量會在先行點(diǎn)計算梯度。
Nesterov generally converges more quickly compared to regular momentum.
與常規(guī)動量相比，內(nèi)斯特羅夫的收斂速度通常更快。

RMSprop (RMSprop)

The main idea of RMSProp is to divide the gradient by an exponential average of its recent magnitude. The update equations are as follows:

RMSProp的主要思想是將梯度除以最近幅度的指數(shù)平均值。更新公式如下：

Figure 27: RMSprop update equations — taken from the Deep Learning Specialization by Andrew Ng圖27：RMSprop更新方程式-摘自Andrew Ng的Deep Learning專業(yè)版

RMSprop tries to normalize the size of the updates across different weights — in other words, reducing the update size when the gradient is large, and increasing it when the gradient is small. As an example, consider a weight parameter where the gradients are [5, 5, 5] (and assume that 𝛼=1). The denominator in the second equation is then 5, so the updates applied would be -[1, 1, 1]. Now, consider a weight parameter where the gradients are [0.5, 0.5, 0.5]; the denominator would be 0.5, giving the same updates -[1, 1, 1] as the previous case! In other words, RMSprop cares more about the direction (+ or -) of each weight than the magnitude, and tries to normalize the size of the update step for each of these weights.

RMSprop試圖標(biāo)準(zhǔn)化不同權(quán)重上的更新大小-換句話說，當(dāng)梯度大時減小更新大小，而當(dāng)梯度小時增大更新大小。例如，考慮一個權(quán)重參數(shù)，其中梯度為[5，5，5](并假定𝛼 = 1)。那么第二個等式中的分母為5，因此應(yīng)用的更新為-[1，1，1]。現(xiàn)在，考慮權(quán)重參數(shù)，其中梯度分別為[0.5，0.5，0.5]；分母將為0.5，與前面的情況相同，更新為[[1，1，1]！換句話說，RMSprop更關(guān)心每個權(quán)重的方向(+或-)，而不是大小，并嘗試針對這些權(quán)重中的每個權(quán)重標(biāo)準(zhǔn)化更新步驟的大小。

This is different from vanilla SGD, which applies larger updates for weight parameters with larger gradients. Considering the above example where the gradient is [5, 5, 5], we can see that the resulting updates would be -[5, 5, 5], whereas for the [0.5, 0.5, 0.5] case the updates would be -[0.5, 0.5, 0.5]. Vanilla SGD thus is different from RMSprop in that the larger the gradient, the larger the update.

這不同于香草SGD，后者對具有較大梯度的權(quán)重參數(shù)應(yīng)用較大的更新。考慮上面的示例，其中梯度為[5，5，5]，我們可以看到結(jié)果更新為-[5，5，5]，而對于[0.5，0.5，0.5]情況，更新為- [0.5，0.5，0.5]。因此，香草SGD與RMSprop的不同之處在于，梯度越大，更新越大。

學(xué)習(xí)率和rho如何影響RMSprop？ (How do learning rate and rho affect RMSprop?)

Let’s try out RMSprop while varying the learning rate 𝛼 (default 0.001) and the coefficient 𝜌 (default 0.9). Let’s first try setting 𝜌 = 0 and vary the learning rate:

讓我們嘗試RMSprop，同時改變學(xué)習(xí)率𝛼(默認(rèn)值0.001)和系數(shù)𝜌(默認(rèn)值0.9)。首先讓我們設(shè)置𝜌 = 0并改變學(xué)習(xí)率：

Figure 28: RMSProp training loss at different learning rates, with rho = 0.圖28：在rho = 0的情況下，不同學(xué)習(xí)率下的RMSProp訓(xùn)練損失。

First lesson learned — don’t use RMSProp with 𝜌=0! This results in the update being as follows:

第一堂課—不要在𝜌 = 0時使用RMSProp！這導(dǎo)致更新如下：

Figure 29: RMSprop when rho = 0圖29：rho = 0時的RMSprop

Let’s try again over nonzero rho values. We first plot the train and validation losses for a small learning rate (1e-3).

讓我們再次嘗試非零的rho值。我們首先以小學(xué)習(xí)率(1e-3)繪制火車和驗(yàn)證損失。

Figure 30: RMSProp at different rho values, with learning rate 1e-3.圖30：學(xué)習(xí)速率為1e-3的不同rho值的RMSProp。

Increasing rho seems to reduce both the training loss and validation loss, but with diminishing returns — the validation loss ceases to improve when increasing rho from 0.95 to 0.99.

增加rho似乎可以減少訓(xùn)練損失和驗(yàn)證損失，但是收益卻減少了-當(dāng)rho從0.95增加到0.99時，驗(yàn)證損失不再改善。

Let’s now take a look at what happens when we use a larger learning rate.

現(xiàn)在讓我們看一下使用較高學(xué)習(xí)率時會發(fā)生什么。

Figure 31: RMSProp at different rho values, with learning rate 3e-2.圖31：不同學(xué)習(xí)速度下的RMSProp，學(xué)習(xí)速率為3e-2。

Here, the training and validation losses entirely fail to converge!

在這里，訓(xùn)練和驗(yàn)證損失完全無法收斂！

Let’s take a look at the minimum training and validation losses across all parameters:

讓我們看一下所有參數(shù)的最小訓(xùn)練和驗(yàn)證損失：

Figure 32: Minimum training loss (left) and minimum validation loss (right) on RMSprop across different learning rates and rho values. Minimum value in each row is highlighted in green.圖32：不同學(xué)習(xí)率和rho值下RMSprop的最小訓(xùn)練損失(左)和最小驗(yàn)證損失(右)。每行的最小值以綠色突出顯示。

From the plots above, we find that once the learning rate reaches 0.01 or higher, RMSprop fails to converge.Thus, the optimal learning rate found here is around ten times as small as the optimal learning rate on SGD! One hypothesis is that the denominator term is much smaller than one, so it effectively scales up the update. Thus, we need to adjust the learning rate downward to compensate.

從上面的圖中可以發(fā)現(xiàn)，一旦學(xué)習(xí)率達(dá)到0.01或更高，RMSprop就無法收斂，因此，此處找到的最佳學(xué)習(xí)率約為SGD最佳學(xué)習(xí)率的十倍！一種假設(shè)是分母項(xiàng)遠(yuǎn)小于分母項(xiàng)，因此它有效地擴(kuò)大了更新范圍。因此，我們需要向下調(diào)整學(xué)習(xí)率以進(jìn)行補(bǔ)償。

Regarding 𝜌, we can see from the graphs above the RMS performs the best on our data with high 𝜌 values (0.9 to 1). Even though the Keras docs recommend using the default value of 𝜌=0.9, it’s worth exploring other values as well — when we increased rho from 0.9 to 0.95, it substantially improved the best attained validation loss from 0.2226 to 0.2061.

關(guān)于𝜌，我們可以從上方的圖表中看出，RMS具有高𝜌值(0.9到1)，對我們的數(shù)據(jù)表現(xiàn)最佳。即使Keras文檔建議使用默認(rèn)值𝜌 = 0.9，也值得探索其他值-當(dāng)我們將rho從0.9增大到0.95時，它將獲得的最佳驗(yàn)證損失從0.2226大大提高到0.2061。

RMSprop的最佳驗(yàn)證損失 (Best validation loss on RMSprop)

Best validation loss: 0.2061
最佳驗(yàn)證損失：0.2061
Associated training loss: 0.2408
相關(guān)的訓(xùn)練損失：0.2408
Epochs to converge to minimum: 338
收斂到最少的紀(jì)元：338
Params: learning rate 0.001, rho 0.95
參數(shù)：學(xué)習(xí)率0.001，rho 0.95

RMSprop外賣 (RMSprop takeaways)

RMSprop seems to work at much smaller learning rates than vanilla SGD (about ten times smaller). This is likely because we divide the original update (dw) by the averaged gradient.
RMSprop的學(xué)習(xí)速度似乎比香草SGD小得多(約小十倍)。 這可能是因?yàn)槲覀儗⒃几?dw)除以平均梯度。
Additionally, it seems to pay off to explore different values of 𝜌, contrary to the Keras docs’ recommendation to use the default value.
此外，探索 Ke的 不同值似乎 很有意義，這與Keras文檔建議使用默認(rèn)值相反。

亞當(dāng) (Adam)

Adam is sometimes regarded as the optimizer of choice, as it has been shown to converge more quickly than SGD and other optimization methods [1]. essentially a combination of SGD with momentum and RMSProp. It uses the following update equations:

亞當(dāng)有時被認(rèn)為是選擇的優(yōu)化器，因?yàn)樗驯蛔C明比SGD和其他優(yōu)化方法收斂更快[1]。本質(zhì)上是SGD與動力和RMSProp的組合。它使用以下更新方程式：

Figure 33: Adam update equations圖33：亞當(dāng)更新方程式

Essentially, we keep a velocity term similar to the one in momentum — it is an exponential average of the gradient updates. We also keep a squared term, which is an exponential average of the squares of the gradients, similar to RMSprop. We also correct these terms by (1 — beta); otherwise, the exponential average will start off with lower values at the beginning, since there are no previous terms to average over. Then we divide the corrected velocity by the square root of the corrected square term, and use that as our update.

本質(zhì)上，我們保持類似于動量中的速度項(xiàng)-它是梯度更新的指數(shù)平均值。我們還保留平方項(xiàng)，它是梯度平方的指數(shù)平均值，類似于RMSprop。我們還將這些條款更正為(1-beta)；否則，指數(shù)平均值將在開始時以較低的值開始，因?yàn)闆]有先前的項(xiàng)可以進(jìn)行平均。然后，我們將校正后的速度除以校正后的平方項(xiàng)的平方根，并將其用作更新。

學(xué)習(xí)率如何影響亞當(dāng)？ (How does learning rate affect Adam?)

It has been suggested that the learning rate is more important than the β1 and β2 parameters, so let’s try varying the learning rate first, on a log scale from 1e-4 to 1:

有人建議，學(xué)習(xí)率比β1和β2參數(shù)更重要，因此讓我們首先嘗試以1e-4到1的對數(shù)刻度更改學(xué)習(xí)率：

Figure 34: Training loss (left) and validation loss (right) on Adam across learning rates.圖34：整個學(xué)習(xí)率對Adam的訓(xùn)練損失(左)和驗(yàn)證損失(右)。

We did not plot learning rates above 0.03, since they failed to converge. We see that as we increase the learning rate, the training and validation loss decrease more quickly — but only up to a certain point. Once we increase the learning rate beyond 0.001, the training and validation loss both start to become worse. This could be due to the ‘overshooting’ behavior illustrated in Figure 4.

我們沒有將學(xué)習(xí)率高于0.03，因?yàn)樗鼈兾茨苁諗俊?我們看到，隨著學(xué)習(xí)率的提高，訓(xùn)練和驗(yàn)證損失的減少速度會更快-但只能達(dá)到一定程度。一旦我們將學(xué)習(xí)率提高到0.001以上，訓(xùn)練和驗(yàn)證損失就會開始變得越來越糟。這可能是由于圖4中所示的“超調(diào)”行為。

So, which of the learning rates is the best? Let’s find out by plotting the best validation loss of each one.

那么，哪個學(xué)習(xí)率是最好的？讓我們通過繪制每一個的最佳驗(yàn)證損失來找出答案。

Figure 35: Minimum training and validation loss on Adam across different learning rates.圖35：在不同學(xué)習(xí)率下對Adam的最小訓(xùn)練和驗(yàn)證損失。

We see that the validation loss on learning rate 0.001 (which happens to be the default learning rate) seems to be the best, at 0.2059. The corresponding training loss is 0.2077. However, this is still worse than the best SGD run, which achieved a validation loss of 0.1899 and training loss of 0.1945. Can we somehow beat that? Let’s try varying β1 and β2 and see.

我們看到學(xué)習(xí)率0.001(恰好是默認(rèn)學(xué)習(xí)率)的驗(yàn)證損失似乎是最好的，為0.2059。相應(yīng)的訓(xùn)練損失為0.2077。但是，這仍然比最佳SGD運(yùn)行更糟糕，后者的驗(yàn)證損失為0.1899，培訓(xùn)損失為0.1945。我們能以某種方式擊敗它嗎？讓我們嘗試改變β1和β2看看。

β1和β2對亞當(dāng)有何影響？ (How do β1 and β2 affect Adam?)

We try the following values for β1 and β2:

我們?yōu)棣?和β2嘗試以下值：

beta_1_values = [0.5, 0.9, 0.95]
beta_2_values = [0.9, 0.99, 0.999]Figure 36: Training loss (left) and validation loss (right) across different values for beta_1 and beta_2.圖36：針對beta_1和beta_2的不同值的訓(xùn)練損失(左)和驗(yàn)證損失(右)。 Figure 37: Minimum training losses (left) and minimum validation losses (right). Minimum value in each row highlighted in green.圖37：最小訓(xùn)練損失(左)和最小驗(yàn)證損失(右)。每行的最小值以綠色突出顯示。

The best run is β1=0.5 and β2=0.999, which achieves a training loss of 0.2071 and validation loss of 0.2021. We can compare this against the default Keras params for Adam (β1=0.9 and β2=0.999), which achieves 0.2077 and 0.2059, respectively. Thus, it pays off slightly to experiment with different values of beta_1 and beta_2, contrary to the recommendation in the Keras docs — but the improvement is not large.

最佳運(yùn)行是β1= 0.5和β2= 0.999，這將導(dǎo)致訓(xùn)練損失為0.2071，驗(yàn)證損失為0.2021。我們可以將其與Adam的默認(rèn)Keras參數(shù)(β1= 0.9和β2= 0.999)進(jìn)行比較，后者分別達(dá)到0.2077和0.2059。因此，與Keras文檔中的建議相反，使用不同的beta_1和beta_2值進(jìn)行試驗(yàn)會有所回報-但改進(jìn)并不大。

Surprisingly, we were not able to beat the best SGD performance! It turns out that others have noticed that Adam sometimes works worse than SGD with momentum or other optimization algorithms [2]. While the reasons for this are beyond the scope of this article, it suggests that it pays off to experiment with different optimizers to find the one that works the best for your data.

令人驚訝的是，我們無法擊敗最佳SGD表現(xiàn)！事實(shí)證明，其他人已經(jīng)注意到，在使用動量或其他優(yōu)化算法的情況下，Adam有時比SGD的工作效果更差[2]。盡管造成這種情況的原因超出了本文的范圍，但它表明嘗試使用不同的優(yōu)化器以找到最適合您數(shù)據(jù)的優(yōu)化器是值得的。

最佳亞當(dāng)驗(yàn)證損失 (Best Adam validation loss)

Best validation loss: 0.2021
最佳驗(yàn)證損失：0.2021
Associated training loss: 0.2071
相關(guān)的訓(xùn)練損失：0.2071
Epochs to converge to minimum: 255
收斂到最小限度的時間：255
Params: learning rate 0.001, β1=0.5, and β2=0.999
參數(shù)：學(xué)習(xí)率0.001，β1= 0.5，β2= 0.999

亞當(dāng)外賣 (Adam takeaways)

Adam is not guaranteed to achieve the best training and validation performance compared to other optimizers, as we found that SGD outperforms Adam.
與其他優(yōu)化程序相比，Adam無法保證獲得最佳的培訓(xùn)和驗(yàn)證性能，因?yàn)槲覀儼l(fā)現(xiàn)SGD優(yōu)于Adam。
Trying out non-default values for β1 and β2 can slightly improve the model’s performance.
試用β1和β2的非默認(rèn)值可以稍微改善模型的性能。

阿達(dá)格勒 (Adagrad)

Adagrad accumulates the squares of gradients, and divides the update by the square root of this accumulator term.

Adagrad累加梯度的平方，然后用該累加項(xiàng)的平方根除以更新。

Figure 38: Adagrad update equation [3]圖38：Adagrad更新方程[3]

This is similar to RMSprop, but the difference is that it simply accumulates the squares of the gradients, without using an exponential average. This should result in the size of the updates decaying over time.

這類似于RMSprop，但不同之處在于，它僅累加了梯度的平方，而沒有使用指數(shù)平均值。這將導(dǎo)致更新的大小隨時間衰減。

Let’s try Adagrad at different learning rates, from 0.001 to 1.

讓我們以0.001到1的不同學(xué)習(xí)率嘗試Adagrad。

Figure 39: Adagrad at different learning rates. Left: training loss. Right: validation loss.圖39：不同學(xué)習(xí)率的Adagrad。左：訓(xùn)練損失。正確：驗(yàn)證損失。

The best training and validation loss are 0.2057 and 0.2310, using a learning rate of 3e-1. Interestingly, if we compare with SGD using the same learning rates, we notice that Adagrad keeps pace with SGD initially but starts to fall behind in later epochs.

使用3e-1的學(xué)習(xí)率，最佳訓(xùn)練和驗(yàn)證損失為0.2057和0.2310。有趣的是，如果我們使用相同的學(xué)習(xí)率與SGD進(jìn)行比較，我們會注意到Adagrad最初與SGD保持同步，但在隨后的時代開始落后。

Figure 40: Adagrad vs SGD at same learning rate. Left: training loss. Right: validation loss.圖40：在相同的學(xué)習(xí)率下，Adagrad與SGD的關(guān)系。左：訓(xùn)練損失。正確：驗(yàn)證損失。

This is likely because Adagrad initially is dividing by a small number, since the gradient accumulator term has not accumulated many gradients yet. This makes the update comparable to that of SGD in the initial epochs. However, as the accumulator term accumulates more gradient, the size of the Adagrad updates decreases, and so the loss begins to flatten or even rise as it becomes more difficult to reach the minimizer.

這很可能是因?yàn)锳dagrad最初會被一個小數(shù)除，因?yàn)樘荻壤奂悠黜?xiàng)尚未累積很多梯度。這使得更新在初始時期可與SGD相提并論。但是，隨著累加器項(xiàng)累積更多的梯度，Adagrad更新的大小會減小，因此，隨著變得越來越難以達(dá)到最小化器，損耗開始趨于平坦甚至上升。

Surprisingly, we observe the opposite effect when we use a large learning rate (3e-1):

令人驚訝的是，當(dāng)我們使用較大的學(xué)習(xí)率(3e-1)時，我們觀察到相反的效果：

Figure 41: Adagrad vs SGD at large learning rate (0.316). Left: training loss. Right: validation loss.圖41：高學(xué)習(xí)率(0.316)時的Adagrad與SGD。左：訓(xùn)練損失。正確：驗(yàn)證損失。

At large learning rates, Adagrad actually converges more quickly than SGD! One possible explanation is that while large learning rates cause SGD to take excessively large update steps, Adagrad divides the updates by the accumulator terms, essentially making the updates smaller and more ‘optimal.’

在較高的學(xué)習(xí)速度下，Adagrad實(shí)際上比SGD融合的速度更快！一種可能的解釋是，雖然較高的學(xué)習(xí)率會導(dǎo)致SGD采取過大的更新步驟，但Adagrad會將更新除以累加器項(xiàng)，從根本上使更新更小，更“最優(yōu)”。

Let’s look at the minimum training and validation losses across all params:

讓我們看一下所有參數(shù)的最小訓(xùn)練和驗(yàn)證損失：

Figure 42: Minimum training and validation losses for Adagrad.圖42：Adagrad的最低培訓(xùn)和驗(yàn)證損失。

We can see that the best learning rate for Adagrad, 0.316, is significantly larger than that for SGD, which was 0.03. As mentioned above, this is most likely because Adagrad divides by the accumulator terms, causing the effective size of the updates to be smaller.

我們可以看到，Adagrad的最佳學(xué)習(xí)率為0.316，大大高于SGD的0.03。如上所述，這很可能是因?yàn)锳dagrad除以累加器項(xiàng)，導(dǎo)致更新的有效大小較小。

Adagrad的最佳驗(yàn)證損失 (Best validation loss on Adagrad)

Best validation loss: 0.2310
最佳驗(yàn)證損失：0.2310
Associated training loss: 0.2057
相關(guān)的訓(xùn)練損失：0.2057
Epochs to converge to minimum: 406
收斂到最小限度的時代：406
Params: learning rate 0.312
參數(shù)：學(xué)習(xí)率0.312

阿達(dá)格勒外賣 (Adagrad takeaways)

Adagrad accumulates the squares of gradients, then divides the update by the square root of the accumulator term.
Adagrad累加梯度的平方，然后將更新除以累加器項(xiàng)的平方根。
The size of Adagrad updates decreases over time.
Adagrad更新的大小會隨著時間的推移而減小。
The optimal learning rate for Adagrad is larger than for SGD (at least 10x in our case).
Adagrad的最佳學(xué)習(xí)率大于SGD(在我們的案例中至少為10倍)。

循環(huán)學(xué)習(xí)率 (Cyclic Learning Rate)

Cyclic Learning Rate is a method that lets the learning rate vary cyclically between a min and max value [4]. It claims to eliminate the need to tune the learning rate, and can help the model training converge more quickly.

循環(huán)學(xué)習(xí)率是一種使學(xué)習(xí)率在最小值和最大值之間周期性變化的方法[4]。它聲稱不需要調(diào)整學(xué)習(xí)速率，并且可以幫助模型訓(xùn)練更快地收斂。

Figure 43: Cyclic learning rate using a triangular cycle圖43：使用三角循環(huán)的循環(huán)學(xué)習(xí)率

We try the cyclic learning rate with reasonable learning rate bounds (base_lr=0.1, max_lr=0.4), and a step size equal to 4 epochs, which is within the 4–8 range suggested by the author.

我們嘗試使用合理的學(xué)習(xí)速率邊界(base_lr = 0.1，max_lr = 0.4)，且步長等于4個紀(jì)元，在作者建議的4–8范圍內(nèi)，以周期性學(xué)習(xí)率進(jìn)行學(xué)習(xí)??。

Figure 44: Cyclic learning rate. Left: Train loss. Right: validation loss.圖44：循環(huán)學(xué)習(xí)率。左：火車丟失。正確：驗(yàn)證損失。

We observe cyclic oscillations in the training loss, due to the cyclic changes in the learning rate. We also see these oscillations to a lesser extend in the validation loss.

由于學(xué)習(xí)率的周期性變化，我們觀察到訓(xùn)練損失中的周期性振蕩。我們還看到這些振蕩在驗(yàn)證損失中的延伸較小。

最佳CLR培訓(xùn)和驗(yàn)證損失 (Best CLR training and validation loss)

Best validation loss: 0.2318
最佳驗(yàn)證損失：0.2318
Associated training loss: 0.2267
相關(guān)的訓(xùn)練損失：0.2267
Epochs to converge to minimum: 280
收斂到最少的時代：280
Params: Used the settings mentioned above. However, we may be able to obtain better performance by tuning the cycle policy (e.g. by allowing the max and min bounds to decay) or by tuning the max and min bounds themselves. Note that this tuning may offset the time savings that CLR purports to offer.
參數(shù)：使用上述設(shè)置。但是，通過調(diào)整循環(huán)策略(例如，允許最大和最小邊界衰減)或自行調(diào)整最大和最小邊界，我們也許可以獲得更好的性能。請注意，此調(diào)整可能會抵消CLR聲稱可以節(jié)省的時間。

CLR外賣店 (CLR takeaways)

CLR varies the learning rate cyclically between a min and max bound.
CLR在最小和最大界限之間周期性地改變學(xué)習(xí)率。
CLR may potentially eliminate the need to tune the learning rate while attaining similar performance. However, we did not attain similar performance.
CLR可能會消除在達(dá)到類似性能的同時調(diào)整學(xué)習(xí)速率的需求。 但是，我們沒有達(dá)到類似的性能。

比較方式 (Comparison)

So, after all the experiments above, which optimizer ended up working the best? Let’s take the best run from each optimizer, i.e. the one with the lowest validation loss:

那么，經(jīng)過以上所有實(shí)驗(yàn)，哪個優(yōu)化程序最終表現(xiàn)最佳？讓我們從每個優(yōu)化器中獲得最佳運(yùn)行，即驗(yàn)證損失最小的運(yùn)行器：

Figure 45: Best validation loss achieved by each optimizer.圖45：每個優(yōu)化器實(shí)現(xiàn)的最佳驗(yàn)證損失。

Surprisingly, SGD achieves the best validation loss, and by a significant margin. Then, we have SGD with Nesterov momentum, Adam, SGD with momentum, and RMSprop, which all perform similarly to one another. Finally, Adagrad and CLR come in last, with losses significantly higher than the others.

出人意料的是，SGD的確最大程度地降低了驗(yàn)證損失。然后，我們得到了具有Nesterov動量的SGD，Adam，具有動量的SGD和RMSprop，它們的性能都相似。最后，Adagrad和CLR排名倒數(shù)第二，損失明顯高于其他公司。

What about training loss? Let’s plot the training loss for the runs selected above:

訓(xùn)練損失呢？讓我們繪制以上所選跑步的訓(xùn)練損失：

Figure 46: Training loss achieved by each optimizer for best runs selected above.圖46：對于上面選擇的最佳運(yùn)行，每個優(yōu)化器都達(dá)到了訓(xùn)練損失。

Here, we see some correlation with the validation loss, but Adagrad and CLR perform better than their validation losses would imply.

在這里，我們看到了與驗(yàn)證損失的一些相關(guān)性，但是Adagrad和CLR的表現(xiàn)要好于其驗(yàn)證損失所暗示的。

What about convergence? Let’s first take a look at how many epochs it takes each optimizer to converge to its minimum validation loss:

那么融合呢？首先讓我們看一下每個優(yōu)化器收斂到最小驗(yàn)證損失所需的時間：

Figure 47: Num epochs to converge to minimizer.圖47：收斂到最小化器的時間段。

Adam is clearly the fastest, while SGD is the slowest.

亞當(dāng)顯然是最快的，而新幣是最慢的。

However, this may not be a fair comparison, since the minimum validation loss for each optimizer is different. How about measuring how many epochs it takes each optimizer to reach a fixed validation loss? Let’s take the worst minimum validation loss of 0.2318 (the one achieved by CLR), and compute how many epochs it takes each optimizer to reach that loss.

但是，這可能不是一個公平的比較，因?yàn)槊總€優(yōu)化程序的最小驗(yàn)證損失是不同的。如何衡量每個優(yōu)化器達(dá)到固定驗(yàn)證損失所需的時間？讓我們假設(shè)最差的最小驗(yàn)證損失為0.2318(CLR實(shí)現(xiàn)的損失)，并計算每個優(yōu)化程序達(dá)到該損失所花費(fèi)的時間。

Figure 48: Number of epochs to converge to worst minimum validation loss (0.2318, achieved by CLR).圖48：收斂到最差的最小驗(yàn)證損失的時期數(shù)(0.2318，由CLR實(shí)現(xiàn))。

Again, we can see that Adam does converge more quickly to the given loss than any other optimizer, which is one of its purported advantages. Surprisingly, SGD with momentum seems to converge more slowly than vanilla SGD! This is because the learning rate used by the best SGD with momentum run is lower than that used by the best vanilla SGD run. If we hold the learning rate constant, we see that momentum does in fact speed up convergence:

再次，我們可以看到Adam確實(shí)比任何其他優(yōu)化器都能更快地收斂到給定的損耗，這是其聲稱的優(yōu)勢之一。出乎意料的是，具有勢頭的SGD收斂似乎比香草SGD慢！這是因?yàn)榫哂袆恿窟\(yùn)行的最佳SGD使用的學(xué)習(xí)速率低于最佳原始SGD運(yùn)行的學(xué)習(xí)速率。如果我們將學(xué)習(xí)率保持恒定，我們會發(fā)現(xiàn)動量確實(shí)會加速收斂：

Figure 49: Comparing SGD and SGD with momentum.圖49：將SGD和SGD與動量進(jìn)行比較。

As seen above, the best vanilla SGD run (blue) converges more quickly than the best SGD with momentum run (orange), since the learning rate is higher at 0.03 compared to the latter’s 0.01. However, when hold the learning rate constant by comparing with vanilla SGD at learning rate 0.01 (green), we see that adding momentum does indeed speed up convergence.

如上所示，最好的香草SGD運(yùn)行(藍(lán)色)比帶有動量運(yùn)行(橙色)的最佳SGD收斂更快，因?yàn)閷W(xué)習(xí)率比后者的0.01高，為0.03。但是，當(dāng)通過與學(xué)習(xí)速率為0.01(綠色)的香草SGD進(jìn)行比較而使學(xué)習(xí)速率保持恒定時，我們看到增加動量確實(shí)確實(shí)會加快收斂。

為什么亞當(dāng)無法擊敗香草SGD？ (Why does Adam fail to beat vanilla SGD?)

As mentioned in the Adam section, others have also noticed that Adam sometimes works worse than SGD with momentum or other optimization algorithms [2]. To quote Vitaly Bushaev’s article on Adam, “after a while people started noticing that despite superior training time, Adam in some areas does not converge to an optimal solution, so for some tasks (such as image classification on popular CIFAR datasets) state-of-the-art results are still only achieved by applying SGD with momentum.” [2] Though the exact reasons are beyond the scope of this article, others have shown that Adam may converge to sub-optimal solutions, even on convex functions.

正如亞當(dāng)部分中提到的那樣，其他人也注意到亞當(dāng)有時在使用動量或其他優(yōu)化算法的情況下比SGD表現(xiàn)更差[2]。引用Vitaly Bushaev關(guān)于Adam的文章，“一段時間后，人們開始注意到，盡管訓(xùn)練時間很長，但Adam在某些領(lǐng)域并沒有收斂到最佳解決方案，因此對于某些任務(wù)(例如，流行的CIFAR數(shù)據(jù)集上的圖像分類)來說仍然只有通過有力地應(yīng)用SGD才能獲得最先進(jìn)的結(jié)果。” [2]盡管確切的原因不在本文討論范圍之內(nèi)，但其他證據(jù)表明，即使在凸函數(shù)上，亞當(dāng)也可能收斂于次優(yōu)解。

結(jié)論 (Conclusions)

Overall, we can conclude that:

總的來說，我們可以得出以下結(jié)論：

You should tune your learning rate — it makes a large difference in your model’s performance, even more so than the choice of optimizer.
您應(yīng)該調(diào)整學(xué)習(xí)速度-與選擇優(yōu)化器相比，它對模型性能的影響很大。
On our data, vanilla SGD performed the best, but Adam achieved performance that was almost as good, while converging more quickly.
根據(jù)我們的數(shù)據(jù)，香草SGD表現(xiàn)最好，但是Adam的表現(xiàn)幾乎一樣好，同時收斂速度更快。
It is worth trying out different values for rho in RMSprop and the beta values in Adam, even though Keras recommends using the default params.
即使Keras建議使用默認(rèn)參數(shù)，也值得嘗試使用RMSprop中的rho和Adam中的beta值不同的值。

翻譯自: https://towardsdatascience.com/effect-of-gradient-descent-optimizers-on-neural-net-training-d44678d27060