梯度下降法_梯度下降
梯度下降法
介紹 (Introduction)
Gradient Descent is a first order iterative optimization algorithm where optimization, often in Machine Learning refers to minimizing a cost function J(w) parameterized by the predictive model's parameters. Additionally, by first-order we mean that Gradient Descent only takes into account the first derivative when performing updates of the parameters.
梯度下降是一階迭代優(yōu)化算法,其中優(yōu)化(通常在機(jī)器學(xué)習(xí)中)是指最小化由預(yù)測(cè)模型的參數(shù)參數(shù)化的成本函數(shù)J(w)。 另外,一階是指梯度下降在執(zhí)行參數(shù)更新時(shí)僅考慮一階導(dǎo)數(shù)。
No matter what happens, at some point in your Machine Learning or Deep Learning journey, you will hear about something called Gradient Descent. A vital piece to the puzzle for many Machine Learning algorithms, I highly recommend that it’s not treated as a black box by practitioners.
無(wú)論發(fā)生什么情況,在您的機(jī)器學(xué)習(xí)或深度學(xué)習(xí)過(guò)程中的某個(gè)時(shí)刻,您都會(huì)聽(tīng)到有關(guān)梯度下降的信息。 對(duì)于許多機(jī)器學(xué)習(xí)算法來(lái)說(shuō),這是一個(gè)難題的重要組成部分,我強(qiáng)烈建議從業(yè)者不要將其視為黑匣子。
In order to minimize the cost function, we aim to either find the global minimum which is quite feasible if the objective function is convex. Nonetheless, in many scenarios such as deep learning task our objective function tends to be non-convex, thereby finding the lowest possible value of the objective function has been highly regarded as a suitable solution.
為了最小化成本函數(shù),我們的目標(biāo)是找到一個(gè)全局最小值,如果目標(biāo)函數(shù)是凸的 ,則這是非常可行的。 盡管如此,在諸如深度學(xué)習(xí)任務(wù)之類的許多情況下,我們的目標(biāo)函數(shù)往往是非凸的,因此,將目標(biāo)函數(shù)的盡可能低的值視為高度合適的解決方案。
Figure 1: Convex function and Non-convex function examples.圖1:凸函數(shù)和非凸函數(shù)示例。To find the local minimum of the function we take steps proportional to the negative of the gradient of the function at the current point (Source: Wikipedia). Frankly, we start with a random point on the objective function and move in the negative direction towards the global/local minimum.
為了找到該函數(shù)的局部最小值,我們采取與該函數(shù)在當(dāng)前點(diǎn)處的梯度的負(fù)值成比例的步驟(來(lái)源: Wikipedia )。 坦白說(shuō),我們從目標(biāo)函數(shù)上的隨機(jī)點(diǎn)開(kāi)始,然后朝整個(gè)全局/局部最小值的負(fù)方向移動(dòng)。
MLextend)MLextend )There are many different adaptations that could be made to Gradient Descent to make it run more efficiently for different scenarios; Each adaptation to Gradient Descent has its own pros and cons as we will share below:
可以對(duì)Gradient Descent進(jìn)行多種調(diào)整,以使其在不同情況下更有效地運(yùn)行。 每種適應(yīng)Gradient Descent的方法都有其優(yōu)缺點(diǎn),我們將在下面分享:
批次梯度下降 (Batch Gradient Descent)
Batch Gradient Descent refers to the sum of all observations on each iteration. In other words, Batch Gradient Descent calculates the error for each observation in the batch (remember this is the full training data) and updates the predictive model only after all observations have been evaluated — A more technical way to say this is “Batch Gradient Descent performs parameter updates at the end of each epoch” (one epoch refers to one iteration through the entire training data).
批梯度下降是指每次迭代中所有觀測(cè)值的總和。 換句話說(shuō),批次梯度下降計(jì)算批次中每個(gè)觀測(cè)值的誤差(請(qǐng)記住,這是完整的訓(xùn)練數(shù)據(jù))并僅在評(píng)估所有觀測(cè)值之后才更新預(yù)測(cè)模型-一種更技術(shù)性的說(shuō)法是“ 批次梯度下降”在每個(gè)時(shí)期的末尾執(zhí)行參數(shù)更新 (一個(gè)時(shí)期是指整個(gè)訓(xùn)練數(shù)據(jù)的一次迭代)。
Figure 3: 2d Representation of Batch Gradient Descent approaching and Converging at the Global Minimum.圖3:批次梯度下降的2d表示接近并收斂于全局最小值。Pros
優(yōu)點(diǎn)
- More stable convergence and error gradient than Stochastic Gradient descent 比隨機(jī)梯度下降法更穩(wěn)定的收斂和誤差梯度
- Embraces the benefits of vectorization 擁抱矢量化的好處
- A more direct path is taken towards the minimum 朝著最小方向走更直接的路徑
- Computationally efficient since updates are required after the run of an epoch 計(jì)算效率高,因?yàn)樵谶\(yùn)行紀(jì)元后需要更新
Cons
缺點(diǎn)
- Can converge at local minima and saddle points 可以收斂于局部最小值和鞍點(diǎn)
- Slower learning since an update is performed only after we go through all observations 學(xué)習(xí)更新后才進(jìn)行學(xué)習(xí),因此學(xué)習(xí)速度較慢
小批量梯度下降 (Mini-Batch Gradient Descent)
If Batch Gradient Descent sums over all observation on each iteration, Mini Batch Gradient Descent sums over a lower number of samples (a mini-batch of the samples) on each iteration — this variant reduces the variance of the gradient since we sum over a designated number of samples (depending on the mini-batch size) on each update.
如果每次迭代的所有觀察結(jié)果都具有“批次梯度下降”的總和,則“每次迭代”中的較小數(shù)量的樣本(樣本的小批量)的“最小批次梯度下降”總和-由于我們?cè)谥付ǖ狞c(diǎn)求和,因此該變體減小了梯度的方差每次更新的樣本數(shù)(取決于最小批量)。
Note: This variation of Gradient Descent is often the recommended technique among Deep Learning practitioners, but we must consider there is an extra hyperparameter which is the “batch sizes”
注意 :這種梯度下降的變化通常是深度學(xué)習(xí)從業(yè)人員推薦的技術(shù),但是我們必須考慮還有一個(gè)額外的超參數(shù),即“批量大小”
Figure 4: 2d Representation of Mini-Batch Gradient approaching Minimum; (Source: https://engmrk.com/mini-batch-gd/)圖4:接近最小的小批量梯度的2d表示; (來(lái)源: https : //engmrk.com/mini-batch-gd/ )Pros
優(yōu)點(diǎn)
- Convergence is more stable than Stochastic Gradient Descent 收斂比隨機(jī)梯度下降更穩(wěn)定
- Computationally efficient 計(jì)算效率高
- Fast Learning since we perform more updates 快速學(xué)習(xí),因?yàn)槲覀儓?zhí)行更多更新
Cons
缺點(diǎn)
- We have to configure the Mini-Batch size hyperparameter 我們必須配置Mini-Batch大小超參數(shù)
隨機(jī)梯度下降 (Stochastic Gradient Descent)
Stochastic Gradient Descent sums the error of an individual observation and performs an update to the model on each observation — This is the same as setting the number of Mini Batches to be equal to m, where m is the number of observations.
隨機(jī)梯度下降可將單個(gè)觀測(cè)值的誤差相加,并對(duì)每個(gè)觀測(cè)值執(zhí)行模型更新-這與將“迷你批次”的數(shù)量設(shè)置為等于m相同 ,其中m是觀測(cè)值的數(shù)量。
Figure 5: 2d Representation of Mini-Batch Gradient approaching Minimum; (Source: https://engmrk.com/mini-batch-gd/)圖5:接近最小的小批量梯度的2d表示; (來(lái)源: https : //engmrk.com/mini-batch-gd/ )Pros
優(yōu)點(diǎn)
- Only a single observation is being processed by the network so it is easier to fit into memory 網(wǎng)絡(luò)僅處理一個(gè)觀測(cè)值,因此更容易放入內(nèi)存
- May (likely) to reach near the minimum (and begin to oscillate) faster than Batch Gradient Descent on a large dataset 在大型數(shù)據(jù)集上,可能(可能)達(dá)到接近最小值(并開(kāi)始振蕩)的速度比批次梯度下降快
- The frequent updates create plenty of oscillations which can be helpful for getting out of local minimums. 頻繁的更新會(huì)產(chǎn)生大量的振蕩,這有助于擺脫局部最小值。
Cons
缺點(diǎn)
- Can veer off in the wrong direction due to frequent updates 由于頻繁更新,可能會(huì)朝錯(cuò)誤的方向偏離
- Lose the benefits of vectorization since we process one observation per time 由于我們每次處理一個(gè)觀測(cè)值,因此失去了矢量化的好處
- Frequent updates are computationally expensive due to using all resources for processing one training sample at a time 由于一次使用所有資源來(lái)一次處理一個(gè)訓(xùn)練樣本,因此頻繁更新的計(jì)算量很大
結(jié)語(yǔ) (Wrap Up)
Optimization is a major part of Machine Learning and Deep Learning. A simple and very popular optimization procedure that is employed with many Machine Learning algorithms is called Gradient descent, and there are 3 ways we can adapt Gradient Descent to perform in a specific way that suits our needs.
優(yōu)化是機(jī)器學(xué)習(xí)和深度學(xué)習(xí)的主要部分。 許多機(jī)器學(xué)習(xí)算法采用的一種簡(jiǎn)單且非常流行的優(yōu)化程序稱為梯度下降,我們可以通過(guò)3種方法使梯度下降以適合我們需求的特定方式執(zhí)行。
Let’s continue the Conversation on LinkedIn!
讓我們繼續(xù)在LinkedIn上進(jìn)行對(duì)話!
翻譯自: https://towardsdatascience.com/gradient-descent-811efcc9f1d5
梯度下降法
總結(jié)
以上是生活随笔為你收集整理的梯度下降法_梯度下降的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 英特尔最新 Evo 笔记本认证已允许搭载
- 下一篇: 学习机器学习的项目_辅助项目在机器学习中