當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

梯度下降优化方法'原理_优化梯度下降的新方法

發布時間：2023/12/15 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了梯度下降优化方法'原理_优化梯度下降的新方法小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

梯度下降優化方法'原理

The new era of machine learning and artificial intelligence is the Deep learning era. It not only has immeasurable accuracy but also a huge hunger for data. Employing neural nets, functions with more exceeding complexity can be mapped on given data points.

機器學習和人工智能的新時代是深度學習時代。它不僅具有無法估量的準確性，而且對數據的渴望也很大。利用神經網絡，可以將復雜程度更高的功能映射到給定的數據點上。

But there are a few very precise things which make the experience with neural networks more incredible and perceiving.

但是，有一些非常精確的東西使神經網絡的體驗更加令人難以置信和感知。

Xavier初始化 (Xavier Initialization)

Let us assume that we have trained a huge neural network. For simplicity, the constant term is zero and the activation function is identity.

讓我們假設我們已經訓練了一個巨大的神經網絡。為簡單起見，常數項為零，激活函數為恒等式。

For the given condition, we can have the following equations of gradient descent and expression of the target variable in terms of weights of all layer and input a[0].

對于給定的條件，我們可以使用以下梯度下降方程和目標變量的表達式來表示所有層和輸入a [0]的權重。

For ease of understanding, let us consider all weights to be equal, i.e.

為了便于理解，讓我們考慮所有權重相等，即

Here we have considered the last weight different as it will give the output value and in case of binary classification it may be a sigmoid function or ReLu function.

在這里，我們考慮了最后一個權重，因為它會給出輸出值，并且在二進制分類的情況下，它可能是S型函數或ReLu函數。

When we replace all the weights in the expression of the target variable, we obtain a new expression for y, the expression of prediction of the target variable.

當替換目標變量表達式中的所有權重時，我們將獲得y的新表達式，即目標變量預測的表達式。

Let us consider two different situations for the weights.

讓我們考慮兩種不同的權重情況。

In case 1, when we advance the weight to the power of L-1, assuming a very large neural network, the value of y becomes very large. Likewise, in case 2, the value of y becomes exponentially small. These are called vanishing and exploding gradients. These provisions affect the accuracy of gradient descent and demand more time for training the data.

在情況1中，當我們將權重提高到L-1的冪時，假設神經網絡非常大，則y的值將變得非常大。同樣，在情況2中，y的值呈指數減小。這些稱為消失和爆炸梯度 。這些規定會影響梯度下降的準確性，并需要更多時間來訓練數據。

To avoid these circumstances we need to initialize our weights more carefully and more systematically. One way of doing this is by Xavier Initialization.

為了避免這些情況，我們需要更仔細，更系統地初始化權重。一種方法是通過Xavier Initialization 。

If we consider a single neuron as in logistic regression, the dimension of the weight matrix is defined by the dimension of a single example. Hence we can set the variance of weights as 1/n. As we increase the dimension of input example, the former ‘s dimensions must be increased to train the model.

如果我們在邏輯回歸中考慮單個神經元，則權重矩陣的維數由單個示例的維數定義。因此，我們可以將權重的方差設置為1 / n。隨著我們增加輸入示例的尺寸，必須增加前者的尺寸以訓練模型。

Once we have applied this technique to deeper neural networks, the weight initialization for each layer can be expressed as

一旦我們將此技術應用于更深的神經網絡，就可以將每一層的權重初始化表示為

similarly, there can be various ways to define the variance and multiplying with the randomly initialized weights.

類似地，可以有多種方法來定義方差并與隨機初始化的權重相乘。

改進梯度計算 (Improvising Gradient Computation)

Let us consider a function f(x) = x3 and compute its gradient at x = 1 using calculus. Using this simple function has a reason to understand and admire this concept. By differentiation, we know that the slope of the function at x=1 is 3.

讓我們考慮一個函數f(x)=x3，并使用微積分計算在x = 1處的梯度。使用這個簡單的功能有一個理解和欣賞這個概念的理由。通過微分，我們知道函數在x = 1處的斜率為3。

Now, let us calculate the slope at x=1 by calculus. We find the value of the function at x = 1+delta, where delta is a very small quantity (say = 0.001). The slope of the function becomes the slope of the hypotenuse of the yellow triangle.

現在，讓我們通過微積分計算x = 1處的斜率。我們在x = 1 + delta處找到函數的值，其中delta是非常小的量(例如= 0.001)。函數的斜率變為黃色三角形斜邊的斜率。

Hence the slope is 3.003 with an error of 0.003. Now, let us define the error differently and again calculate the slope.

因此，斜率為3.003，誤差為0.003。現在，讓我們以不同的方式定義誤差，然后再次計算斜率。

Now we are calculating the slope of a bigger triangle formed by boundaries of 1-delta and 1+delta. Calculating the slope in this manner has reduced the error significantly to 0.000001.

現在，我們正在計算由1-delta和1 + delta的邊界形成的較大三角形的斜率。以這種方式計算斜率已將誤差顯著降低至0.000001。

Hence, we can infer that defining the slope in this manner will help us to better calculate the slope of a function. This demonstration helps us to optimize gradient calculation hence optimizing the Gradient descent.

因此，我們可以推斷出以這種方式定義斜率將有助于我們更好地計算函數的斜率。該演示幫助我們優化了梯度計算，從而優化了梯度下降。

One thing to note is implementing this function to calculate gradients more efficiently will increase the time required to calculate the gradients.

要注意的一件事是，實現此功能以更有效地計算梯度將增加計算梯度所需的時間。

翻譯自: https://towardsdatascience.com/new-ways-for-optimizing-gradient-descent-42ce313fccae

梯度下降優化方法'原理

總結

以上是生活随笔為你收集整理的梯度下降优化方法'原理_优化梯度下降的新方法的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：阿里云-云监控插件安装指南
下一篇： k 最近邻_k最近邻与维数的诅咒