如何计算一个神经网络在使用momentum时的hessian矩阵(论文调研)
根據[4]中的說法,“Though results on the Hessian of individual layers were not included in this study”,似乎每個層都有一個對應的Hessian矩陣。
根據[5]中的說法,最后一層的hessian矩陣很好計算,但是如果下一層,那就很不好計算
下面的這些對hessian矩陣的理論處理可能有幫助,先記載一下:
------------------------------------------
[7]很清晰地講解了分母是否轉置對求導結果的影響,如下:
對于x=(x1…xN)Tx=\left(x_{1} \dots x_{N}\right)^{T}x=(x1?…xN?)T
?f(x)?x=(?f(x)?x1?f(x)?x2??f(x)?xN)\frac{\partial f(x)}{\partial x}=\left(\begin{array}{c}{\frac{\partial f(x)}{\partial x_{1}}} \\ {\frac{\partial f(x)}{\partial x_{2}}} \\ {\vdots} \\ {\frac{\partial f(x)}{\partial x_{N}}}\end{array}\right)?x?f(x)?=????????x1??f(x)??x2??f(x)???xN??f(x)?????????
(?f(x)?x)T=?f(x)?xT=(?f(x)?x1?f(x)?x2…?f(x)?xN)\left(\frac{\partial f(x)}{\partial x}\right)^{T}=\frac{\partial f(x)}{\partial x^{T}}=\left(\frac{\partial f(x)}{\partial x_{1}} \quad \frac{\partial f(x)}{\partial x_{2}} \quad \ldots \quad \frac{\partial f(x)}{\partial x_{N}}\right)(?x?f(x)?)T=?xT?f(x)?=(?x1??f(x)??x2??f(x)?…?xN??f(x)?)
?2f(x)?x?xT=(?2f(x)?x12?2f(x)?x1?x2??2f(x)?x1?xN?2f(x)?x2?x1?2f(x)?x22??2f(x)?xN?1?x2??2f(x)?xN?x1???2f(x)?xN2)\frac{\partial^{2} f(x)}{\partial x \partial x^{T}}=\left(\begin{array}{cccc}{\frac{\partial^{2} f(x)}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f(x)}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f(x)}{\partial x_{1} \partial x_{N}}} \\ {\frac{\partial^{2} f(x)}{\partial x_{2} \partial x_{1}}} & {\frac{\partial^{2} f(x)}{\partial x_{2}^{2}}} & {} & {\vdots} \\ {} & {\frac{\partial^{2} f(x)}{\partial x_{N-1}\partial x_{2}}} & {} & {\vdots} \\ {\frac{\partial^{2} f(x)}{\partial x_{N} \partial x_{1}}} & {\cdots} & {\cdots} & {\frac{\partial^{2} f(x)}{\partial x_{N}^{2}}}\end{array}\right)?x?xT?2f(x)?=??????????x12??2f(x)??x2??x1??2f(x)??xN??x1??2f(x)???x1??x2??2f(x)??x22??2f(x)??xN?1??x2??2f(x)???????x1??xN??2f(x)????xN2??2f(x)???????????
------------------------------------------
粘貼工具是:Mathpix Snipping Tool,第一次發現這工具截圖然后轉化不準的問題,sigh…
------------------------------------------
[1]中的(2.16)~(2.18)無法核實,
(3.1)~(3.3)中出現了奇怪的符號δ沒有說明是什么含義
(2.8)對于bnib_{ni}bni?的定義很奇怪,[1]中根據(2.15)與(2.12)的比較,可知該文是在論述二分類目的的神經網絡,該文作者無法聯系上,最終放棄閱讀。
[3]使用彈簧振子在模仿神經網絡的不斷振蕩,分別從微分方程和差分方程兩個角度來論述為什么momentum這種optimizer能夠加速收斂
聯系了[4]作者,回復是需要谷歌的大量設備以及專門腳本才能復現,并不能在家里實現,連他自己手上都沒有代碼。
------------------------------------------
至于hessian-free的意思指的是,計算Hv而不是直接計算H,這樣避開計算H的龐大工作量。
計算H?1VH^{-1}VH?1V的目標是為了在訓練神經網絡時,二階牛頓法的迭代項中有所使用.
------------------------------------------
##################下面幾個github鏈接和hessian-free相關####################################
[7]這個作者不回復了,棄坑
https://github.com/drasmuss/hessianfree
這個里面的代碼主要是共軛梯度法,直接舍棄了和Jacobian和Hessian相關的操作
https://github.com/NithinTangellamudi/HessianFreeImplementation
代碼各種語法錯誤,棄坑
☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆☆
下面的還在研究中:
#------------------------------------------------------------------------------------------
[8]中的代碼配合論文[9]:
hessian-free部分的代碼如下:
給作者發了郵件詢問理由支持,但是沒有回復
#------------------------------------------------------------------------------------------
[10]代碼是下面論文[11]的一部分
hessian-free部分的代碼如下:
def gauss_newton_product(cost, p, v, s): # this computes the product Gv = J'HJv (G is the Gauss-Newton matrix)Jv = T.Rop(s, p, v)HJv = T.grad(T.sum(T.grad(cost, s)*Jv), s, consider_constant=[Jv], disconnected_inputs='ignore')Gv = T.grad(T.sum(HJv*s), p, consider_constant=[HJv, Jv], disconnected_inputs='ignore')Gv = map(T.as_tensor_variable, Gv) # for CudaNdarrayreturn Gv給作者發了郵件詢問理由支持,但是沒有回復
#------------------------------------------------------------------------------------------
[12]涉及到元學習
Reference:
[1]Exact Calculation of the Hessian Matrix for the Multilayer Perceptron
[2]A fast procedure for re-training the multilayer perceptron
[3]On the Momentum Term in Gradient Descent Learning Algorithms
[4]Negative eigen values of the hessian in deep neural networks
[5]Most efficient way to calculate hessian of cost function in neural network
[6]https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470173862.app4
[7]https://github.com/moonl1ght/HessianFreeOptimization/issues/1
[8]https://github.com/doomie/HessianFree
[9]Improved Preconditioner for Hessian Free Optimization
[10]https://github.com/boulanni/theano-hf
[11]Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription
[12]https://github.com/ozzzp/MLHF
總結
以上是生活随笔為你收集整理的如何计算一个神经网络在使用momentum时的hessian矩阵(论文调研)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 矩阵行列式的几何意义验证
- 下一篇: 牛顿法中为何出现hessian矩阵