贝叶斯深度神经网络_深度学习为何胜过贝叶斯神经网络
貝葉斯深度神經(jīng)網(wǎng)絡(luò)
Recently I came across an interesting Paper named, “Deep Ensembles: A Loss Landscape Perspective” by a Laxshminarayan et al.In this article, I will break down the paper, summarise it’s findings and delve into some of the techniques and strategies they used that will be useful for delving into understanding models and their learning process. It will also go over some possible extensions to the paper. You can also find my annotations on the paper down below.
最近,我碰到了Laxshminarayan等人寫的一篇有趣的論文,名為“ 深度合奏:一種迷失的景觀視角” 。在本文中,我將對(duì)該論文進(jìn)行分解,總結(jié)其發(fā)現(xiàn),并深入研究他們使用的一些技術(shù)和策略。有助于深入了解模型及其學(xué)習(xí)過程。 它還將介紹本文的一些可能擴(kuò)展。 您也可以在下面的論文中找到我的注釋。
理論 (The Theory)
The authors conjectured (correctly) that Deep Ensembles (an ensemble of Deep learning models) outperform Bayesian Neural Networks because “popular scalable variational Bayesian methods tend to focus on a single mode, whereas deep ensembles tend to explore diverse modes in function space.”
作者推測(正確),深度集成(深度學(xué)習(xí)模型的集合)優(yōu)于貝葉斯神經(jīng)網(wǎng)絡(luò),因?yàn)椤傲餍械目蓴U(kuò)展變分貝葉斯方法傾向于專注于單一模式,而深度集成則傾向于探索功能空間中的多種模式。”
In simple words, when running a Bayesian Network at a single initialization it will reach one of the peaks and stop. Deep ensembles will explore different modes, therefore reducing error when put in practice. In picture form:
簡而言之,在一次初始化中運(yùn)行貝葉斯網(wǎng)絡(luò)時(shí),它將到達(dá)其中一個(gè)峰值并停止。 深度合奏將探索不同的模式,因此在實(shí)際操作中會(huì)減少錯(cuò)誤。 圖片形式:
Taken from the paper取自本文Depending on it’s hyperparameters, a single run of a bayesian network will find one of the paths (colors)and it’s mode. Therefore it won’t explore the set of parameters. On the other hand, a deep ensemble will explore all the paths, and therefore get a better understanding of the weight space (and solutions). To understand why this translates to better understanding consider the following illustration.
取決于它的超參數(shù),一次運(yùn)行貝葉斯網(wǎng)絡(luò)將找到路徑(顏色)及其模式之一。 因此,它不會(huì)探索參數(shù)集。 另一方面,一個(gè)深層次的合奏將探索所有路徑,從而更好地理解權(quán)重空間(和解決方案)。 要了解為什么這可以更好地理解,請(qǐng)考慮以下插圖。
3 Possible Solutions. The area colored red is what it gets wrong3可能的解決方案。 顏色為紅色的區(qū)域出了問題In the diagram, we have 3 possible solution spaces, corresponding to each of the trajectories. The optimized mode for each gives a performance gives us a score of 90% (for example). Each mode is unable to solve a certain kind of problem (highlighted in red). A Bayesian Network will get to either A, B, or C in a run while a Deep Ensemble will be able to train over all 3.
在該圖中,我們有3個(gè)可能的解空間,分別對(duì)應(yīng)于每個(gè)軌跡。 每種模式的優(yōu)化模式都能為我們提供90%的得分(例如)。 每種模式都無法解決某種問題(以紅色突出顯示)。 一個(gè)貝葉斯網(wǎng)絡(luò)將同時(shí)到達(dá)A,B或C,而Deep Ensemble則將能夠訓(xùn)練全部3個(gè)。
技術(shù) (The Techniques)
They proved their hypothesis using various strategies. This allowed them to approach the problem from various perspectives. I will show the details for each of them.
他們使用各種策略證明了自己的假設(shè)。 這使他們能夠從各種角度解決問題。 我將顯示它們的詳細(xì)信息。
余弦相似度: (Cosine Similarity:)
Cosine similarity is defined, as the “measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1.” It is derived from the dot product between vectors. Imagine 3 texts: A, B, and C. A and C and large documents on a similar topic and B is a very short summary of A. A and C might end up having a low Euclidean Distance because they have a lot of overlapping words or phrases, while A and B will have a larger distance because of the difference in size. The cosine similarities would paint a different picture, however, since A and B would have a low angle (thus high similarity) between them.
余弦相似度定義為“內(nèi)積空間的兩個(gè)非零向量之間的相似度的度量。 它被定義為等于它們之間夾角的余弦,這也與歸一化為長度為1的相同向量的內(nèi)積相同。 想象一下3個(gè)文本:A,B和C。A和C以及類似主題的大型文檔,而B是A的簡短摘要。A和C最終的歐氏距離可能很低,因?yàn)樗鼈冇泻芏嘀丿B的單詞或詞組,而A和B的距離會(huì)因?yàn)榇笮〉牟煌兇蟆?余弦相似度會(huì)描繪出不同的畫面,因?yàn)锳和B之間的夾角較小(因此相似度較高)。
This diagram is used to show that “checkpoints along a trajectory are largely similar both in the weight space and the function space.” Checkpoint 30 and Checkpoint 25 have high similarity (red) while 30 and 5 have relatively low similarity (grey). An interesting thing to note is that the lowest labeled point is a similarity of 0.68. This goes to show how quickly models
該圖用于表明“沿著軌跡的檢查點(diǎn)在權(quán)重空間和功能空間上都非常相似。” 檢查點(diǎn)30和檢查點(diǎn)25具有很高的相似性(紅色),而檢查點(diǎn)30和5具有相對(duì)較低的相似性(灰色)。 值得注意的是,最低的標(biāo)記點(diǎn)相似度為0.68。 這表明模型有多快
功能空間上的分歧: (Disagreement in Function Space:)
The opposite of the similarity相似的反面Defined as “the fraction of points the checkpoints disagree on”, or,
定義為“檢查點(diǎn)不同意的分?jǐn)?shù)分?jǐn)?shù)”,或者,
network for input x:
輸入x的網(wǎng)絡(luò):
here f(x; theta) denotes the class label predicted by the model for input x這里f(x; theta)表示模型為輸入x預(yù)測的類別標(biāo)簽Disagreement is like the complement to the similarity scores and serves to showcase the difference in checkpoints along the same trajectory in a more direct manner. Both similarity and disagreement are calculated over a single run with a single model. The highest slab for disagreement starts at 0.45. This shows that there is relatively low disagreement, consistent with the findings of the similarity map.
分歧就像對(duì)相似性得分的補(bǔ)充,有助于更直接地展示沿著相同軌跡的檢查站之間的差異。 相似性和分歧均在單個(gè)模型的單次運(yùn)行中進(jìn)行計(jì)算。 意見分歧的最高標(biāo)準(zhǔn)從0.45開始。 這表明存在相對(duì)較低的分歧,與相似性圖的發(fā)現(xiàn)一致。
The use of disagreement and similarity shows that points along the same trajectory have very similar predictions. The third way is used to prove that trajectories can take very different paths, and thus end up unable to solve certain kinds of problems.
使用分歧和相似性表明,沿著同一軌跡的點(diǎn)具有非常相似的預(yù)測。 第三種方法用來證明軌跡可以采取截然不同的路徑,從而最終無法解決某些類型的問題。
使用tsne繪制不同的隨機(jī)初始化 (Plotting Different Random Initializations using tsne)
This makes a comeback這卷土重來As stated before, this diagram is used to show how different random initializations differ in function space. It is used to effectively plot the higher dimensional data into lower (human understandable) dimensions. It is an alternative to PCA. Unlike PCA, tSNE is non-linear and probabilistic in nature. It is also far more computationally expensive (esp. with lots of samples and high dimensionality). In this context, TSNE makes more sense than PCA because deep learning is also not a linear process, so TSNE suits it better. The researchers applied some preprocessing to keep the costs down. The steps they took specifically were, “for each checkpoint we take the softmax output for a set of examples, flatten the vector and use it to represent the model’s predictions. The t-SNE algorithm is then used to reduce it to a 2D point in the t-SNE plot. Figure 2(c) shows that the functions explored by different trajectories (denoted by circles with different colors) are far away, while functions explored within a single trajectory (circles with the same color) tend to be much more similar.”
如前所述,此圖用于顯示函數(shù)空間中不同的隨機(jī)初始化如何不同。 它用于將較高維度的數(shù)據(jù)有效地繪制為較低(人類可以理解的)維度。 它是PCA的替代方法。 與PCA不同,tSNE本質(zhì)上是非線性的并且是概率性的。 它在計(jì)算上也要昂貴得多(尤其是具有大量樣本和高維數(shù))。 在這種情況下,TSNE比PCA更有意義,因?yàn)樯疃葘W(xué)習(xí)也不是線性過程,因此TSNE更適合它。 研究人員進(jìn)行了一些預(yù)處理以降低成本。 他們具體采取的步驟是,“對(duì)于每個(gè)檢查點(diǎn),我們將softmax輸出作為一組示例,將向量展平并使用它來表示模型的預(yù)測。 然后使用t-SNE算法將其減少到t-SNE圖中的2D點(diǎn)。 圖2(c)顯示,通過不同軌跡(以不同顏色的圓圈表示)探索的功能相距遙遠(yuǎn),而在單個(gè)軌跡(具有相同顏色的圓圈)中探索的功能往往更加相似。”
子空間采樣和分集測量 (Subspace Sampling and Diversity Measurement)
The last 2 things the paper implemented were subspace sampling and diversity measurement. Subspace sampling involves training the data without all the features. This can be used to create several learners, that when combined, perform better than the original. The details of the sampling methods are in the paper. The samples, validated the results of the full-feature tsne, with random initializations going along different paths.
本文執(zhí)行的最后兩件事是子空間采樣和分集測量。 子空間采樣涉及在沒有所有功能的情況下訓(xùn)練數(shù)據(jù)。 這可以用來創(chuàng)建多個(gè)學(xué)習(xí)器,這些學(xué)習(xí)器組合在一起后,性能會(huì)比原始學(xué)習(xí)器更好。 采樣方法的詳細(xì)信息在本文中。 這些樣本驗(yàn)證了全功能tsne的結(jié)果,并通過不同的路徑進(jìn)行了隨機(jī)初始化。
3 different subspace sampling methods, all lead to distinct neighborhoods3種不同的子空間采樣方法,均導(dǎo)致不同的鄰域The diversity score quantifies the difference of two functions (a base solution and a sampled one), by measuring the fraction of data points on which the predictions differ. This simple approach is enough to validate the premise
分集得分通過測量預(yù)測值不同的數(shù)據(jù)點(diǎn)的比例來量化兩個(gè)函數(shù)(基本解和采樣解)的差異。 這種簡單的方法足以驗(yàn)證前提
Both the sampling and accuracy-diversity plots are further proof of the hypothesis.
采樣圖和準(zhǔn)確性-多樣性圖都進(jìn)一步證明了這一假設(shè)。
Combined these techniques prove two things:
結(jié)合這些技術(shù),可以證明兩件事:
擴(kuò)展名 (Extensions)
This paper was great in popping the hood behind the learning processes Deep Ensembles and Bayesian Networks. An analysis of the learning curves and validation curves would’ve been interesting. Furthermore, it would be interesting to see how deep learning ensembles stack up Random Forests, or other ensembles. Performing a similar analysis on the learning processes of these would allow us to create mixed ensembles that might use be good for solving complex problems.
這篇論文很好地揭開了深度整合和貝葉斯網(wǎng)絡(luò)學(xué)習(xí)過程的面紗。 對(duì)學(xué)習(xí)曲線和驗(yàn)證曲線的分析將很有趣。 此外,有趣的是,深度學(xué)習(xí)集合會(huì)如何堆積隨機(jī)森林或其他集合。 對(duì)這些學(xué)習(xí)過程進(jìn)行類似的分析將使我們能夠創(chuàng)建混合樂團(tuán),這些樂團(tuán)可能有助于解決復(fù)雜的問題。
重點(diǎn)論文 (Highlighted Paper)
Below is the paper. I have highlighted what I thought was important and added definitions to some important concepts. Hope it helps
以下是論文。 我已經(jīng)強(qiáng)調(diào)了我認(rèn)為很重要的內(nèi)容,并為一些重要概念添加了定義。 希望能幫助到你
Please leave your feedback on this article below. If this was useful to you, please share it and follow me here. Additionally, check out my YouTube channel. I will be posting videos breaking down different concepts there. I will also be streaming on Twitch here. I will be answering any questions/having discussions there. Please go leave a follow there. If you would like to work with me email me here: devanshverma425@gmail.com or reach out to me LinkedIn
請(qǐng)?jiān)谙旅媪粝履鷮?duì)本文的反饋。 如果這對(duì)您有用,請(qǐng)分享并在這里關(guān)注我。 此外,請(qǐng)?jiān)L問我的YouTube頻道 。 我將在那里發(fā)布分解不同概念的視頻。 我還將在這里在Twitch上直播 。 我將在那里回答任何問題/進(jìn)行討論。 請(qǐng)去那里跟隨。 如果您想與我合作,請(qǐng)?jiān)谶@里給我發(fā)電子郵件:devanshverma425@gmail.com或與我聯(lián)系
翻譯自: https://medium.com/swlh/why-deep-learning-ensembles-outperform-bayesian-neural-networks-dba2cd34da24
貝葉斯深度神經(jīng)網(wǎng)絡(luò)
總結(jié)
以上是生活随笔為你收集整理的贝叶斯深度神经网络_深度学习为何胜过贝叶斯神经网络的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 坎巴拉太空计划怎么发射
- 下一篇: 深度学习数据集制作工作_创建我的第一个深