Deep Residual Learning for Image Recognition(ResNet)论文翻译及学习笔记
【論文翻譯】:Deep Residual Learning for Image Recognition
【論文來源】:Deep Residual Learning for Image Recognition
【翻譯人】:莫墨莫陌
Deep Residual Learning for Image Recognition
基于深度殘差學習的圖像識別
2016 IEEE Conference on Computer Vision and Pattern Recognition 圖像識別的深度殘差學習2016 IEEE計算機視覺與模式識別會議 Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research {kahe, v-xiangz, v-shren, jiansun}@microsoft.comAbstract
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
摘要
更深的神經網絡更難訓練。我們提出了一個residual learning framework來簡化網絡的訓練,這些網絡比以前使用的要深得多。我們顯式地將層重新表示為根據層輸入學習residual function,而不是學習未引用的函數。我們提供了全面的經驗證據表明,這些殘差網絡更容易優化,并可以獲得準確性從相當大的深度。在ImageNet數據集上,我們評估的residual network深度可達152層,比VGG網絡深8層,但仍然具有較低的復雜性。這些殘余網的集合在ImageNet測試集上的誤差達到3.57%,該結果在ILSVRC 2015分類任務中獲得第一名。我們還對CIFAR-10進行了100層和1000層的分析。
對于許多視覺識別任務來說,表征的深度是至關重要的。僅僅由于我們非常深入的表示,我們在COCO對象檢測數據集上獲得了28%的相對改進。深殘差網是我們提交給ILSVRC的基礎。在2015年COCO競賽中,我們在ImageNet檢測、ImageNet定位、COCO檢測、COCO分割任務上也獲得了第一名。
1 Introduction
1 引言
Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21,49, 39]. Deep networks naturally integrate low/mid/highlevel features [49] and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence [40, 43] reveals that network depth is of crucial importance, and the leading results [40, 43, 12, 16] on the challenging ImageNet dataset [35] all exploit “very deep” [40] models, with a depth of sixteen [40] to thirty [16]. Many other nontrivial visual recognition tasks [7, 11, 6, 32, 27] have also greatly benefited from very deep models.
深度卷積神經網絡[22,21]在圖像分類方面取得了一系列突破[21,49,39]。深度網絡以端到端的多層方式自然地集成了低/中/高級功能[49]和分類器,并且特征的“層”可以通過堆疊的層數(深度)來豐富。最新證據[40,43]表明了網絡深度至關重要,在具有挑戰性的ImageNet數據集[35]上的領先結果[40,43,12,16]都采用了“非常深”的模型[40],深度為十六[40]到三十[16]。許多其他重要的視覺識別任務[7,11,6,32,27]也有非常受益于非常深入的模型。
Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing /exploding gradients [14, 1, 8], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 8, 36, 12] and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].
在深度意義的驅動下,出現了一個問題:學習更好的網絡是否像堆疊更多的層一樣容易?回答這個問題的障礙是眾所周知的梯度消失/爆炸[14,1,8],從一開始就阻礙了收斂。但是,此問題已通過歸一化初始化[23,8,36,12]和中間歸一化層[16]得到了很大解決,這使具有數十層的網絡能夠通過反向傳播開始收斂用于隨機梯度下降(SGD)[22]。
When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [10, 41] and thoroughly verified by our experiments. Fig. 1 shows a typical example.
當更深的網絡能夠開始收斂時,就會暴露出一個退化問題:隨著網絡深度的增加,準確性達到飽和(這可能不足為奇),然后迅速退化。出乎意料的是,這種退化不是由過度擬合引起的,并且在[10,41]中報道,并由我們的實驗充分驗證了,將更多層添加到適當深度的模型中會導致更高的訓練誤差。圖1顯示了一個典型示例。
The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are compar- ably good or better than the constructed solution(or unable to do so in feasible time).
訓練準確性的下降表明并非所有系統都同樣容易優化。讓我們考慮一個較淺的體系結構及其更深的對應結構,它會在其上添加更多層。通過構建更深層的模型,可以找到一種解決方案:添加的層是恒等映射,而其他層是從學習的淺層模型中復制的。該構造解決方案的存在表明,較深的模型不會產生比淺模型更高的訓練誤差。但是實驗表明,我們現有的求解器無法找到解決方案比構造的解決方案好或更好(或在可行時間內無法做到)。
In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x)?x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.
在本文中,我們通過引入深度殘差學習框架來解決退化問題。而不是希望每個堆疊的層都直接適合所需的基礎映射,我們明確讓這些層適合殘差映射。形式上,將所需的基礎映射表示為H(x),我們讓堆疊的非線性層適合另一映射F(x):=H(x)-x。原始映射將重新轉換為F(x)+x。我們假設優化殘差映射比優化原始的、未引用的映射更容易。在極端情況下,如果恒等映射是最佳的,則將殘差推到零比通過非線性層堆棧擬合恒等映射要容易。
The formulation of F(x)+x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 33, 48] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.
F(x)+x的公式可通過具有“快捷連接”的前饋神經網絡來實現(圖2)。快捷連接[2、33、48]是跳過一層或多層的連接。在我們的例子中,快捷連接僅執行恒等映射,并將其輸出添加到堆疊層的輸出中(圖2)。恒等快捷連接既不增加額外的參數,也不增加計算復雜度。整個網絡仍然可以通過SGD反向傳播進行端到端訓練,并且可以使用通用庫(例如Caffe [19])輕松實現,而無需修改求解器。
We present comprehensive experiments on ImageNet [35] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.
我們在ImageNet [35]上進行了全面的實驗,以顯示退化問題并評估我們的方法。我們發現:1)我們極深的殘差網絡很容易優化,但是當深度增加時,對應的“普通”網絡(簡單地堆疊層)在深度增加時訓練誤差較大;2)我們的深層殘差網絡可以很容易地從深度的大幅增加中獲得精度提升,從而產生比以前的網絡更好的結果。
Similar phenomena are also shown on the CIFAR-10 set [20], suggesting that the optimization difficulties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers.
在CIFAR-10集上也出現了類似的現象[20],這表明我們的方法的優化困難和效果不僅僅類似于特定數據集。我們在這個數據集上成功地訓練了超過100個層的模型,并探索了1000多個層的模型。
On the ImageNet classification dataset [35], we obtain excellent results by extremely deep residual nets. Our 152- layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets [40]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classification competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems.
在ImageNet分類數據集[35]上,我們通過極深的殘差網得到了出色的結果。我們的152層殘差網絡是ImageNet上提出的最深的網絡,同時其復雜度仍低于VGG網絡[40]。在ImageNet測試集中,我們的集合有3.57%的top-5錯誤率,并在ILSVRC 2015分大賽中榮獲第一名。極深的表征形式在其他識別任務上也具有出色的泛化性能,使我們在ILSVRC和COCO 2015競賽中進一步贏得了第一名:ImageNet檢測,ImageNet定位,COCO檢測和COCO分割。這種強有力的證據表明,殘留的學習原則是通用的,我們希望它是適用于其他視覺和非視覺問題。
退化問題:隨著網絡深度的增加,準確率達到飽和然后迅速退化。即網絡達到一定層數后繼續加深模型會導致模型表現下降。
意外的是,這種退化并不是由過擬合造成的,也不是由梯度消失和爆炸造成的,在一個合理的深度模型中增加更多的層卻導致了更高的錯誤率。
解決思路:一是創造新的優化方法,二是化簡現有的優化問題。而論文作者選擇了第二種方法,通過讓求解更深的神經網絡模型變得更容易。
2 RelatedWork
2 相關工作
Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classification [4, 47]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors.
殘差表示。 在圖像識別中,VLAD [18]是對字典的殘差矢量進行編碼的表示,Fisher Vector [30]可以表示為VLAD的概率版本[18]。它們都是用于圖像檢索和分類的有力的淺層表示[4,47]。對于矢量量化,殘差矢量[17]進行編碼被證明比原始矢量進行編碼更有效。
In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis preconditioning [44, 45], which relies on variables that represent residual vectors between two scales. It has been shown [3, 44, 45] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.
在低級視覺和計算機圖形學中,為了求解偏微分方程(PDEs),廣泛使用的多網格方法[3]將系統重新形成為多個尺度的子問題,其中每個子問題負責粗尺度和細尺度之間的殘差解。多網格的替代方法是分層基礎預處理[44,45],它依賴于表示兩個尺度之間殘差矢量的變量。已經證明[3,44,45],這些求解器的收斂速度比不知道解決方案殘差性質的標準求解器要快得多。這些方法表明,良好的重構或預處理可以簡化優化過程。
Shortcut Connections. Practices and theories that lead to shortcut connections [2, 33, 48] have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output [33, 48]. In [43, 24], a few intermediate layers are directly connected to auxiliary classifiers for addressing vanishing/exploding gradients. The papers of [38, 37, 31, 46] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In [43], an “inception” layer is composed of a shortcut branch and a few deeper branches.
快捷連接。 導快捷連接[2、33、48]的實踐和理論已經研究了很長時間。訓練多層感知器(MLPs)的早期實踐是添加從網絡輸入連接到輸出的線性層[33,48]。在[43,24]中,一些中間層直接連接到輔助分類器,以解決梯度消失/爆炸。[38,37,31,46]的論文提出了通過捷徑連接實現居中層響應、梯度和傳播誤差的方法。在[43]中,“起始”層由快捷分支和一些更深的分支組成。
Concurrent with our work, “highway networks” [41, 42] present shortcut connections with gating functions [15]. These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is “closed” (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, highway networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers).
與我們的工作同時,“高速公路網絡”[41,42]提供了與門功能[15]的快捷連接。與我們的不帶參數的標識快捷方式相反,這些門取決于數據并具有參數。當封閉的快捷方式“關閉”(接近零)時,高速公路網絡中的層表示非殘差函數。相反,我們的公式總是學習殘差函數。我們的標識快捷鍵永遠不會關閉,所有信息始終都會被傳遞,還需要學習其他殘余函數。另外,“高速公路網絡”還沒有證明深度會大大增加(例如,超過100層)會提高準確性。
3 Deep Residual Learning
3.1 Residual Learning
3.1 殘差學習
Let us consider H(x) as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions2, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., H(x) ? x (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate H(x), we explicitly let these layers approximate a residual function F(x) := H(x) ? x. The original function thus becomes F(x)+x. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.
讓我們將H(x)視為由一些堆疊層(不一定是整個網絡)擬合的基礎映射,其中x表示這些層中第一層的輸入。如果假設多個非線性層可以漸近地逼近復雜函數2,則等效于假設它們可以漸近地近似殘差函數,即H(x)-x(假設輸入和輸出的維數相同)。因此,不是讓堆疊的層逼近H(x),而是明確讓這些層逼近殘差函數F(x):=H(x)-x。因此,原始函數變為F(x)+x。盡管兩種形式都應能夠漸近地逼近所需的函數(如假設),但學習的難易程度可能有所不同。
This reformulation is motivated by the counterintuitive phenomena about the degradation problem (Fig. 1, left). As we discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.
關于退化問題的反直覺現象促使這種重新構造(圖1,左)。正如我們在引言中所討論的,如果可以將添加的層構造為標識映射,則較深的模型應具有的訓練誤差不大于其較淺的模型的訓練誤差。退化問題表明,求解器在用多個非線性層逼近恒等映射時可能存在困難。通過殘差學習的重構,如果恒等映射是最佳的,則求解器可以簡單地將多個非線性層的權重趨近于零來逼近恒等映射。
In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.
在實際情況下,恒等映射不太可能是最佳的,但是我們的重新構造可能有助于為這個問題提供先決條件。如果最優函數比零映射更接近恒等映射,則求解器參考恒等映射來查找擾動,應該比學習新函數更容易。我們通過實驗(圖7)表明,所學習的殘差函數通常具有較小的響應,這表明恒等映射提供了合理的預處理。
3.2 Identity Mapping by Shortcuts
3.2 通過快捷方式進行恒等映射
We adopt residual learning to every few stacked layers. A building block is shown in Fig. 2. Formally, in this paper we consider a building block defined as:
y = F(x, {Wi}) + x. (1) Here x and y are the input and output vectors of the layers considered. The function F(x, {Wi}) represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, F = W2σ(W1x) in which σ denotes ReLU [29] and the biases are omitted for simplifying notations. The operation F + x is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition (i.e., σ(y), see Fig. 2). 我們對每幾個堆疊的層采用殘差學習。構建塊如圖2所示。在形式上,在本文中,我們考慮定義為: y = F(x, {Wi}) + x. (1) 這里的x和y是所考慮層的輸入和輸出向量。函數F(x,{Wi})表示要學習的殘差映射。對于圖2中具有兩層的示例,F =W2σ(W1x),其中σ表示為簡化符號,省略了ReLU [29]和偏差。F+x操作通過快捷連接和逐元素加法執行。在加法之后我們采用第二個非線性度(即σ(y),見圖2)。The shortcut connections in Eqn.(1) introduce neither extra parameter nor computation complexity. This is not only attractive in practice but also important in our comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).
公式(1)中,快捷連接既沒有引入額外的參數,也沒有引入計算復雜性。這不僅在實踐中具有吸引力,而且在我們比較普通網絡和殘差網絡時也很重要。我們可以公平地比較同時具有相同數量的參數、深度、寬度和計算成本(除了可以忽略的逐元素加法)的普通/殘差網絡。
The dimensions of x and F must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:
y = F(x, {Wi}) +Wsx. (2) 在等式(1)中,x和F的維數必須相等。如果不是這種情況(例如,在更改輸入/輸出通道時),我們可以通過快捷方式連接執行線性投影Ws以匹配維度: y = F(x, {Wi}) +Wsx. (2)We can also use a square matrixWs in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus Ws is only used when matching dimensions.
在等式(1)中,我們還可以使用方陣Ws。但是我們將通過實驗證明,恒等映射足以解決退化問題并且經濟,因此Ws僅在匹配維度時使用。
The form of the residual function F is flexible. Experiments in this paper involve a function F that has two or three layers (Fig. 5), while more layers are possible. But if F has only a single layer, Eqn.(1) is similar to a linear layer: y = W1x+x, for which we have not observed advantages.
殘差函數F的形式是靈活的。本文中的實驗涉及一個具有兩層或三層的函數F(圖5),而更多的層是可能的。但是,如果F僅具有一層,則等式(1)類似于線性層:y =W1x + x,對此我們沒有觀察到優勢。
We also note that although the above notations are about fully-connected layers for simplicity, they are applicable to convolutional layers. The function F(x, {Wi}) can represent multiple convolutional layers. The element-wise addition is performed on two feature maps, channel by channel.
我們還注意到,盡管為簡化起見,上述符號是關于全連接層的,但它們也適用于卷積層。函數F(x,{Wi})可以表示多個卷積層。在兩個特征映射上逐個通道執行逐元素加法。
3.3 Network Architectures
3.3 網絡體系結構
We have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows.
Plain Network. Our plain baselines (Fig. 3, middle) are mainly inspired by the philosophy of VGG nets [40] (Fig. 3,left). The convolutional layers mostly have 3×3 filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by convolutional layers that have a stride of 2. The network ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. The total number of weighted layers is 34 in Fig. 3 (middle).
我們已經測試了各種平原/殘差網絡,并觀察到了一致的現象。為了提供討論實例,我們描述了ImageNet的兩個模型,如下所示。
普通網絡。 我們簡單的基線(圖3,中間)主要受到VGG網絡原理的啟發[40](圖3,左)。卷積層通常具有3×3的過濾器,并遵循兩個簡單的設計規則:(1)對于相同的輸出要素圖大小,這些層具有相同的過濾器數;(2)如果特征圖的大小減半,則過濾器的數量將增加一倍,以保持每層的時間復雜度。我們直接通過步長為2的卷積層執行下采樣。網絡以全局平均池化層和帶有softmax的1000路全連接層結束。圖3中的加重層總數為34(中)。
It is worth noticing that our model has fewer filters and lower complexity than VGG nets [40] (Fig. 3, left). Our 34- layer baseline has 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs).
Residual Network. Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1×1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.
值得注意的是,我們的模型比VGG網絡[40]具有更少的過濾器和更低的復雜性(圖3,左)。我們的34層基準具有36億個FLOP(乘加),僅占VGG-19(196億個FLOP)的18%。
殘留網絡。 在上面的普通網絡的基礎上,我們插入快捷方式連接(圖3,右),將網絡變成其對應的殘差版本。當輸入和輸出的尺寸相同時,可以直接使用標識快捷方式(等式(1))(圖3中的實線快捷方式)。當維度增加時(圖3中的虛線快捷方式),我們考慮兩個選項:(A)快捷方式仍然執行恒等映射,并為增加維度填充了額外的零項填充。此選項不引入任何額外的參數。(B)等式(2)中的投影快捷方式用于匹配維度(按1×1卷積完成)。對于這兩個選項,當快捷方式遍歷兩種維度的特征圖時,步幅為2。
3.4 Implementation
3.4 實施
Our implementation for ImageNet follows the practice in [21, 40]. The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation [40]. A 224×224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21]. The standard color augmentation in [21] is used. We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]. We initialize the weights as in [12] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60×104 iterations. We use a weight decay of 0.0001 and a momentum of 0.9. We do not use dropout [13], following the practice in [16].
我們對ImageNet的實現遵循[21,40]中的做法。調整圖像大小,并在[256,480]中隨機采樣其較短的一面,以進行比例增強[40]。從圖像或其水平翻轉中隨機采樣224×224作物,并減去每像素均值[21]。使用[21]中的標準色彩增強。在每次卷積之后和激活之前,緊接著[16],我們采用批歸一化(BN)[16]。我們按照[12]中的方法初始化權重,并從頭開始訓練所有普通/殘差網絡。我們使用最小批量為256的SGD。學習率從0.1開始,當誤差平穩時除以10,并且對模型進行了多達60×104迭代的訓練。我們使用0.0001的權重衰減和0.9的動量。我們不遵循[16]中的做法使用dropout[13]。
In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fullyconvolutional form as in [40, 12], and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640}).
在測試中,為了進行比較研究,我們采用了標準的10種作物測試方法[21]。為了獲得最佳結果,我們采用[40,12]中的完全卷積形式,并在多個尺度上對分數取平均(圖像被調整大小,使得較短的邊在{224,256,384,480,640}中)。
4 Experiments
4 實驗
4.1 ImageNet Classification
4.1 ImageNet分類
We evaluate our method on the ImageNet 2012 classification dataset [35] that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images. We also obtain a final result on the 100k test images, reported by the test server. We evaluate both top-1 and top-5 error rates.
Plain Networks. We first evaluate 18-layer and 34-layer plain nets. The 34-layer plain net is in Fig. 3 (middle). The 18-layer plain net is of a similar form. See Table 1 for detailed architectures.
我們在ImageNet 2012分類數據集[35]上評估了我們的方法,該數據集包含1000個類。在128萬張訓練圖像上訓練模型,并在50k驗證圖像上進行評估。我們還將在測試服務器報告的10萬張測試圖像上獲得最終結果。我們評估了top-1和top-5的錯誤率。
普通網絡。 我們首先評估18層和34層普通網絡。34層普通網絡在圖3中(中)。18層普通網絡具有類似的形式。有關詳細架構,請參見表1。
The results in Table 2 show that the deeper 34-layer plain net has higher validation error than the shallower 18-layer plain net. To reveal the reasons, in Fig. 4 (left) we compare their training/validation errors during the training procedure. We have observed the degradation problem - the 34-layer plain net has higher training error throughout the whole training procedure, even though the solution space of the 18-layer plain network is a subspace of that of the 34-layer one.
表2中的結果表明,較深的34層普通網絡比較淺的18層普通網絡具有更高的驗證誤差。為了揭示原因,在圖4(左)中,我們比較了他們在訓練過程中的訓練/驗證錯誤。我們已經觀察到了退化問題,即使18層普通網絡的解空間是34層普通網絡的子空間,在整個訓練過程中34層普通網絡具有較高的訓練誤差。
We argue that this optimization difficulty is unlikely to be caused by vanishing gradients. These plain networks are trained with BN [16], which ensures forward propagated signals to have non-zero variances. We also verify that the backward propagated gradients exhibit healthy norms with BN. So neither forward nor backward signals vanish. In fact, the 34-layer plain net is still able to achieve competitive accuracy (Table 3), suggesting that the solver works to some extent. We conjecture that the deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error3. The reason for such optimization difficulties will be studied in the future.
我們認為,這種優化困難不太可能是由消失的梯度引起的。這些普通網絡使用BN [16]進行訓練,可確保前向傳播信號具有非零方差。我們還驗證了向后傳播的梯度具有BN的健康規范。因此,前進或后退信號都不會消失。實際上,34層普通網絡仍然可以達到競爭精度(表3),這表明求解器在某種程度上可以工作。我們推測深層的普通網絡的收斂速度可能呈指數級降低,這會影響到減少訓練誤差。將來將研究這種優化困難的原因。
Residual Networks. Next we evaluate 18-layer and 34-layer residual nets (ResNets). The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3×3 filters as in Fig. 3 (right). In the first comparison (Table 2 and Fig. 4 right), we use identity mapping for all shortcuts and zero-padding for increasing dimensions (option A). So they have no extra parameter compared to the plain counterparts.
殘差網絡。 接下來,我們評估18層和34層殘差網絡(ResNets)。基線架構與上述普通網絡相同,希望將快捷連接添加到圖3(右)中的每對3×3過濾器中。在第一個比較中(表2和圖4),我們將恒等映射用于所有短鏈接,將零填充用于增加維度(選項A)。因此,與普通網絡相比,它們沒有額外的參數。
We have three major observations from Table 2 and Fig. 4. First, the situation is reversed with residual learning – the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%). More importantly, the 34-layer ResNet exhibits considerably lower training error and is generalizable to the validation data. This indicates that the degradation problem is well addressed in this setting and we manage to obtain accuracy gains from increased depth.
我們從表2和圖4中獲得了三個主要觀察結果。首先,這種情況通過殘差學習得以逆轉34層ResNet優于18層ResNet(降低了2.8%)。更重要的是,34層ResNet表現出較低的訓練誤差,并且可以推廣到驗證數據。這表明在這種情況下可以很好地解決退化問題,并且我們設法從增加的深度中獲得準確性的提高。
Second, compared to its plain counterpart, the 34-layer ResNet reduces the top-1 error by 3.5% (Table 2), resulting from the successfully reduced training error (Fig. 4 right vs. left). This comparison verifies the effectiveness of residual learning on extremely deep systems.
其次,與普通的34層相比ResNet將top-1錯誤減少了3.5%(表2),這是由于成功減少了訓練錯誤(圖4右與左)。這項比較驗證了殘留學習在極深系統上的有效性。
Last, we also note that the 18-layer plain/residual nets are comparably accurate (Table 2), but the 18-layer ResNet converges faster (Fig. 4 right vs. left). When the net is “not overly deep” (18 layers here), the current SGD solver is still able to find good solutions to the plain net. In this case, the ResNet eases the optimization by providing faster convergence at the early stage.
最后,我們還注意到18層普通/殘差網絡比較準確(表2),但18層ResNet收斂更快(圖4右vs左)。當網絡“不是太深”(此處為18層)時,當前的SGD求解器仍然能夠為普通找到良好的解決方案。在這種情況下,ResNet通過在早期提供更快的收斂來簡化優化。
Identity vs. Projection Shortcuts. We have shown that parameter-free, identity shortcuts help with training. Next we investigate projection shortcuts (Eqn.(2)). In Table 3 we compare three options: (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameterfree (the same as Table 2 and Fig. 4 right); (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity; and ? all shortcuts are projections.
恒等與投影短鏈接。 我們已經證明無參數的恒等快捷方式有助于訓練。接下來,我們研究投影短鏈接(等式(2))。在表3中,我們比較了三個選項:(A)零填充短鏈接用于增加維度,并且所有短鏈接都是無參數的(與表2和右圖4相同);(B)投影短鏈接用于增加維度,其他短鏈接是恒等的。(C)所有短鏈接都是投影。
Table 3 shows that all three options are considerably better than the plain counterpart. B is slightly better than A.We argue that this is because the zero-padded dimensions in A indeed have no residual learning. C is marginally better than B, and we attribute this to the extra parameters introduced by many (thirteen) projection shortcuts. But the small differences among A/B/C indicate that projection shortcuts are not essential for addressing the degradation problem. So we do not use option C in the rest of this paper, to reduce memory/time complexity and model sizes. Identity shortcuts are particularly important for not increasing the complexity of the bottleneck architectures that are introduced below.
表3顯示,所有三個選項都比普通選項好得多。B比A稍好。我們認為這是因為A中的零填充維度確實沒有殘留學習。C比B好一點,我們將其歸因于許多(十三)投影快捷方式引入的額外參數。但是,A/B/C之間的細微差異表明,投影捷徑對于解決退化問題并不是必不可少的。因此,在本文的其余部分中,我們不會使用選項C來減少內存/時間的復雜性和模型大小。恒等快捷方式對于不增加下面介紹的瓶頸架構的復雜性特別重要。
Deeper Bottleneck Architectures. Next we describe our deeper nets for ImageNet. Because of concerns on the training time that we can afford, we modify the building block as a bottleneck design4. For each residual function F, we use a stack of 3 layers instead of 2 (Fig. 5). The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions. Fig. 5 shows an example, where both designs have similar time complexity.
更深的瓶頸架構。 接下來,我們將介紹ImageNet的更深層網絡。由于擔心我們可以負擔得起的培訓時間,因此我們將構建模塊修改為瓶頸設計4。對于每個殘差函數F,我們使用3層而不是2層的堆棧(圖5)。這三個層分別是1×1、3×3和1×1卷積,其中1×1層負責減小然后增加(還原)尺寸,從而使3×3層成為輸入/輸出尺寸較小的瓶頸。圖5顯示了一個示例,其中兩種設計都具有相似的時間復雜度。
The parameter-free identity shortcuts are particularly important for the bottleneck architectures. If the identity shortcut in Fig. 5 (right) is replaced with projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. So identity shortcuts lead to more efficient models for the bottleneck designs.
無參數標識短鏈接對于瓶頸體系結構特別重要。如果將圖5(右)中的恒等短鏈接替換為投影,則可以顯示時間復雜度和模型大小增加了一倍,因為短鏈接連接到兩個高維端。因此,恒等短鏈接可以為瓶頸設計提供更有效的模型。
50-layer ResNet: We replace each 2-layer block in the 34-layer net with this 3-layer bottleneck block, resulting in a 50-layer ResNet (Table 1). We use option B for increasing dimensions. This model has 3.8 billion FLOPs.
50層ResNet: 我們替換了具有3層瓶頸塊的34層網,形成了50層ResNet(表1)。我們使用選項B來增加尺寸。該模型具有38億個FLOP。
101-layer and 152-layer ResNets: We construct 101-layer and 152-layer ResNets by using more 3-layer blocks (Table 1). Remarkably, although the depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 nets (15.3/19.6 billion FLOPs).
101層和152層ResNet: 我們通過使用更多的3層塊來構建101層和152層ResNet(表1)。值得注意的是,盡管深度顯著增加,但152層ResNet(113億個FLOP)的復雜度仍低于VGG-16/19網(153.96億個FLOP)。
The 50/101/152-layer ResNets are more accurate than the 34-layer ones by considerable margins (Table 3 and 4). We do not observe the degradation problem and thus enjoy significant accuracy gains from considerably increased depth. The benefits of depth are witnessed for all evaluation metrics (Table 3 and 4).
50/101/152層ResNet比34層ResNet準確度高(表3和表4)。我們沒有觀察到退化問題,因此深度的增加大大提高了精度。所有評估指標都證明了深度的好處(表3和表4)。
Comparisons with State-of-the-art Methods. In Table 4 we compare with the previous best single-model results. Our baseline 34-layer ResNets have achieved very competitive accuracy. Our 152-layer ResNet has a single-model top-5 validation error of 4.49%. This single-model result outperforms all previous ensemble results (Table 5). We combine six models of different depth to form an ensemble (only with two 152-layer ones at the time of submitting). This leads to 3.57% top-5 error on the test set (Table 5). This entry won the 1st place in ILSVRC 2015.
與最新方法的比較。 在表4中,我們與以前的最佳單模型的結果進行了比較。我們的基準34層ResNet獲得了非常具有競爭力的準確性。我們的152層ResNet的單模型top-5驗證錯誤為4.49%。該單模型的結果優于所有之前的整體結果(表5)。我們將六個不同深度的模型組合在一起,形成一個整體(提交時只有兩個152層模型)。這導致測試集上3.5-5的top-5錯誤(表5)。該作品在ILSVRC 2015中獲得第一名。
4.2 CIFAR-10 and Analysis
4.2 CIFAR-10與分析
We conducted more studies on the CIFAR-10 dataset [20], which consists of 50k training images and 10k testing images in 10 classes. We present experiments trained on the training set and evaluated on the test set. Our focus is on the behaviors of extremely deep networks, but not on pushing the state-of-the-art results, so we intentionally use simple architectures as follows.
我們對CIFAR-10數據集[20]進行了更多研究,該數據集包含10個類別的50k訓練圖像和10k測試圖像。我們介紹在訓練集上訓練的實驗,并在測試集上進行評估。我們的重點是極度深度的網絡的行為,而不是推動最先進的結果,因此我們有意使用了如下的簡單架構。
The plain/residual architectures follow the form in Fig. 3 (middle/right). The network inputs are 32×32 images, with the per-pixel mean subtracted. The first layer is 3×3 convolutions. Then we use a stack of 6n layers with 3×3 convolutions on the feature maps of sizes {32, 16, 8} respectively, with 2n layers for each feature map size. The numbers of filters are {16, 32, 64} respectively. The subsampling is performed by convolutions with a stride of 2. The network ends with a global average pooling, a 10-way fully-connected layer, and softmax. There are totally 6n+2 stacked weighted layers. The following table summarizes the architecture:
普通/殘留體系結構遵循圖3中的形式(中間/右側)。網絡輸入為32×32圖像,每像素均值被減去。第一層是3×3卷積。然后,我們分別在大小為{32,16,8}的特征圖上使用具有3×3卷積的6n層堆棧,每個特征圖尺寸為2n層。過濾器的數量分別為{16,32,64}。二次采樣通過步幅為2的卷積執行。網絡以全局平均池,10路全連接層和softmax結尾。總共有6n +2個堆疊的加權層。下表總結了體系結構:
When shortcut connections are used, they are connected to the pairs of 3×3 layers (totally 3n shortcuts). On this dataset we use identity shortcuts in all cases (i.e., option A), so our residual models have exactly the same depth, width, and number of parameters as the plain counterparts.
使用快捷方式連接時,它們連接到成對的3×3層對(總共3n個快捷方式)。在此數據集上,我們在所有情況下都使用身份快捷方式(即選項A),因此我們的殘差模型的深度,寬度和參數數量與普通模型完全相同。
We use a weight decay of 0.0001 and momentum of 0.9,and adopt the weight initialization in [12] and BN [16] but with no dropout. These models are trained with a minibatch size of 128 on two GPUs. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split. We follow the simple data augmentation in [24] for training: 4 pixels are padded on each side, and a 32×32 crop is randomly sampled from the padded image or its horizontal flip. For testing, we only evaluate the single view of the original 32×32 image.
我們使用0.0001的權重衰減和0.9的動量,并在[12]和BN [16]中采用權重初始化,但是沒有丟失。這些模型在兩個GPU上以最小批量為128進行訓練。我們從0.1的學習率開始,在32k和48k迭代中將其除以10,然后在64k迭代中終止訓練,這是由45k/5k的火車/ val分配決定的。我們按照[24]中的簡單數據增強進行訓練:在每側填充4個像素,從填充的圖像或其水平翻轉中隨機抽取3 2×32的農作物。為了進行測試,我們僅評估原始32×32圖像的單個視圖。
We compare n = {3, 5, 7, 9}, leading to 20, 32, 44, and 56-layer networks. Fig. 6 (left) shows the behaviors of the plain nets. The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper. This phenomenon is similar to that on ImageNet (Fig. 4, left) and on MNIST (see [41]), suggesting that such an optimization difficulty is a fundamental problem.
我們比較n={3,5,7,9},得出20、32、44和56層網絡。圖6(左)顯示了普通網絡的行為。較深的平原網會增加深度,并且在深入時會表現出較高的訓練誤差。這種現象類似于ImageNet(圖4,左)和MNIST(參見[41])上的現象,表明這種優化困難是一個基本問題。
We further explore n = 18 that leads to a 110-layer ResNet. In this case, we find that the initial learning rate of 0.1 is slightly too large to start converging5. So we use 0.01 to warm up the training until the training error is below 80% (about 400 iterations), and then go back to 0.1 and continue training. The rest of the learning schedule is as done previously. This 110-layer network converges well (Fig. 6, middle). It has fewer parameters than other deep and thin networks such as FitNet [34] and Highway [41] (Table 6),yet is among the state-of-the-art results (6.43%, Table 6).
我們進一步探索n=18導致110層ResNet。在這種情況下,我們發現初始學習速率0.1太大,無法開始收斂5。因此,我們使用0.01來預熱訓練,直到訓練誤差低于80%(約400次迭代),然后返回0.1并繼續訓練。其余的學習時間表與之前一樣。這個110層的網絡可以很好地融合(圖6,中間)。它的參數比其他任何參數都少網絡,例如FitNet [34]和Highway [41](表6),但仍處于最新結果之中(6.43%,表6)。
Analysis of Layer Responses. Fig. 7 shows the standard deviations (std) of the layer responses. The responses are the outputs of each 3×3 layer, after BN and before other nonlinearity (ReLU/addition). For ResNets, this analysis reveals the response strength of the residual functions. Fig. 7 shows that ResNets have generally smaller responses than their plain counterparts. These results support our basic motivation (Sec.3.1) that the residual functions might be generally closer to zero than the non-residual functions. We also notice that the deeper ResNet has smaller magnitudes of responses, as evidenced by the comparisons among ResNet-20, 56, and 110 in Fig. 7. When there are more layers, an individual layer of ResNets tends to modify the signal less.
層響應分析。 圖7顯示了層響應的標準偏差(std)。響應是BN之后以及其他非線性(ReLU /加法)之前每個3×3層的輸出。對于ResNet,此分析揭示了殘差函數的響應強度。圖7顯示ResNet的響應通常比普通響應小。這些結果支持我們的基本動機(第3.1節),即與非殘差函數相比,殘差函數通常可能更接近于零。我們還注意到,較深的ResNet具有較小的響應幅度,如圖7中ResNet-20、56和110之間的比較所證明的。當有更多層時,ResNets的單個層往往會修改信號較少。
Exploring Over 1000 layers. We explore an aggressively deep model of over 1000 layers. We set n = 200 that leads to a 1202-layer network, which is trained as described above. Our method shows no optimization difficulty, and this 103-layer network is able to achieve training error <0.1% (Fig. 6, right). Its test error is still fairly good (7.93%, Table 6).
探索超過1000層。 我們探索了一個超過1000層的深度模型。我們將n設置為200,這將導致1202層網絡的運行,如上所述。我們的方法沒有優化困難,該103層網絡能夠實現訓練誤差<0.1%(圖6,右)。其測試誤差仍然相當不錯(7.93%,表6)。
But there are still open problems on such aggressively deep models. The testing result of this 1202-layer network is worse than that of our 110-layer network, although both have similar training error. We argue that this is because of overfitting. The 1202-layer network may be unnecessarily large (19.4M) for this small dataset. Strong regularization such as maxout [9] or dropout [13] is applied to obtain the best results ([9, 25, 24, 34]) on this dataset. In this paper, we use no maxout/dropout and just simply impose regularization via deep and thin architectures by design, without distracting from the focus on the difficulties of optimization. But combining with stronger regularization may improve results, which we will study in the future.
但是,在如此積極的深度模型上仍然存在未解決的問題。盡管這1202層網絡的測試結果都比我們110層網絡的測試結果差有類似的訓練錯誤。我們認為這是由于過度擬合。對于這個小的數據集,1202層網絡可能會不必要地大(19.4M)。使用強正則化(例如maxout [9]或dropout [13])可以在此數據集上獲得最佳結果([9,25,24,34])。在本文中,我們不使用maxout/dropout,而只是通過設計通過深度和精簡架構強加正則化,而不會分散對優化困難的關注。但是,結合更強的正則化可能會改善結果,我們將在以后進行研究。
4.3 Object Detection on PASCAL and MS COCO
4.3 基于PASCAL和MS-COCO的目標檢測
Our method has good generalization performance on other recognition tasks. Table 7 and 8 show the object detection baseline results on PASCAL VOC 2007 and 2012 [5] and COCO [26]. We adopt Faster R-CNN [32] as the detection method. Here we are interested in the improvements of replacing VGG-16 [40] with ResNet-101. The detection implementation (see appendix) of using both models is the same, so the gains can only be attributed to better networks. Most remarkably, on the challenging COCO dataset we obtain a 6.0% increase in COCO’s standard metric (mAP@[.5,.95]), which is a 28% relative improvement. This gain is solely due to the learned representations.
我們的方法在其他識別任務上具有良好的泛化性能。表7和8顯示了PASCAL VOC 2007和2012 [5]和COCO [26]上的對象檢測基線結果。我們采用Faster R-CNN [32]作為檢測方法。在這里,我們對用ResNet-101替換VGG-16 [40]的改進感興趣。使用這兩種模型的檢測實現方式(請參閱附錄)是相同的,因此只能將收益歸因于更好的網絡。最值得注意的是,在具有挑戰性的COCO數據集上,我們的COCO標準指標(mAP @ [.5,.95])增加了6.0%,相對提高了28%。該收益完全歸因于所學的表示。
Based on deep residual nets, we won the 1st places in several tracks in ILSVRC & COCO 2015 competitions: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. The details are in the appendix.
基于深層殘差網絡,我們在ILSVRC和COCO 2015競賽的多個賽道上均獲得了第一名:ImageNet檢測,ImageNet本地化,COCO檢測和COCO分割。詳細信息在附錄中。
殘差網絡事實上是由多個淺的網絡融合而成,它沒有在根本上解決消失的梯度問題,只是避免了消失的梯度問題,因為它是由多個淺的網絡融合而成,淺的網絡在訓練時不會出現消失的梯度問題,所以它能夠加速網絡的收斂.
總結
以上是生活随笔為你收集整理的Deep Residual Learning for Image Recognition(ResNet)论文翻译及学习笔记的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 锐界机器人_2019款锐界智能家居远程控
- 下一篇: zabbix服务端远程执行命令