當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Fast R-CNN 论文详读

發(fā)布時間：2023/12/20 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 Fast R-CNN 论文详读小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

剛剛才開始研讀R-CNN系列的論文，如果理解有偏差，還請多多指教！

Fast R-CNN

Abstract

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9 faster than R-CNN, is 213 faster at test-time, and achieves a higher mAP on PASCAL VOC
2012.Compared to SPPnet, Fast R-CNN trains VGG16 3 faster, tests 10 faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.

Fast R-CNN是一類基于區(qū)域特征的卷積網(wǎng)絡(luò)算法，建立在前人研究比較深入的網(wǎng)絡(luò)結(jié)構(gòu)的基礎(chǔ)之上。實現(xiàn)了速度和精度的提高。

1. Introduction

Recently, deep ConvNets [14, 16] have significantly improved image classification [14] and object detection [9, 19] accuracy. Compared to image classification, object detection is a more challenging task that requires more complex methods to solve. Due to this complexity, current approaches
(e.g., [9, 11, 19, 25]) train models in multi-stage pipelines that are slow and inelegant.

物體檢測難度高于分類，目前采用的multistage pipelines （多級流水線）的方式既滿而且精度不高。

Complexity arises because detection requires the ac- curate localization of objects, creating two primary chal- lenges. First, numerous candidate object locations (often called “proposals”) must be processed. Second, these can- didates provide only rough localization that must be refined to achieve precise localization. Solutions to these problems often compromise speed, accuracy, or simplicity.

現(xiàn)階段目標(biāo)定位的問題在于：1.proposals的數(shù)目過大；2.這些proposals一般只有粗略標(biāo)記；

1.1 R-CNN and SPPnet

The Region-based Convolutional Network method (R- CNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks:

傳統(tǒng)R-CNN的缺點(diǎn)是：

Training is a multi-stage pipeline. R-CNN first fine- tunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classi- fier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.

R-CNN首先用log損失函數(shù)微調(diào)proposals，然后再將卷積神經(jīng)網(wǎng)絡(luò)學(xué)得的特征送入SVM用于目標(biāo)檢測，替代之前通過微調(diào)學(xué)習(xí)的softmax分類器。最后第三階段才是學(xué)習(xí)檢測bounding-box的回歸模型。所以就有很多冗余步驟和不必要的計算。

Training is expensive in space and time. For SVM and bounding-box regressor training, features are ex- tracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features re- quire hundreds of gigabytes of storage.

在時間和空間上的訓(xùn)練開銷很大。

Object detection is slow. At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image (on a GPU).

目標(biāo)檢測速度很慢。

R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Spatial pyramid pooling networks (SPPnets) [11] were pro- posed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object pro- posal using a feature vector extracted from the shared fea- ture map. Features are extracted for a proposal by max- pooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6 6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pool- ing [15]. SPPnet accelerates R-CNN by 10 to 100 at test time. Training time is also reduced by 3 due to faster pro- posal feature extraction.

R-CNN很慢是因為它為每一個proposals都計算一遍卷積神經(jīng)網(wǎng)絡(luò)的正向傳遞，所以proposals重復(fù)的部分就有可能被計算好多次導(dǎo)致計算的極大浪費(fèi)。而SPPnet會計算整個圖像的卷積特征圖，然后將計算結(jié)果在整個網(wǎng)絡(luò)內(nèi)共享，后續(xù)的處理就可以建立在共享的計算結(jié)果的基礎(chǔ)上進(jìn)行計算。通過max poling將proposals中的特征圖轉(zhuǎn)化為固定大小的輸出從而進(jìn)入后面的計算。

SPPnet also has notable drawbacks. Like R-CNN, train- ing is a multi-stage pipeline that involves extracting fea- tures, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning al- gorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurpris- ingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.

當(dāng)然了，SSPnet也有缺點(diǎn)——雖然改進(jìn)了共享的問題節(jié)省了一部分計算，但是在特征提取，用log函數(shù)微調(diào)網(wǎng)絡(luò)，訓(xùn)練SVM和bounding-box的回歸檢測中并沒有改進(jìn)。而且由于微調(diào)算法不能更新到spatial pyramid poling中的卷積層中，所以使得SPPnet的精度無法很高。

1.2. Contributions

We propose a new training algorithm that fixes the disad- vantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN be- cause it’s comparatively fast to train and test. The Fast R- CNN method has several advantages:

Higher detection quality (mAP) than R-CNN, SPPnet

Training is single-stage, using a multi-task loss

Training can update all network layers

No disk storage is required for feature caching
Fast R-CNN is written in Python and C++ (Caffe [13]) and is available under the open-source MIT Li- cense at https://github.com/rbgirshick/fast-rcnn.

對于上面的問題，Fast R-CNN提出了解決方案，有以下幾個優(yōu)點(diǎn)：
1.目標(biāo)檢測精度更高；
2.訓(xùn)練模型由multistage變成了single-stage，使用multi-task loss訓(xùn)練；
3.訓(xùn)練結(jié)果共享，可以用于更新所有的層；
4.不需要磁盤緩存提取的特征。

2. Fast R-CNN architecture and training

Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each ob- ject proposal a region of interest (RoI) pooling layer ex- tracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output lay- ers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

輸入是一整張圖像和若干個proposals。然后通過若干卷積和最大池化之后得到卷積特征圖。每個午第的ROI都會被池化層處理，輸出一個固定大小的特征向量。這些特征向量會通過兩個全連接層導(dǎo)出兩個輸出：一個是softmax概率判斷所屬類別；一個是bounding-box的回歸檢測偏移。

2.1. The RoI pooling layer

The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small fea- ture map with a fixed spatial extent of H W (e.g., 7 7), where H and W are layer hyper-parameters that are inde- pendent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w).

RoI池化層使用最大池化將任何有效的RoI內(nèi)的特征轉(zhuǎn)換成具有H×W（例如，7×77×7）的固定空間范圍的小特征圖（上文提到過的固定輸出），其中H和W是層的超參數(shù)，獨(dú)立于任何特定的RoI。在本文中，RoI是卷積特征圖中的一個矩形窗口。每個RoI由指定其左上角(r,c)及其高度和寬度(h,w)的四元組(r,c,h,w)定義。

就像上文說的，一次計算整張圖的卷積特征，然后將ROI部分再通過ROI pooling layer提取具體的特征（固定大小），提取的特征向量送到兩個全連接層進(jìn)行物體歸類和位置回歸的學(xué)習(xí)。

RoI max pooling works by dividing the h w RoI win- dow into an H W grid of sub-windows of approximate size h/H w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pool- ing is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].

RoI最大池化通過將大小為h×w的RoI窗口分割成H×W個網(wǎng)格，子窗口大小約為h/H×w/W，然后對每個子窗口執(zhí)行最大池化，并將輸出合并到相應(yīng)的輸出網(wǎng)格單元中。同標(biāo)準(zhǔn)的最大池化一樣，池化操作獨(dú)立應(yīng)用于每個特征圖通道。RoI層只是SPPnets中使用的空間金字塔池層的特殊情況，其只有一個金字塔層。

2.2. Initializing from pre-trained networks

We experiment with three pre-trained ImageNet [4] net- works, each with five max pooling layers and between five and thirteen conv layers (see Section 4.1 for network de- tails). When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.

作者預(yù)訓(xùn)練了三個ImageNet網(wǎng)絡(luò)，每一個都有五個最大池化層和五到十三個卷積層。當(dāng)用預(yù)訓(xùn)練初始化Fast R-CNN網(wǎng)絡(luò)的時候，要經(jīng)歷三個變換。

First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16).

首先，最后一個最大池化層要被RoI池化層代替，而池化層的輸出大小要與全連接層的設(shè)計掛鉤，即H,W固定相等。

Second, the network’s last fully connected layer and soft- max (which were trained for 1000-way ImageNet classifi- cation) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K + 1 cat- egories and category-specific bounding-box regressors).

第二，神經(jīng)網(wǎng)絡(luò)的最后一個全連接層和softmax層（原被設(shè)計用來訓(xùn)練1000個分類）要被替換為上述的兩個全連接層——分類器和位置回歸檢測。

Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

最后，網(wǎng)絡(luò)把輸入修改為輸入圖像數(shù)據(jù)集和對應(yīng)的RoI集。

2.3. Fine-tuning for detection

Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.

用反向傳播訓(xùn)練所有網(wǎng)絡(luò)權(quán)重是Fast R-CNN的重要能力。先說一下為什么無法更新：

The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).

根本原因是因為訓(xùn)練數(shù)據(jù)集來自不同的圖像導(dǎo)致每個RoI都可能會有一個非常大的感受野，通?？赡芨采w整個輸入圖片。因為前向傳播必須處理整個感受野，倒是訓(xùn)練的輸入規(guī)模很大（甚至達(dá)到整幅圖像）。

We propose a more efficient training method that takes advantage of feature sharing during training. In Fast R- CNN training, stochastic gradient descent (SGD) mini-batches are sampled hierarchically, first by sampling N im- ages and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computationsmoothL1 (x) =0.5x2 if |x| < 1|x| ? 0.5 otherwise,
(3)and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64 faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy).

改進(jìn)基于充分利用提取特征進(jìn)行共享。在Fast R-CNN網(wǎng)絡(luò)訓(xùn)練中，隨機(jī)梯度下降（SGD）的小批量數(shù)據(jù)由分層采樣獲得，首先采樣N個圖像，然后從每個圖像采樣R/N個RoI。重要的是，來自同意圖像的RoI在前向和后向傳播中共享計算和內(nèi)存。減小N就等于減小了每個小批次的計算。例如當(dāng)N=2和R=128時，比起原來從128幅圖像各取一個的方式快了大約64倍。

One concern over this strategy is it may cause slow train- ing convergence because RoIs from the same image are cor- related. This concern does not appear to be a practical issue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN.

理論上因為來自同一圖片，所以相似度比較高，收斂會比較慢，但是實際上并沒有出現(xiàn)這種情況。

In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box re- gressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages [9, 11]. The compo- nents of this procedure (the loss, mini-batch sampling strat- egy, back-propagation through RoI pooling layers, and SGD hyper-parameters) are described below.

除了分層采樣，Fast R-CNN還使用了精細(xì)訓(xùn)練，在微調(diào)階段同時優(yōu)化softmax分類器和bounding-box的回歸，改進(jìn)了原來分為三個階段獨(dú)立訓(xùn)練softmax分類器、SVM和回歸模型的做法。下面將詳細(xì)描述這一過程（包括損失，小批量采樣策略，通過RoI池化層的反向傳播和SGD超參數(shù)）。

Multi-task loss. A Fast R-CNN network has two sibling output layers. The first outputs a discrete probability distri- bution (per RoI), p = (p0, . . . , pK), over K + 1 categories. As usual, p is computed by a softmax over the K +1 outputs of a fully connected layer. The second sibling layer outputs bounding-box regression offsets, tk = tk, tk, tk , tk , for is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. When the regression targets are unbounded, training with L2 loss can require careful tuning of learning rates in order to prevent exploding gradients. Eq. 3 eliminates this sensitivity.
The hyper-parameter λ in Eq. 1 controls the balance be- tween the two task losses. We normalize the ground-truth regression targets vi to have zero mean and unit variance. All experiments use λ = 1.
We note that [6] uses a related loss to train a class- agnostic object proposal network. Different from our ap- proach, [6] advocates for a two-network system that sepa- rates localization and classification. OverFeat [19], R-CNN [9], and SPPnet [11] also train classifiers and bounding-box localizers, however these methods use stage-wise training, which we show is suboptimal for Fast R-CNN (Section 5.1).
Mini-batch sampling. During fine-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uni- formly at random (as is common practice, we actually iter- ate over permutations of the dataset). We use mini-batches of size R = 128, sampling 64 RoIs from each image. As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a ground-each of the K object classes, indexed by k. We use the pa- rameterization for tk given in [9], in which tk specifies a scale-invariant translation and log-space height/width shift relative to an object proposal.Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression:
in which Lcls(p, u) = -log pu is log loss for true class u.

這里說的是最后兩個輸出層。一個負(fù)責(zé)分類，一個負(fù)責(zé)位置回歸。第一個輸出在K+1個類別上的離散概率分布（對于每個RoI），p=(p0,…,pK)。通常，通過全連接層的K+1個輸出上的Softmax來計算p。第二個輸出層輸出檢測框回歸偏移，對于第k個分類，計算tk=(tx,ty,tw,th)。我們使用

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014

中給出的tk的參數(shù)。其中tk明確了一個尺度不變的轉(zhuǎn)化方式和一個對于物體proposals在log-space保持高寬比偏移的方法。
對于每一個RoI，都給定了實際分類u和bounding-box目標(biāo)位置作為標(biāo)簽。我們用一個多任務(wù)損失函數(shù)L同時訓(xùn)練這兩個任務(wù)。公式如圖；

The second task loss, Lloc, is defined over a tuple of true bounding-box regression targets for class u, v = (vx; vy; vw; vh), and a predicted tuple tu = (tux ; tuy ; tuw; tuh ),
again for class u. The Iverson bracket indicator function [u 1] evaluates to 1 when u 1 and 0 otherwise. By convention the catch-all background class is labeled u = 0. For background RoIs there is no notion of a ground-truthbounding box and hence Lloc is ignored. For bounding-box regression, we use the loss

is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. When the regression targets are unbounded, training with L2 loss can require careful tuning of learning rates in order to prevent exploding gradients. Eq. 3 eliminates this sensitivity. The hyper-parameter in Eq. 1 controls the balance between
the two task losses. We normalize the ground-truth regression targets vi to have zero mean and unit variance. All experiments use = 1. We note that [6] uses a related loss to train a classagnostic
object proposal network. Different from our approach, [6] advocates for a two-network system that separates localization and classification. OverFeat [19], R-CNN
[9], and SPPnet [11] also train classifiers and bounding-box localizers, however these methods use stage-wise training, which we show is suboptimal for Fast R-CNN (Section 5.1).

對于類u，第二個損失Lloc是定義在true bounding-box回歸目標(biāo)真值目標(biāo)元組u,v=(vx,vy,vw,vh)和預(yù)測元組tu=(tux,tuy,tuw,tuh)上的損失。 Iverson括號代表函數(shù)[u≥1]當(dāng)u≥1的時候為值1，否則為0。按照慣例，背景類標(biāo)記為u=0。對于背景RoI，沒有檢測框真值的概念，因此Lloc被忽略。對于檢測框回歸，我們使用損失Lloc(tu,v)=∑i∈{x,y,w,h} smoothL1(tui?vi)(2)其中：smoothL1(x)={0.5x2|x|?0.5 if |x|<1 otherwise |x|?0.5；這表征了L1損失對于異常值的敏感度不如在R-CNN和SPPnet中使用的L2損失。當(dāng)回歸目標(biāo)無界時，具有L2損失的訓(xùn)練可能需要仔細(xì)調(diào)整學(xué)習(xí)速率，以防止爆炸梯度。公式(3)消除了這種靈敏度。

公式(1)中的超參數(shù)λ控制兩個任務(wù)損失之間的平衡。我們將回歸目標(biāo)真值vi歸一化為具有零均值和單位方差。所有實驗都使用λ=1。

我們注意到有論文使用相關(guān)損失來訓(xùn)練一個類別無關(guān)的目標(biāo)候選網(wǎng)絡(luò)。與我們的方法不同的是另外一篇論文提出構(gòu)建一個分離定位和分類的雙網(wǎng)絡(luò)系統(tǒng)。OverFeat，R-CNN和SPPnet也分開訓(xùn)練分類器和bounding-box定位器，但是這些方法使用逐級訓(xùn)練，這對于Fast RCNN來說不是最好的選擇。

Mini-batch sampling. During fine-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uni- formly at random (as is common practice, we actually iter- ate over permutations of the dataset). We use mini-batches of size R = 128, sampling 64 RoIs from each image. As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a ground-each of the K object classes, indexed by k. We use the pa-rameterization for tk given in [9], in which tk specifies a scale-invariant translation and log-space height/width shift relative to an object proposal.Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression:L(p, u, tu, v) = Lcls(p, u) + λ[u ≥ 1]Lloc(tu, v), (1) in which Lcls(p, u) = log pu is log loss for true class u.The second task loss, Lloc, is defined over a tuple of
true bounding-box regression targets for class u, v = (vx, vy, vw, vh), and a predicted tuple tu = (tu, tu, tu, tu), the examples labeled with a foreground object class, i.e. u 1. The remaining RoIs are sampled from object pro- posals that have a maximum IoU with ground truth in the in- terval [0.1, 0.5), following [11]. These are the background examples and are labeled with u = 0. The lower threshold of 0.1 appears to act as a heuristic for hard example mining [8]. During training, images are horizontally flipped with probability 0.5. No other data augmentation is used.

小批量采樣。在微調(diào)期間，每個SGD的小批量訓(xùn)練樣本由N=2個圖像構(gòu)成，均勻地隨機(jī)選擇（正如通常的做法，我們實際上迭代數(shù)據(jù)集的排列）。我們使用大小為R=128的小批量訓(xùn)練數(shù)據(jù)，從每個圖像采樣64個RoI。如在一篇論文中中，我們從候選框中獲取25％的RoI，這些候選框與檢測框ground-truth的IoU至少為0.5。這些RoI只包括用前景對象類標(biāo)記的樣本，即u≥1。剩余的RoI從候選框中采樣，該候選框與檢測框真值的最大IoU在區(qū)間[0.1,0.5)。這些背景樣本，并用u=0標(biāo)記。0.1的閾值下限似乎充當(dāng)難負(fù)樣本重訓(xùn)練的啟發(fā)式算法。在訓(xùn)練期間，圖像以0.5概率水平翻轉(zhuǎn)。不使用其他數(shù)據(jù)增強(qiáng)算法。

Back-propagation through RoI pooling layers. Backpropagation routes derivatives through the RoI pooling
layer. For clarity, we assume only one image per mini-batch (N = 1), though the extension to N > 1 is straightforward
because the forward pass treats all images independently. Let xi 2 R be the i-th activation input into the RoI pooling
layer and let yrj be the layer’s j-th output from the r-th RoI. The RoI pooling layer computes yrj = xi(r;j), in which i(r; j) = argmaxi02R(r;j) xi0 . R(r; j) is the index set of inputs in the sub-window over which the output unit yrj max pools. A single xi may be assigned to several different
outputs yrj . The RoI pooling layer’s backwards function computes
partial derivative of the loss function with respect to each input variable xi by following the argmax switches:

通過RoI池化層的反向傳播。反向傳播通過RoI池化層。為了看起來更清晰，我們假設(shè)每個小批量訓(xùn)練樣本(N=1)只有一個圖像，想擴(kuò)展到N>1也很簡單，因為前向傳播算法獨(dú)立地處理所有圖像。

令xi∈?是到RoI池化層的第i個激活輸入，并且令yrj是來自第r個RoI層的第j個輸出。RoI池化層計算yrj=xi?(r,j)，其中i?(r,j)=argmaxi′∈R(r,j)xi′。R(r,j)是輸出單元yrj最大池化的子窗口中的輸入的索引集合。單個xi可以被分配給幾個不同的輸出yrj。

RoI池化層反向傳播函數(shù)通過遵循argmax switches來計算關(guān)于每個輸入變量xi的損失函數(shù)的偏導(dǎo)數(shù)：?L?xi=∑r∑j[i=i?(r,j)]?L?yrj(4）
換句話說，對于每個小批量訓(xùn)練樣本RoI r和對于每個池化輸出單元yrj，如果i是yrj通過最大池化選擇的argmax，則將這個偏導(dǎo)數(shù)?L/?yrj積累下來。在反向傳播中，偏導(dǎo)數(shù)?L/?yrj已經(jīng)由RoI池化層頂部的層的反向傳播函數(shù)計算。

SGD hyper-parameters. The fully connected layers used for softmax classification and bounding-box regression are initialized from zero-mean Gaussian distributions with standard deviations 0:01 and 0:001, respectively. Biases are initialized to 0. All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0:001. When training on VOC07 or VOC12 trainval we run SGD for 30k mini-batch iterations, and then lower the learning rate to 0:0001 and train for another 10k iterations. When we train on larger datasets, we run SGD for more iterations, as described later. A momentum of 0:9 and parameter decay of 0:0005 (on weights and biases) are used.

SGD超參數(shù)。用于Softmax分類和檢測框回歸的全連接層的權(quán)重分別使用具有方差0.01和0.001的零均值高斯分布初始化。偏置項biases初始化為0。所有層的權(quán)重學(xué)習(xí)率為1倍的全局學(xué)習(xí)率，偏置項比biases為2倍的全局學(xué)習(xí)率，全局學(xué)習(xí)率為0.001。當(dāng)對VOC07或VOC12 trainval訓(xùn)練時，我們運(yùn)行SGD進(jìn)行30k次小批量迭代，然后將學(xué)習(xí)率降低到0.0001，再訓(xùn)練10k次迭代。當(dāng)我們訓(xùn)練更大的數(shù)據(jù)集，我們運(yùn)行SGD更多的迭代，如下文所述。使用0.9的動量和0.0005的參數(shù)衰減（權(quán)重和偏置）。

2.4. Scale invariance

We explore two ways of achieving scale invariant object detection: (1) via “brute force” learning and (2) by using image pyramids. These strategies follow the two approaches in [11]. In the brute-force approach, each image is processed at a pre-defined pixel size during both training and testing. The network must directly learn scale-invariant object detection from the training data.
The multi-scale approach, in contrast, provides approximate scale-invariance to the network through an image pyramid. At test-time, the image pyramid is used to approximately scale-normalize each object proposal. During multi-scale training, we randomly sample a pyramid scale each time an image is sampled, following [11], as a form of data augmentation. We experiment with multi-scale training for smaller networks only, due to GPU memory limits.

我們探索兩種實現(xiàn)尺度不變對象檢測的方法：（1）通過“brute force”學(xué)習(xí)和（2）通過使用圖像金字塔。在“brute force”方法中，在訓(xùn)練和測試期間以預(yù)定義的像素大小處理每個圖像。網(wǎng)絡(luò)必須直接從訓(xùn)練數(shù)據(jù)學(xué)習(xí)尺度不變性目標(biāo)檢測。

多尺度方法通過圖像金字塔向網(wǎng)絡(luò)提供近似尺度不變性。在測試時，圖像金字塔用于大致縮放-歸一化每個proposal。在多尺度訓(xùn)練期間，我們在每次圖像采樣時隨機(jī)采樣金字塔尺度，作為數(shù)據(jù)增強(qiáng)的形式。由于GPU內(nèi)存限制，我們只對較小的網(wǎng)絡(luò)進(jìn)行多尺度訓(xùn)練。

3. Fast R-CNN detection

Once a Fast R-CNN network is fine-tuned, detection amounts to little more than running a forward pass (assuming object proposals are pre-computed). The network takes as input an image (or an image pyramid, encoded as a list of images) and a list of R object proposals to score. Attest-time, R is typically around 2000, although we will consider cases in which it is larger ( 45k). When using an image pyramid, each RoI is assigned to the scale such that the scaled RoI is closest to 2242 pixels in area .
For each test RoI r, the forward pass outputs a class posterior probability distribution p and a set of predicted bounding-box offsets relative to r (each of the K classes gets its own refined bounding-box prediction). We assign a detection confidence to r for each object class k using the estimated probability Pr(class = k | r) = pk. We then
perform non-maximum suppression independently for each class using the algorithm and settings from R-CNN.

一旦Fast R-CNN網(wǎng)絡(luò)被微調(diào)完畢，檢測相當(dāng)于運(yùn)行前向傳播（假設(shè)proposal是預(yù)先計算的）。網(wǎng)絡(luò)將圖像（或圖像金字塔，編碼為圖像列表）和待計算概率的R個候選框的列表作為輸入。在測試的時候，R通常在2000左右，雖然我們將考慮將它變大（約45k）的情況。當(dāng)使用圖像金字塔時，每個RoI被縮放，使其最接近5中的224^2個像素。

對于每個測試的RoI r，正向傳播輸出的是分類的后驗概率分布p和相對于r的預(yù)測的檢測框偏移（K個類別中的每一個都將獲得其相對于自己的更細(xì)致的檢測框預(yù)測）。我們使用估計的概率Pr(class=k|r)?pk為每個對象類別k計算檢測置信度r。然后，我們使用R-CNN算法和對每個類別獨(dú)立執(zhí)行非最大抑制。

3.1. Truncated SVD for faster detection

For whole-image classification, the time spent computing the fully connected layers is small compared to the conv layers. On the contrary, for detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers (see Fig. 2). Large fully connected layers are easily accelerated by compressing them with truncated SVD [5, 23].
In this technique, a layer parameterized by the u x v weight matrix W is approximately factorized as

using SVD. In this factorization, U is a u x t matrix comprising the first t left-singular vectors of W, Σt is a t x t diagonal matrix containing the top t singular values of W, and V is v x t matrix comprising the first t right-singular vectors of W. Truncated SVD reduces the parameter count from uv to t(u + v), which can be significant if t is much smaller than min(u; v). To compress a network, the single fully connected layer corresponding to W is replaced by two fully connected layers, without a non-linearity between them. The first of these layers uses the weight matrix ΣtV^T (and no biases) and the second uses U (with the original biases associated with W). This simple compression method gives good speedups when the number of RoIs is large.

對于圖像整體的分類，與卷積層相比，計算全連接層花費(fèi)的時間較少。相反，為了檢測物體，要處理的RoI的數(shù)量很大，并且接近一半的正向傳遞時間用于計算全連接層（參見圖2）。大的全連接層容易通過用截斷SVD壓縮來加速。

在這種技術(shù)中，層的u×v權(quán)重矩陣W通過SVD被近似分解為：W≈UΣtV^T(5)；在這種分解中，U是一個u×t的矩陣，包括W的前t個左奇異向量，Σt是t×t對角矩陣，其包含W的前t個奇異值，并且V是v×t矩陣，包括W的前t個右奇異向量。截斷SVD將參數(shù)計數(shù)從uv減少到t(u+v)個，如果t遠(yuǎn)小于min(u,v)，則SVD會顯著降低計算量。為了壓縮網(wǎng)絡(luò)，對應(yīng)于W的單個全連接層由兩個全連接層替代，在它們之間沒有非線性。這些層中的第一層使用權(quán)重矩陣ΣtVT（沒有偏置），并且第二層使用U（其中原始偏置項biases與W相關(guān)聯(lián)）。當(dāng)RoI的數(shù)量大時，這種簡單的壓縮方法可以顯著加速。

4. Main results

Three main results support this paper’s contributions:

State-of-the-art mAP on VOC07, 2010, and 2012

Fast training and testing compared to R-CNN, SPPnet

Fine-tuning conv layers in VGG16 improves mAP

支撐本文的成果有三個：
1.VOC07，2010和2012的最高的mAP。
2.相比R-CNN，SPPnet，快速訓(xùn)練和測試。
3.在VGG16中微調(diào)卷積層改善了mAP。

5.Design evaluation

文章還通過實驗數(shù)據(jù)說明了各種因素對訓(xùn)練的影響，其中包括：
5.1. Does multitask training help?
5.2. Scale invariance: to brute force or finesse?
5.3. Do we need more training data?
5.4. Do SVMs outperform softmax?
5.5. Are more proposals always better?
5.6. Preliminary MS COCO results

總結(jié)

以上是生活随笔為你收集整理的Fast R-CNN 论文详读的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：计算机怎么听音乐,电脑怎样听啪啪音乐圈？
下一篇：诺基亚5320微信提示服务器繁忙,诺基亚