當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Arcface v1 论文翻译与解读

發布時間：2025/3/21 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 Arcface v1 论文翻译与解读小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

神羅Noctis 2019-10-13 16:14:39 ?543 ?收藏 4
展開
論文地址：http://arxiv.org/pdf/1801.07698v1.pdf
最新版本v3的論文翻譯：Arcface v3 論文翻譯與解讀

Arcface v1 論文的篇幅比較長，花費了本人3天的時間進行翻譯解讀，希望能夠幫助讀者更好地理解論文。

ArcFace: Additive Angular Margin Loss for Deep Face Recognition
目錄
Abstract

1. Introduction

2. From Softmax to ArcFace?

2.1. Softmax

2.2.Weights Normalisation

2.3. Multiplicative Angular Margin

2.4. Feature Normalisation

2.5. Additive Cosine Margin

2.6. Additive Angular Margin

2.7. Comparison under Binary Case

2.8. Target Logit Analysis

3. Experiments

3.1. Data

3.1.1 Training data

3.1.2 Validation data

3.1.3 Test data

3.2. Network Settings

3.2.1 Input setting

3.2.2 Output setting

3.2.3 Block Setting

3.2.4 Backbones

3.2.5 Network Setting Conclusions

3.3. Loss Setting

3.4. MegaFace Challenge1 on FaceScrub

3.5. Further Improvement by Triplet Loss

4. Conclusions

4. 結論

Abstract
Convolutional neural networks have significantly boosted the performance of face recognition in recent?years due to its high capacity in learning discriminative?features. To enhance the discriminative power of the Softmax?loss, multiplicative angular margin [23] and additive?cosine margin [44, 43] incorporate angular margin and?cosine margin into the loss functions, respectively. In?this paper, we propose a novel supervisor signal, additive?angular margin (ArcFace), which has a better geometrical interpretation than supervision signals proposed so far. Specifically, the proposed ArcFace cos(θ?+ m) directly?maximise decision boundary in angular (arc) space based?on the L2 normalised weights and features. Compared to?multiplicative angular margin cos(mθ) and additive cosine?margin cosθ -?m, ArcFace can obtain more discriminative?deep features. We also emphasise the importance of?network settings and data refinement in the problem of deep?face recognition. Extensive experiments on several relevant?face recognition benchmarks, LFW, CFP and AgeDB,?prove the effectiveness of the proposed ArcFace. Most importantly,we get state-of-art performance in the MegaFace?Challenge in a totally reproducible way. We make data,?models and training/test code public .

摘要
近年來，卷積神經網絡顯著提高了人臉識別的性能，因其強大學習具有判別性特征的能力。為了提高Softmax損失的判別能力，乘法角度間隔(multiplicative angular margin)[23]和加法余弦間隔(additive cosine margin)[44,43]分別將角度間隔和余弦間隔加入到損失函數中。在本文中，我們提出了一種新穎的監控信號，即(additive angular margin)加法角度間隔（ArcFace），它比目前提出的監督信號具有更好的幾何解釋。具體來說，所提出的ArcFace?cos(θ?+ m)直接將基于L2歸一化權重和特征的角度空間中的決策邊界最大化。與乘法角度間隔 cos(mθ) 和加法余弦間隔?cosθ -?m 相比，ArcFace可以獲得更具判別能力的深層特征。我們也強調了網絡設置和數據細化在深度人臉識別中的重要性。在相關的人臉識別基準(LFW、CFP和AgeDB)上進行了大量的實驗，證明了ArcFace的有效性。最重要的是，我們在MegaFace挑戰賽中以完全可重現的方式獲得最先進的性能。我們公開數據、模型和培訓/測試代碼。

圖1. ArcFace的幾何解釋。(a) 藍點和綠點代表來自兩個不同類的嵌入特征。ArcFace可以直接在類之間增加角度間隔(angular (arc) margin)。(b)我們發現角度和角度間隔(arc margin)之間有一種直觀的對應關系。Arcface的angular margin對應超球面上的arc?margin(測地距離)。

1. Introduction
Face representation through the deep convolutional network embedding is considered the state-of-the-art method for face verification, face clustering, and face recognition?[42, 35, 31]. The deep convolutional network is responsible?for mapping the face image, typically after a pose normalisation?step, into an embedding feature vector such that?features of the same person have a small distance while features?of different individuals have a considerable distance.?The various face recognition approaches by deep convolutional network embedding differ along three primary?attributes.

1. 介紹
通過深度卷積網絡嵌入的人臉表征，被認為是目前最先進的人臉驗證、人臉聚類和人臉識別方法[42,35,31]。通常在歸一化步驟之后，深度卷積網絡負責將人臉圖像映射到嵌入的特征向量中，使得同一個人的特征距離小，不同一個人的特征距離大。基于深度卷積網絡嵌入的人臉識別方法，主要有三個屬性。

The first attribute is the training data employed to train?the model. The identity number of public available training?data, such as VGG-Face [31], VGG2-Face [7], CAISAWebFace?[48], UMDFaces [6], MS-Celeb-1M [11], and?MegaFace [21], ranges from several thousand to half million. Although MS-Celeb-1M and MegaFace have a significant?number of identities, they suffer from annotation?noises [47] and long tail distributions [50]. By comparison,?private training data of Google [35] even has several million?identities. As we can check from the latest performance report?of Face Recognition Vendor Test (FRVT) [4], Yitu, a
start-up company from China, ranks first based on their private?1.8 billion face images [5]. Due to orders of magnitude
difference on the training data scale, face recognition models?from industry perform much better than models from?academia. The difference of training data also makes some?deep face recognition results [2] not fully reproducible.?

第一個屬性是用于訓練模型的訓練數據。公開人臉訓練數據集，如VGG-Face [31]，VGG2-Face [7]，CAISAWebFace [48]，UMDFaces [6]，MS-Celeb-1M [11]和MegaFace [21]，包含了數量范圍從幾千到五十萬的身份。盡管MS-Celeb-1M和MegaFace具有相當數量的身份，但它們受到注釋噪聲[47]和長尾分布[50]的影響。相比之下，Google [35]的私人訓練數據集甚至擁有數百萬個身份。從人臉識別供應商Test (FRVT)[4]的最新業績報告中可以看出，中國初創企業依圖公司擁有18億張私有的人臉圖像[5]，排名第一。由于訓練數據規模上的數量級差異，來自工業界的人臉識別模型比來自學術界的模型表現要好得多。訓練數據的差異也使得一些深度人臉識別結果[2]不能完全復現。

The second attribute is the network architecture and?settings. High capacity deep convolutional networks, such as ResNet [14, 15, 46, 50, 23] and Inception-ResNet [40,3], can obtain better performance compared to VGG network?[37, 31] and Google Inception V1 network [41, 35].Different applications of deep face recognition prefer different?trade-off between speed and accuracy [16, 51]. For?face verification on mobile devices, real-time running speed?and compact model size are essential for slick customer experience.For billion level security system, high accuracy is?as important as efficiency.?

第二個屬性是網絡架構和設置。高容量的深卷積網絡，如ResNet[14, 15, 46, 50, 23]和Inception-ResNet[40,3]，與VGG網絡[37,31]和Google Inception V1網絡[41,35]相比，可以獲得更好的性能。不同的深度人臉識別應用在速度和精度之間的取舍是不同的[16,51]。對于移動設備上的人臉驗證，實時運行速度和緊湊的模型大小對于流暢的客戶體驗是至關重要的。對于十億級安全系統，精度和效率是同等重要的。

The third attribute is the design of the loss functions.

第三個屬性是損失函數的設計。

(1) Euclidean margin based loss.?

In [42] and [31], a Softmax classification layer is trained?over a set of known identities. The feature vector is then?taken from an intermediate layer of the network and used?to generalise recognition beyond the set of identities used?in training. Centre loss [46] Range loss [50] and Marginal?loss [10] add extra penalty to compress intra-variance or enlarge?inter-distance to improve the recognition rate, but all?of them still combine Softmax to train recognition models.However, the classification-based methods [42, 31] suffer?from massive GPU memory consumption on the classification?layer when the identity number increases to million?level, and prefer balanced and sufficient training data for?each identity.

The contrastive loss [39] and the Triplet loss [35] utilise?pair training strategy. The contrastive loss function consists?of positive pairs and negative pairs. The gradients of the loss?function pull together positive pairs and push apart negative?pairs.??Triplet loss minimises the distance between an anchor?and a positive sample and maximises the distance between?the anchor and a negative sample from a different identity.?However, the training procedure of the contrastive loss [39]?and the Triplet loss [35] is tricky due to the selection of?effective training samples.

(1) 基于歐幾里德距離的損失。

在[42]和[31]中，Softmax分類層是在一組已知身份上訓練的。然后，從網絡的中間層提取特征向量，用于訓練中使用的一組身份之外的泛化識別。中心損失[46]范圍損失[50]和margin損失[10]增加了額外的懲罰來減小類內方差或增大類間距離，從而提高識別率，但它們仍然結合Softmax來訓練識別模型。然而，當身份數量增加到百萬級別時，基于分類(classi?cation-based)方法[42,31]會在分類層上消耗大量的GPU內存，并且每個身份都需要均衡且充足的訓練數據。

對比損失[39]和三重損失[35]采用配對訓練策略。對比損失函數由正對和負對組成。損失函數的梯度將正對(positive pairs)拉攏在一起，將負對(negative pairs)分開。三重損失最小化錨點(anchor)與正樣本之間的距離，最大化錨點(anchor)與不同身份的負樣本之間的距離。然而，很難選擇有效的訓練樣本，導致對比損失[39]和三重損失[35]的訓練過程比較復雜。

(2) Angular and cosine margin based loss.

Liu et al. [24] proposed a large margin Softmax (L-Softmax)?by adding multiplicative angular constraints to?each identity to improve feature discrimination. SphereFace?cos(mθ) [23] applies L-Softmax to deep face recognition?with weights?normalisation. Due to the non-monotonicity?of the cosine function, a piece-wise function is applied in?SphereFace to guarantee the monotonicity. During training?of SphereFace, Softmax loss is combined to facilitate?and ensure the convergence. To overcome the optimisation?difficulty of SphereFace, additive cosine margin [44, 43]?cos(θ) - m moves the angular margin into cosine space. The?implementation and optimisation of additive cosine margin?are much easier than SphereFace. Additive cosine margin?is easily reproducible and achieves state-of-the-art performance?on MegaFace (TencentAILab_FaceCNN_v1) [2].
Compared to Euclidean margin based loss, angular and?cosine margin based loss explicitly adds discriminative?constraints on a hypershpere manifold, which intrinsically?matches the prior that human face lies on a manifold.

As is well known that the above mentioned three attributes, data, network and loss, have a high-to-low influence?on the performance of face recognition models. In?this paper, we contribute to improving deep face recognition?from all of these three attributes.

(2)基于角度間隔和余弦間隔的損失

Liu等人在[24]中提出了 large margin Softmax (L-Softmax)，通過在每個身份中添加乘法角度約束來提高特征的判別能力。 SphereFace?cos(mθ) 將L-Softmax應用于權重歸一化的深度人臉識別。在SphereFace的訓練中，結合了Softmax損失，以促進和確保收斂。為了克服SphereFace的優化困難，加法余弦間隔(additive cosine margin) [44, 43]?cos(θ) - m 將角度間隔移動到余弦空間中。加法余弦間隔的實現和優化比SphereFace容易得多。加法余弦間隔很容易重現，并在MegaFace（TencentAILab_FaceCNN_v1）[2] 上達到了最先進的性能。與基于歐幾里德距離的損失相比，基于角度間隔和余弦間隔的損失在超球面流形上顯式地增加了判別約束，這本質上與人臉分布在流形上的先驗匹配。

眾所周知，上述三個屬性，數據，網絡和損失，對人臉識別模型的性能有高低的影響。在這篇論文中，我們從這三個屬性來改進深度人臉識別。

Data. We refined the largest public available training?data, MS-Celeb-1M [11], in both automatic and manual?way. We have checked the quality of the refined MS1M?dataset with the Resnet-27 [14, 50, 10] network and the?marginal loss [10] on the NIST Face Recognition Prize Challenge?. We also find that there are hundreds of overlap?face images between the MegaFace one million distractors?and the FaceScrub dataset, which significantly affects the?evaluation results. We manually find these overlap face images?from the MegaFace distractors. Both the refinement of?training data and test data will be public available.

數據。我們以自動和手動兩種方式對目前世界上規模最大的公開人臉訓練數據集MS-Celeb-1M[11]進行清洗。我們使用Resnet-27[14, 50, 10]網絡和NIST人臉識別獎挑戰中的margin損失[10]，檢測了修改后的MS1M數據集的質量。我們還發現在MegaFace中的一百萬個干擾集和FaceScrub數據集之間存在數百個重復的人臉圖像，這對評估結果有顯著影響。(MegaFace挑戰將從 Flickr Dataset中挑選的百萬張人臉圖像作為測試時的干擾集?(distractors)，而使用的搜索測試集 (probes)來自于FaceScrub 數據集 )?我們在MegaFace的干擾集中手動找到這些重復的人臉圖像。訓練數據和測試數據的改進都將公之于眾。

Network. Taking VGG2 [7] as the training data, we conduct?extensive contrast experiments regarding the convolutional
network settings and report the verification accuracy?on LFW, CFP and AgeDB. The proposed network settings?have been confirmed robust under large pose and age variations.?We also explore the trade-off between the speed and?accuracy based on the most recent network structures.

網絡。以VGG2[7]作為訓練數據集，我們對卷積網絡設置進行了大量對比實驗，并報告了LFW、CFP和AgeDB的驗證精度。在大的面部姿態變動和年齡變化下，提出的網絡設置已被證實具有魯棒性。基于最新的網絡結構，我們還探討了速度和精度之間的權衡。

Loss. We propose a new loss function, additive angular?margin (ArcFace), to learn highly discriminative features?for robust face recognition. As shown in Figure 1, the proposed loss function cos(θ + m) directly maximise?decision boundary in angular (arc) space based on the L2?normalised weights and features. We show that ArcFace?not only has a more clear geometrical interpretation but also?outperforms the baseline methods, e.g. multiplicative angular?margin [23] and additive cosine margin [44, 43]. We?innovatively explain why ArcFace is better than Softmax,?SphereFace [23] and CosineFace [44, 43] from the view of
semi-hard sample distributions.

損失。我們提出了一種新的損失函數——加法角度間隔(ArcFace)，學習具有高判別性的特征，以實現具有魯棒性的人臉識別。如圖1所示,所提出的損失函數cos(θ+ m)直接最大化基于L2歸一化權重和特征的角度空間中的決策邊界。我們分析表明，ArcFace不僅有更清晰的幾何解釋，而且比乘法角度間隔[23]和加法余弦間隔[44,43]這些 baseline方法更好。

Performance. The proposed ArcFace achieves state-ofthe-art results on the MegaFace Challenge [21], which is?the largest public face benchmark with one million faces?for recognition. We make these results totally reproducible?with data, trained models and training/test code public?available.

性能。所提出的ArcFace在MegaFace挑戰賽[21]上取得了優異的成績，[21]是目前世界上規模最大和公開的百萬規模級別的人臉識別算法的測試基準。我們將這些結果與數據、經過訓練的模型和訓練/測試代碼公開。

2. From Softmax to ArcFace?
2.1. Softmax
The most widely used classification loss function, Softmax
loss, is presented as follows:

where ??denotes the deep feature of the -th samples, belonging to the -th class. The feature dimension d is set?as 512 in this paper following [46, 50, 23, 43]. ?denotes the -th column of the weights ?in the?last fully connected layer and ?is the bias term. The?batch size and the class number is ?and , respectively.?Traditional Softmax loss is widely used in deep face recognition?[31, 7]. However, the Softmax loss function does?not explicitly optimise the features to have higher similarity?score for positive pairs and lower similarity score for negative?pairs, which leads to a?performance gap.

2.?從Softmax到ArcFace
2.1. Softmax
最廣泛使用的分類損失函數Softmax損失如下：

其中??表示第個樣本的深層特征，屬于類。本文根據[46,50,23,43]將特征維度設置為512。?表示最后一個完全連接層中權重??的第列，?為偏置項。批大小和類的數量分別為和?。傳統的Softmax損失在深度人臉識別中得到了廣泛的應用[31,7]。然而，Softmax損失函數沒有明確地優化特征，使正對的相似性得分更高并且負對的相似性得分更低，這導致性能差距。

2.2.Weights Normalisation
For simplicity, we fix the bias ?as [23]. Then, we transform the target logit [32] as follows:

Following [23, 43, 45], we fix ??by L2 normalisation, which makes the predictions only depend on the angle?between the feature vector and the weight.

In the experiments of SphereFace, L2 weight normalisation?only improves little on performance.

2.2. 權重歸一化
為了簡單起見，我們像[23]那樣，固定偏差??。然后，我們將目標logit[32]變換如下:

在[23,43,45]之后，我們通過L2歸一化將權重向量固定為，這使得預測只依賴于特征向量和權重之間的角度。

在SphereFace的實驗中，L2權重歸一化對性能的改善微乎其微。

2.3. Multiplicative Angular Margin
In SphereFace [23, 24], angular margin m is introduced?by multiplication on the angle.

where ?. In order to remove this restriction,??is substituted by a piece-wise monotonic function?. The SphereFace is formulated as:

where ,?,?,??is the integer
that controls the size of angular margin. However, during?the implementation of SphereFace, Softmax supervision is?incorporated to guarantee the convergence of training, and?the weight is controlled by a dynamic hyper-parameter λ.?With the additional Softmax loss, ??in fact is:

where λ is a additional hyper-parameter to facilitate the?training of SphereFace. λ is set to 1,000 at beginning and?decreases to 5 to make the angular space of each class more?compact [23]. This additional dynamic hyper-parameter λ?makes the training of SphereFace relatively tricky.

2.3. 乘法角度間隔
在SphereFace[23,24]中，角度間隔 ?通過乘法引入到角度中。

其中??。為了消除這個限制，用一個分段單調函數 ?代替。SphereFace表示為:

其中，???，??，??是一個整數，它控制角度間隔的大小。然而，在SphereFace的實施過程中，引入一個Softmax監督項?，用于確保訓練時收斂，監督項的權重由動態的超參數 λ 控制。加上額外的Softmax損失，?實際變為：

其中 λ 是一個額外的超參數，用于促進SphereFace的訓練。?λ在開始時設置為1,000，并且逐漸減小到5，使得每個類的角度空間更緊湊[23]。這個額外的動態超參數 λ 使得SphereFace的訓練難度更大。

2.4. Feature Normalisation
Feature normalisation is widely used for face verification,e.g. L2-normalised Euclidean distance and cosine distance?[29]. Parde et al. [30] observe that the L2-norm of?features learned using Softmax loss is informative of the?quality of the face. Features for good quality frontal faces?have a high L2-norm while blurry faces with extreme pose?have low L2-norm. Ranjan et al. [33] add the L2-constraint?to the feature descriptors and restrict features to lie on a?hypersphere of a fixed radius. L2 normalisation on features?can be easily implemented using existing deep learning?frameworks and significantly boost the performance of?face verification. Wang et al. [44] point out that gradient?norm may be extremely large when the feature norm from?low-quality face image is very small, which potentially increases?the risk of gradient explosion. The advantages of?feature normalisation are also revealed in [25, 26, 43, 45]?and the feature normalisation is explained from analytic, geometric?and experimental perspectives.

2.4. 特征歸一化
特征歸一化被廣泛用于人臉驗證，例如，L2歸一化的歐幾里德距離和余弦距離[29]。Parde等人[30]觀察到，使用的softmax損失學習的歸一化特征，對于得到關于人臉質量的信息是有幫助的。高質量正面臉的特征具有較高的L2范數，而姿態極端的模糊臉的特征具有較低的L2范數。Ranjan等人將L2約束添加到特征描述符中，并限制特征分布在一個半徑固定的超球面上。使用現有的深度學習框架可以很容易地實現特征的L2歸一化，并顯著提高人臉驗證的性能。Wang等人[44]指出，當來自低質量人臉圖像的特征范數非常小時，梯度范數可能非常大，這可能增加了梯度爆炸的風險。在[25,26,43,45]揭示了特征歸一化的優點，并從分析、幾何和實驗的角度對特征歸一化進行了解釋。

As we can see from above works, L2 normalisation on?features and weights is an important step for hypersphere?metric learning. The intuitive insight behind feature and?weight normalisation is to remove the radial variation and?push every feature to distribute on a hypersphere manifold.

從以上工作可以看出，特征和權值的L2歸一化是超球度量學習的重要步驟。特征和權重歸一化背后的直覺洞察力是去除放射性狀的變量，并推動每個特征分布在超球面流形上。

Following [33, 43, 45, 44], we fix???by L2 normalisation?and re-scale ?to s, which is the hypersphere?radius and the lower bound is give in [33]. In this paper, we use s = 64 for face recognition experiments [33, 43].?Based on feature and weight normalisation, we can get??.

按照[33,43,45,44]，我們通過L2歸一化將??固定，并將??重新縮放到?，也就是超球面的半徑，其下界在[33]中給出。本文采用s = 64進行人臉識別實驗[33,43]。基于特征和權重歸一化，我們可以得到。

If the feature normalisation is applied to SphereFace, we can get the feature normalised SphareFace, denoted as?SphereFace-FNorm

如果將特征歸一化應用于SphereFace，則可以得到特征歸一化的SphareFace，表示為SphereFace-FNorm

?
2.5. Additive Cosine Margin
In [44, 43], the angular margin m is removed to the outside?of cosθ, thus they propose the cosine margin loss function:

In this paper, we set the cosine margin m as 0:35 [44, 43].Compared to SphereFace, additive cosine margin (CosineFace)?has three advantages: (1) extremely easy to implement?without tricky hyper-parameters; (2) more clear and?able to converge?without the Softmax supervision; (3) obvious?performance improvement.

2.5. 加法余弦間隔
在[44,43]中，角度間隔?被移動到cosθ的外邊，這樣一來，它們提出了余弦間隔損失函數：

在本文中，我們將余弦間隔設為0:35[44,43]。與SphereFace相比，加法余弦間隔(CosineFace)具有以下三個優點:(1)無需復雜的超參數即可輕松實現；?(2)更清晰，不需要Softmax監督即可收斂；(3)性能明顯提高。

2.6. Additive Angular Margin
Although the cosine margin in [44, 43] has a one-to-one?mapping from the cosine space to the angular space, there is?still a difference between these two margins. In fact, the angular?margin has a more clear geometric interpretation compared?to cosine margin, and the margin in angular space corresponds?to the arc distance on the hypersphere manifold.

We add an angular margin m within cosθ. Since cos(θ+m) is lower than cos(θ) when???, the constraint?is more stringent for classification. We define the proposed?ArcFace as:

If we expand the proposed additive angular margin?cos(θ+m), we get cos(θ+m) = cosθcosm - sinθsinm.?Compared to the additive cosine margin cos(θ) - m proposed?in [44, 43], the proposed ArcFace is similar but the?margin is dynamic due to sin θ.?In Figure 2, we illustrate the proposed ArcFace, and the?angular margin corresponds to the arc margin. Compared to
SphereFace and CosineFace, our method has the best geometric?interpretation.

In Figure 2, we illustrate the proposed ArcFace, and the?angular margin corresponds to the arc margin. Compared to?SphereFace and CosineFace, our method has the best geometric?interpretation.

2.6. 加法角度間隔
雖然[44,43]中的余弦間隔從余弦空間到角度空間是一對一映射的，但這兩個間隔(margin)之間仍然存在差異。事實上，與余弦間隔相比，角度間隔有更清晰的幾何解釋，角度空間中的間隔(margin)對應于超球面流形上的弧距(arc distance)。

我們在cosθ里面增加一個角度間隔。因為cos(θ+ m)小于cos(θ)，所以當??時，對于分類，約束更為嚴格。我們將提議的ArcFace定義為:

如果我們將提出的加法角度間隔cos(θ+ m)展開，得到?cos(θ+m) = cosθcosm - sinθsinm。與[44,43]中提出的加法余弦間隔cos(θ) - m?相比，提出的ArcFace與之類似，由于有sinθ，所以margin是動態的。

在圖2中，我們說明了所提出的ArcFace，角度間隔(arc margin)對應于弧度間隔(arc margin)。與SphereFace和CosineFace相比，我們的方法具有最佳的幾何解釋。

圖2. ArcFace的幾何解釋。不同的顏色區域代表不同類的特征空間。ArcFace不僅可以壓縮特征區域，而且可以對應超球面上的測地線距離。

2.7. Comparison under Binary Case
To better understand the process from Softmax to the?proposed ArcFace, we give the decision boundaries under?binary classification case in Table 1 and Figure 3. Based on?the weights and features normalisation, the main difference?among these methods is where we put the margin.

2.7. 二分類情景下的比較
為了更好地理解從Softmax到所提出的ArcFace的過程，我們在表1和圖3中給出了二元分類情況下的決策邊界。基于權重和特征歸一化，這些方法之間的主要區別是margin的擺放位置。

表1. 二分類情景下類1的決策邊界。注意，?是?和?之間的角度，是超球面半徑，是間隔(margin)。

圖3. 二分類情景下不同損失函數的決策間隔(decision margins)。虛線表示決策邊界，灰色區域是決策間隔。

2.8. Target Logit Analysis
To investigate why the face recognition performance can?be improved by SphereFace, CosineFace and ArcFace, we
analysis the target logit curves and the θ distributions during?training. Here, we use the LResNet34E-IR (refer to?Sec. 3.2) network and the refined MS1M dataset (refer to?Sec. 3.1).

2.8. 目標Logit分析
補充一點：target logit按照字面翻譯是目標邏輯，但實際上跟論文想表達的意思不符。target logit代表的是全連接層輸出矩陣中預測類別為真實類別的輸出，應該翻譯成目標分數比較好。

為了研究為什么SphereFace，CosineFace和ArcFace可以改善人臉識別性能，我們分析了目標logit曲線和訓練期間的θ分布。在這里，我們使用LResNet34E-IR（參見3.2節）網絡和修改后的MS1M數據集（參見3.1節）。

圖4. 目標logit分析。 (a) Softmax，SphereFace，CosineFace和ArcFace的目標logit曲線。 (b)?對Softmax，CosineFace和ArcFace進行批訓練，估算的目標logit 收斂曲線。 (c)?在訓練期間，θ分布從大角度移動到小角度（開始，中間和結束）。最好通過放大查看。

In Figure 4(a), we plot the target logit curves for?Softmax, SphereFace, CosineFace and ArcFace. For?SphereFace, the best setting is m = 4 and λ = 5, which?is similar to the curve with m = 1.5 and λ = 0. However,?the implementation of SphereFace requires the m to be an?integer. When we try the minimum multiplicative margin,?m = 2 and λ = 0, the training can not converge. Therefore,?decreasing the target logit curve slightly from Softmax is?able to increase the training difficulty and improve the performance,?but decreasing too much may cause the training?divergence.

在圖4(a)中，我們繪制了Softmax、SphereFace、CosineFace和ArcFace的目標logit曲線。對于SphereFace,最佳設置為m = 4 和 λ= 5, 它類似于m = 1.5 和 λ= 0 的曲線。但是，SphereFace的實現要求m為整數。我們嘗試最小化乘法間隔,m = 2 和 λ= 0,但是訓練無法收斂。因此，與Softmax相比，稍微降低目標logit曲線可以增加訓練難度，提高訓練效果，但降低太多會導致訓練發散。

Both CosineFace and ArcFace follow this insight. As we?can see from Figure 4(a), CosineFace moves the target logit?curve along the negative direction of y-axis, while ArcFace?moves the target logit curve along the negative direction of?x-axis. Now, we can easily understand the performance improvement?from Softmax to CosineFace and ArcFace.

CosineFace和ArcFace都遵循這一觀點。從圖4(a)可以看出，CosineFace將目標logit曲線沿y軸負方向移動，ArcFace將目標logit曲線沿x軸負方向移動。現在，我們可以很容易地理解從Softmax到CosineFace和ArcFace的性能改進。

For ArcFace with the margin m = 0.5, the target logit?curve is not monotonic decreasing when ?. In?fact, the target logit curve increases when .?However, as shown in Figure 4(c), the θ has a Gaussian distribution?with the centre at ?and the largest angle below??when starting from the randomly initialised network.?The increasing interval of ArcFace is almost never reached?during training. Therefore, we do not need to deal with this?explicitly.

對于margin m = 0.5的ArcFace，當??時，目標logit曲線不是單調遞減的。事實上，當?時，目標logit曲線會增加。然而，如圖4（c）所示，當從隨機初始化網絡開始時，θ具有高斯分布，其中心位于，最大角度小于。在訓練期間，ArcFace逐漸增大的間隔幾乎從未達到??。因此，我們不需要明確地處理這個問題。

In Figure 4(c), we show the θ distributions of CosineFace?and ArcFace in three phases of training, e.g. start, middle?and end. The distribution centres gradually move from ?to . In Figure 4(a), we find the target logit curve?of ArcFace is lower than that of CosineFace between ?to?. Therefore, the proposed ArcFace puts more strict margin?penalty compared to CosineFace in this interval. In Figure?4(b), we show the target logit converge curves estimated?on training batches for Softmax, CosineFace and ArcFace.?We can also find that the margin penalty of ArcFace is heavier?than that of CosineFace at the beginning, as the red dotted?line is lower than the blue dotted line. At the end of?training, ArcFace converges better than CosineFace, as the?histogram of θ is in the left (Figure 4(c)) and the target logit?converge curve is higher (Figure 4(b)). From Figure 4(c),?we can find that almost all of the θs are smaller than ?at?the end of training. The samples beyond this field are the?hardest samples as well as the noise samples of the training?dataset. Even though CosineFace puts more strict margin?penalty when ?(Figure 4(a)), this field is seldom?reached even at the end of training (Figure 4(c)). Therefore,?we can also understand why SphereFace can obtain?very good performance even with a relatively small margin?in this section.

在圖4(c)中，我們展示了CosineFace和ArcFace在三個訓練階段的θ分布，例如：開始，中間和結束。θ值的分布中心逐漸從?移動到??之間。在圖4(a)中，我們發現在?到??之間，ArcFace的目標logit曲線低于CosineFace的目標logit曲線。因此，與CosineFace相比，本文提出的ArcFace對此區間內的margin懲罰更為嚴格。在圖4(b)中，我們展示了對Softmax、CosineFace和ArcFace進行批訓練，估算出的目標logit收斂曲線。我們還可以發現在開始時,ArcFace的margin懲罰比CosineFace重，因為紅色虛線低于藍色虛線。在訓練結束時，ArcFace收斂比CosineFace要好，因為θ的直方圖在左邊（圖4(c)），目標logit收斂曲線更高（圖4(b)）。從圖4(c)中，我們可以發現在訓練結束時幾乎所有的θ都小于。超出這個區域的樣本是最困難的樣本和訓練數據集的噪聲樣本。當?（圖4(a)），盡管CosineFace會對margin進行更嚴格的懲罰，在訓練結束時也很少達到這個區域(圖4(c))。因此，我們也可以理解為什么SphereFace在本節中即使是相對較小的margin也可以獲得非常好的性能。

In conclusion, adding too much margin penalty when??may cause training divergence,??e.g.?SphereFace (m = 2 and λ = 0). Adding margin when?can potentially improve the performance,?because this section corresponds to the most effective semihard?negative samples [35]. Adding margin when ?can not obviously improve the performance, because this?section corresponds to the easiest samples. When we go?back to Figure 4(a) and rank the curves between ,?we can understand why the performance can improve from?Softmax, SphereFace, CosineFace to ArcFace under their?best parameter settings. Note that, ?and ?here are?the roughly estimated thresholds for easy and hard training?samples.

總之，當??，添加太大的margin懲罰，可能會導致訓練發散，例如SphereFace (m = 2 和 λ= 0) 。當?，添加margin有可能會提高性能，因為這部分對應最有效的半困難negative樣本[35]。當，添加margin無法明顯改善性能，因為這部分對應于最簡單的樣本。當我們回到圖4(a)，對??之間的曲線進行排序時，我們可以理解為什么在它們(Softmax，SphereFace，CosineFace、ArcFace)的最佳參數設置下，性能會有所提高。請注意，此處的??and ?是對于簡單和困難訓練樣本，粗略估計的閾值。

3. Experiments
In this paper, we target to obtain state-of-the-art performance?on MegaFace Challenge [21], the largest face?identification and verification benchmark, in a totally reproducible?way. We take Labelled Faces in the Wild?(LFW) [19], Celebrities in Frontal Profile (CFP) [36], Age?Database (AgeDB) [27] as the validation datasets, and conduct?extensive experiments regarding network settings and?loss function designs. The proposed ArcFace achieves?state-of-the-art performance on all of these four datasets.

3. 實驗
在本文中，我們的目標是在MegaFace Challenge [21]中以完全可復制的方式獲得最先進的性能，其中MegaFace Challenge是目前世界上規模最大的人臉識別和人臉驗證測試基準。我們采用 Labelled Faces in the Wild?(LFW) [19], Celebrities in Frontal Profile (CFP) [36], Age?Database (AgeDB) [27] 作為驗證數據集，并對有關網絡設置和損失函數設計進行大量的實驗。所提出的ArcFace在這四個數據集上實現了最先進的性能。

3.1. Data
3.1.1 Training data
We use two datasets, VGG2 [7] and MS-Celeb-1M [11], as?our training data.

VGG2. VGG2 dataset contains a training set with 8,631?identities (3,141,890 images) and a test set with 500 identities?(169,396 images). VGG2 has large variations in pose,?age, illumination, ethnicity and profession. Since VGG2 is
a high-quality dataset, we use it directly without data refinement.

MS-Celeb-1M. The original MS-Celeb-1M dataset contains?about 100k identities with 10 million images. To decrease?the noise of MS-Celeb-1M and get a high-quality?training data, we rank all face images of each identity by?their distances to the identity centre. For a particular identity,?the face image whose feature vector is too far from the?identity’s feature centre is automatically removed [10]. We?further manually check the face images around the threshold?of the first automatic step for each identity. Finally, we?obtain a dataset which contains 3.8M images of 85k unique?identities. To facilitate other researchers to reproduce all of?the experiments in this paper, we make the refined MS1M?dataset public available within a binary file, but please cite?the original paper [11] and follow the original license [11]?when using this dataset. Our contribution here is only training?data refinement, not release.

3.1. 數據
3.1.1 訓練數據集
我們使用兩個數據集，VGG2 [7]和MS-Celeb-1M [11]作為我們的訓練數據集。

VGG2。 VGG2數據集包含具有8,631個身份（3,141,890個圖像）的訓練集和具有500個身份（169,396個圖像）的測試集。VGG2在姿勢，年齡，光照，種族和職業方面有很大差異。由于VGG2是一個高質量的數據集，我們直接使用它，無需對數據進行清洗。

MS-Celeb-1M。最初的MS-Celeb-1M數據集包含大約10萬個身份和1000萬張圖像。為了降低MS-Celeb-1M的噪聲并獲得高質量的訓練數據，我們將每個身份的所有面部圖像按照它們到身份中心的距離進行排序。對于一個特定的身份，如果其特征向量距離身份特征中心太遠，則該人臉圖像將被自動清洗[10]。在第一個自動步驟中，我們進一步為每個身份手動檢查閾值附近的人臉圖像。最后，我們得到一個包含3.8M張圖像(85k個唯一身份)的數據集。為了方便其他研究人員復制本文中的所有實驗，我們用一個二進制文件，將清洗過的MS1M數據集公開，但是在使用該數據集時，請引用原始論文[11]并遵循原始許可證[11]。我們在這里的貢獻只是對訓練數據進行修改，而不是發布。

3.1.2 Validation data
We employ Labelled Faces in the Wild (LFW) [19],Celebrities in Frontal Profile (CFP) [36] and Age Database?(AgeDB) [27] as the validation datasets.

LFW. [19] LFW dataset contains 13,233 web-collected?images from 5749 different identities, with large variations?in pose, expression and illuminations. Following the standard?protocol of unrestricted with labelled outside data, we?give the verification accuracy on 6,000 face pairs.??

CFP. [36]. CFP dataset consists of 500 subjects, each?with 10 frontal and 4 profile images. The evaluation protocol includes frontal-frontal (FF) and frontal-profile (FP)?face verification, each having 10 folders with 350 sameperson?pairs and 350 different-person pairs. In this paper,?we only use the most challenging subset, CFP-FP, to report?the performance.

AgeDB. [27, 10] AgeDB dataset is an in-the-wild dataset?with large variations in pose, expression, illuminations, and?age. AgeDB contains 12,240 images of 440 distinct subjects,?such as actors, actresses, writers, scientists, and politicians.?Each image is annotated with respect to the identity,?age and gender attribute. The minimum and maximum ages are 3 and 101, respectively. The average age range for each?subject is 49 years. There are four groups of test data with different year gaps (5 years, 10 years, 20 years and 30 years,respectively) [10]. Each group has ten split of face images,?and each split contains 300 positive examples and 300 negative?examples. The face verification evaluation metric is?the same as LFW. In this paper, we only use the most challenging?subset, AgeDB-30, to report the performance.

3.1.2 驗證數據集
我們采用?Labelled Faces in the Wild (LFW) [19],Celebrities in Frontal Profile (CFP) [36] and Age Database?(AgeDB) [27] 作為驗證數據集。

LFW。 [19] LFW數據集包含來自5749個不同身份的13,233個網絡收集的圖像，其姿態，表情和照明有很大變化。遵循不受限制的標準協議，并標注外部數據，我們給出了6,000對人臉的驗證精度。

CFP。[36]。CFP數據集包含500名受試者，每個受試者有10張正面圖和4張側面圖。評估方案包括正面對正面(FF)和正面對側面(FP)的人臉驗證，每個都有10個文件夾，包含350對相同的人和350對不同的人。在本文中，我們僅使用最具挑戰性的子集CFP-FP來報告性能。

補充：收集和注釋在無約束條件下捕獲的面部圖像，通常被稱為“in-the-wild”

AgeDB。[27,10] AgeDB數據集是一種in-the-wild的數據集，在姿態、表情、光照和年齡方面有很大的變化。AgeDB。[27,10] AgeDB數據集是一種野外數據集，在姿態、表情、光照和年齡方面有很大的變化。AgeDB包含了12240張440個不同主題的圖片，這些主題包括男女演員、作家、科學家和政治家。每個圖像都有關于身份、年齡和性別屬性的注釋。最小年齡為3歲，最大年齡為101歲。每個研究對象的平均年齡范圍為49歲。有四組測試數據具有不同的年份差距（分別為5年，10年，20年和30年）[10]。每組10張分割的人臉圖片，每組包含300個正例圖片和300個負例圖片。人臉驗證評價指標與LFW相同。在本文中，我們僅使用最具挑戰性的子集AgeDB-30來報告性能。

3.1.3 Test data
MegaFace. MegaFace datasets [21] are released as the?largest public available testing benchmark, which aims at?evaluating the performance of face recognition algorithms?at the million scale of distractors. MegaFace datasets include?gallery set and probe set. The gallery set, a subset of?Flickr photos from Yahoo, consists of more than one million?images from 690k different individuals. The probe sets?are two existing databases: FaceScrub [28] and FGNet [1].?FaceScrub is a publicly available dataset that containing?100k photos of 530 unique individuals, in which 55,742?images are males, and 52,076 images are females. FGNet?is a face ageing dataset, with 1002 images from 82 identities.?Each identity has multiple face images at different ages?(ranging from 1 to 69).

It is quite understandable that data collection of?MegaFace is very arduous and time-consuming thus data?noise is inevitable. For FaceScrub dataset, all of the face?images from one particular identity should have the same?identity. For the one million distractors, there should not?be any overlap with the FaceScrub identities. However, we?find noisy face images not only exist in FaceScrub dataset?but also exist in the one million distractors, which significantly?affect the performance.

In Figure 5, we give the noisy face image examples from?the Facesrub dataset. As shown in Figure 8(c), we rank all?of the faces according to the cosine distance to the identity?centre. In fact, face image 221 and 136 are not Aaron Eckhart.?We manually clean the FaceScrub dataset and finally?find 605 noisy face images. During testing, we change the?noisy face to another right face, which can increase the identification?accuracy by about 1%. In Figure 6(b), we give the?noisy face image examples from the MegaFace distractors.?All of the four face images from the MegaFace distractors?are Alec Baldwin. We manually clean the MegaFace distractors?and finally find 707 noisy face images. During testing,?we add one additional feature dimension to distinguish?these noisy faces, which can increase the identification accuracy?by about 15%.

Even though the noisy face images are double checked?by seven annotators who are very familiar with these?celebrities, we still can not promise these images are 100%?noisy. We put the noise lists of the FaceScrub dataset and?the MegaFace distractors online. We believe the masses?have sharp eyes and we will update these lists based on other?researchers’ feedback.

3.1.3 測試數據集
MegaFace。MegaFace數據集[21]作為世界上規模最大的公開測試基準，旨在評估人臉識別算法在百萬級干擾項干擾下的性能。MegaFace數據集包括圖庫集和探測集。圖庫集是來自雅虎Flickr照片的一個子集，由來自69萬不同個體的100多萬張照片組成。探測集是兩個現有的數據庫:FaceScrub[28]和FGNet[1]。FaceScrub是一個公開數據集，包含530個獨立個體的100k張照片，其中55,742張是男性，52,076張是女性。FGNet是一個面部老化數據集，包含來自82個身份的1002張圖像。每個身份在不同年齡(從1歲到69歲)都有多個人臉圖像。

可以理解的是，MegaFace的數據采集是非常艱巨和耗時的，因此數據噪聲是不可避免的。對于FaceScrub數據集，來自一個特定身份的所有人臉圖像應該具有相同的身份。對于數量為一百萬的干擾集，不應該與FaceScrub身份有任何重復。然而，我們發現噪聲人臉圖像不僅存在于FaceScrub數據集中，而且還存在于數量為一百萬的干擾集中，這對性能有很大的影響。

在圖5中，我們給出了來自Facesrub數據集的噪聲人臉圖像示例。如圖8(c)所示，我們根據到身份中心的余弦距離對所有的人臉進行排序。事實上，noise face 221和?noise face 136?并不是Aaron Eckhart。我們手動清理FaceScrub數據集，最終找到605張有噪聲的人臉圖像。在測試過程中，我們將有噪聲的人臉變換為另一個右臉，可以使識別精度提高約1%。在圖6(b)中，我們給出了來自MegaFace干擾集的噪聲人臉圖像示例。這四張來自MegaFace干擾集的人臉圖像都是亞歷克·鮑德溫。我們手動清楚了MegaFace的干擾集，最終找到了707張有噪聲的人臉圖像。在測試過程中，我們增加了一個額外的特征維度來區分這些有噪聲的人臉，可以將識別精度提高約15%。

盡管這些有噪聲的人臉圖像被7位非常熟悉這些名人的注釋者反復檢查，我們仍然不能保證這些圖像100%沒有噪聲。我們將FaceScrub數據集和MegaFace干擾集的噪聲列表放到了網上。我們相信大眾有敏銳的眼睛，我們將根據其他研究人員的反饋更新這些列表。

圖5. 來自FaceScrub數據集的噪聲人臉圖像示例。在(a)中，圖像id放在左上角，到身份中心的余弦距離放在左下角。

圖6. (a)用于注釋器從FaceScrub數據集學習身份。(b)顯示從MegaFace干擾集中選取的重復人臉。

3.2. Network Settings
We first evaluate the face verification performance based?on different network settings by using VGG2 as the training?data and Softmax as the loss function. All experiments in?this paper are implemented by MxNet [8]. We set the batch?size as 512 and train models on four or eight NVIDIA Tesla?P40 (24GB) GPUs. The learning rate is started from 0.1?and divided by 10 at the 100k, 140k, 160k iterations. Total?iteration step is set as 200k. We set momentum at 0.9 and?weight decay at 5e -?4 (Table 5).

3.2. 網絡設置
我們首先使用VGG2作為訓練數據和Softmax作為損失函數，根據不同的網絡設置，評估人臉驗證的性能。本文中的所有實驗均由MxNet [8]實現。我們將批大小(batch?size)設置為512，在4個或8個NVIDIA Tesla P40(24GB)GPU上訓練模型。學習速率從0.1開始，并在100k、140k、160k個迭代(iterations)時除以10。總迭代步長設置為200k。我們設定動量為0.9，權重衰減為5e - 4(表5)。

3.2.1 Input setting
Following [46, 23], we use five facial landmarks (eye centres,nose tip and mouth corners) [49] for similarity transformation?to normalise the face images. The faces are cropped?and resized to 112 × 112, and each pixel (ranged between?[0,255]) in RGB images is normalised by subtracting 127.5?then divided by 128.?

As most of the convolutional networks are designed for the Image-Net [34] classification task, the input image size?is usually set as 224 × 224 or larger. However, the size?of our face crops is only 112 × 112. To preserve higher?feature map resolution, we use conv3 × 3 and stride = 1?in the first convolutional layer instead of using conv7 × 7?and stride = 2. For these two settings, the output size of?the convolutional networks is 7×7 (denoted as “L” in front?of the network names) and 3 × 3, respectively.

3.2.1 輸入設置
按照[46,23]，我們使用五個面部關鍵點landmarks（眼睛中心，鼻尖和嘴角）[49]進行相似性變換，來標準化人臉圖像。將人臉裁剪并調整為112×112，在RGB圖像中，通過對每個像素(范圍在[0,255]之間)減去127.5再除以128，來進行歸一化。

由于大多數卷積網絡都是針對Image-Net [34]分類任務而設計的，因此輸入圖像大小通常設置為224 × 224或更大。但是，我們的剪裁后人臉圖像的大小只有112×112。為了保持更高的feature map分辨率，我們在第一個convolutional layer中使用了conv3×3 and stride = 1來代替conv7×7 and stride = 2。對于這兩種設置，卷積網絡的輸出大小分別為7×7(在網絡名稱前面用“L”表示)和3×3。

?
3.2.2 Output setting
In last several layers, some different options can be investigated?to check how the embedding settings affect the model??performance. All feature embedding dimension is set to 512?expect for Option-A, as the embedding size in Option-A is
determined by the channel size of last convolutional layer.

?Option-A: Use global pooling layer(GP).

?Option-B: Use one fully connected (FC) layer after GP.

?Option-C: Use FC-Batch Normalisation (BN) [20] after?GP.

?Option-D: Use FC-BN-Parametric Rectified Linear Unit (PReLu) [13] after GP.

?Option-E: Use BN-Dropout [38]-FC-BN after the last?convolutional layer.

During testing, the score is computed by the Cosine Distance?of two feature vectors. Nearest neighbour and threshold?comparison are used for face identification and verification?tasks.

3.2.2 輸出設置
在最后幾層中，可以探討一些不同的選項，來檢測嵌入設置是如何影響模型的性能。對于Option-A，所有feature的嵌入維數設置為512，其中Option-A中的嵌入維數由最后一個convolutional layer的通道大小決定。

?選項-A：使用全局池化層（GP）。

?選項-B：在GP之后使用一個全連接（FC）層。

?選項-C：在GP之后使用FC-Batch 標準化（BN）[20]。

?選項-D：在GP之后使用FC-BN-Parametric 整流線性單元（PReLu）[13]。

?選項-E：在最后一個卷積層之后使用BN-Dropout [38] -FC-BN。

在測試過程中，通過兩個特征向量的余弦距離來計算分數。采用鄰近算法和閾值比較，用于人臉識別和人臉驗證任務。

3.2.3 Block Setting
Besides the original ResNet [14] unit, we also investigate?a more advanced residual unit setting [12] for the training?of face recognition model. In Figure 7, we show the improved?residual unit (denoted as “IR” in the end of model?names), which has a BN-Conv-BN-PReLu-Conv-BN structure.?Compared to the residual unit proposed by [12], we?set stride = 2 for the second convolutional layer instead of?the first one. In addition, PReLu [13] is used to substitute?the original ReLu.

3.2.3 模塊設置
在原有的ResNet[14]單元的基礎上，我們還研究了一種更高級的用于人臉識別模型訓練的殘差單元設置[12]。在圖7中，我們展示了改進后的殘差單元(模型名稱末尾用“IR”表示)，其結構為BN-Conv-BN-PReLu-Conv-BN。與[12]提出的殘差單位相比，我們將第二個卷積層的步長設置為2，而不是第一個卷積層?(如下圖第二個藍色框中，步長設置為2)。另外，使用PReLu[13]代替原來的ReLu。

3.2.4 Backbones
Based on recent advances on the model structure designs,we also explore MobileNet [16], Inception-Resnet-V2 [40], Densely connected convolutional networks?(DenseNet) [18], Squeeze and excitation networks?(SE) [17] and Dual path Network (DPN) [9] for deep face?recognition. In this paper, we compare the differences between?these networks from the aspects of accuracy, speed?and model size.

3.2.4 骨干
基于模型結構設計的最新進展，我們還探索了MobileNet [16], Inception-Resnet-V2 [40], ?DenseNet[18], Squeeze，?SE[17] 和DPN [9]，用于深度人臉識別。本文從精度、速度和模型大小三個方面比較了這些網絡之間的差異。

3.2.5 Network Setting Conclusions
Input selects L. In Table 2, we compare two networks with?and without the setting of “L”. When using conv3 × 3 and?stride = 1 as the first convolutional layer, the network output?is 7×7. By contrast, if we use conv7×7 and stride = 2?as the first??convolutional layer, the network output is only?3×3. It is obvious from Table 2 that choosing larger feature?maps during training obtains higher verification accuracy.

表2. 驗證精度(%)在不同的輸入條件下(Softmax@VGG2)。

3.2.5 網絡設置結論
輸入選擇L。在表2中，我們比較了兩個有和沒有設置“L”的網絡。當使用conv3×3和stride = 1作為第一個卷積層時，網絡輸出為7×7。相比之下，如果我們使用conv7×7和stride = 2作為第一個卷積層，網絡輸出只有3×3。從表2可以看出，在訓練過程中選擇較大的feature map可以獲得較高的驗證精度。

Output selects E. In Table 3, we give the detailed comparison?between different output settings. The option E?(BN-Dropout-FC-BN) obtains the best performance. In this?paper, the dropout parameter is set as 0.4. Dropout can effectively?act as the????regularisation term to avoid over-fitting?and obtain better generalisation for deep face recognition.

表3. 驗證精度(%)在不同的輸出設置(Softmax@VGG2)。

輸出選擇E。在表3中，我們給出了不同輸出設置的詳細比較。選項E (BN-Dropout-FC-BN)的性能最好。本文將dropout參數設置為0.4。Dropout可以有效地作為正則化項，避免過擬合，獲得更好的深度人臉識別泛化效果。

Block selects IR. In Table 4, we give the comparison?between the original residual unit and the improved?residual unit. As we can see from the results, the proposed?BN-Conv(stride=1)-BN-PReLu-Conv(stride=2)-BN?unit can obviously improve the verification performance.

表4. 驗證精度(%)原殘差單元與改進殘差單元的比較(Softmax@VGG2)。

殘差模塊選擇 IR。表4給出了原殘差單元與改進殘差單元的比較。從結果可以看出，提出的BN-Conv(stride=1)-BN-PReLu-Conv(stride=2)-BN 單元可以明顯提高驗證性能。

Backbones Comparisons. In Table 8, we give the verification accuracy, test speed and model size of different backbones. The running time is estimated on the P40 GPU. As?the performance on LFW is almost saturated, we focus on?the more challenging test sets, CFP-FP and AgeDB-30, to?compare these network backbones. The Inception-Resnet-V2 network obtains the best performance with long running?time (53.6ms) and largest model size (642MB). By contrast,MobileNet can finish face feature embedding within?4.2ms with a model of 112MB, and the performance only?drops slightly. As we can see from Table 8, the performance?gaps between these large networks, e.g. ResNet-100,Inception-Resnet-V2, DenseNet, DPN and SE-Resnet-100,
are relatively small. Based on the trade-off between accuracy,speed and model size, we choose LResNet100E-IR to?conduct experiments on the Megaface challenge.

表8.??不同骨干之間的準確性(%)、速度(ms)和模型大小(MB)的比較(Softmax@VGG2)

骨干比較。在表8中，我們給出了不同骨架的驗證精度、測試速度和模型尺寸。運行時間在P40 GPU上估算。由于LFW的性能已經接近飽和，我們將重點放在更具挑戰性的測試集CFP-FP和AgeDB-30上，來比較這些網絡骨架。Inception-Resnet-V2網絡獲得最佳的性能，其運行時間長為(53.6ms)，最大的模型大小為(642MB)。相比之下，MobileNet可以使用大小為112MB的模型，在4.2ms內完成人臉特征的嵌入，性能略有下降。從表8可以看出，這些大型網絡，如ResNet-100,Inception-Resnet-V2, DenseNet, DPN 和?SE-Resnet-100，它們之間的性能差距相對較小。基于精度、速度和模型尺寸之間的權衡，我們選擇LResNet100E-IR來進行Megaface challenge實驗。

Weight decay. Based on the SE-LResNet50E-IR network,we also explore how the weight decay (WD) value?affects the verification performance. As we can see from?Table 5, when the weight decay value is set as 5e -?4, the?verification accuracy reaches the highest point. Therefore,?we fix the weight decay at 5e -?4 in all other experiments.

表5. 不同權重衰減(WD)值的驗證性能(%)(SE-LResNet50E-IR,Softmax@VGG2)。

權重衰減。基于SE-LResNet50E-IR網絡，我們還探討了權重衰減(WD)值如何影響驗證性能。從表5可以看出，當權重衰減值設置為5e - 4時，驗證精度達到最高點。因此，在所有其他實驗中，我們將權重衰減的值固定為5e - 4。

3.3. Loss Setting
Since the margin parameter m plays an important role?in the proposed ArcFace, we first conduct experiments to?search the best angular margin. By varying m from 0.2?to 0.8, we use the LMobileNetE network and the ArcFace?loss to train models on the refined MS1M dataset. As?illustrated in Table 6, the performance improves consistently?from m = 0.2 on all datasets and gets saturated at?m = 0.5. Then, the verification accuracy turns to decrease?from m = 0.5. In this paper, we fix the additive angular?margin m as 0.5.

表6. 不同的角度間隔 m (LMobileNetE,ArcFace@MS1M)對應的ArcFace驗證性能(%)。

3.3. 損失設計
由于margin參數 m 在提出的ArcFace中起著重要的作用，我們首先進行實驗來尋找最佳的角度間隔。通過將m從0.2變化到0.8，我們使用LMobileNetE網絡和ArcFace損失在清洗完的MS1M數據集上訓練模型。如表6所示，在所有數據集上，從m = 0.2開始，性能不斷提高，在m = 0.5時達到飽和。驗證精度從m = 0.5之后開始下降。本文將加法角度間隔 m 固定為0.5。

Based on the LResNet100E-IR network and the refined?MS1M dataset, we compare the performance of different?loss functions, e.g. Softmax, SphereFace [23], Cosine-Face [44, 43] and ArcFace. In Table 7, we give the detailed?verification accuracy on the LFW, CFP-FP, and AgeDB-30?datasets. As LFW is almost saturated, the performance improvement?is not obvious. We find that (1) Compared to?Softmax, SphereFace, CosineFace and ArcFace improve the?performance obviously, especially under large pose and age?variations. (2) CosineFace and ArcFace obviously outperform?SphereFace with much easier implementation. Both CosineFace and ArcFace can converge easily without additional?supervision from Softmax. By contrast, additional?supervision from Softmax is indispensable for SphereFace?to avoid divergence during training. (3) ArcFace is slightly?better than CosineFace. However, ArcFace is more intuitive?and has a more clear geometric interpretation on the hypersphere?manifold as shown in Figure 1.

表7. 不同損失函數下的驗證性能(%)(LResNet100E-IR@MS1M)。

基于LResNet100E-IR網絡和MS1M數據集清洗，我們比較了不同損失函數的性能，如Softmax、SphereFace[23]、Cosine-Face[44、43]和ArcFace。在表7中，我們給出了LFW、CFP-FP和AgeDB-30數據集的詳細驗證精度。由于LFW接近飽和，性能改善不明顯。我們發現(1)與Softmax相比，SphereFace、CosineFace和ArcFace明顯提高了性能，特別是在較大的姿態和年齡變化情況下。(2) CosineFace和ArcFace明顯優于SphereFace，實現更簡單。CosineFace和ArcFace可以很容易地收斂，而不需要額外的Softmax監督。相比之下，為了避免在訓練中出現發散，額外的Softmax監督對于SphereFace來說是必不可少的。(3) ArcFace略優于CosineFace。但是ArcFace更加直觀，對超球面流形的幾何解釋更加清晰，如圖1所示。

3.4. MegaFace Challenge1 on FaceScrub
For the experiments on the MegaFace challenge, we?use the LResNet100E-IR network and the refined MS1M?dataset as the training data. In both Table 9 and 10, we?give the identification and verification results on the original?MegaFace dataset and the refined MegaFace dataset.

In Table 9, we use the whole refined MS1M dataset to?train models. We compare the performance of the proposed?ArcFace with related baseline methods, e.g. Softmax,Triplet, SphereFace, and CosineFace. The proposed Arc-Face obtains the best performance before and after the distractors?refinement. After the overlapped face images are?removed from the one million distractors, the identification?performance significantly improves. We believe that the results?on the manually refined MegaFace dataset are more?reliable, and the performance of face identification under?million distractors is better than we think [2].

To strictly follow the evaluation instructions on?MegaFace, we need to remove all of the identities appearing?in the FaceScrub dataset from our training data. We calculate?the feature centre for each identity in the refined MS1M?dataset and the FaceScrub dataset. We find that 578 identities from the refined MS1M dataset have a close distance?(cosine similarity is higher than 0.45) with the identities?from the FaceScrub dataset. We remove these 578 identities?from the refined MS1M dataset and compare the proposed?ArcFace to other baseline methods in Table 10. ArcFace?still outperforms CosineFace with a slight performance drop?compared to Table 9. But for Softmax, the identification?rate drops obviously from 78.89% to 73.66% after the suspectable?overlap identities are removed from the training?data. On the refined MegaFace testset, the verification result?of CosineFace is slightly higher than that of ArcFace.This is because we read the verification results which are?closest to FAR=1e-6 from the outputs of the devkit. As we?can see from Figure 8, the proposed ArcFace always outperforms?CosineFace under both identification and verification?metric.

3.4. 在FaceScrub上的MegaFace Challenge1
對于MegaFace挑戰的實驗，我們使用LResNet100E-IR網絡和清洗完的MS1M數據集作為訓練數據。在表9和表10中，我們給出了原始MegaFace數據集和清洗完的MegaFace數據集的識別和驗證結果。

在表9中，我們使用整個清洗完的MS1M數據集來訓練模型。我們將提出的ArcFace與相關baseline方法(Softmax、Triplet、SphereFace和CosineFace)的性能進行了比較。在修改干擾項之前和之后，提出的ArcFace都獲得最佳的性能。從數量為一百萬的干擾集中去除重復的人臉圖像后，識別性能顯著提高。我們認為在手動清洗完的MegaFace數據集上的結果更可靠，在百萬級別的干擾集下，人臉識別的性能比我們認為的[2]更好。

為了嚴格遵守MegaFace上的評估說明，我們需要從我們的訓練數據集中刪除FaceScrub數據集中出現的所有身份。我們計算了清洗完的MS1M數據集和FaceScrub數據集中每個身份的特征中心。我們發現來自清洗完的MS1M數據集的578個身份與來自FaceScrub數據集的身份的距離相近(余弦相似度高于0.45)。我們從清洗完的MS1M數據集中刪除了這578個身份，并將提出的ArcFace與其他baseline方法進行比較，在表10中。ArcFace仍然優于CosineFace，與表9相比性能略有下降。但是對于Softmax，在從訓練數據中去除疑似重復的身份后，識別率明顯下降，從78.89%下降到73.66%。在清洗完的MegaFace測試集上，CosineFace的驗證結果略高于ArcFace。這是因為我們讀取的驗證結果與devkit的輸出FAR=1e-6 最接近。從圖8可以看出，在識別度和驗證度度量方面，提出的ArcFace總是優于CosineFace。

以下補充2個概念

rank-1 ：https://blog.csdn.net/sinat_42239797/article/details/93651594

TAR和FAR：https://blog.csdn.net/liuweiyuxiang/article/details/81259492

表9. MegaFace Challenge1 (LResNet100E-IR@MS1M)中不同方法的識別和驗證結果。“Rank 1”指的是rank-1人臉識別的精度，“VR”指的是,在FAR(錯誤接受的比例)為時，人臉驗證的TAR(正確接受的比例)。(R)表示MegaFace數據集清洗的版本。

表10. MegaFace Challenge1 (Methods@ MS1M - FaceScrub)?中不同方法的識別和驗證結果。“Rank 1”指的是rank-1人臉識別的精度，“VR”指的是,在FAR(錯誤接受的比例)為時，人臉驗證的TAR(正確接受的比例)。(R)表示MegaFace數據集清洗的版本。

圖8.?(a) 和?(c) 報告了在附帶1M干擾項的MegaFace數據集上不同方法的CMC曲線。(b) 和?(d) 報告了在附帶1M干擾集的MegaFace數據集上不同方法的ROC曲線。(a) 和?(b)在原始的MegaFace數據集上評估，(c) 和?(d)則在清洗完的MegaFace數據集上評估。

3.5. Further Improvement by Triplet Loss
Due to the limitation of GPU memory, it is hard to train?Softmax-based methods,e.g. SphereFace, CosineFace and?ArcFace, with millions of identities. One practical solution?is to employ metric learning methods, and the most widely?used method is the Triplet loss [35, 22]. However, the converging?speed of Triplet loss is relatively slow. To this end,?we explore Triplet loss to fine-turn exist face recognition?models which are trained with Softmax based methods.

For Triplet loss fine-tuning, we use the LResNet100EIR?network and set learning rate at 0.005, momentum at 0?and weight decay at 5e -?4. As shown in Table 11, we?give the verification accuracy by Triplet loss fine-tuning?on the AgeDB-30 dataset. We find that (1) The Softmax?model trained on a dataset with fewer identity numbers (e.g.VGG2 with 8,631 identities) can be obviously improved?by Triplet loss fine-tuning on a dataset with more identity?numbers (e.g. MS1M with 85k identities). This improvement?confirms the effectiveness of the two-step training?strategy, and this strategy can significantly accelerate the
whole model training compared to training Triplet loss from?scratch. (2) The Softmax model can be further improved by?Triplet loss fine-tuning on the same dataset, which proves?that the local refinement can improve the global model. (3)?The excellence of margin improved Softmax methods, e.g.SphereFace, CosineFace, and ArcFace, can be kept and further?improved by Triplet loss fine-tuning, which also verifies?that local metric learning method, e.g. Triplet loss, is complementary to global hypersphere metric learning based?methods.

As the margin used in Triplet loss is the Euclidean distance,we will investigate Triplet loss with the angular margin?recently.

表11.??通過三重損失微調(LResNet100E-IR)提高驗證精度。

3.5. 進一步改進三重損失
由于GPU內存的限制，很難訓練使用softmax-based的方法(SphereFace, CosineFace和ArcFace)，去訓練百萬級別的身份。一種實用的解決方法是使用度量學習方法，最廣泛使用的方法是三重損失[35,22]。然而，三重態損失的收斂速度相對較慢。為此，我們探索三重損失的微調現，存在的softmax based方法訓練的人臉識別模型。

對于三重損失的微調，我們使用LResNet100EIR網絡，并設置學習率為0.005，動量為0，權重衰減為5e - 4。如表11所示，我們通過對AgeDB-30數據集進行三重損失微調來給出驗證精度。我們發現 (1)用較少身份數量的數據集(例如具有8,631個身份的vgg2)訓練的Softmax模型可以顯著得到提升，通過使用在較多身份數量的數據集(例如具有85k身份的MS1M)上微調過的三重損失。這一改進證實了兩步訓練策略的有效性，與從頭開始訓練的三重損失相比，這種策略可以顯著加速整個模型訓練。(2)通過對同一數據集上的三重損失進行微調，可以進一步改進Softmax模型，證明局部改進可以提升全局模型。(3)margin的有點提升了的Softmax方法，如sphereface、CosineFace和ArcFace。這個優點可以通過三重損失的微調來保持和進一步改進，這也驗證了局部度量學習方法，如三重損失，是對全局超球度量學習基本方法的補充。

由于三重損失使用的間隔是歐幾里德距離，所以我們最近將用研究帶有角度間隔的三重損失。

4. Conclusions
In this paper, we contribute to improving deep face?recognition from data refinement, network settings and?loss function?designs. We have (1) refined the largest?public available training dataset (MS1M) and test dataset?(MegaFace); (2) explored different network settings and?analysed the trade-off between accuracy and speed; (3) proposed?a geometrically interpretable loss function called ArcFaceand explained why the proposed ArcFace is better?than Softmax, SphereFace and CosineFace from the view?of semi-hard sample distributions; (4) obtained state-of-theart?performance on the MegaFace dataset in a totally reproducible way.

4. 結論
在本文中，我們從數據清洗、網絡設置和損失函數設計三個方面來提升深度人臉識別的效果。我們有(1)清洗了規模最大的公開訓練數據集(MS1M)和測試數據集(MegaFace)；(2)探索不同的網絡設置，分析準確性與速度之間的權衡；(3)提出了一種稱為ArcFace的幾何可解釋損失函數，從semi-hard樣本分布的角度解釋了為什么提出的ArcFace要優于Softmax、SphereFace和CosineFace；(4)以完全可復制的方式，在MegaFace數據集中獲得最先進的性能
————————————————
版權聲明：本文為CSDN博主「神羅Noctis」的原創文章，遵循CC 4.0 BY-SA版權協議，轉載請附上原文出處鏈接及本聲明。
原文鏈接：https://blog.csdn.net/qq_39937396/article/details/102523945

總結

以上是生活随笔為你收集整理的Arcface v1 论文翻译与解读的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Insightface项目爬坑指南+使用
下一篇：使用keras的cifar10.load