A ConvNet for the 2020s
作者:Zhuang Liu1,2* Hanzi Mao1 Chao-Yuan Wu1 Christoph Feichtenhofer1 Trevor Darrell2 Saining Xie1?
機(jī)構(gòu):1Facebook AI Research (FAIR) 2UC Berkeley
*Work done during an internship at Facebook AI Research. —— 在Facebook人工智能研究部實(shí)習(xí)期間完成的工作。
?Corresponding author.
Abstract
The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.
視覺識(shí)別的 "咆哮20年代 "始于視覺Transformer(ViTs)的引入,它迅速取代了ConvNets成為最先進(jìn)的圖像分類模型。另一方面,一個(gè)虛無(wú)的ViT在應(yīng)用于一般的計(jì)算機(jī)視覺任務(wù)時(shí)面臨著困難,如目標(biāo)檢測(cè)和語(yǔ)義分割。正是分層Transformer(如Swin Transformers)重新引入了幾個(gè)ConvNet先驗(yàn) (priors),使得Transformer作為通用視覺骨干實(shí)際上是可行的,并在各種視覺任務(wù)中表現(xiàn)出顯著的性能。然而,這種混合方法的有效性仍然主要?dú)w功于Transformers的內(nèi)在優(yōu)勢(shì),而不是Convolutions的內(nèi)在歸納偏置(the inherent inductive biases of convolutions)。在這項(xiàng)工作中,我們重新審視了設(shè)計(jì)空間(design spaces),并測(cè)試了純ConvNet所能實(shí)現(xiàn)的極限。我們逐步將一個(gè)標(biāo)準(zhǔn)的ResNet “現(xiàn)代化(modernize)”,使之成為一個(gè)視覺Transformer的設(shè)計(jì),并在這一過(guò)程中發(fā)現(xiàn)了幾個(gè)促成性能差異的關(guān)鍵組件。這一探索的結(jié)果是一個(gè)被稱為ConvNeXt的純ConvNet模型系列。ConvNeXt完全由標(biāo)準(zhǔn)的ConvNet模塊構(gòu)成,在準(zhǔn)確性和可擴(kuò)展性方面與Transformer競(jìng)爭(zhēng),在COCO檢測(cè)和ADE20K分割方面達(dá)到了87.8%的ImageNet top-1準(zhǔn)確性并超過(guò)了Swin Transformers,同時(shí)保持了標(biāo)準(zhǔn)ConvNets的簡(jiǎn)單性和效率。
這里面的ConvNets指的是基于CNN的網(wǎng)絡(luò)。
1. Introduction
Looking back at the 2010s, the decade was marked by the monumental progress and impact of deep learning. The primary driver was the renaissance of neural networks, particularly convolutional neural networks (ConvNets). Through the decade, the field of visual recognition successfully shifted from engineering features to designing (ConvNet) architectures. Although the invention of back-propagationtrained ConvNets dates all the way back to the 1980s [42], it was not until late 2012 that we saw its true potential for visual feature learning. The introduction of AlexNet [40] precipitated the “ImageNet moment” [59], ushering in a new era of computer vision. The field has since evolved at a rapid speed. Representative ConvNets like VGGNet [64], Inceptions [68], ResNe(X)t [28, 87], DenseNet [36], MobileNet [34], EfficientNet [71] and RegNet [54] focused on different aspects of accuracy, efficiency and scalability, and popularized many useful design principles.
回顧2010年代,這十年的特點(diǎn)是深度學(xué)習(xí)的巨大進(jìn)步和影響。主要驅(qū)動(dòng)力是神經(jīng)網(wǎng)絡(luò)的復(fù)興,特別是卷積神經(jīng)網(wǎng)絡(luò)(ConvNets)。在這十年中,視覺識(shí)別領(lǐng)域成功地從工程特征轉(zhuǎn)向設(shè)計(jì)(ConvNet)架構(gòu)。雖然反向傳播訓(xùn)練的ConvNets的發(fā)明可以追溯到20世紀(jì)80年代,但直到2012年底,我們才看到它在視覺特征學(xué)習(xí)方面的真正潛力。AlexNet的引入催生了 “ImageNet時(shí)刻”,開創(chuàng)了計(jì)算機(jī)視覺的新時(shí)代。此后,該領(lǐng)域以極快的速度發(fā)展起來(lái)。代表性的ConvNets如
- VGGNet
- Inceptions
- ResNe(X)t
- DenseNet
- MobileNet
- EfficientNet
- RegNet
- …
專注于準(zhǔn)確性、效率和可擴(kuò)展性的不同方面,并推廣了許多有用的設(shè)計(jì)原則。
The full dominance of ConvNets in computer vision was not a coincidence: in many application scenarios, a “sliding window” strategy is intrinsic to visual processing, particularly when working with high-resolution images. ConvNets have several built-in inductive biases that make them wellsuited to a wide variety of computer vision applications. The most important one is translation equivariance, which is a desirable property for tasks like objection detection. ConvNets are also inherently efficient due to the fact that when used in a sliding-window manner, the computations are shared [62]. For many decades, this has been the default use of ConvNets, generally on limited object categories such as digits [43], faces [58, 76] and pedestrians [19, 63]. Entering the 2010s, the region-based detectors [23, 24, 27, 57] further elevated ConvNets to the position of being the fundamental building block in a visual recognition system.
ConvNets在計(jì)算機(jī)視覺中的全面主導(dǎo)地位并不是一個(gè)巧合:在許多應(yīng)用場(chǎng)景中,"滑動(dòng)窗口(sliding window)"策略是視覺處理的內(nèi)在因素,特別是在處理高分辨率圖像時(shí)。ConvNets有幾個(gè)內(nèi)置的歸納偏置,使它們非常適合于各種計(jì)算機(jī)視覺應(yīng)用。最重要的是平移等變性 (translation equivariant),這是目標(biāo)檢測(cè)等任務(wù)的一個(gè)理想屬性。ConvNets本身也是高效的,因?yàn)楫?dāng)以滑動(dòng)窗口的方式使用時(shí),計(jì)算是共享的(也就是常說(shuō)的卷積第二個(gè)特征——權(quán)值共享)。幾十年來(lái),這一直是ConvNets的默認(rèn)用法,一般用于有限的對(duì)象類別,如數(shù)字、人臉和行人。進(jìn)入2010年代,基于區(qū)域的檢測(cè)器(region-based detectors)進(jìn)一步提升了ConvNets的地位,成為視覺識(shí)別系統(tǒng)的基本構(gòu)件。
translation equivariant: 卷積操作具有平移等變性(translation equivariant),這意味著它保存了轉(zhuǎn)換,而CNN則允許平移不變性(translation invariance)這是通過(guò)適當(dāng)?shù)?即與空間特征相關(guān)的)降維來(lái)實(shí)現(xiàn)的。
Around the same time, the odyssey of neural network design for natural language processing (NLP) took a very different path, as the Transformers replaced recurrent neural networks to become the dominant backbone architecture. Despite the disparity in the task of interest between language and vision domains, the two streams surprisingly converged in the year 2020, as the introduction of Vision Transformers (ViT) completely altered the landscape of network architecture design. Except for the initial “patchify” layer, which splits an image into a sequence of patches, ViT introduces no image-specific inductive bias and makes minimal changes to the original NLP Transformers. One primary focus of ViT is on the scaling behavior: with the help of larger model and dataset sizes, Transformers can outperform standard ResNets by a significant margin. Those results on image classification tasks are inspiring, but computer vision is not limited to image classification. As discussed previously, solutions to numerous computer vision tasks in the past decade depended significantly on a sliding-window, fully convolutional paradigm. Without the ConvNet inductive biases, a vanilla ViT model faces many challenges in being adopted as a generic vision backbone. The biggest challenge is ViT’s global attention design, which has a quadratic complexity with respect to the input size. This might be acceptable for ImageNet classification, but quickly becomes intractable with higher-resolution inputs.
大約在同一時(shí)間,用于自然語(yǔ)言處理(NLP)的神經(jīng)網(wǎng)絡(luò)設(shè)計(jì)的漫長(zhǎng)而充滿風(fēng)險(xiǎn)地走了一條非常不同的道路,因?yàn)門ransformer取代了遞歸神經(jīng)網(wǎng)絡(luò)(RNN),成為了主流的骨干架構(gòu)。盡管語(yǔ)言和視覺領(lǐng)域的關(guān)注點(diǎn)任務(wù)不盡相同,但這兩股潮流在2020年出人意料地融合在一起,因?yàn)閂ision Transformers(ViT)的引入完全改變了網(wǎng)絡(luò)架構(gòu)設(shè)計(jì)的格局。除了最初的 "補(bǔ)丁化"層 —— patchify(將圖像分割成一連串的patches),ViT沒有引入圖像特定的歸納偏置,對(duì)原始的NLP變形器的改動(dòng)也很小。ViT的一個(gè)主要關(guān)注點(diǎn)是擴(kuò)展行為:在更大的模型和數(shù)據(jù)集規(guī)模的幫助下,Transformers可以在很大程度上超過(guò)標(biāo)準(zhǔn)ResNets的表現(xiàn)。這些關(guān)于圖像分類任務(wù)的結(jié)果是鼓舞人心的,但計(jì)算機(jī)視覺并不限于圖像分類。如前所述,在過(guò)去十年中,許多計(jì)算機(jī)視覺任務(wù)的解決方案在很大程度上依賴于滑動(dòng)窗口、全卷積范式(fully convolutional paradigm)。如果沒有ConvNet的歸納偏置,視覺的ViT模型在作為通用視覺骨干時(shí)面臨許多挑戰(zhàn)。最大的挑戰(zhàn)是ViT的全局注意力設(shè)計(jì),它的復(fù)雜度與輸入大小呈二次方。這對(duì)于ImageNet分類來(lái)說(shuō)可能是可以接受的,但對(duì)于更高分辨率的輸入來(lái)說(shuō)很快就變得難以解決了。
Hierarchical Transformers employ a hybrid approach to bridge this gap. For example, the “sliding window” strategy (e.g. attention within local windows) was reintroduced to Transformers, allowing them to behave more similarly to ConvNets. Swin Transformer [45] is a milestone work in this direction, demonstrating for the first time that Transformers can be adopted as a generic vision backbone and achieve state-of-the-art performance across a range of computer vision tasks beyond image classification. Swin Transformer’s success and rapid adoption also revealed one thing: the essence of convolution is not becoming irrelevant; rather, it remains much desired and has never faded.
分層Transformer采用了一種混合方法來(lái)彌補(bǔ)這一差距。例如,"滑動(dòng)窗口 "策略(如在局部窗口內(nèi)的注意)被重新引入Transformers,使其行為與ConvNets更加相似。Swin Transformer是這個(gè)方向上的一個(gè)里程碑式的工作,首次證明了Transformer可以作為通用的視覺骨干,并在圖像分類之外的一系列計(jì)算機(jī)視覺任務(wù)中取得最先進(jìn)的性能。Swin Transformer的成功和快速采用也揭示了一件事:卷積的本質(zhì)并沒有變得不重要;相反,它仍然備受期待,從未褪色。
Under this perspective, many of the advancements of Transformers for computer vision have been aimed at bringing back convolutions. These attempts, however, come at a cost: a naive implementation of sliding window self-attention can be expensive [55]; with advanced approaches such as cyclic shifting [45], the speed can be optimized but the system becomes more sophisticated in design. On the other hand, it is almost ironic that a ConvNet already satisfies many of those desired properties, albeit in a straightforward, no-frills way. The only reason ConvNets appear to be losing steam is that (hierarchical) Transformers surpass them in many vision tasks, and the performance difference is usually attributed to the superior scaling behavior of Transformers, with multi-head self-attention being the key component.
在這種觀點(diǎn)下,許多用于計(jì)算機(jī)視覺的Transformer的進(jìn)步都是為了讓卷積回歸。然而,這些嘗試是有代價(jià)的:樸實(shí)的滑動(dòng)窗口self-attention的實(shí)現(xiàn)可能是昂貴的;用先進(jìn)的方法,如循環(huán)移位(cyclic shifting),速度可以被優(yōu)化,但系統(tǒng)的設(shè)計(jì)變得更加復(fù)雜。另一方面,具有諷刺意味的是,ConvNet已經(jīng)滿足了許多這些期望的特性,盡管是以一種直接的、不加修飾的方式。ConvNets似乎正在失去動(dòng)力的唯一原因是(分層的)Transformers在許多視覺任務(wù)中超過(guò)了它們,而性能差異通常歸因于Transformers卓越的擴(kuò)展行為,其中多頭自注意力是關(guān)鍵的組成部分。
Unlike ConvNets, which have progressively improved over the last decade, the adoption of Vision Transformers was a step change. In recent literature, system-level comparisons (e.g. a Swin Transformer vs. a ResNet) are usually adopted when comparing the two. ConvNets and hierarchical vision Transformers become different and similar at the same time: they are both equipped with similar inductive biases, but differ significantly in the training procedure and macro/micro-level architecture design. In this work, we investigate the architectural distinctions between ConvNets and Transformers and try to identify the confounding variables when comparing the network performance. Our research is intended to bridge the gap between the pre-ViT and post-ViT eras for ConvNets, as well as to test the limits of what a pure ConvNet can achieve.
與ConvNets不同的是,在過(guò)去的十年中,ConvNets逐步得到了改善,而采用Vision Transformers則是一個(gè)步驟的改變。在最近的文獻(xiàn)中,在比較兩者時(shí)通常采用系統(tǒng)級(jí)的比較(如Swin Transformer與ResNet)。ConvNets和分層視覺Transformer同時(shí)變得既不同又相似:它們都配備了類似的歸納偏置,但在訓(xùn)練程序和宏觀/微觀層面的架構(gòu)設(shè)計(jì)上有很大的不同。在這項(xiàng)工作中,我們研究了ConvNets和Transformers之間的架構(gòu)區(qū)別,并試圖確定比較網(wǎng)絡(luò)性能時(shí)的混雜變量。我們的研究旨在彌合ConvNets的前ViT時(shí)代和后ViT時(shí)代之間的差距,以及測(cè)試純ConvNet能夠?qū)崿F(xiàn)的極限。
To do this, we start with a standard ResNet (e.g. ResNet50) trained with an improved procedure. We gradually “modernize” the architecture to the construction of a hierarchical vision Transformer (e.g. Swin-T). Our exploration is directed by a key question: How do design decisions in Transformers impact ConvNets’ performance? We discover several key components that contribute to the performance difference along the way. As a result, we propose a family of pure ConvNets dubbed ConvNeXt. We evaluate ConvNeXts on a variety of vision tasks such as ImageNet classification [17], object detection/segmentation on COCO [44], and semantic segmentation on ADE20K [92]. Surprisingly, ConvNeXts, constructed entirely from standard ConvNet modules, compete favorably with Transformers in terms of accuracy, scalability and robustness across all major benchmarks. ConvNeXt maintains the efficiency of standard ConvNets, and the fully-convolutional nature for both training and testing makes it extremely simple to implement.
為了做到這一點(diǎn),我們從一個(gè)標(biāo)準(zhǔn)的ResNet(例如ResNet50)開始,用改進(jìn)的程序進(jìn)行訓(xùn)練。我們逐漸將架構(gòu) “現(xiàn)代化”,以構(gòu)建一個(gè)分層的視覺Transformer(例如Swin-T)。我們的探索是由一個(gè)關(guān)鍵問(wèn)題引導(dǎo)的。Transformer中的設(shè)計(jì)決定如何影響ConvNets的性能?我們發(fā)現(xiàn)了幾個(gè)關(guān)鍵的組件,這些組件有助于沿途的性能差異。因此,我們提出了一個(gè)被稱為ConvNeXt的純ConvNets系列。我們?cè)诟鞣N視覺任務(wù)上評(píng)估了ConvNeXts,如ImageNet分類、COCO上的物體檢測(cè)/分割,以及ADE20K上的語(yǔ)義分割。令人驚訝的是,完全由標(biāo)準(zhǔn)ConvNet模塊構(gòu)建的ConvNeXts在所有主要基準(zhǔn)的準(zhǔn)確性、可擴(kuò)展性和魯棒性方面與Transformers競(jìng)爭(zhēng)。ConvNeXt保持了標(biāo)準(zhǔn)ConvNets的效率,而且訓(xùn)練和測(cè)試的完全卷積性質(zhì)使其實(shí)現(xiàn)起來(lái)非常簡(jiǎn)單。
We hope the new observations and discussions can challenge some common beliefs and encourage people to rethink the importance of convolutions in computer vision.
我們希望新的觀察和討論可以挑戰(zhàn)一些常見的信念,鼓勵(lì)人們重新思考計(jì)算機(jī)視覺中卷積的重要性。
2. Modernizing a ConvNet: a Roadmap —— 現(xiàn)代化的ConvNet:一個(gè)路線圖
In this section, we provide a trajectory going from a ResNet to a ConvNet that bears a resemblance to Transformers. We consider two model sizes in terms of FLOPs, one is the ResNet-50 / Swin-T regime with FLOPs around 4.5×109 and the other being ResNet-200 / Swin-B regime which has FLOPs around 15.0 × 109. For simplicity, we will present the results with the ResNet-50 / Swin-T complexity models. The conclusions for higher capacity models are consistent and results can be found in Appendix C.
在這一節(jié)中,我們提供了一個(gè)從ResNet到ConvNet的軌跡,這個(gè)軌跡與Transformer很相似。我們考慮了兩種FLOPs大小的模型,一種是ResNet-50 / Swin-T制度,FLOPs約為 4.5×1094.5\times 10^94.5×109,另一種是ResNet-200 / Swin-B制度,FLOPs約為 15.0×10915.0\times 10^915.0×109。為了簡(jiǎn)單起見,我們將介紹ResNet-50 / Swin-T復(fù)雜度模型的結(jié)果。更高容量模型的結(jié)論是一致的,結(jié)果可以在附錄C中找到。
At a high level, our explorations are directed to investigate and follow different levels of designs from a Swin Transformer while maintaining the network’s simplicity as a standard ConvNet. The roadmap of our exploration is as follows. Our starting point is a ResNet-50 model. We first train it with similar training techniques used to train vision Transformers and obtain much improved results compared to the original ResNet-50. This will be our baseline. We then study a series of design decisions which we summarized as 1) macro design, 2) ResNeXt, 3) inverted bottleneck, 4) large kernel size, and 5) various layer-wise micro designs. In Figure 2, we show the procedure and the results we are able to achieve with each step of the “network modernization”. Since network complexity is closely correlated with the final performance, the FLOPs are roughly controlled over the course of the exploration, though at intermediate steps the FLOPs might be higher or lower than the reference models. All models are trained and evaluated on ImageNet-1K.
在高層(high level)上,我們的探索方向是研究和遵循Swin Transformer的不同層次(level)的設(shè)計(jì),同時(shí)保持網(wǎng)絡(luò)作為一個(gè)標(biāo)準(zhǔn)ConvNet的簡(jiǎn)單性。我們探索的路線圖如下。
Figure 2. We modernize a standard ConvNet (ResNet) towards the design of a hierarchical vision Transformer (Swin), without introducing any attention-based modules. The foreground bars are model accuracies in the ResNet-50/Swin-T FLOP regime; results for the ResNet-200/Swin-B regime are shown with the gray bars. A hatched bar means the modification is not adopted. Detailed results for both regimes are in the appendix. Many Transformer architectural choices can be incorporated in a ConvNet, and they lead to increasingly better performance. In the end, our pure ConvNet model, named ConvNeXt, can outperform the Swin Transformer.
圖2. 我們將一個(gè)標(biāo)準(zhǔn)的ConvNet(ResNet)現(xiàn)代化,以設(shè)計(jì)一個(gè)層次化的視覺Transformer(Swin),而不引入任何基于注意力的模塊。前面的條形圖是ResNet-50/Swin-T FLOP體系中的模型精度;ResNet-200/Swin-B體系的結(jié)果用灰色條形圖表示。帶帽子的條形圖表示沒有采用該修改。兩個(gè)制度的詳細(xì)結(jié)果見附錄。許多Transformer架構(gòu)的選擇可以被納入ConvNet中,而且它們會(huì)帶來(lái)越來(lái)越好的性能。最后,我們的純ConvNet模型,名為ConvNeXt,可以超過(guò)Swin Transformer。
我們的起點(diǎn)是一個(gè)ResNet-50模型。我們首先用類似于訓(xùn)練視覺Transformer的訓(xùn)練技巧來(lái)訓(xùn)練它,并獲得比原來(lái)的ResNet-50更多的結(jié)果。這將是我們的基線(baseline)。然后,我們研究了一系列的設(shè)計(jì)決策,我們總結(jié)為:
在圖2中,我們展示了 "網(wǎng)絡(luò)現(xiàn)代化 "的每一步的程序和我們能夠?qū)崿F(xiàn)的結(jié)果。由于網(wǎng)絡(luò)的復(fù)雜性與最終的性能密切相關(guān),在探索的過(guò)程中,FLOPs被大致控制,盡管在中間步驟,FLOPs可能高于或低于參考模型。所有模型都是在ImageNet-1K上訓(xùn)練和評(píng)估的。
2.1 Training Techniques —— 訓(xùn)練技巧
Apart from the design of the network architecture, the training procedure also affects the ultimate performance. Not only did vision Transformers bring a new set of modules and architectural design decisions, but they also introduced different training techniques (e.g. AdamW optimizer) to vision. This pertains mostly to the optimization strategy and associated hyper-parameter settings. Thus, the first step of our exploration is to train a baseline model with the vision Transformer training procedure, in this case, ResNet50/200. Recent studies [7, 81] demonstrate that a set of modern training techniques can significantly enhance the performance of a simple ResNet-50 model. In our study, we use a training recipe that is close to DeiT’s [73] and Swin Transformer’s [45]. The training is extended to 300 epochs from the original 90 epochs for ResNets. We use the AdamW optimizer [46], data augmentation techniques such as Mixup [90], Cutmix [89], RandAugment [14], Random Erasing [91], and regularization schemes including Stochastic Depth [36] and Label Smoothing [69]. The complete set of hyper-parameters we use can be found in Appendix A.1. By itself, this enhanced training recipe increased the performance of the ResNet-50 model from 76.1% [1] to 78.8% (+2.7%), implying that a significant portion of the performance difference between traditional ConvNets and vision Transformers may be due to the training techniques. We will use this fixed training recipe with the same hyperparameters throughout the “modernization” process. Each reported accuracy on the ResNet-50 regime is an average obtained from training with three different random seeds.
除了網(wǎng)絡(luò)架構(gòu)的設(shè)計(jì),訓(xùn)練程序也會(huì)影響最終的性能。視覺Transformer不僅帶來(lái)了一套新的模塊和架構(gòu)設(shè)計(jì)決策,而且還為視覺引入了不同的訓(xùn)練技術(shù)(如AdamW優(yōu)化器)。這主要涉及到優(yōu)化策略和相關(guān)的超參數(shù)設(shè)置。因此,我們探索的第一步是用視覺Transformer訓(xùn)練程序訓(xùn)練一個(gè)基線模型(baseline),在這種情況下是ResNet50/200。最近的研究表明,一套現(xiàn)代訓(xùn)練技術(shù)可以顯著提高一個(gè)簡(jiǎn)單的ResNet-50模型的性能。在我們的研究中,我們使用了與DeiT和Swin Transformer的相近的訓(xùn)練配置。訓(xùn)練從原來(lái)的90個(gè)epochs擴(kuò)展到300個(gè)epochs的ResNets。我們使用AdamW優(yōu)化器,數(shù)據(jù)增強(qiáng)技術(shù),如Mixup、Cutmix、RandAugment、Random Erasing,以及包括Stochastic Depth和Label Smoothing的正則化方案。我們使用的完整的超參數(shù)集可以在附錄A.1中找到。
就其本身而言,這個(gè)增強(qiáng)的訓(xùn)練配置將ResNet-50模型的性能從76.1%提高到78.8%(+2.7%),這意味著傳統(tǒng)ConvNets和視覺Transformer之間的性能差異的很大一部分可能是由于訓(xùn)練技巧造成的。我們將在整個(gè) "現(xiàn)代化"過(guò)程中使用這個(gè)固定的訓(xùn)練配置,并使用相同的超參數(shù)。ResNet-50制度上的每個(gè)報(bào)告的準(zhǔn)確度是用三個(gè)不同的隨機(jī)種子訓(xùn)練得到的平均值。
2.2 Macro Design —— 宏觀設(shè)計(jì)
We now analyze Swin Transformers’ macro network design. Swin Transformers follow ConvNets [28, 65] to use a multi-stage design, where each stage has a different feature map resolution. There are two interesting design considerations: the stage compute ratio, and the “stem cell” structure.
我們現(xiàn)在分析一下Swin Transformers的宏觀網(wǎng)絡(luò)設(shè)計(jì)。Swin Transformers跟隨ConvNets使用多階段設(shè)計(jì),每個(gè)階段有不同的特征圖分辨率。有兩個(gè)有趣的設(shè)計(jì)考慮:階段計(jì)算比和 "干細(xì)胞(stem cell)"結(jié)構(gòu)。
Changing stage compute ratio. The original design of the computation distribution across stages in ResNet was largely empirical. The heavy “res4” stage was meant to be compatible with downstream tasks like object detection, where a detector head operates on the 14×14 feature plane. Swin-T, on the other hand, followed the same principle but with a slightly different stage compute ratio of 1:1:3:1. For larger Swin Transformers, the ratio is 1:1:9:1. Following the design, we adjust the number of blocks in each stage from (3, 4, 6, 3) in ResNet-50 to (3, 3, 9, 3), which also aligns the FLOPs with Swin-T. This improves the model accuracy from 78.8% to 79.4%. Notably, researchers have thoroughly investigated the distribution of computation [53, 54], and a more optimal design is likely to exist.
From now on, we will use this stage compute ratio.
2.2.1 改變階段性的計(jì)算比例 (Changing stage compute ratio)
ResNet中各階段的計(jì)算分布的最初設(shè)計(jì)主要是經(jīng)驗(yàn)性的。沉重的 "res4 "階段是為了與下游任務(wù)兼容,如目標(biāo)檢測(cè),其中一個(gè)檢測(cè)器頭(detector head)在14×14的特征平面上操作。另一方面,Swin-T也遵循同樣的原則,但階段計(jì)算比例略有不同,為1:1:3:1。對(duì)于較大的Swin Transformers,比例為1:1:9:1。按照設(shè)計(jì),我們將每個(gè)階段的塊數(shù)(blocks)從ResNet-50的(3,4,6,3)調(diào)整為(3,3,9,3),這也使FLOPs與Swin-T一致。這使模型的準(zhǔn)確性從78.8%提高到79.4%。值得注意的是,研究人員已經(jīng)徹底調(diào)查了計(jì)算的分布情況,而且很可能存在一個(gè)更理想的設(shè)計(jì)。
從現(xiàn)在開始,我們將使用這個(gè)階段的計(jì)算比例。
Changing stem to “Patchify”. Typically, the stem cell design is concerned with how the input images will be processed at the network’s beginning. Due to the redundancy inherent in natural images, a common stem cell will aggressively downsample the input images to an appropriate feature map size in both standard ConvNets and vision Transformers. The stem cell in standard ResNet contains a 7×7 convolution layer with stride 2, followed by a max pool, which results in a 4× downsampling of the input images. In vision Transformers, a more aggressive “patchify” strategy is used as the stem cell, which corresponds to a large kernel size (e.g. kernel size = 14 or 16) and non-overlapping convolution. Swin Transformer uses a similar “patchify” layer, but with a smaller patch size of 4 to accommodate the architecture’s multi-stage design. We replace the ResNet-style stem cell with a patchify layer implemented using a 4×4, stride 4 convolutional layer. The accuracy has changed from 79.4% to 79.5%. This suggests that the stem cell in a ResNet may be substituted with a simpler “patchify” layer à la ViT which will result in similar performance.
We will use the “patchify stem” (4×4 non-overlapping convolution) in the network.
2.2.2 將"stem"改為 “Patchify” (Changing stem to “Patchify”)。
通常情況下,stem設(shè)計(jì)關(guān)注的是在網(wǎng)絡(luò)開始時(shí)如何處理輸入圖像。由于自然圖像中固有的冗余,一個(gè)普通的stem層將積極地對(duì)輸入圖像進(jìn)行降采樣,以達(dá)到標(biāo)準(zhǔn)卷積網(wǎng)絡(luò)和視覺Transformer中適當(dāng)?shù)奶卣鲌D大小。標(biāo)準(zhǔn)ResNet中的stem層包含一個(gè)7×7的卷積層,步長(zhǎng)為2,然后是一個(gè)MaxPooling層,這導(dǎo)致輸入圖像的4倍下采樣。在視覺Transformer中,一個(gè)更激進(jìn)的 "Patchify"策略被用作stem層,它對(duì)應(yīng)于一個(gè)大的核大小(例如kernel size=14或16)和非重疊卷積。Swin Transformer使用類似的 "Patchify "層,但patch尺寸較小,為4,以適應(yīng)架構(gòu)的多階段設(shè)計(jì)。我們用一個(gè)使用4×4、步長(zhǎng)為4的卷積層實(shí)現(xiàn)的patchify層取代ResNet式的stem層。準(zhǔn)確率從79.4%變?yōu)?9.5%。這表明ResNet中的stem層可以用一個(gè)更簡(jiǎn)單的 "patchify "層來(lái)代替,就像ViT一樣,這將導(dǎo)致類似的性能。
我們將在網(wǎng)絡(luò)中使用 “patchify stem”(4×4非重疊卷積)。
非重疊卷積就是說(shuō)卷積核大小 ≤\le≤ 步長(zhǎng)
2.3. ResNeXt-ify —— ResNeXT化
In this part, we attempt to adopt the idea of ResNeXt [87], which has a better FLOPs/accuracy trade-off than a vanilla ResNet. The core component is grouped convolution, where the convolutional filters are separated into different groups. At a high level, ResNeXt’s guiding principle is to “use more groups, expand width”. More precisely, ResNeXt employs grouped convolution for the 3×3 conv layer in a bottleneck block. As this significantly reduces the FLOPs, the network width is expanded to compensate for the capacity loss.
在這一部分,我們?cè)噲D采用ResNeXt的思想,它比普通的ResNet有更好的FLOPs/準(zhǔn)確性權(quán)衡。其核心部分是分組卷積,其中卷積卷積核被分成不同的組。在高層次上,ResNeXt的指導(dǎo)原則是 “使用更多的組,擴(kuò)大寬度”。更確切地說(shuō),ResNeXt對(duì)Bottleneck中的3×3卷積層采用了分組卷積。由于這大大減少了FLOPs,網(wǎng)絡(luò)寬度被擴(kuò)大以補(bǔ)償容量的損失。
In our case we use depthwise convolution, a special case of grouped convolution where the number of groups equals the number of channels. Depthwise conv has been popularized by MobileNet [34] and Xception [11]. We note that depthwise convolution is similar to the weighted sum operation in self-attention, which operates on a per-channel basis, i.e., only mixing information in the spatial dimension. The combination of depthwise conv and 1 × 1 convs leads to a separation of spatial and channel mixing, a property shared by vision Transformers, where each operation either mixes information across spatial or channel dimension, but not both. The use of depthwise convolution effectively reduces the network FLOPs and, as expected, the accuracy. Following the strategy proposed in ResNeXt, we increase the network width to the same number of channels as Swin-T’s (from 64 to 96). This brings the network performance to 80.5% with increased FLOPs (5.3G). We will now employ the ResNeXt design.
在我們的案例中,我們使用深度卷積,這是分組卷積的一個(gè)特例,其中分組的數(shù)量等于通道的數(shù)量。深度卷積已被MobileNet和Xception所推廣。我們注意到,深度卷積與自注意中的加權(quán)和操作類似,后者是在每個(gè)通道的基礎(chǔ)上操作的,也就是說(shuō),只混合空間維度的信息。深度卷積和1×1卷積的結(jié)合導(dǎo)致了空間和通道混合的分離,這是視覺Transformer所共有的屬性,每個(gè)操作要么在空間或通道維度上混合信息,但不能同時(shí)混合。深度卷積的使用有效地減少了網(wǎng)絡(luò)的FLOPs,正如預(yù)期的那樣,也減少了準(zhǔn)確性。按照ResNeXt提出的策略,我們將網(wǎng)絡(luò)寬度增加到與Swin-T的通道數(shù)量相同(從64到96)。這使得網(wǎng)絡(luò)性能達(dá)到80.5%,FLOPs增加(5.3G)。我們現(xiàn)在將采用ResNeXt的設(shè)計(jì)。
2.4. Inverted Bottleneck —— 逆殘差模塊
One important design in every Transformer block is that it creates an inverted bottleneck, i.e., the hidden dimension of the MLP block is four times wider than the input dimension (see Figure 4). Interestingly, this Transformer design is connected to the inverted bottleneck design with an expansion ratio of 4 used in ConvNets. The idea was popularized by MobileNetV2 [61], and has subsequently gained traction in several advanced ConvNet architectures [70, 71].
每個(gè)Transformer塊中的一個(gè)重要設(shè)計(jì)是,它創(chuàng)造了一個(gè)逆殘差瓶頸模塊,即MLP塊的隱藏維度比輸入維度寬四倍(見圖4)。有趣的是,這種Transformer設(shè)計(jì)與ConvNets中使用的擴(kuò)展率為4的逆殘差瓶頸模塊設(shè)計(jì)有聯(lián)系。這個(gè)想法被MobileNetV2所推廣,隨后在一些先進(jìn)的ConvNet架構(gòu)中得到推廣[MnasNet, EfficientNet]。
Figure 3. Block modifications and resulted specifications. (a) is a ResNeXt block; in (b) we create an inverted bottleneck block and in ? the position of the spatial depthwise conv layer is moved up.
圖3. 塊的修改和結(jié)果規(guī)格。(a)是一個(gè)ResNeXt塊;在(b)中,我們創(chuàng)建了一個(gè)倒置的瓶頸塊,在?中,空間縱深說(shuō)服層的位置被上移。
Here we explore the inverted bottleneck design. Figure 3 (a) to (b) illustrate the configurations. Despite the increased FLOPs for the depthwise convolution layer, this change reduces the whole network FLOPs to 4.6G, due to the significant FLOPs reduction in the downsampling residual blocks’ shortcut 1×1 conv layer. Interestingly, this results in slightly improved performance (80.5% to 80.6%). In the ResNet-200 / Swin-B regime, this step brings even more gain (81.9% to 82.6%) also with reduced FLOPs.
We will now use inverted bottlenecks.
在這里,我們探討了逆殘差模塊設(shè)計(jì)。圖3(a)至(b)說(shuō)明了配置。盡管深度卷積層的FLOPs增加了,但由于下采樣殘余塊的捷徑1×1卷積層的FLOPs大幅減少,這種改變使整個(gè)網(wǎng)絡(luò)的FLOPs減少到4.6G。有趣的是,這樣做的結(jié)果是性能略有提高(80.5%到80.6%)。在ResNet-200/Swin-B系統(tǒng)中,這一步帶來(lái)了更多的收益(81.9%到82.6%),也減少了FLOPs。
我們現(xiàn)在將使用逆殘差模塊。
2.5. Large Kernel Sizes —— 大卷積核
In this part of the exploration, we focus on the behavior of large convolutional kernels. One of the most distinguishing aspects of vision Transformers is their non-local self-attention, which enables each layer to have a global receptive field. While large kernel sizes have been used in the past with ConvNets [40, 68], the gold standard (popularized by VGGNet [65]) is to stack small kernel-sized (3×3) conv layers, which have efficient hardware implementations on modern GPUs [41]. Although Swin Transformers reintroduced the local window to the self-attention block, the window size is at least 7×7, significantly larger than the ResNe(X)t kernel size of 3×3. Here we revisit the use of large kernel-sized convolutions for ConvNets.
在這一部分的探索中,我們重點(diǎn)關(guān)注大型卷積核的效果。視覺Transformer最突出的一個(gè)方面是它們的非局部自我注意(non-local self-attention),這使得每一層都有一個(gè)全局的接受場(chǎng)(global receptive field)。雖然過(guò)去在ConvNets[AlexNet, Inception v1]中使用了大內(nèi)核尺寸,但黃金標(biāo)準(zhǔn)(由VGGNet推廣)是堆疊小內(nèi)核尺寸(3×3)的conv層,這在現(xiàn)代GPU上有高效的硬件實(shí)現(xiàn)。雖然Swin Transformers在自注意力模塊中重新引入了局部窗口(local window),但窗口大小至少是7×7,明顯大于3×3的ResNe(X)t內(nèi)核大小。在此,我們重新審視大核大小的卷積在ConvNets中的使用。
2.5.1 Moving up depthwise conv layer —— 上移深度卷積層
To explore large kernels, one prerequisite is to move up the position of the depthwise conv layer (Figure 3 (b) to ?). That is a design decision also evident in Transformers: the MSA block is placed prior to the MLP layers. As we have an inverted bottleneck block, this is a natural design choice — the complex/inefficient modules (MSA, large-kernel conv) will have fewer channels, while the efficient, dense 1×1 layers will do the heavy lifting. This intermediate step reduces the FLOPs to 4.1G, resulting in a temporary performance degradation to 79.9%.
為了探索大的內(nèi)核,一個(gè)前提條件是將深度卷積層的位置上移(圖3(b)到(c))。這是一個(gè)在Transformer中也很明顯的設(shè)計(jì)決定:MSA塊被放在MLP層之前。由于我們有一個(gè)逆殘差模塊,這是一個(gè)自然的設(shè)計(jì)選擇——復(fù)雜/低效的模塊(MSA,大核conv)將有較少的通道,而高效、密集的1×1層將完成重任。這個(gè)中間步驟將FLOPs減少到4.1G,導(dǎo)致性能暫時(shí)下降到79.9%。
2.5.2 Increasing the kernel size —— 增大卷積核尺寸
With all of these preparations,the benefit of adopting larger kernel-sized convolutions is significant. We experimented with several kernel sizes, including 3, 5, 7, 9, and 11. The network’s performance increases from 79.9% (3×3) to 80.6% (7×7), while the network’s FLOPs stay roughly the same. Additionally, we observe that the benefit of larger kernel sizes reaches a saturation point at 7×7. We verified this behavior in the large capacity model too: a ResNet-200 regime model does not exhibit further gain when we increase the kernel size beyond 7×7.
We will use 7×7 depthwise conv in each block.
At this point, we have concluded our examination of network architectures on a macro scale. Intriguingly, a significant portion of the design choices taken in a vision Transformer may be mapped to ConvNet instantiations.
在所有這些準(zhǔn)備工作中,采用較大的核大小的卷積的好處是顯著的。我們?cè)囼?yàn)了幾種內(nèi)核大小,包括3、5、7、9和11。網(wǎng)絡(luò)的性能從79.9%(3×3)增加到80.6%(7×7),而網(wǎng)絡(luò)的FLOPs大致保持不變。此外,我們觀察到,更大的內(nèi)核尺寸的好處在7×7時(shí)達(dá)到了飽和點(diǎn)。我們?cè)诖笕萘磕P椭幸豺?yàn)證了這種行為:當(dāng)我們將核大小增加到7×7以上時(shí),ResNet-200制度模型沒有表現(xiàn)出進(jìn)一步的收益。
我們將在每個(gè)區(qū)塊中使用7×7的深度 conv。
至此,我們結(jié)束了對(duì)宏觀規(guī)模上的網(wǎng)絡(luò)結(jié)構(gòu)的研究。耐人尋味的是,在視覺Transformer中采取的相當(dāng)一部分設(shè)計(jì)選擇可以映射到ConvNet實(shí)例中。
2.6. Micro Design —— 微觀設(shè)計(jì)
In this section, we investigate several other architectural differences at a micro scale — most of the explorations here are done at the layer level, focusing on specific choices of activation functions and normalization layers.
在本節(jié)中,我們?cè)谖⒂^層面上研究了其他幾個(gè)架構(gòu)上的差異——這里的大部分探索都是在層級(jí)上完成的,重點(diǎn)是激活函數(shù)和歸一化層的具體選擇。
2.6.1 Replacing ReLU with GELU
One discrepancy between NLP and vision architectures is the specifics of which activation functions to use. Numerous activation functions have been developed over time, but the Rectified Linear Unit (ReLU) [49] is still extensively used in ConvNets due to its simplicity and efficiency. ReLU is also used as an activation function in the original Transformer paper [77]. The Gaussian Error Linear Unit, or GELU [32], which can be thought of as a smoother variant of ReLU, is utilized in the most advanced Transformers, including Google’s BERT [18] and OpenAI’s GPT-2 [52], and, most recently, ViTs. We find that ReLU can be substituted with GELU in our ConvNet too, although the accuracy stays unchanged (80.6%).
NLP和視覺架構(gòu)之間的一個(gè)差異是使用何種激活函數(shù)的具體問(wèn)題。隨著時(shí)間的推移,許多激活函數(shù)已經(jīng)被開發(fā)出來(lái),但整流線性單元(ReLU)由于其簡(jiǎn)單和高效,仍然被廣泛用于ConvNets。ReLU也被用作原始變形器論文中的激活函數(shù)。高斯誤差線性單元,即GELU,可以被認(rèn)為是ReLU的平滑變體,在最先進(jìn)的Transformer中被利用,包括谷歌的BERT和OpenAI的GPT-2,以及最近的ViTs。我們發(fā)現(xiàn),在我們的ConvNet中,ReLU也可以用GELU代替,盡管準(zhǔn)確率保持不變(80.6%)。
2.6.2 Fewer activation functions —— 更少的激活函數(shù)
One minor distinction between a Transformer and a ResNet block is that Transformers have fewer activation functions. Consider a Transformer block with key/query/value linear embedding layers, the projection layer, and two linear layers in an MLP block. There is only one activation function present in the MLP block. In comparison, it is common practice to append an activation function to each convolutional layer, including the 1 × 1 convs. Here we examine how performance changes when we stick to the same strategy. As depicted in Figure 4, we eliminate all GELU layers from the residual block except for one between two 1 × 1 layers, replicating the style of a Transformer block. This process improves the result by 0.7% to 81.3%, practically matching the performance of Swin-T.
We will now use a single GELU activation in each block.
Transformer和ResNet塊之間的一個(gè)小區(qū)別是,Transformer的激活函數(shù)較少。考慮一個(gè)帶有鍵(Key)/查詢(Query)/值(Value)線性嵌入層(Embedding層)的Transformer塊,投影層(projection layer),以及MLP塊中的兩個(gè)線性層(linear layer)。在MLP塊中只有一個(gè)激活函數(shù)存在。相比之下,通常的做法是在每個(gè)卷積層(包括1×1卷積層)上附加一個(gè)激活函數(shù)。在這里,我們研究了當(dāng)我們堅(jiān)持使用相同的策略時(shí),性能如何變化,如圖4所示。
Figure 4. Block designs for a ResNet, a Swin Transformer, and a ConvNeXt. Swin Transformer’s block is more sophisticated due to the presence of multiple specialized modules and two residual connections. For simplicity, we note the linear layers in Transformer MLP blocks also as “1×1 convs” since they are equivalent.
圖4. 一個(gè)ResNet、一個(gè)Swin Transformer和一個(gè)ConvNeXt的模塊設(shè)計(jì)。由于存在多個(gè)專門的模塊和兩個(gè)剩余連接,Swin Transformer的模塊更加復(fù)雜。為了簡(jiǎn)單起見,我們把Transformer MLP塊中的線性層也記為 “1×1 convs”,因?yàn)樗鼈兪堑韧摹?/p>
我們從殘差塊中消除了所有的GELU層,除了兩個(gè)1×1層之間的一個(gè),復(fù)制了變形塊的風(fēng)格。這個(gè)過(guò)程將結(jié)果提高了0.7%,達(dá)到81.3%,實(shí)際上與Swin-T的性能相匹配。
現(xiàn)在我們將在每個(gè)塊中使用單一的GELU激活。
2.6.3 Fewer normalization layers —— 更少的歸一化層
Transformer blocks usually have fewer normalization layers as well. Here we remove two BatchNorm (BN) layers, leaving only one BN layer before the conv 1 × 1 layers. This further boosts the performance to 81.4%, already surpassing Swin-T’s result. Note that we have even fewer normalization layers per block than Transformers, as empirically we find that adding one additional BN layer at the beginning of the block does not improve the performance.
Transformer塊通常也有較少的歸一化層。這里我們?nèi)サ袅藘蓚€(gè)BatchNorm(BN)層,在Conv 1×1層之前只留下一個(gè)BN層。這進(jìn)一步將性能提高到81.4%,已經(jīng)超過(guò)了Swin-T的結(jié)果。請(qǐng)注意,我們每個(gè)區(qū)塊的歸一化層數(shù)甚至比Transformers還要少,因?yàn)?font color="red">根據(jù)經(jīng)驗(yàn),我們發(fā)現(xiàn)在區(qū)塊的開始增加一個(gè)額外的BN層并不能提高性能。
2.6.4 Substituting BN with LN —— 使用LN替換BN
BatchNorm [38] is an essential component in ConvNets as it improves the convergence and reduces overfitting. However, BN also has many intricacies that can have a detrimental effect on the model’s performance [84]. There have been numerous attempts at developing alternative normalization [60, 75, 83] techniques, but BN has remained the preferred option in most vision tasks. On the other hand, the simpler Layer Normalization [5] (LN) has been used in Transformers, resulting in good performance across different application scenarios.
BatchNorm是ConvNets中的一個(gè)重要組成部分,因?yàn)?font color="red">它可以提高收斂性并減少過(guò)擬合。然而,BN也有許多錯(cuò)綜復(fù)雜的問(wèn)題,會(huì)對(duì)模型的性能產(chǎn)生不利的影響[84]。已經(jīng)有很多人嘗試開發(fā)替代的歸一化技術(shù)[60, 75, 83],但在大多數(shù)視覺任務(wù)中,BN仍然是首選。另一方面,更簡(jiǎn)單的層歸一化(LN)已被用于Transformer,在不同的應(yīng)用場(chǎng)景中產(chǎn)生了良好的性能。
Directly substituting LN for BN in the original ResNet will result in suboptimal performance [83]. With all the modifications in network architecture and training techniques, here we revisit the impact of using LN in place of BN. We observe that our ConvNet model does not have any difficulties training with LN; in fact, the performance is slightly better, obtaining an accuracy of 81.5%.
From now on, we will use one LayerNorm as our choice of normalization in each residual block.
在原ResNet中直接用LN代替BN會(huì)導(dǎo)致次優(yōu)的性能[83]。隨著網(wǎng)絡(luò)結(jié)構(gòu)和訓(xùn)練技術(shù)的所有修改,這里我們重新審視了使用LN來(lái)代替BN的影響。我們觀察到,我們的ConvNet模型在使用LN訓(xùn)練時(shí)沒有任何困難;事實(shí)上,性能略好,獲得了81.5%的準(zhǔn)確性。
從現(xiàn)在開始,我們將使用一個(gè)LayerNorm作為我們?cè)诿總€(gè)殘差塊中的標(biāo)準(zhǔn)化選擇。
2.6.5 Separate downsampling layers獨(dú)立的下采樣層
In ResNet, the spatial downsampling is achieved by the residual block at the start of each stage, using 3×3 conv with stride 2 (and 1×1 conv with stride 2 at the shortcut connection). In Swin Transformers, a separate downsampling layer is added between stages. We explore a similar strategy in which we use 2×2 conv layers with stride 2 for spatial downsampling. This modification surprisingly leads to diverged training. Further investigation shows that, adding normalization layers wherever spatial resolution is changed can help stablize training. These include several LN layers also used in Swin Transformers: one before each downsampling layer, one after the stem, and one after the final global average pooling. We can improve the accuracy to 82.0%, significantly exceeding Swin-T’s 81.3%.
在ResNet中,空間下采樣是由每個(gè)階段開始時(shí)的residual block實(shí)現(xiàn)的,使用3×3 conv with stride 2(在捷徑連接處使用1×1 conv with stride 2)。在Swin Transformers中,在各階段之間增加了一個(gè)單獨(dú)的下采樣層。我們探索了一種類似的策略,即使用跨度為2的2×2 conv層進(jìn)行空間下采樣。這種修改出人意料地導(dǎo)致了訓(xùn)練的分歧。進(jìn)一步的調(diào)查顯示,在空間分辨率改變的地方添加歸一化層,有助于穩(wěn)定訓(xùn)練。這些包括同樣用于Swin Transformers的幾個(gè)LN層:一個(gè)在每個(gè)下采樣層之前,一個(gè)在干層之后,一個(gè)在最后的全局平均匯集之后。我們可以將精度提高到82.0%,大大超過(guò)Swin-T的81.3%。
We will use separate downsampling layers. This brings us to our final model, which we have dubbed ConvNeXt. A comparison of ResNet, Swin, and ConvNeXt block structures can be found in Figure 4. A comparison of ResNet-50, Swin-T and ConvNeXt-T’s detailed architecture specifications can be found in Table 9.
我們將使用單獨(dú)的下采樣層。這給我們帶來(lái)了最終的模型,我們將其稱為ConvNeXt。圖4是ResNet、Swin和ConvNeXt塊結(jié)構(gòu)的比較。ResNet-50、Swin-T和ConvNeXt-T的詳細(xì)架構(gòu)規(guī)格的比較可以在表9中找到。
2.6.6 Closing remarks —— 閉幕詞(總結(jié))
We have finished our first “playthrough” and discovered ConvNeXt, a pure ConvNet, that can outperform the Swin Transformer for ImageNet-1K classification in this compute regime. It is worth noting that all design choices discussed so far are adapted from vision Transformers. In addition, these designs are not novel even in the ConvNet literature — they have all been researched separately, but not collectively, over the last decade. Our ConvNeXt model has approximately the same FLOPs, #params., throughput, and memory use as the Swin Transformer, but does not require specialized modules such as shifted window attention or relative position biases.
我們已經(jīng)完成了我們的第一周目,發(fā)現(xiàn)ConvNeXt,一個(gè)純粹的ConvNet,在這個(gè)計(jì)算系統(tǒng)中可以超過(guò)Swin Transformer的ImageNet-1K分類。值得注意的是,到目前為止討論的所有設(shè)計(jì)選擇都是從視覺變形器中改編而來(lái)。此外,這些設(shè)計(jì)即使在ConvNet文獻(xiàn)中也并不新穎——它們?cè)谶^(guò)去十年中都被單獨(dú)研究過(guò),但沒有被集體研究過(guò)。我們的ConvNeXt模型的FLOPs、#params.、吞吐量和內(nèi)存使用量與Swin Transformer大致相同,但不需要專門的模塊,如移窗注意或相對(duì)位置偏差。
These findings are encouraging but not yet completely convincing — our exploration thus far has been limited to a small scale, but vision Transformers’ scaling behavior is what truly distinguishes them. Additionally, the question of whether a ConvNet can compete with Swin Transformers on downstream tasks such as object detection and semantic segmentation is a central concern for computer vision practitioners. In the next section, we will scale up our ConvNeXt models both in terms of data and model size, and evaluate them on a diverse set of visual recognition tasks.
這些發(fā)現(xiàn)令人鼓舞,但還不能完全令人信服——迄今為止,我們的探索僅限于小規(guī)模,但視覺Transformer的擴(kuò)展行為才是它們真正的區(qū)別所在。此外,ConvNet能否在下游任務(wù)(如物體檢測(cè)和語(yǔ)義分割)上與Swin Transformers競(jìng)爭(zhēng)的問(wèn)題是計(jì)算機(jī)視覺從業(yè)者的核心關(guān)注點(diǎn)。在下一節(jié)中,我們將在數(shù)據(jù)和模型大小方面擴(kuò)大我們的ConvNeXt模型,并在一組不同的視覺識(shí)別任務(wù)上對(duì)它們進(jìn)行評(píng)估。
3. Empirical Evaluations on ImageNet —— 在ImageNet上的經(jīng)驗(yàn)評(píng)估
We construct different ConvNeXt variants, ConvNeXtT/S/B/L, to be of similar complexities to Swin-T/S/B/L [45]. ConvNeXt-T/B is the end product of the “modernizing” procedure on ResNet-50/200 regime, respectively. In addition, we build a larger ConvNeXt-XL to further test the scalability of ConvNeXt. The variants only differ in the number of channels C, and the number of blocks B in each stage. Following both ResNets and Swin Transformers, the number of channels doubles at each new stage. We summarize the configurations below:
我們構(gòu)建了不同的ConvNeXt變體,ConvNeXtT/S/B/L,其復(fù)雜程度與Swin-T/S/B/L相似。ConvNeXt-T/B是在ResNet-50/200制度上分別進(jìn)行 "現(xiàn)代化 "程序的最終產(chǎn)品。此外,我們建立了一個(gè)更大的ConvNeXt-XL來(lái)進(jìn)一步測(cè)試ConvNeXt的可擴(kuò)展性。這些變體只在通道數(shù)C和每個(gè)階段的塊數(shù)B上有所不同。按照ResNets和Swin Transformers,通道的數(shù)量在每個(gè)新階段都會(huì)增加一倍。我們把這些配置總結(jié)如下。
? ConvNeXt-T: C = (96, 192, 384, 768), B = (3, 3, 9, 3)
? ConvNeXt-S: C = (96, 192, 384, 768), B = (3, 3, 27, 3)
? ConvNeXt-B: C = (128, 256, 512, 1024), B = (3, 3, 27, 3)
? ConvNeXt-L: C = (192, 384, 768, 1536), B = (3, 3, 27, 3)
? ConvNeXt-XL: C = (256, 512, 1024, 2048), B = (3, 3, 27, 3)
3.1. Settings
The ImageNet-1K dataset consists of 1000 object classes with 1.2M training images. We report ImageNet-1K top-1 accuracy on the validation set. We also conduct pre-training on ImageNet-22K, a larger dataset of 21841 classes (a superset of the 1000 ImageNet-1K classes) with ~14M images for pre-training, and then fine-tune the pre-trained model on ImageNet-1K for evaluation. We summarize our training setups below. More details can be found in Appendix A.
ImageNet-1K數(shù)據(jù)集由1000個(gè)物體類別和120萬(wàn)張訓(xùn)練圖像組成。我們報(bào)告了ImageNet-1K在驗(yàn)證集上的最高準(zhǔn)確性。我們還在ImageNet-22K上進(jìn)行了預(yù)訓(xùn)練,這是一個(gè)由21841個(gè)類組成的更大的數(shù)據(jù)集(1000個(gè)ImageNet-1K類的超集),有1400萬(wàn)張圖像用于預(yù)訓(xùn)練,然后在ImageNet-1K上對(duì)預(yù)訓(xùn)練模型進(jìn)行微調(diào)以進(jìn)行評(píng)估。我們?cè)谙旅婵偨Y(jié)了我們的訓(xùn)練設(shè)置。更多的細(xì)節(jié)可以在附錄A中找到。
Training on ImageNet-1K
We train ConvNeXts for 300 epochs using AdamW [46] with a learning rate of 4e-3. There is a 20-epoch linear warmup and a cosine decaying schedule afterward. We use a batch size of 4096 and a weight decay of 0.05. For data augmentations, we adopt common schemes including Mixup [90], Cutmix [89], RandAugment [14], and Random Erasing [91]. We regularize the networks with Stochastic Depth [37] and Label Smoothing [69]. Layer Scale [74] of initial value 1e-6 is applied. We use Exponential Moving Average (EMA) [51] as we find it alleviates larger models’ overfitting.
我們使用AdamW對(duì)ConvNeXts進(jìn)行了300個(gè)epochs的訓(xùn)練,學(xué)習(xí)率為4×10?34\times 10^{-3}4×10?3。有一個(gè)20個(gè)epoch的線性預(yù)熱,之后是余弦衰落的時(shí)間表。我們使用了4096的批次大小和0.05的權(quán)重衰減。對(duì)于數(shù)據(jù)增強(qiáng),我們采用常見的方案,包括Mixup、Cutmix、RandAugment和Random Erasing。我們用隨機(jī)深度和標(biāo)簽平滑對(duì)網(wǎng)絡(luò)進(jìn)行規(guī)范。采用了初始值為1×10?61\times 10^{-6}1×10?6的Layer Scale。我們使用指數(shù)移動(dòng)平均法(EMA),因?yàn)槲覀儼l(fā)現(xiàn)它可以減輕較大的模型的過(guò)擬合。
Pre-training on ImageNet-22K
We pre-train ConvNeXts on ImageNet-22K for 90 epochs with a warmup of 5 epochs. We do not use EMA. Other settings follow ImageNet-1K.
我們?cè)贗mageNet-22K上對(duì)ConvNeXts進(jìn)行了90個(gè)epochs的預(yù)訓(xùn)練,并進(jìn)行了5個(gè)epochs的預(yù)熱。我們不使用EMA。其他設(shè)置遵循ImageNet-1K。
Fine-tuning on ImageNet-1K
We fine-tune ImageNet-22K pre-trained models on ImageNet-1K for 30 epochs. We use AdamW, a learning rate of 5e-5, cosine learning rate schedule, layer-wise learning rate decay [6, 12], no warmup, a batch size of 512, and weight decay of 1e-8. The default pre-training, fine-tuning, and testing resolution is 2242 . Additionally, we fine-tune at a larger resolution of 3842, for both ImageNet-22K and ImageNet-1K pre-trained models.
我們?cè)贗mageNet-1K上對(duì)ImageNet-22K的預(yù)訓(xùn)練模型進(jìn)行了30個(gè)epochs的微調(diào)。我們使用AdamW,學(xué)習(xí)率為5×10?55\times 10^{-5}5×10?5,余弦學(xué)習(xí)率計(jì)劃,層級(jí)學(xué)習(xí)率衰減[6, 12],無(wú)預(yù)熱,批次大小為512,權(quán)重衰減為1×10?81\times 10^{-8}1×10?8。默認(rèn)的預(yù)訓(xùn)練、微調(diào)和測(cè)試分辨率為2242224^22242。此外,我們對(duì)ImageNet-22K和ImageNet-1K的預(yù)訓(xùn)練模型在更大的分辨率下進(jìn)行微調(diào),即3842384^23842。
Compared with ViTs/Swin Transformers, ConvNeXts are simpler to fine-tune at different resolutions, as the network is fully-convolutional and there is no need to adjust the input patch size or interpolate absolute/relative position biases.
與ViTs/Swin Transformers相比,ConvNeXts在不同分辨率下的微調(diào)更簡(jiǎn)單,因?yàn)榫W(wǎng)絡(luò)是完全卷積的,不需要調(diào)整輸入補(bǔ)丁大小或插值絕對(duì)/相對(duì)位置偏差。
3.2. Results
ImageNet-1K
Table 1 (upper) shows the result comparison with two recent Transformer variants, DeiT [73] and Swin Transformers [45], as well as two ConvNets from architecture search - RegNets [54], EfficientNets [71] and EfficientNetsV2 [72]. ConvNeXt competes favorably with two strong ConvNet baselines (RegNet [54] and EfficientNet [71]) in terms of the accuracy-computation trade-off, as well as the inference throughputs. ConvNeXt also outperforms Swin Transformer of similar complexities across the board, sometimes with a substantial margin (e.g. 0.8% for ConvNeXt-T). Without specialized modules such as shifted windows or relative position bias, ConvNeXts also enjoy improved throughput compared to Swin Transformers.
表1(上)顯示了與最近的兩個(gè)Transformer變體DeiT和Swin Transformers,以及兩個(gè)來(lái)自架構(gòu)搜索的ConvNets–RegNets、EfficientNets和EfficientNetsV2的結(jié)果比較。
Table 1. Classification accuracy on ImageNet-1K. Similar to Transformers, ConvNeXt also shows promising scaling behavior with higher-capacity models and a larger (pre-training) dataset. Inference throughput is measured on a V100 GPU, following [45]. On an A100 GPU, ConvNeXt can have a much higher throughput than Swin Transformer. See Appendix E. (?)ViT results with 90-epoch AugReg [67] training, provided through personal communication with the authors.
表1. ImageNet-1K的分類精度。與Transformers類似,ConvNeXt也顯示了在更高容量的模型和更大的(預(yù)訓(xùn)練)數(shù)據(jù)集下有希望的擴(kuò)展行為。推理吞吐量是在V100 GPU上測(cè)量的,遵循[Swin-Transformer]。在A100 GPU上,ConvNeXt的吞吐量可以比Swin Transformer高得多。見附錄E。(?)ViT在90個(gè)周期的AugReg[67]訓(xùn)練下的結(jié)果,通過(guò)與作者的個(gè)人交流提供。
ConvNeXt與兩個(gè)強(qiáng)大的ConvNet基線(RegNet和EfficientNet)在準(zhǔn)確性-計(jì)算權(quán)衡以及推理吞吐量方面進(jìn)行了良好的競(jìng)爭(zhēng)。ConvNeXt也全面超越了復(fù)雜程度相似的Swin Transformer,有時(shí)還有很大的差距(例如ConvNeXt-T的0.8%)。如果沒有專門的模塊,如移位窗口(shifted windows)或相對(duì)位置偏差(relative position bias),ConvNeXt也享有比Swin Transformer更好的吞吐量(throughput )。
A highlight from the results is ConvNeXt-B at 3842384^23842 : it outperforms Swin-B by 0.6% (85.1% vs. 84.5%), but with 12.5% higher inference throughput (95.7 vs. 85.1 image/s). We note that the FLOPs/throughput advantage of ConvNeXt-B over Swin-B becomes larger when the resolution increases from 2242224^22242 to 3842384^23842. Additionally, we observe an improved result of 85.5% when further scaling to ConvNeXt-L.
結(jié)果中的一個(gè)亮點(diǎn)是3842384^23842的ConvNeXt-B:它比Swin-B高出0.6%(85.1%對(duì)84.5%),但推理吞吐量高出12.5%(95.7對(duì)85.1圖像/秒)。我們注意到,當(dāng)分辨率從2242224^22242增加到3842384^23842時(shí),ConvNeXt-B相對(duì)于Swin-B的FLOPs/吞吐量?jī)?yōu)勢(shì)變得更大。此外,當(dāng)進(jìn)一步擴(kuò)展到ConvNeXt-L時(shí),我們觀察到85.5%的改進(jìn)結(jié)果。
ImageNet-22K. We present results with models fine-tuned from ImageNet-22K pre-training at Table 1 (lower). These experiments are important since a widely held view is that vision Transformers have fewer inductive biases thus can perform better than ConvNets when pre-trained on a larger scale. Our results demonstrate that properly designed ConvNets are not inferior to vision Transformers when pre-trained with large dataset — ConvNeXts still perform on par or better than similarly-sized Swin Transformers, with slightly higher throughput. Additionally, our ConvNeXt-XL model achieves an accuracy of 87.8% — a decent improvement over ConvNeXt-L at 3842384^23842 , demonstrating that ConvNeXts are scalable architectures.
ImageNet-22K
我們?cè)诒?(下圖)展示了從ImageNet-22K預(yù)訓(xùn)練中微調(diào)的模型結(jié)果。這些實(shí)驗(yàn)是很重要的,因?yàn)橛幸环N廣泛的觀點(diǎn)認(rèn)為,視覺Transformer的歸納偏置較少,因此在進(jìn)行大規(guī)模的預(yù)訓(xùn)練時(shí)可以比ConvNets的表現(xiàn)更好。我們的結(jié)果表明,當(dāng)用大型數(shù)據(jù)集進(jìn)行預(yù)訓(xùn)練時(shí),適當(dāng)設(shè)計(jì)的ConvNets并不遜于視覺Transformer——ConvNeXts的性能仍然與類似規(guī)模的Swin Transformers相當(dāng)或更好,而且吞吐量略高。此外,我們的ConvNeXt-XL模型達(dá)到了87.8%的準(zhǔn)確率——比ConvNeXt-L的3842384^23842的準(zhǔn)確率有了很大的提高,這表明ConvNeXts是可擴(kuò)展的架構(gòu)。
On ImageNet-1K, EfficientNetV2-L, a searched architecture equipped with advanced modules (such as Squeeze-andExcitation [35]) and progressive training procedure achieves top performance. However, with ImageNet-22K pre-training, ConvNeXt is able to outperform EfficientNetV2, further demonstrating the importance of large-scale training.
In Appendix B, we discuss robustness and out-of-domain generalization results for ConvNeXt.
在ImageNet-1K上,EfficientNetV2-L,一個(gè)配備了高級(jí)模塊(如Squeeze-andExcitation[35])和漸進(jìn)式訓(xùn)練程序的搜索架構(gòu)取得了頂級(jí)性能。然而,在ImageNet-22K的預(yù)訓(xùn)練下,ConvNeXt能夠超越EfficientNetV2,進(jìn)一步證明了大規(guī)模訓(xùn)練的重要性。
在附錄B中,我們討論了ConvNeXt的魯棒性(robustness)和域外泛化結(jié)果(out-of-domain generalization results)。
3.3. Isotropic ConvNeXt vs. ViT —— 各向同性研究
In this ablation, we examine if our ConvNeXt block design is generalizable to ViT-style [20] isotropic architectures which have no downsampling layers and keep the same feature resolutions (e.g. 14×14) at all depths. We construct isotropic ConvNeXt-S/B/L using the same feature dimensions as ViT-S/B/L (384/768/1024). Depths are set at 18/18/36 to match the number of parameters and FLOPs. The block structure remains the same (Fig. 4). We use the supervised training results from DeiT [73] for ViT-S/B and MAE [26] for ViT-L, as they employ improved training procedures over the original ViTs [20]. ConvNeXt models are trained with the same settings as before, but with longer warmup epochs. Results for ImageNet-1K at 2242 resolution are in Table 2. We observe ConvNeXt can perform generally on par with ViT, showing that our ConvNeXt block design is competitive when used in non-hierarchical models.
在這個(gè)消融中,我們研究了我們的ConvNeXt塊設(shè)計(jì)是否可以推廣到ViT式(ViT-style)的各向異性架構(gòu),這種架構(gòu)沒有下采樣層,在所有深度都保持相同的特征分辨率(如14×14)。我們使用與ViT-S/B/L相同的特征尺寸(384/768/1024)構(gòu)建各向異性的ConvNeXt-S/B/L。深度設(shè)置為18/18/36,以匹配參數(shù)和FLOPs的數(shù)量。塊狀結(jié)構(gòu)保持不變(圖4)。
Figure 4. Block designs for a ResNet, a Swin Transformer, and a ConvNeXt. Swin Transformer’s block is more sophisticated due to the presence of multiple specialized modules and two residual connections. For simplicity, we note the linear layers in Transformer MLP blocks also as “1×1 convs” since they are equivalent.
圖4. 一個(gè)ResNet、一個(gè)Swin Transformer和一個(gè)ConvNeXt的模塊設(shè)計(jì)。由于存在多個(gè)專門的模塊和兩個(gè)剩余連接,Swin Transformer的模塊更加復(fù)雜。為了簡(jiǎn)單起見,我們把Transformer MLP塊中的線性層也記為 “1×1 convs”,因?yàn)樗鼈兪堑韧摹?/p>
我們對(duì)ViT-S/B使用DeiT、的監(jiān)督訓(xùn)練結(jié)果,對(duì)ViT-L使用MAE[26]的監(jiān)督訓(xùn)練結(jié)果,因?yàn)樗鼈儾捎昧吮仍糣iTs[20]更好的訓(xùn)練程序。ConvNeXt模型的訓(xùn)練設(shè)置與之前相同,但有更長(zhǎng)的預(yù)熱周期。表2列出了2242分辨率的ImageNet-1K的結(jié)果。我們觀察到ConvNeXt的表現(xiàn)基本與ViT持平,這表明我們的ConvNeXt塊設(shè)計(jì)在用于非層次模型時(shí)具有競(jìng)爭(zhēng)力(non-hierarchical models)。
Table 2. Comparing isotropic ConvNeXt and ViT. Training memory is measured on V100 GPUs with 32 per-GPU batch size.
表2. 比較各向同性的ConvNeXt和ViT。訓(xùn)練內(nèi)存是在V100 GPU上測(cè)量的,每個(gè)GPU的批量大小為32。
4. Empirical Evaluation on Downstream Tasks —— 下游任務(wù)的實(shí)證評(píng)估
Object detection and segmentation on COCO
We finetune Mask R-CNN [27] and Cascade Mask R-CNN [9] on the COCO dataset with ConvNeXt backbones. Following Swin Transformer [45], we use multi-scale training, AdamW optimizer, and a 3× schedule. Further details and hyperparameter settings can be found in Appendix A.3.
我們?cè)贑OCO數(shù)據(jù)集上用ConvNeXt骨干網(wǎng)絡(luò)對(duì)Mask R-CNN和Cascade Mask R-CNN進(jìn)行微調(diào)。在Swin Transformer之后,我們使用了多尺度訓(xùn)練、AdamW優(yōu)化器和3×?xí)r間表。進(jìn)一步的細(xì)節(jié)和超參數(shù)設(shè)置可以在附錄A.3中找到。
Table 3 shows object detection and instance segmentation results comparing Swin Transformer, ConvNeXt, and traditional ConvNet such as ResNeXt. Across different model complexities, ConvNeXt achieves on-par or better performance than Swin Transformer. When scaled up to bigger models (ConvNeXt-B/L/XL) pre-trained on ImageNet-22K, in many cases ConvNeXt is significantly better (e.g. +1.0 AP) than Swin Transformers in terms of box and mask AP.
表3顯示了Swin Transformer、ConvNeXt和ResNeXt等傳統(tǒng)ConvNet的物體檢測(cè)和實(shí)例分割結(jié)果的比較。
Table 3. COCO object detection and segmentation results using Mask-RCNN and Cascade Mask-RCNN. ? indicates that the model is pre-trained on ImageNet-22K. ImageNet-1K pre-trained Swin results are from their Github repository [3]. AP numbers of the ResNet-50 and X101 models are from [45]. We measure FPS on an A100 GPU. FLOPs are calculated with image size (1280, 800).
表3. 使用Mask-RCNN和Cascade Mask-RCNN進(jìn)行COCO物體檢測(cè)和分割的結(jié)果。?表示該模型是在ImageNet-22K上預(yù)訓(xùn)練的。ImageNet-1K的預(yù)訓(xùn)練Swin結(jié)果來(lái)自其Github資源庫(kù)[3]。ResNet-50和X101模型的AP編號(hào)來(lái)自[45]。我們?cè)贏100 GPU上測(cè)量FPS。FLOPs是以圖像尺寸(1280,800)計(jì)算的。
在不同的模型復(fù)雜性中,ConvNeXt取得了與Swin Transformer相當(dāng)或更好的性能。當(dāng)擴(kuò)大到在ImageNet-22K上預(yù)訓(xùn)練的更大的模型(ConvNeXt-B/L/XL)時(shí),在許多情況下,ConvNeXt在box 和mask AP方面明顯優(yōu)于Swin Transformer(例如+1.0AP)。
Semantic segmentation on ADE20K
We also evaluate ConvNeXt backbones on the ADE20K semantic segmentation task with UperNet [85]. All model variants are trained for 160K iterations with a batch size of 16. Other experimental settings follow [6] (see Appendix A.3 for more details). In Table 4, we report validation mIoU with multi-scale testing. ConvNeXt models can achieve competitive performance across different model capacities, further validating the effectiveness of our architecture design.
我們還在ADE20K語(yǔ)義分割任務(wù)中評(píng)估了ConvNeXt骨架與UperNet的關(guān)系。所有的模型變體都訓(xùn)練了16萬(wàn)次迭代,批次大小為16。其他實(shí)驗(yàn)設(shè)置遵循[6](更多細(xì)節(jié)見附錄A.3)。在表4中,我們報(bào)告了多尺度測(cè)試的驗(yàn)證mIoU。
Table 4. ADE20K validation results using UperNet [85]. ? indicates IN-22K pre-training. Swins’ results are from its GitHub repository [2]. Following Swin, we report mIoU results with multiscale testing. FLOPs are based on input sizes of (2048, 512) and (2560, 640) for IN-1K and IN-22K pre-trained models, respectively.
表4. 使用UperNet[85]的ADE20K驗(yàn)證結(jié)果。?表示IN-22K預(yù)訓(xùn)練。Swins的結(jié)果來(lái)自其GitHub倉(cāng)庫(kù)[2]。繼Swin之后,我們報(bào)告了多尺度測(cè)試的mIoU結(jié)果。FLOPs是基于IN-1K和IN-22K預(yù)訓(xùn)練模型的輸入尺寸(2048,512)和(2560,640)。
ConvNeXt模型可以在不同的模型容量下取得有競(jìng)爭(zhēng)力的性能,進(jìn)一步驗(yàn)證了我們架構(gòu)設(shè)計(jì)的有效性。
Remarks on model efficiency
Under similar FLOPs, models with depthwise convolutions are known to be slower and consume more memory than ConvNets with only dense convolutions. It is natural to ask whether the design of ConvNeXt will render it practically inefficient. As demonstrated throughout the paper, the inference throughputs of ConvNeXts are comparable to or exceed that of Swin Transformers. This is true for both classification and other tasks requiring higher-resolution inputs (see Table 1,3 for comparisons of throughput/FPS). Furthermore, we notice that training ConvNeXts requires less memory than training Swin Transformers. For example, training Cascade Mask-RCNN using ConvNeXt-B backbone consumes 17.4GB of peak memory with a per-GPU batch size of 2, while the reference number for Swin-B is 18.5GB. In comparison to vanilla ViT, both ConvNeXt and Swin Transformer exhibit a more favorable accuracy-FLOPs trade-off due to the local computations. It is worth noting that this improved efficiency is a result of the ConvNet inductive bias, and is not directly related to the self-attention mechanism in vision Transformers.
在類似的FLOPs下,已知具有深度卷積的模型比只有密集卷積的ConvNets更慢,消耗更多的內(nèi)存。我們很自然地會(huì)問(wèn),ConvNeXt的設(shè)計(jì)是否會(huì)使其實(shí)際效率降低。正如本文所展示的那樣,ConvNeXt的推理吞吐量與Swin Transformers相當(dāng),甚至超過(guò)了Swin Transformers。這對(duì)于分類和其他需要高分辨率輸入的任務(wù)來(lái)說(shuō)都是如此(吞吐量/FPS的比較見表1,3)。此外,我們注意到,訓(xùn)練ConvNeXts需要的內(nèi)存比訓(xùn)練Swin Transformers少。例如,使用ConvNeXt-B骨干訓(xùn)練Cascade Mask-RCNN,在每個(gè)GPU批次大小為2的情況下,消耗了17.4GB的峰值內(nèi)存,而Swin-B的參考數(shù)字是18.5GB。與vanilla ViT相比,由于本地計(jì)算,ConvNeXt和Swin Transformer都表現(xiàn)出更有利的精度-FLOPs權(quán)衡。值得注意的是,這種效率的提高是ConvNet歸納偏置的結(jié)果,而與視覺Transformer中的自注意機(jī)制沒有直接關(guān)系。
5. Related Work
5.1 Hybrid models
In both the pre- and post-ViT eras, the hybrid model combining convolutions and self-attentions has been actively studied. Prior to ViT, the focus was on augmenting a ConvNet with self-attention/non-local modules [8, 55, 66, 79] to capture long-range dependencies. The original ViT [20] first studied a hybrid configuration, and a large body of follow-up works focused on reintroducing convolutional priors to ViT, either in an explicit [15, 16, 21, 82, 86, 88] or implicit [45] fashion.
在ViT之前和之后的時(shí)代,結(jié)合卷積和自留地的混合模型一直被積極研究。在ViT之前,重點(diǎn)是用自注意力/非本地模塊來(lái)增強(qiáng)ConvNet[8, 55, 66, 79],以捕捉長(zhǎng)距離的依賴關(guān)系。最初的ViT[20]首次研究了一種混合配置,大量的后續(xù)工作集中在將卷積先驗(yàn)重新引入ViT,無(wú)論是以顯式[15, 16, 21, 82, 86, 88]還是隱式[45]方式。
5.2 Recent convolution-based approaches
Han et al. [25] show that local Transformer attention is equivalent to inhomogeneous dynamic depthwise conv. The MSA block in Swin is then replaced with a dynamic or regular depthwise convolution, achieving comparable performance to Swin. A concurrent work ConvMixer [4] demonstrates that, in small-scale settings, depthwise convolution can be used as a promising mixing strategy. ConvMixer uses a smaller patch size to achieve the best results, making the throughput much lower than other baselines. GFNet [56] adopts Fast Fourier Transform (FFT) for token mixing. FFT is also a form of convolution, but with a global kernel size and circular padding. Unlike many recent Transformer or ConvNet designs, one primary goal of our study is to provide an in-depth look at the process of modernizing a standard ResNet and achieving state-of-the-art performance.
Han等人[25]表明,局部Transformer注意力等同于不均勻的動(dòng)態(tài)深度卷積,然后用動(dòng)態(tài)或常規(guī)深度卷積取代Swin中的MSA塊,取得與Swin相當(dāng)?shù)男阅堋M瑫r(shí)進(jìn)行的一項(xiàng)工作ConvMixer[4]表明,在小范圍內(nèi),深度卷積可以作為一種有前途的混合策略。ConvMixer使用較小的補(bǔ)丁尺寸來(lái)達(dá)到最佳效果,使得吞吐量比其他基線低很多。GFNet[56]采用快速傅里葉變換(FFT)進(jìn)行標(biāo)記混合。FFT也是卷積的一種形式,但有一個(gè)全局內(nèi)核大小和循環(huán)填充。與許多最近的Transformer或ConvNet設(shè)計(jì)不同,我們研究的一個(gè)主要目標(biāo)是深入研究標(biāo)準(zhǔn)ResNet的現(xiàn)代化過(guò)程并實(shí)現(xiàn)最先進(jìn)的性能。
6. Conclusions
In the 2020s, vision Transformers, particularly hierarchical ones such as Swin Transformers, began to overtake ConvNets as the favored choice for generic vision backbones. The widely held belief is that vision Transformers are more accurate, efficient, and scalable than ConvNets. We propose ConvNeXts, a pure ConvNet model that can compete favorably with state-of-the-art hierarchical vision Transformers across multiple computer vision benchmarks, while retaining the simplicity and efficiency of standard ConvNets. In some ways, our observations are surprising while our ConvNeXt model itself is not completely new — many design choices have all been examined separately over the last decade, but not collectively. We hope that the new results reported in this study will challenge several widely held views and prompt people to rethink the importance of convolution in computer vision.
在2020年代,視覺Transformer,特別是層次化的Transformer,如Swin Transformers,開始超越ConvNets,成為通用視覺骨干的首選。人們普遍認(rèn)為,視覺Transformer比ConvNets更準(zhǔn)確、更高效、更可擴(kuò)展。我們提出了ConvNeXts,一個(gè)純ConvNet模型,它可以在多個(gè)計(jì)算機(jī)視覺基準(zhǔn)中與最先進(jìn)的分層視覺Transformer競(jìng)爭(zhēng),同時(shí)保留了標(biāo)準(zhǔn)ConvNets的簡(jiǎn)單性和效率。在某些方面,我們的觀察結(jié)果令人驚訝,而我們的ConvNeXt模型本身并不是全新的——許多設(shè)計(jì)選擇都在過(guò)去十年中被單獨(dú)研究過(guò),但沒有集體研究過(guò)。我們希望本研究報(bào)告的新結(jié)果將挑戰(zhàn)幾個(gè)廣泛持有的觀點(diǎn),并促使人們重新思考計(jì)算機(jī)視覺中卷積的重要性。
Acknowledgments
We thank Kaiming He, Eric Mintun, Xingyi Zhou, Ross Girshick, and Yann LeCun for valuable discussions and feedback.
我們感謝何開明、Eric Mintun、周欣怡、Ross Girshick和Yann LeCun的寶貴討論和反饋。
Appendix —— 附錄
In this Appendix, we provide further experimental details (§A), robustness evaluation results (§B), more modernization experiment results (§C), and a detailed network specification (§D). We further benchmark model throughput on A100 GPUs (§E). Finally, we discuss the limitations (§F) and societal impact (§G) of our work.
在這個(gè)附錄中,我們提供了進(jìn)一步的實(shí)驗(yàn)細(xì)節(jié)(§A),魯棒性評(píng)估結(jié)果(§B),更多的現(xiàn)代化實(shí)驗(yàn)結(jié)果(§C),以及詳細(xì)的網(wǎng)絡(luò)規(guī)范(§D)。我們進(jìn)一步對(duì)A100 GPU上的模型吞吐量進(jìn)行了基準(zhǔn)測(cè)試(§E)。最后,我們討論了我們工作的局限性(§F)和社會(huì)影響(§G)。
A. Experimental Settings
A.1. ImageNet (Pre-)training
We provide ConvNeXts’ ImageNet-1K training and ImageNet-22K pre-training settings in Table 5. The settings are used for our main results in Table 1 (Section 3.2). All ConvNeXt variants use the same setting, except the stochastic depth rate is customized for model variants.
我們?cè)诒?中提供了ConvNeXts的ImageNet-1K訓(xùn)練和ImageNet-22K預(yù)訓(xùn)練設(shè)置。這些設(shè)置用于我們?cè)诒?(第3.2節(jié))的主要結(jié)果。所有的ConvNeXt變體都使用相同的設(shè)置,只是隨機(jī)深度率是為模型變體定制的。
Table 5. ImageNet-1K/22K (pre-)training settings. Multiple stochastic depth rates (e.g., 0.1/0.4/0.5/0.5) are for each model (e.g., ConvNeXt-T/S/B/L) respectively.
表5. ImageNet-1K/22K(預(yù))訓(xùn)練設(shè)置。多個(gè)隨機(jī)深度率(如0.1/0.4/0.5/0.5)分別為每個(gè)模型(如ConvNeXt-T/S/B/L)。
Table 1. Classification accuracy on ImageNet-1K. Similar to Transformers, ConvNeXt also shows promising scaling behavior with higher-capacity models and a larger (pre-training) dataset. Inference throughput is measured on a V100 GPU, following [45]. On an A100 GPU, ConvNeXt can have a much higher throughput than Swin Transformer. See Appendix E. (?)ViT results with 90-epoch AugReg [67] training, provided through personal communication with the authors.
表1. ImageNet-1K的分類精度。與Transformers類似,ConvNeXt也顯示了在更高容量的模型和更大的(預(yù)訓(xùn)練)數(shù)據(jù)集下有希望的擴(kuò)展行為。推理吞吐量是在V100 GPU上測(cè)量的,遵循[Swin-Transformer]。在A100 GPU上,ConvNeXt的吞吐量可以比Swin Transformer高得多。見附錄E。(?)ViT在90個(gè)周期的AugReg[67]訓(xùn)練下的結(jié)果,通過(guò)與作者的個(gè)人交流提供。
For experiments in “modernizing a ConvNet” (Section 2), we also use Table 5’s setting for ImageNet-1K, except EMA is disabled, as we find using EMA severely hurts models with BatchNorm layers. For isotropic ConvNeXts (Section 3.3), the setting for ImageNet-1K in Table A is also adopted, but warmup is extended to 50 epochs, and layer scale is disabled for isotropic ConvNeXt-S/B. The stochastic depth rates are 0.1/0.2/0.5 for isotropic ConvNeXt-S/B/L.
在 "ConvNet現(xiàn)代化 "的實(shí)驗(yàn)中(第2節(jié)),我們也使用了表5對(duì)ImageNet-1K的設(shè)置,只是EMA被禁用,因?yàn)?font color="red">我們發(fā)現(xiàn)使用EMA會(huì)嚴(yán)重傷害帶有BatchNorm層的模型。對(duì)于各向同性的ConvNeXts(第3.3節(jié)),我們也采用了表A中對(duì)ImageNet-1K的設(shè)置,但預(yù)熱時(shí)間延長(zhǎng)到50個(gè)歷時(shí),并且對(duì)于各向同性的ConvNeXt-S/B來(lái)說(shuō),層規(guī)模是禁用的。各向同性的ConvNeXt-S/B/L的隨機(jī)深度率為0.1/0.2/0.5。
A.2. ImageNet Fine-tuning
We list the settings for fine-tuning on ImageNet-1K in Table 6. The fine-tuning starts from the final model weights obtained in pre-training, without using the EMA weights, even if in pre-training EMA is used and EMA accuracy is reported. This is because we do not observe improvement if we fine-tune with the EMA weights (consistent with observations in [73]). The only exception is ConvNeXt-L pre-trained on ImageNet-1K, where the model accuracy is significantly lower than the EMA accuracy due to overfitting, and we select its best EMA model during pre-training as the starting point for fine-tuning.
我們?cè)诒?中列出了ImageNet-1K的微調(diào)設(shè)置。微調(diào)是從預(yù)訓(xùn)練中得到的最終模型權(quán)重開始的,沒有使用EMA權(quán)重,即使在預(yù)訓(xùn)練中使用了EMA,并且報(bào)告了EMA精度。這是因?yàn)槿绻褂肊MA權(quán)重進(jìn)行微調(diào),我們并沒有觀察到改進(jìn)(與[73]中的觀察一致)。唯一的例外是在ImageNet-1K上預(yù)訓(xùn)練的ConvNeXt-L,由于過(guò)擬合,其模型精度明顯低于EMA精度,我們?cè)陬A(yù)訓(xùn)練中選擇其最佳EMA模型作為微調(diào)的起點(diǎn)。
In fine-tuning, we use layer-wise learning rate decay [6, 12] with every 3 consecutive blocks forming a group. When the model is fine-tuned at 3842 resolution, we use a crop ratio of 1.0 (i.e., no cropping) during testing following [2, 74, 80], instead of 0.875 at 2242.
在微調(diào)中,我們使用層間學(xué)習(xí)率衰減[6, 12],每3個(gè)連續(xù)的塊形成一個(gè)組。當(dāng)模型在3842分辨率下進(jìn)行微調(diào)時(shí),我們?cè)跍y(cè)試過(guò)程中使用1.0的裁剪率(即不裁剪),而不是2242時(shí)的0.875。
A.3. Downstream Tasks
For ADE20K and COCO experiments, we follow the training settings used in BEiT [6] and Swin [45]. We also use MMDetection [10] and MMSegmentation [13] toolboxes. We use the final model weights (instead of EMA weights) from ImageNet pre-training as network initializations.
對(duì)于ADE20K和COCO的實(shí)驗(yàn),我們遵循BEiT[6]和Swin[45]中使用的訓(xùn)練設(shè)置。我們還使用了MMDetection[10]和MMSegmentation[13]工具箱。我們使用ImageNet預(yù)訓(xùn)練的最終模型權(quán)重(而不是EMA權(quán)重)作為網(wǎng)絡(luò)初始化。
We conduct a lightweight sweep for COCO experiments including learning rate {1e-4, 2e-4}, layer-wise learning rate decay [6] {0.7, 0.8, 0.9, 0.95}, and stochastic depth rate {0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. We fine-tune the ImageNet-22K pre-trained Swin-B/L on COCO using the same sweep. We use the official code and pre-trained model weights [3].
我們對(duì)COCO實(shí)驗(yàn)進(jìn)行了輕量級(jí)掃描,包括學(xué)習(xí)率{1×10?41 \times 10^{-4}1×10?4, 2×10?42 \times 10^{-4}2×10?4},層間學(xué)習(xí)率衰減[6] {0.7, 0.8, 0.9, 0.95},以及隨機(jī)深度率{0.3, 0.4, 0.5, 0.6, 0.7, 0.8}。我們?cè)贑OCO上使用同樣的掃頻對(duì)ImageNet-22K預(yù)訓(xùn)練的Swin-B/L進(jìn)行微調(diào)。我們使用官方代碼和預(yù)訓(xùn)練的模型權(quán)重[3]。
The hyperparameters we sweep for ADE20K experiments include learning rate {8e-5, 1e-4}, layer-wise learning rate decay {0.8, 0.9}, and stochastic depth rate {0.3, 0.4, 0.5}. We report validation mIoU results using multi-scale testing. Additional single-scale testing results are in Table 7.
我們?yōu)锳DE20K實(shí)驗(yàn)掃除的超參數(shù)包括學(xué)習(xí)率{8×10?58 \times 10^{-5}8×10?5, 1×10?41 \times 10^{-4}1×10?4},層間學(xué)習(xí)率衰減{0.8, 0.9},以及隨機(jī)深度率{0.3, 0.4, 0.5}。我們報(bào)告了使用多尺度測(cè)試的驗(yàn)證性mIoU結(jié)果。其他單尺度測(cè)試結(jié)果見表7。
B. Robustness Evaluation
Additional robustness evaluation results for ConvNeXt models are presented in Table 8. We directly test our ImageNet-1K trained/fine-tuned classification models on several robustness benchmark datasets such as ImageNet-A [33], ImageNet-R [30], ImageNet-Sketch [78] and ImageNetC/Cˉ\bar{\mathrm{C}}Cˉ [31, 48] datasets. We report mean corruption error (mCE) for ImageNet-C, corruption error for ImageNet-Cˉ\bar{\mathrm{C}}Cˉ, and top-1 Accuracy for all other datasets.
表8中列出了ConvNeXt模型的其他魯棒性評(píng)估結(jié)果。我們直接在幾個(gè)魯棒性基準(zhǔn)數(shù)據(jù)集上測(cè)試我們的ImageNet-1K訓(xùn)練/微調(diào)分類模型,如ImageNet-A [33], ImageNet-R [30], ImageNet-Sketch [78] 和ImageNetC/Cˉ\bar{\mathrm{C}}Cˉ [31, 48] 數(shù)據(jù)集。我們報(bào)告了ImageNet-C的平均腐蝕誤差(mCE),ImageNet-Cˉ\bar{\mathrm{C}}Cˉ的腐蝕誤差,以及所有其他數(shù)據(jù)集的top-1準(zhǔn)確率。
Table 8. Robustness evaluation of ConvNeXt. We do not make use of any specialized modules or additional fine-tuning procedures.
表8. ConvNeXt的魯棒性評(píng)估。我們沒有使用任何專門的模塊或額外的微調(diào)程序。
ConvNeXt (in particular the large-scale model variants) exhibits promising robustness behaviors, outperforming state-of-the-art robust transformer models [47] on several benchmarks. With extra ImageNet-22K data, ConvNeXtXL demonstrates strong domain generalization capabilities (e.g. achieving 69.3%/68.2%/55.0% accuracy on ImageNetA/R/Sketch benchmarks, respectively). We note that these robustness evaluation results were acquired without using any specialized modules or additional fine-tuning procedures.
ConvNeXt(尤其是大規(guī)模模型的變體)表現(xiàn)出了很好的魯棒性行為,在一些基準(zhǔn)測(cè)試上超過(guò)了最先進(jìn)的魯棒性Transformer模型[47]。利用額外的ImageNet-22K數(shù)據(jù),ConvNeXt XL展示了強(qiáng)大的領(lǐng)域泛化能力(例如,在ImageNetA/R/Sketch基準(zhǔn)上分別達(dá)到69.3%/68.2%/55.0%的精度)。我們注意到,這些魯棒性評(píng)估結(jié)果是在沒有使用任何專門模塊或額外微調(diào)程序的情況下獲得的。
C. Modernizing ResNets: detailed results
Here we provide detailed tabulated results for the modernization experiments, at both ResNet-50 / Swin-T and ResNet-200 / Swin-B regimes. The ImageNet-1K top-1 accuracies and FLOPs for each step are shown in Table 10 and 11. ResNet-50 regime experiments are run with 3 random seeds.
這里我們提供了在ResNet-50 / Swin-T和ResNet-200 / Swin-B兩個(gè)制度下的現(xiàn)代化實(shí)驗(yàn)的詳細(xì)表格結(jié)果。表10和11顯示了ImageNet-1K每一步的最高準(zhǔn)確率和FLOPs。ResNet-50制度的實(shí)驗(yàn)是用3個(gè)隨機(jī)種子運(yùn)行的。
For ResNet-200, the initial number of blocks at each stage is (3, 24, 36, 3). We change it to Swin-B’s (3, 3, 27, 3) at the step of changing stage ratio. This drastically reduces the FLOPs, so at the same time, we also increase the width from 64 to 84 to keep the FLOPs at a similar level. After the step of adopting depthwise convolutions, we further increase the width to 128 (same as Swin-B’s) as a separate step.
對(duì)于ResNet-200,每個(gè)階段的初始?jí)K數(shù)是(3, 24, 36, 3)。在改變階段比例的步驟中,我們將其改為Swin-B的(3, 3, 27, 3)。這大大減少了FLOPs,所以同時(shí)我們也將寬度從64增加到84,以保持FLOPs在一個(gè)類似的水平。在采用深度卷積的步驟后,我們進(jìn)一步將寬度增加到128(與Swin-B的相同),作為一個(gè)單獨(dú)的步驟。
The observations on the ResNet-200 regime are mostly consistent with those on ResNet-50 as described in the main paper. One interesting difference is that inverting dimensions brings a larger improvement at ResNet-200 regime than at ResNet-50 regime (+0.79% vs. +0.14%). The performance gained by increasing kernel size also seems to saturate at kernel size 5 instead of 7. Using fewer normalization layers also has a bigger gain compared with the ResNet-50 regime (+0.46% vs. +0.14%).
對(duì)ResNet-200系統(tǒng)的觀察與主論文中描述的ResNet-50系統(tǒng)的觀察基本一致。一個(gè)有趣的區(qū)別是,與ResNet-50系統(tǒng)相比,倒置尺寸帶來(lái)了更大的改進(jìn)(+0.79% vs. +0.14%)。與ResNet-50系統(tǒng)相比,使用較少的歸一化層也有更大的收益(+0.46% vs. +0.14%)。
D. Detailed Architectures
We present a detailed architecture comparison between ResNet-50, ConvNeXt-T and Swin-T in Table 9. For differently sized ConvNeXts, only the number of blocks and the number of channels at each stage differ from ConvNeXt-T (see Section 3 for details). ConvNeXts enjoy the simplicity of standard ConvNets, but compete favorably with Swin Transformers in visual recognition.
我們?cè)诒?中列出了ResNet-50、ConvNeXt-T和Swin-T之間的詳細(xì)結(jié)構(gòu)比較。對(duì)于不同大小的ConvNeXts,只有每個(gè)階段的塊數(shù)和通道數(shù)與ConvNeXt-T不同(詳見第三節(jié))。ConvNeXts享有標(biāo)準(zhǔn)ConvNets的簡(jiǎn)單性,但在視覺識(shí)別方面與Swin Transformers的競(jìng)爭(zhēng)很有利。
E. Benchmarking on A100 GPUs
Following Swin Transformer [45], the ImageNet models’ inference throughputs in Table 1 are benchmarked using a V100 GPU, where ConvNeXt is slightly faster in inference than Swin Transformer with a similar number of parameters. We now benchmark them on the more advanced A100 GPUs, which support the TensorFloat32 (TF32) tensor cores. We employ PyTorch [50] version 1.10 to use the latest “Channel Last” memory layout [22] for further speedup.
按照Swin Transformer[45]的做法,表1中ImageNet模型的推理吞吐量是使用V100 GPU進(jìn)行基準(zhǔn)測(cè)試的,在參數(shù)數(shù)量相似的情況下,ConvNeXt的推理速度略高于Swin Transformer。現(xiàn)在我們?cè)诟冗M(jìn)的A100 GPU上對(duì)它們進(jìn)行基準(zhǔn)測(cè)試,它支持TensorFloat32(TF32)張量核心。我們采用PyTorch[50]1.10版本,使用最新的 "Channel Last "內(nèi)存布局[22],以進(jìn)一步提高速度。
We present the results in Table 12. Swin Transformers and ConvNeXts both achieve faster inference throughput than V100 GPUs, but ConvNeXts’ advantage is now significantly greater, sometimes up to 49% faster. This preliminary study shows promising signals that ConvNeXt, employed with standard ConvNet modules and simple in design, could be practically more efficient models on modern hardwares.
我們?cè)诒?2中列出了結(jié)果。Swin Transformers和ConvNeXts都取得了比V100 GPU更快的推理吞吐量,但ConvNeXts的優(yōu)勢(shì)現(xiàn)在明顯更大,有時(shí)可以快到49%。這項(xiàng)初步研究顯示了有希望的信號(hào),即ConvNeXt,采用標(biāo)準(zhǔn)的ConvNet模塊,設(shè)計(jì)簡(jiǎn)單,實(shí)際上可以在現(xiàn)代硬軟件上成為更有效的模型。
Table 12. Inference throughput comparisons on an A100 GPU. Using TF32 data format and “channel last” memory layout, ConvNeXt enjoys up to ~49% higher throughput compared with a Swin Transformer with similar FLOPs.
表12. A100 GPU上的推理吞吐量比較。使用TF32數(shù)據(jù)格式和 "通道最后(channel last) "內(nèi)存布局,ConvNeXt與具有類似FLOPs的Swin Transformer相比,享有高達(dá)49%的吞吐量。
F. Limitations
We demonstrate ConvNeXt, a pure ConvNet model, can perform as good as a hierarchical vision Transformer on image classification, object detection, instance and semantic segmentation tasks. While our goal is to offer a broad range of evaluation tasks, we recognize computer vision applications are even more diverse. ConvNeXt may be more suited for certain tasks, while Transformers may be more flexible for others. A case in point is multi-modal learning, in which a cross-attention module may be preferable for modeling feature interactions across many modalities. Additionally, Transformers may be more flexible when used for tasks requiring discretized, sparse, or structured outputs. We believe the architecture choice should meet the needs of the task at hand while striving for simplicity.
我們證明了ConvNeXt,一個(gè)純粹的ConvNet模型,在圖像分類、物體檢測(cè)、實(shí)例和語(yǔ)義分割等任務(wù)上的表現(xiàn)不亞于層次化的視覺變換器。雖然我們的目標(biāo)是提供廣泛的評(píng)估任務(wù),但我們認(rèn)識(shí)到計(jì)算機(jī)視覺的應(yīng)用甚至更加多樣化。ConvNeXt可能更適合某些任務(wù),而Transformer可能對(duì)其他任務(wù)更靈活。一個(gè)典型的例子是多模態(tài)學(xué)習(xí),在這種情況下,交叉注意力模塊可能更適合于為許多模態(tài)之間的特征互動(dòng)建模。此外,當(dāng)用于需要離散的(discretized)、稀疏的(sparse)或結(jié)構(gòu)化(structured )輸出的任務(wù)時(shí),Transformer可能更靈活。我們認(rèn)為,架構(gòu)的選擇應(yīng)該滿足手頭任務(wù)的需要,同時(shí)爭(zhēng)取做到簡(jiǎn)單。
G. Societal Impact
In the 2020s, research on visual representation learning began to place enormous demands on computing resources. While larger models and datasets improve performance across the board, they also introduce a slew of challenges. ViT, Swin, and ConvNeXt all perform best with their huge model variants. Investigating those model designs inevitably results in an increase in carbon emissions. One important direction, and a motivation for our paper, is to strive for simplicity — with more sophisticated modules, the network’s design space expands enormously, obscuring critical components that contribute to the performance difference. Additionally, large models and datasets present issues in terms of model robustness and fairness. Further investigation on the robustness behavior of ConvNeXt vs. Transformer will be an interesting research direction. In terms of data, our findings indicate that ConvNeXt models benefit from pre-training on large-scale datasets. While our method makes use of the publicly available ImageNet-22K dataset, individuals may wish to acquire their own data for pre-training. A more circumspect and responsible approach to data selection is required to avoid potential concerns with data biases.
在2020年代,關(guān)于視覺表征學(xué)習(xí)的研究開始對(duì)計(jì)算資源提出了巨大的要求。雖然更大的模型和數(shù)據(jù)集全面提高了性能,但也帶來(lái)了一系列的挑戰(zhàn)。ViT、Swin和ConvNeXt都在其巨大的模型變體中表現(xiàn)最好。研究這些模型設(shè)計(jì)不可避免地會(huì)導(dǎo)致碳排放的增加。一個(gè)重要的方向,也是我們論文的動(dòng)機(jī),就是力求簡(jiǎn)單——隨著更復(fù)雜的模塊,網(wǎng)絡(luò)的設(shè)計(jì)空間會(huì)極大地?cái)U(kuò)展,掩蓋了造成性能差異的關(guān)鍵部件。此外,大型模型和數(shù)據(jù)集在模型魯棒性和公平性方面存在問(wèn)題。對(duì)ConvNeXt與Transformer的魯棒性行為的進(jìn)一步調(diào)查將是一個(gè)有趣的研究方向。在數(shù)據(jù)方面,我們的發(fā)現(xiàn)表明ConvNeXt模型得益于大規(guī)模數(shù)據(jù)集的預(yù)訓(xùn)練。雖然我們的方法利用了公開的ImageNet-22K數(shù)據(jù)集,但個(gè)人可能希望獲得自己的數(shù)據(jù)進(jìn)行預(yù)訓(xùn)練。為避免潛在的數(shù)據(jù)偏差問(wèn)題,需要采取更加謹(jǐn)慎和負(fù)責(zé)任的方法來(lái)選擇數(shù)據(jù)。
References
[1] PyTorch Vision Models. https://pytorch.org/vision/stable/models.html. Accessed: 2021-10-01.
[2] GitHub repository: Swin transformer. https://github.com/microsoft/Swin-Transformer, 2021.
[3] GitHub repository: Swin transformer for object detection.https://github.com/SwinTransformer/Swin-Transformer-Object-Detection, 2021.
[4] Anonymous. Patches are all you need? Openreview, 2021.
[5] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv:1607.06450, 2016.
[6] Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-training of image transformers. arXiv:2106.08254, 2021.
[7] Irwan Bello, William Fedus, Xianzhi Du, Ekin Dogus Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. Revisiting resnets: Improved training and scaling strategies. NeurIPS, 2021.
[8] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V Le. Attention augmented convolutional networks. In ICCV, 2019.
[9] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In CVPR, 2018.
[10] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. arXiv:1906.07155, 2019.
[11] Fran?ois Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.
[12] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR, 2020.
[13] MMSegmentation contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
[14] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, 2020.
[15] Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. NeurIPS, 2021.
[16] Stéphane d’Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, and Levent Sagun. ConViT: Improving vision transformers with soft convolutional inductive biases. ICML, 2021.
[17] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
[18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
[19] Piotr Dollár, Serge Belongie, and Pietro Perona. The fastest pedestrian detector in the west. In BMVC, 2010.
[20] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[21] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. ICCV, 2021.
[22] Vitaly Fedyunin. Tutorial: Channel last memory format in PyTorch. https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html, 2021. Accessed: 2021-10-01.
[23] Ross Girshick. Fast R-CNN. In ICCV, 2015.
[24] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
[25] Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu, and Jingdong Wang. Demystifying local vision transformer: Sparse connectivity, weight sharing, and dynamic weight. arXiv:2106.04263, 2021.
[26] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021.
[27] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In ICCV, 2017.
[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
[30] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021.
[31] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2018.
[32] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv:1606.08415, 2016.
[33] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021.
[34] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861, 2017.
[35] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018.
[36] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017.
[37] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In ECCV, 2016.
[38] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In NeurIPS, 2017.
[39] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big Transfer (BiT): General visual representation learning. In ECCV, 2020.
[40] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
[41] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. In CVPR, 2016.
[42] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
[43] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
[44] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014.
[45] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. 2021.
[46] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.
[47] Xiaofeng Mao, Gege Qi, Yuefeng Chen, Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Xue. Towards robust vision transformer. arXiv preprint arXiv:2105.07926, 2021.
[48] Eric Mintun, Alexander Kirillov, and Saining Xie. On interaction between augmentations and corruptions in natural corruption robustness. NeurIPS, 2021.
[49] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
[50] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
[51] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 1992.
[52] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
[53] Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo, and Piotr Dollár. On network design spaces for visual recognition. In ICCV, 2019.
[54] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In CVPR, 2020.
[55] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone self-attention in vision models. NeurIPS, 2019.
[56] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. Global filter networks for image classification. NeurIPS, 2021.
[57] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
[58] Henry A Rowley, Shumeet Baluja, and Takeo Kanade. Neural network-based face detection. TPAMI, 1998.
[59] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
[60] Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In NeurIPS, 2016.
[61] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
[62] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
[63] Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, and Yann LeCun. Pedestrian detection with unsupervised multistage feature learning. In CVPR, 2013.
[64] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NeurIPS, 2014.
[65] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
[66] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck transformers for visual recognition. In CVPR, 2021.
[67] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
[68] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
[69] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
[70] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In CVPR, 2019.
[71] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
[72] Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. In ICML, 2021.
[73] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. arXiv:2012.12877, 2020.
[74] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. ICCV, 2021.
[75] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022, 2016.
[76] Régis Vaillant, Christophe Monrocq, and Yann Le Cun. Original approach for the localisation of objects in images. Vision, Image and Signal Processing, 1994.
[77] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
[78] Haohan Wang, Songwei Ge, Eric P Xing, and Zachary C Lipton. Learning robust global representations by penalizing local predictive power. NeurIPS, 2019.
[79] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018.
[80] Ross Wightman. GitHub repository: Pytorch image models. https://github.com/rwightman/pytorchimage-models, 2019.
[81] Ross Wightman, Hugo Touvron, and Hervé Jégou. Resnet strikes back: An improved training procedure in timm. arXiv:2110.00476, 2021.
[82] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. ICCV, 2021.
[83] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.
[84] Yuxin Wu and Justin Johnson. Rethinking “batch” in batchnorm. arXiv:2105.07576, 2021.
[85] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In ECCV, 2018.
[86] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, and Ross Girshick. Early convolutions help transformers see better. In NeurIPS, 2021.
[87] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
[88] Weijian Xu, Yifan Xu, Tyler Chang, and Zhuowen Tu. Coscale conv-attentional image transformers. ICCV, 2021.
[89] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk
Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
[90] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018.
[91] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In AAAI, 2020.
[92] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20K dataset. IJCV, 2019
總結(jié)
以上是生活随笔為你收集整理的A ConvNet for the 2020s的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: java中使用MD5验证文件的完整性
- 下一篇: oracle 数据库存储过程编译报错PL