ConvNeXt

ConvNeXt

文章标题:A ConvNet for the 2020s 作者:Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie

发表时间:2022

Official Code

ResNet的Transformer版

Modernizing a ConvNet: a Roadmap路线图


深色为 ResNet-50/Swin-T;灰色为ResNet-200/Swin-B;阴影为未修改

Detailed Architectures


ConvNeXt
Detailed results for modernizing a ResNet-50Detailed results for modernizing a ResNet-200

macro design 宏观设计

  • Changing stage compute ratio (78.8%—>79.4%)

    每个stage的block数量:(3,4,6,3)->(3,3,9,3) 和为Swin-T的stage(1,1,3,1)一致。

  • Changing stem to “Patchify” (79.4%—>79.5%)

    输入224;经历stem,导致$4\times$下采样成56;卷积计算: $ (W-F+2P)/s+1$

    传统:stride=2的$7\times7$卷积(padding为3)—>stride=2的$3\times3$max pooling(padding为1) $(224-7+2\cdot3)/2+1=112–>(112-3+2)/2=56$(pytorch向下取整)

    Swin-T:stride=4的$4\times4$卷积 $(224-4)/4+1=56$

    ConvNeXt :stride=4的$4\times4$卷积

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    # 标准ResNet
    stem = nn.Sequential(
        nn.Conv2d(in_chans, dims[0], kernel_size=7, stride=2,padding=3),
        nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
    )
    # ConvNeXt
    stem = nn.Sequential(
        nn.Conv2d(in_chans, dims[0], kernel_size=4, stride=4),
        LayerNorm(dims[0], eps=1e-6, data_format="channels_first")
    )

ResNeXt (79.4%—>80.5%)

Use more groups, expand width 使用更多的组,扩大宽度

bottleneck的$3\times3$卷积—>depthwise conv(组数等于通道数)

将网络宽度增加到与Swin-T的通道数量相同(从64到96

Inverted bottleneck (80.5%—>80.6%)


Block modifications and resulted specifications

(a) ResNeXt block; (b) inverted bottleneck block ; (c) b的深度卷积位置上移

d=4(维度系数)

large kernel size

  • 使用[c图](###Inverted bottleneck (80.5%—>80.6%))深度卷积位置上移后的倒残差结构 (退化到79.9%)
  • 使用$7\times7$卷积 (79.9% (3×3) —> 80.6%) (7×7)

various layer-wise micro designs各种层级的微观设计


ConvNeXt_Block
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class Block(nn.Module):
    r""" ConvNeXt Block. There are two equivalent implementations:
    (1) DwConv -> LayerNorm (channels_first) -> 1x1 Conv -> GELU -> 1x1 Conv; all in (N, C, H, W)
    (2) DwConv -> Permute to (N, H, W, C); LayerNorm (channels_last) -> Linear -> GELU -> Linear; Permute back
    We use (2) as we find it slightly faster in PyTorch
    
    Args:
        dim (int): Number of input channels.
        drop_path (float): Stochastic depth rate. Default: 0.0
        layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6.
    """
    def __init__(self, dim, drop_path=0., layer_scale_init_value=1e-6):
        super().__init__()
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim) # depthwise conv
        self.norm = LayerNorm(dim, eps=1e-6)
        self.pwconv1 = nn.Linear(dim, 4 * dim) # pointwise/1x1 convs, implemented with linear layers
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(4 * dim, dim)
        # gamma的作用是用于做layer scale训练策略
        self.gamma = nn.Parameter(layer_scale_init_value * torch.ones((dim)), 
                                    requires_grad=True) if layer_scale_init_value > 0 else None
        # drop_path是用于stoch. depth训练策略
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()

    def forward(self, x):
        input = x
        x = self.dwconv(x)
        # 由于用FC来做1x1conv,所以需要调换通道顺序
        x = x.permute(0, 2, 3, 1) # (N, C, H, W) -> (N, H, W, C)
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)
        if self.gamma is not None:
            x = self.gamma * x
        x = x.permute(0, 3, 1, 2) # (N, H, W, C) -> (N, C, H, W)

        x = input + self.drop_path(x)
        return x
  • GELU代替RELU (80.6%不变)

    和Swin-T一样只用一个GELU (80.6%—>81.3%)

  • 只留下一个BN层(比Swin-T还少:在Block开始添加一个额外的BN层并不能提高性能)(81.3%—>81.4%)

    LN代替BN (81.4%—>81.5%)

    直接在ResNet基础上替换成LN,效果并不好。

  • 单独的下采样层 (81.5%—>82%)

    ResNet:stride=2的$3\times3$卷积,有残差结构的block则在短路连接中使用stride=2的$1\times1$卷积

    Swin-T:单独采样层

    ConvNeXt :stride=2的$2\times2$卷积

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    
    #https://github.com/facebookresearch/ConvNeXt/blob/e4e7eb2fbd22d58feae617a8c989408824aa9eda/models/convnext.py#L72
    self.downsample_layers = nn.ModuleList() # stem and 3 intermediate downsampling conv layers
    stem = nn.Sequential(
        nn.Conv2d(in_chans, dims[0], kernel_size=4, stride=4),
        LayerNorm(dims[0], eps=1e-6, data_format="channels_first")
           )
    self.downsample_layers.append(stem)
    for i in range(3):
        downsample_layer = nn.Sequential(
                LayerNorm(dims[i], eps=1e-6, data_format="channels_first"),
                nn.Conv2d(dims[i], dims[i+1], kernel_size=2, stride=2),
        )
        self.downsample_layers.append(downsample_layer)

Empirical Evaluations on ImageNet

ConvNeXt 变体配置

ConvNeXt 系列C_channelsB_stage_blocksIN-1K top-1 acc_input_224
ConvNeXt-T(96,192,384,768)(3,3,9,3)82.1
ConvNeXt-S(96,192,384,768)(3,3,27,3)83.1
ConvNeXt-B(128,256,512,1024)(3,3,27,3)83.8
ConvNeXt-L(192,384,768,1536)(3,3,27,3)84.3
ConvNeXt-XL(256,512,1024,2048)(3,3,27,3)IN-22K pre-trained-87.0

Training Techniques

ImageNet-1K

(Pre)-training configResNet50(standard)ResNet50(timm)ResNet50(torchvision)ConvNeXt-T
optimizerSGDLAMBSGDAdamW
base learning rate0.15e-30.54e-3
weight decay1e-40.012e-50.05
optimizer momentum0.9-0.9$\beta_1,\beta_2=0.9,0.999$
batch size$8\times32=256$$4\times512=2048$$8\times128=1024$$4\times8\times128=4096$
training epochs90600600300
learning rate scheduleStepLR
(step=30,gamma=0.1)
cosine decaycosine decaycosine decay
warmup epochs-5520
warmup schedule-linearlinearlinear

The effective batch size = --nodes * --ngpus * --batch_size * --update_freq. In the example above, the effective batch size is 4*8*128*1 = 4096

数据增强

(Pre)-training configResNet50(standard)ResNet50(timm)ResNet50(torchvision)ConvNeXt-T
Mixup-0.20.20.8
Cutmix-1.01.01.0
RandAugment-(7,0.5)auto_augment=‘ta_wide’(9,0.5)

正则化

(Pre)-training configResNet50(standard)ResNet50(timm)ResNet50(torchvision)ConvNeXt-T
Stochastic Depth-0.05-0.1
Label Smoothing-0.10.10.1
Layer Scale---1e-6
EMA--0.999980.9999

Top-1 acc

(Pre)-training configResNet50(standard)ResNet50(timm)ResNet50(torchvision)ConvNeXt-T
Top-1 acc75.380.480.67482.1

拓展阅读

ConvNeXt:手把手教你改模型

ResNet strikes back: An improved training procedure in timm

How to Train State-Of-The-Art Models Using TorchVision’s Latest Primitives

0%