EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba论文精读(逐段解析)
EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba论文精读(逐段解析)
论文地址:https://arxiv.org/abs/2403.09977
CVPR 2024
Abstract. Prior efforts in light-weight model development mainly centered on CNN and Transformer-based designs yet faced persistent challenges. CNNs adept at local feature extraction compromise resolution while Transformers offer global reach but escalate computational deremains a significant hurdle. Recently, state space models (SSMs), such mands O(N2)\mathcal{O}(N^{2})O(N2) . This ongoing trade-off between accuracy and efficiency as Mamba, have shown outstanding performance and competitiveness in various tasks such as language modeling and computer vision, while reducing the time complexity of global information extraction to O(N)\mathcal O(N)O(N) . Inspired by this, this work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Additionally, we investigate the integration between SSM blocks and convolutions, and introduce an efficient visual state space block combined with an additional convolution branch, which further elevate the model performance. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, our EfficientVMamba-S with 1 . 3 G FLOPs improves VimTi with 1 . 5 G FLOPs by a large margin of 5.6%5.6\%5.6% accuracy on ImageNet. Code is available at: https://github.com/TerryPei/EfficientVMamba .
【翻译】摘要。先前在轻量级模型开发方面的努力主要集中在基于CNN和Transformer的设计上,但仍面临持续的挑战。CNN擅长局部特征提取但会损害分辨率,而Transformer提供全局覆盖但计算成本急剧上升,形成了一个显著的障碍。最近,状态空间模型(SSMs),如Mamba,在语言建模和计算机视觉等各种任务中表现出卓越的性能和竞争力,同时将全局信息提取的时间复杂度降低到O(N)\mathcal O(N)O(N)。受此启发,本工作提出探索视觉状态空间模型在轻量级模型设计中的潜力,并引入了一种名为EfficientVMamba的新型高效模型变体。具体而言,我们的EfficientVMamba通过高效跳跃采样集成了基于空洞的选择性扫描方法,构成了旨在利用全局和局部表征特征的构建块。此外,我们研究了SSM块与卷积之间的集成,并引入了一个与额外卷积分支相结合的高效视觉状态空间块,进一步提升了模型性能。实验结果表明,EfficientVMamba在缩小计算复杂度的同时,在各种视觉任务中产生了具有竞争力的结果。例如,我们的EfficientVMamba-S仅需1.3G FLOPs,相比需要1.5G FLOPs的VimTi在ImageNet上提升了5.6%5.6\%5.6%的准确率。代码可在以下链接获取:https://github.com/TerryPei/EfficientVMamba。
【解析】当我们要构建一个既准确又高效的视觉模型时,传统的两大主流架构都有各自的明显缺点。CNN虽然在提取图像的局部特征方面表现出色(比如边缘、纹理等),但这种局部性导致它在处理需要全局理解的任务时力不从心,而且为了获得更大的感受野往往需要增加网络深度,这又会降低特征图的分辨率。Transformer架构恰恰相反,它的自注意力机制天生具备全局建模能力,可以让图像中任意两个位置的像素直接交互,但这种全连接的特性导致计算复杂度随输入长度的平方增长,在高分辨率图像上变得极其昂贵。
状态空间模型为这个困境提供了一个很好的解决方案。它其实源自控制理论,最初用于描述动态系统的状态演化过程。在深度学习的语境下,SSM可以被理解为一种序列建模工具,它通过维护一个隐藏状态来捕获序列中的长程依赖关系。关键的突破在于,SSM能够以线性复杂度O(N)\mathcal O(N)O(N)实现全局信息的传播和聚合,这打破了Transformer中二次复杂度的桎梏。Mamba作为SSM的一个重要实现,通过引入选择性机制和硬件友好的算法设计,进一步提升了这种架构的实用性。
EfficientVMamba的核心创新体现在两个方面:一是提出了基于空洞卷积思想的选择性扫描策略,通过跳跃采样减少需要处理的token数量,在保持全局感受野的同时大幅降低计算开销;二是设计了一种双路径架构,将SSM的全局建模能力与卷积的局部特征提取优势有机结合,通过通道注意力机制实现两者的动态融合。
Keywords: Light-weight Architecture Efficient Network State Space Model
【翻译】关键词:轻量级架构 高效网络 状态空间模型
1 Introduction
Convolutional networks, exemplified by models such as ResNet [ 14 ], Inception [ 38 , 39 ], and EfficientNet [ 41 ], and Transformer-based networks, such as SwinTransformer [ 25 ], Beit [ 1 ], and Resformer [ 59 ] have been extensively applied to visual tasks including image classification, detection, and segmentation, achieving remarkable results. Recently, Mamba [ 7 ], a network based on state-space models (SSMs) [ 9 – 11 , 20 , 31 ], has demonstrated competitive performance to Transformers [ 47 ] in sequence modeling tasks such as language modeling. Inspired by this, some works [ 22 , 24 , 33 , 55 , 61 ] are pioneering in introducing SSMs into vision tasks. Among these methods, Vmamba [ 24 ] stands out by introducing an SS2D method to preserve 2D spatial dependencies by scanning images from multiple directions.
【翻译】卷积网络,以ResNet [14]、Inception [38, 39]和EfficientNet [41]等模型为代表,以及基于Transformer的网络,如SwinTransformer [25]、Beit [1]和Resformer [59],已被广泛应用于包括图像分类、检测和分割在内的视觉任务,取得了显著的成果。最近,基于状态空间模型(SSMs)[9-11, 20, 31]的网络Mamba [7]在语言建模等序列建模任务中展现出了与Transformers [47]相竞争的性能。受此启发,一些工作[22, 24, 33, 55, 61]开始将SSMs引入视觉任务。在这些方法中,Vmamba [24]通过引入SS2D方法来保持2D空间依赖性,通过从多个方向扫描图像而脱颖而出。
【解析】这段话说明了当前视觉任务的主流方法和最新进展。在深度学习的视觉领域,长期以来存在两大主流架构:CNN和Transformer。CNN通过卷积操作能够很好地捕获图像的局部特征和空间结构,而Transformer则通过自注意力机制实现了全局信息的建模能力。但是,这两种架构都有各自的局限性——CNN难以建模长程依赖关系,而Transformer的计算复杂度过高。状态空间模型作为一种新兴的序列建模方法,最初在自然语言处理领域取得了成功,特别是Mamba模型在保持全局建模能力的同时大幅降低了计算复杂度。因此激发了研究者将SSM引入计算机视觉领域的兴趣。Vmamba是这一方向的重要代表,它通过SS2D(二维选择性扫描)方法解决了如何将原本设计用于一维序列的SSM适配到二维图像数据的关键问题。
However, the impressive performance achieved by these various architectures usually comes from the scaling up of model sizes, making a critical challenge in applying them on resource-constrained devices. In pursue of light-weight models, many studies have been conducted to reduce the resource consumption of vision models while keeping a competitive performance. Early works on efficient CNNs mainly focus on narrowing the original convolutional block with efficient group convolutions [ 4 , 15 , 34 ], light skipping connections [ 12 , 27 ], e.t.c . While recently, due to the remarkable success of including the global representation ability of Transformers into vision tasks, some works are proposed to reduce the computation complexity of ViTs [ 17 , 25 , 46 , 49 ] and fuse ViTs with CNNs in light-weight models [ 19 , 28 , 46 ]. However, the lightening of ViTs are usually obtained with the lost of global capture capability in self-attention. Due to the O(N)\mathcal O(N)O(N) time complexity of global self-attention, its computation and memory costs increase dramastically at large resolutions. As a result, existing efficient ViT methods have to perform local self-attentions within partitioned windows [ 17 , 25 , 46 ], or only conduct global self-attentions in deeper stages with low resolutions [ 19 , 28 ]. The embarrassing trade-off and rollback of ViTs to CNNs hinders the ability of improving the light-weight models further.
【翻译】然而,这些各种架构所取得的令人印象深刻的性能通常来自于模型规模的扩大,这使得在资源受限的设备上应用它们成为一个关键挑战。为了追求轻量级模型,许多研究已经进行,以减少视觉模型的资源消耗,同时保持竞争性能。早期关于高效CNN的工作主要集中在通过高效的分组卷积[4, 15, 34]、轻量跳跃连接[12, 27]等来缩小原始卷积块。而最近,由于将Transformers的全局表示能力引入视觉任务取得了显著成功,一些工作被提出来降低ViTs的计算复杂度[17, 25, 46, 49],并将ViTs与CNNs融合到轻量级模型中[19, 28, 46]。然而,ViTs的轻量化通常是以失去自注意力中的全局捕获能力为代价获得的。由于全局自注意力的O(N2)\mathcal O(N^2)O(N2)时间复杂度,其计算和内存成本在大分辨率下急剧增加。因此,现有的高效ViT方法必须在分割的窗口内执行局部自注意力[17, 25, 46],或只在具有低分辨率的深层阶段进行全局自注意力[19, 28]。ViTs向CNNs的尴尬权衡和回退阻碍了进一步改进轻量级模型的能力。
【解析】这段话点出了当前轻量级模型设计面临的难题。模型性能的提升往往依赖于参数量和计算复杂度的增加,但这与移动设备、边缘计算等实际应用场景的资源限制产生了根本性矛盾。在轻量化的演进过程中,CNN领域的研究相对成熟,通过分组卷积、深度可分离卷积、跳跃连接等技术手段实现了效率与性能的较好平衡。Transformer在视觉任务中的成功主要源于其强大的全局建模能力,但这种能力的代价是二次复杂度的自注意力计算。当图像分辨率增加时,这种复杂度会呈现爆炸式增长,使得模型在资源受限环境下难以实用。为了解决这个问题,现有的轻量化ViT方法不得不做出妥协:要么将注意力限制在局部窗口内(如Swin Transformer, https://blog.csdn.net/weixin_46248968/article/details/149156228?spm=1001.2014.3001.5502),要么只在网络的深层使用全局注意力(此时特征图分辨率已经较低)。这种妥协实际上是在向CNN的设计理念倒退,违背了Transformer全局建模的初衷,也限制了轻量级模型进一步突破的可能性。
Fig. 1: Lightweight Model Performance Comparison on ImageNet. EfficientV- Mamba outperforms previous work across various model variants in terms of both accuracy and computational complexity.
【翻译】图1:ImageNet上轻量级模型性能比较。EfficientVMamba在各种模型变体中,在准确率和计算复杂度方面都优于之前的工作。
In this paper, recalling the previously metioned linear scaling complexity in SSMs, we are inspired to obtain efficient global capture ability in light-weight vision models by involving SSMs into model design. Its outstanding performance is demonstrated in Figure 1 . We achieve this by first introducing a skip-sampling mechanism, which reduces the number of tokens that need to be scanned in the spatial dimension, and saves multiple times of computation cost in sequence modeling of SSMs while keeping the global receptive field among tokens, as illustrated in Figure 2 . On the other hand, acknowledging that the convolutions provide a more efficient way for feature extraction in the case when only local representations suffice, we introduce a convolution branch in supplement of the original global SSM branch, and perform feature fusion of them through the channel attention module, SE [ 16 ]. Finally, for an optimal allocation of capabilities of various block types, we construct our network with global SSM blocks in the shallow and high-resolution layers, while adopting efficient convolution blocks (MobileNetV2 blocks [ 34 ]) in the deeper layers. The final network, achieving efficient SSM computation and efficient integration of convolutions, has showcased significant improvements compared to previous CNN and ViT based light-weight models through our experiments on image classification, object detection, and semantic segmentation tasks.
【翻译】在本文中,回顾前面提到的SSMs中的线性缩放复杂度,我们受到启发,通过将SSMs引入模型设计来在轻量级视觉模型中获得高效的全局捕获能力。其卓越性能在图1中得到了展示。我们通过首先引入跳跃采样机制来实现这一点,该机制减少了在空间维度中需要扫描的标记数量,并在保持标记间全局感受野的同时节省了SSMs序列建模中数倍的计算成本,如图2所示。另一方面,认识到卷积在仅需要局部表示的情况下提供了更高效的特征提取方式,我们引入了一个卷积分支来补充原始的全局SSM分支,并通过通道注意力模块SE [16]对它们进行特征融合。最后,为了优化各种块类型的能力分配,我们在浅层和高分辨率层构建了具有全局SSM块的网络,而在深层采用了高效的卷积块(MobileNetV2块[34])。最终的网络实现了高效的SSM计算和卷积的高效集成,通过我们在图像分类、目标检测和语义分割任务上的实验,与之前基于CNN和ViT的轻量级模型相比展现了显著的改进。
【解析】这段话指出了EfficientVMamba的设计思路。作者从SSM的线性复杂度特性出发,提出了一个多层次的解决方案。首先是跳跃采样机制,通过有策略地跳过某些token来减少计算量,但同时保持全局感受野不变。这就像在保持能看到整个画面的前提下,选择性地关注一些关键点,从而大幅降低计算开销。其次是双分支架构的设计,SSM擅长全局建模但在局部特征提取上可能不如卷积高效,而卷积在局部特征提取上有天然优势。通过将两者结合并用通道注意力进行动态权重分配,实现了优势互补。最后是网络层次化设计,这其实是进一步优化了特征提取过程:在浅层高分辨率阶段,全局信息的建模更为重要,而在深层低分辨率阶段,局部特征的进一步抽象和整合更加关键。这种分层设计提高效率的同时,更符合视觉特征处理的层次化规律。
Fig. 2: Illustration of efficient 2D scan methods (ES2D). (a.) Vmamba [ 24 ] employs SS2D method in vision tasks, traversing entire row or column axes, which incurs heavy computational resources. (b.) We present an efficient 2D scanning method, ES2D, which organizes patches by omitting sampling steps, and then proceeds with an intra-group traversal (with a skipping step of 2 in the Figure). The proposed scan approach reduces computational demands ( 4N→N4N\rightarrow N4N→N ) while preserving global feature maps ( e.g. Each group contains eye-related patches.)
【翻译】图2:高效2D扫描方法(ES2D)的图示。(a.) Vmamba [24]在视觉任务中采用SS2D方法,遍历整个行或列轴,这会产生大量计算资源开销。(b.) 我们提出了一种高效的2D扫描方法ES2D,通过省略采样步骤来组织补丁,然后进行组内遍历(图中跳跃步长为2)。所提出的扫描方法在保持全局特征图的同时减少了计算需求(4N→N4N\rightarrow N4N→N)(例如,每组都包含与眼部相关的补丁)。
In summary, the contributions of this paper are as follows.
- We propose an atrous-based selective scanning strategy, which is realized through a novel skip sampling and regrouping patched in the spatial respective field. The strategy refines the building blocks to efficiently extract global dependencies while reducing computation complexity ( O(N)→O(N/p2))\mathcal{O}(N)\to\mathcal{O}(N/p^{2}))O(N)→O(N/p2)) with step ppp ).
- We introduce a dual-pathway module that combines our efficient scanning strategy for global feature capture and a convolution branch for efficient local feature extraction, along with a channel attention module to balance the integration of both global and local features. Besides, we propose a better allocation of SSM and CNN blocks by promoting SSMs in early stages with high resolutions for better global capture, while adopting CNNs in low resolutions for better efficiency.
- We conduct extensive experiments on image classification, object detection, and semantic segmentation tasks. The results and illustration shown in Figure 1 demonstrate that, our EfficientVMamba effectively reduces the FLOPs of the models while achieving significant performance improvements compared to existing light-weight models.
【翻译】总结而言,本文的贡献如下:
-
我们提出了一种基于空洞的选择性扫描策略,通过在空间感受野中新颖的跳跃采样和重新分组补丁来实现。该策略优化了构建块,以高效提取全局依赖关系,同时降低计算复杂度(从O(N)\mathcal{O}(N)O(N)到O(N/p2)\mathcal{O}(N/p^{2})O(N/p2),步长为ppp)。
-
我们引入了一个双路径模块,结合我们的高效扫描策略进行全局特征捕获和卷积分支进行高效局部特征提取,以及一个通道注意力模块来平衡全局和局部特征的集成。此外,我们提出了SSM和CNN块的更好分配,通过在高分辨率的早期阶段推广SSMs以获得更好的全局捕获,而在低分辨率中采用CNNs以获得更好的效率。
-
我们在图像分类、目标检测和语义分割任务上进行了广泛的实验。图1中显示的结果和说明表明,我们的EfficientVMamba有效降低了模型的FLOPs,同时与现有轻量级模型相比实现了显著的性能改进。
【解析】EfficientVMamba三个贡献点。第一个贡献是算法层面的创新,空洞扫描策略的数学表达O(N)→O(N/p2)\mathcal{O}(N)\to\mathcal{O}(N/p^{2})O(N)→O(N/p2)说明了计算复杂度的显著降低——当跳跃步长为ppp时,计算量会按p2p^2p2的比例减少,这在大分辨率图像处理中具有巨大价值。第二个贡献是架构层面的创新,双路径设计考虑了不同计算模式特点的高效利用,而层次化的块分配策略则改善了视觉特征处理过程。第三个贡献验证了理论设计的实际效果,跨多个视觉任务的实验结果证明了其方法的通用性,也展示了在效率和性能双重指标上的优越性。可以说,EfficientVMamba是轻量级视觉模型的优秀成果。
2 Related Work
2.1 轻量级视觉模型
In recent years, the realm of vision tasks has been predominantly governed by Convolutional Neural Networks (CNNs) and Visual Transformer (ViT) architectures. The focus on making these architectures lightweight to enhance efficiency has emerged as a pragmatic and promising direction in research. For CNNs, notable advancements have been made in improving image classification accuracy, as evidenced by the development of influential architectures like ResNet [ 14 ], RegNet [ 35 ], and DenseNet [ 18 ]. These advancements have set new benchmarks in accuracy but also introduced a need for lightweight architectures [ 51 , 52 ]. This need has been addressed through various factorization-based methods, making CNNs more mobile-friendly. For instance, separable convolutions introduced by Xception have been instrumental in this regard, leading to the development of state-of-the-art lightweight CNNs, such as MobileNets [ 15 ], ShuffleNetv2 [ 27 ], ESPNetv2 [ 29 ], MixConv [ 42 ], MNASNet [ 40 ], and GhostNets [ 13 ]. These models are not only versatile but also relatively simpler to train. Following CNNs, Transformers have gained significant traction in various vision tasks, such as image classification, object detection, and autonomous driving, rapidly becoming the mainstream approach. The lightweight versions of Transformers have been achieved through diverse methods. On the training front, sophisticated data augmentation strategies and techniques like Mixup [ 60 ], CutMix [ 58 ], and RandAugment [ 6 ] have been employed, as seen in models like CaiT [ 45 ] and DeiTIII [ 44 ], which demonstrate exceptional performance without the need for large proprietary datasets. From the architectural design perspective, efforts have been concentrated on optimizing self-attention input resolution and devising attention mechanisms that incur lower computational costs. Innovations like PVT-v1 [ 49 ]'s emulation of CNN’s feature map pyramid, Swin-T [ 25 ] and LightViT [ 17 ]'s hierarchical feature map and shifted-window mechanisms, and the introduction of (multi-scale) deformable attention modules in Deformable DETR [ 62 ] exemplify these advancements. There is also NAS for ViTs [ 37 ].
【翻译】近年来,视觉任务领域主要由卷积神经网络(CNNs)和视觉Transformer(ViT)架构主导。专注于使这些架构轻量化以提高效率已成为研究中一个实用且有前景的方向。对于CNNs,在提高图像分类准确性方面取得了显著进步,这体现在ResNet [14]、RegNet [35]和DenseNet [18]等有影响力架构的发展上。这些进步在准确性方面设立了新的基准,但也引入了对轻量级架构的需求[51, 52]。这种需求已通过各种基于分解的方法得到解决,使CNNs更适合移动设备。例如,Xception引入的可分离卷积在这方面发挥了重要作用,导致了最先进的轻量级CNNs的发展,如MobileNets [15]、ShuffleNetv2 [27]、ESPNetv2 [29]、MixConv [42]、MNASNet [40]和GhostNets [13]。这些模型不仅多功能,而且训练相对简单。继CNNs之后,Transformers在各种视觉任务中获得了显著关注,如图像分类、目标检测和自动驾驶,迅速成为主流方法。Transformers的轻量级版本通过多种方法实现。在训练方面,采用了复杂的数据增强策略和技术,如Mixup [60]、CutMix [58]和RandAugment [6],在CaiT [45]和DeiTIII [44]等模型中可以看到,这些模型在不需要大型专有数据集的情况下展现出卓越性能。从架构设计角度来看,努力集中在优化自注意力输入分辨率和设计产生较低计算成本的注意力机制上。PVT-v1 [49]模拟CNN特征图金字塔、Swin-T [25]和LightViT [17]的层次化特征图和移位窗口机制,以及在Deformable DETR [62]中引入(多尺度)可变形注意力模块等创新都体现了这些进步。还有针对ViTs的NAS [37]。
【解析】早期的ResNet、DenseNet等模型虽然在精度上取得突破,但参数量和计算量的增长使得实际部署变得困难。为解决这个问题,研究者们开发了多种分解技术,其中可分离卷积是重要的突破之一。可分离卷积将标准卷积分解为深度卷积和逐点卷积两个步骤,大幅降低了参数量和计算量。基于这一思想,MobileNet系列、ShuffleNet系列等轻量级模型应运而生,它们在保持相当精度的同时将模型大小压缩到了可以在移动设备上运行的程度。Transformer在视觉领域的轻量化则面临着不同的挑战。研究者从两个维度来解决这个问题:训练策略的优化和架构设计的改进。在训练方面,通过更好的数据增强技术可以让模型在较小的数据集上达到更好的性能,从而降低了对大型专有数据集的依赖。在架构方面,研究者们设计了各种巧妙的注意力机制变种,比如分层注意力、局部窗口注意力、可变形注意力等,这些设计在保持Transformer全局建模能力的同时显著降低了计算复杂度。神经架构搜索(NAS)技术的引入进一步推动了这一发展,它能够自动发现更优的网络架构设计。
2.2 状态空间模型
The State Space Model (SSM) [ 9 – 11 , 20 , 31 ] is a family architecture encapsulates a sequence-to-sequence transformation has the potential to handle tokens with long dependencies, but it is challenging to train due to its high computational and memory usage. Nevertheless, recent works [7−9,11,36][7-9,11,36][7−9,11,36] have enabled deep State Space Models to become progressively more competitive with CNN and Transformer. In particular, S4 [ 9 ] employs a Normal Plus Low-Rank (NPLR) representation to efficiently compute the convolution kernel by leveraging the Woodbury identity for matrix inversion. And then Mamba [ 7 ] enhances SSMs with input-specific parameterization and a scalable, hardware-optimized algorithm, achieving simpler design and superior efficiency in processing long sequences for language and genomics. Following success of SSM, there has been a surge in applying the framework to computer vision tasks. S4ND [ 30 ] first introduce the SSM blocks into vision tasks, facilitating the modeling of visual data across 1D, 2D, and 3D as continuous signals. Vmamba [ 24 ] pioneers a mamba-based vision backbone a cross-scan module to address the direction-sensitivity issue arising from the differences between 1D sequences and multi-channels images. Similarly, Vim [ 61 ] introduces an efficient state space model for vision tasks by leveraging bidirectional state space modeling for data-dependent global visual context without image-specific biases. The impressive performance of the Mamba backbone in various vision tasks has inspired a wave of research [ 2 , 22 , 33 , 33 , 48 ] focusing on adapting Mamba-based models for specialized vision applications. Recent works like Vm-unet [ 33 ], U-Mamba [ 22 ], and SegMamba [ 55 ] have adapted Mambabased backbones for medical image segmentation, integrating unique features such as a U-shaped architecture in Vm-unet, an encoder-decoder framework in U-Mamba, and whole volume feature modeling in SegMamba. In the domain of graph representation, GraphMamba [ 48 ] integrates Graph Guided Message Passing (GMB) with Message Passing Neural Networks (MPNN) within the Graph GPS architecture, which enhances the training and contextual filtration for graph embeddings. Furthermore, GMNs [ 2 ] present a comprehensive framework that encompasses tokenization, optional positional or structural encoding, localized encoding, sequencing of tokens, and utilizes a series of bidirectional Mamba layers for processing graphs.
【翻译】状态空间模型(SSM)[9-11, 20, 31]是一类架构族,封装了序列到序列的变换,具有处理长依赖标记的潜力,但由于其高计算和内存使用量而难以训练。然而,最近的工作[7-9,11,36]使深度状态空间模型在与CNN和Transformer的竞争中变得越来越具有竞争力。特别是,S4 [9]采用正态加低秩(NPLR)表示,通过利用Woodbury恒等式进行矩阵求逆来高效计算卷积核。然后Mamba [7]通过输入特定的参数化和可扩展的硬件优化算法增强了SSMs,在处理语言和基因组学的长序列方面实现了更简单的设计和卓越的效率。随着SSM的成功,将该框架应用于计算机视觉任务的研究激增。S4ND [30]首先将SSM块引入视觉任务,促进了将1D、2D和3D视觉数据建模为连续信号。Vmamba [24]开创了基于mamba的视觉骨干网络,采用交叉扫描模块来解决1D序列和多通道图像差异产生的方向敏感性问题。类似地,Vim [61]通过利用双向状态空间建模为视觉任务引入了高效的状态空间模型,实现了数据依赖的全局视觉上下文而无图像特定偏差。Mamba骨干网络在各种视觉任务中的卓越性能激发了一波研究浪潮[2, 22, 33, 33, 48],专注于将基于Mamba的模型适配到专门的视觉应用中。最近的工作如Vm-unet [33]、U-Mamba [22]和SegMamba [55]已将基于Mamba的骨干网络适配用于医学图像分割,集成了独特的特征,如Vm-unet中的U形架构、U-Mamba中的编码器-解码器框架,以及SegMamba中的全体积特征建模。在图表示领域,GraphMamba [48]在Graph GPS架构中集成了图引导消息传递(GMB)和消息传递神经网络(MPNN),这增强了图嵌入的训练和上下文过滤。此外,GMNs [2]提出了一个综合框架,包括标记化、可选的位置或结构编码、局部编码、标记排序,并利用一系列双向Mamba层来处理图。
【解析】状态空间模型本质上是一种特殊的序列建模架构,它最初来源于控制理论中的状态空间概念。在深度学习中,SSM的核心思想是通过隐藏状态来建模序列数据的长期依赖关系。与RNN和LSTM相比,SSM具有更强的数学理论基础和更好的长序列处理能力,但传统SSM的训练确实存在计算复杂度高的问题。S4模型的突破在于引入了NPLR(正态加低秩)分解技术,这种技术巧妙地利用了矩阵的结构特性,通过Woodbury矩阵求逆恒等式将原本复杂的矩阵运算转化为更高效的形式。Mamba进一步优化了这个过程,不仅在算法层面进行了改进,还考虑了硬件实现的效率,特别是针对现代GPU的并行计算特性进行了专门的优化。当SSM开始应用到计算机视觉领域时,面临的主要挑战是如何将原本设计用于一维序列的模型适配到二维图像数据。S4ND通过将图像视为连续信号的多维扩展解决了这个问题,而Vmamba则提出了更为精妙的交叉扫描机制,通过多个方向的扫描来确保图像的二维结构信息不会丢失。
3 预备知识
3.1 State Space Models (S4)
State Space Models (SSMs) are a general family of sequence model used in deep learning that are influenced by systems capable of mapping one-dimensional sequences in a continuous manner. These models transform input DDD-dimensional sequence x(t)∈RL×Dx(t)\in\mathbb{R}^{L\times D}x(t)∈RL×D into output sequence y(t)∈RL×Dy(t)\in\mathbb{R}^{L\times D}y(t)∈RL×D by utilizing a learnable latent state h(t)∈RN×Dh(t)\in\mathbb{R}^{N\times D}h(t)∈RN×D that is not directly observable. The mapping process could be denoted as:
【翻译】状态空间模型(SSMs)是深度学习中使用的序列模型的一个通用族群,受能够以连续方式映射一维序列的系统影响。这些模型通过利用不可直接观察的可学习潜在状态h(t)∈RN×Dh(t)\in\mathbb{R}^{N\times D}h(t)∈RN×D,将输入的DDD维序列x(t)∈RL×Dx(t)\in\mathbb{R}^{L\times D}x(t)∈RL×D转换为输出序列y(t)∈RL×Dy(t)\in\mathbb{R}^{L\times D}y(t)∈RL×D。映射过程可以表示为:
【解析】在传统的序列建模中,模型需要直接处理输入序列到输出序列的映射,这在处理长序列时会遇到梯度消失和计算复杂度高的问题。SSM引入了一个中间的"隐藏状态"概念,这个隐藏状态就像是系统的内部记忆,它能够捕获和保持序列中的重要信息。这个隐藏状态h(t)h(t)h(t)的维度是N×DN\times DN×D,其中NNN是状态维度,DDD是特征维度。通过这种设计,模型不再需要直接处理复杂的长距离依赖关系,而是通过状态的连续演化来间接实现这种建模能力。这种方法的优势在于它能够以线性的计算复杂度处理任意长度的序列,这是相对于Transformer的二次复杂度的巨大优势。
h′(t)=Ah(t)+Bx(t),y(t)=Ch(t),\begin{array}{r}{h^{\prime}(t)=A h(t)+B x(t),}\ {y(t)=C h(t),\qquad}\end{array} h′(t)=Ah(t)+Bx(t), y(t)=Ch(t),
where A∈RN×N\pmb{A}\in\mathbb{R}^{N\times N}A∈RN×N , B∈RD×N\pmb{{B}}\in\mathbb{R}^{D\times N}B∈RD×N and C∈RD×N\pmb{C}\in\mathbb{R}^{D\times N}C∈RD×N .
【翻译】其中A∈RN×N\pmb{A}\in\mathbb{R}^{N\times N}A∈RN×N,B∈RD×N\pmb{{B}}\in\mathbb{R}^{D\times N}B∈RD×N和C∈RD×N\pmb{C}\in\mathbb{R}^{D\times N}C∈RD×N。
【解析】这三个矩阵是SSM的核心参数,它们定义了系统的动态行为。矩阵AAA是状态转移矩阵,它控制着隐藏状态如何随时间演化,可以理解为系统的"记忆衰减"机制。矩阵BBB是输入矩阵,它决定了当前输入如何影响隐藏状态的更新。矩阵CCC是输出矩阵,它控制如何从隐藏状态中提取有用信息来生成最终输出。这种设计将复杂的序列建模问题分解为三个相对独立的子问题:状态演化、输入处理和输出生成。通过学习这三个矩阵的参数,模型能够自适应地发现序列数据中的模式和规律。
Discretization. Discretization aims to convert the continuous differential equations into discrete functions, aligning the model to the input signal’s sampling frequency for more efficient computation [ 10 ]. Following the work [ 11 ], the continuous parameters (A,B)(A,B)(A,B) can be discretized by zero-order hold (ZOH) method to be the discrete parameters (Aˉ,Bˉ)(\bar{A},\bar{B})(Aˉ,Bˉ) with a time step Δ\DeltaΔ:
【翻译】离散化。离散化旨在将连续微分方程转换为离散函数,使模型与输入信号的采样频率对齐以实现更高效的计算[10]。根据工作[11],连续参数(A,B)(A,B)(A,B)可以通过零阶保持(ZOH)方法离散化为具有时间步长Δ\DeltaΔ的离散参数(Aˉ,Bˉ)(\bar{A},\bar{B})(Aˉ,Bˉ):
【解析】在实际的深度学习应用中,我们处理的都是离散的数据序列,而不是连续的信号。因此需要将连续时间的状态空间模型转换为离散时间版本。零阶保持方法是信号处理中的一种标准技术,它假设在每个采样间隔内,输入信号保持常数值。这种方法的核心思想是在时间步长Δ\DeltaΔ内,系统的输入保持不变,从而可以通过积分得到精确的离散化形式。这种离散化不仅保持了原始连续系统的数学性质,还使得模型能够直接处理数字化的序列数据,同时保证了数值计算的稳定性和效率。
Aˉ=exp(ΔA),Bˉ=(ΔA)−1(exp(ΔA)−I)ΔB.\bar{A}=\exp(\Delta A), \quad \bar{B}=(\Delta A)^{-1}(\exp(\Delta A)-I) \Delta B . Aˉ=exp(ΔA),Bˉ=(ΔA)−1(exp(ΔA)−I)ΔB.
where Aˉ∈RN×N\bar{\pmb{A}}\in\mathbb{R}^{N\times N}Aˉ∈RN×N, Bˉ∈RD×N\bar{B}\in\mathbb{R}^{D\times N}Bˉ∈RD×N and Cˉ∈RD×N\bar{C}\in\mathbb{R}^{D\times N}Cˉ∈RD×N .
【翻译】其中Aˉ∈RN×N\bar{\pmb{A}}\in\mathbb{R}^{N\times N}Aˉ∈RN×N,Bˉ∈RD×N\bar{B}\in\mathbb{R}^{D\times N}Bˉ∈RD×N和Cˉ∈RD×N\bar{C}\in\mathbb{R}^{D\times N}Cˉ∈RD×N。
【解析】这个公式保证了即使在离散化后,系统仍然保持其原有的稳定性和收敛性质。矩阵Aˉ\bar{A}Aˉ通过矩阵指数来计算,这确保了状态转移的平滑性。而Bˉ\bar{B}Bˉ的计算涉及到矩阵逆和矩阵指数的组合,这个公式实际上是对连续时间内输入对状态影响的精确积分结果。
To simplify calculations, the repeated application of Equation 2 can be efficiently performed simultaneously using a global convolution approach.
【翻译】为了简化计算,方程2的重复应用可以使用全局卷积方法同时高效地执行。
【解析】这里提到的全局卷积方法是SSM计算效率的关键突破。传统的递归计算需要逐步计算每个时间步的状态,这种顺序计算无法充分利用现代GPU的并行计算能力。通过将递归形式转换为卷积形式,我们可以利用高度优化的卷积算子来并行处理整个序列。这种转换不仅大幅提升了计算速度,还使得SSM能够享受到深度学习框架中针对卷积操作的各种优化技。
y=x⊛K‾withK‾=(CB‾,CA‾B‾,...,CA‾L−1B‾),\begin{array}{c}{{y=x\circledast\overline{{K}}}}\\ {{\mathrm{with}\overline{{K}}=(C\overline{{B}},C\overline{{A}}\overline{{B}},...,C\overline{{A}}^{L-1}\overline{{B}}),}}\end{array} y=x⊛KwithK=(CB,CAB,...,CAL−1B),
⊛\circledast⊛ denotes convolution operation, and K‾∈RL\overline{{\pmb K}}\in\mathbb{R}^{L}K∈RL is the SSM kernel.
【翻译】⊛\circledast⊛表示卷积操作,K‾∈RL\overline{{\pmb K}}\in\mathbb{R}^{L}K∈RL是SSM核。
【解析】SSM核K‾\overline{K}K是整个状态空间模型的核心,它将复杂的递归计算压缩成了一个简单的卷积核。这个核的每个元素都对应着不同时间步长的影响权重,从CB‾C\overline{B}CB(当前时间步)到CA‾L−1B‾C\overline{A}^{L-1}\overline{B}CAL−1B(最远的历史时间步)。通过这种方式,模型能够在一次卷积操作中同时考虑所有历史信息对当前输出的影响。这种设计的好处在于它将时间维度上的复杂依赖关系转化为空间维度上的卷积操作,从而实现了计算效率和建模能力的完美平衡。这也解释了为什么SSM能够以线性复杂度处理长序列——因为卷积操作的复杂度是线性的。
3.2 Selective State Space Models (S6)
Mamba [ 7 ] improves the performance of SSM by introducing Selective State Space Models (S6), allowing the continuous parameters to vary with the input enhances selective information processing across sequences, which extend the discretization process by selection mechanism:
【翻译】Mamba [7] 通过引入选择性状态空间模型(S6)来改善SSM的性能,允许连续参数随输入变化,增强了跨序列的选择性信息处理,通过选择机制扩展了离散化过程:
【解析】这里介绍了Mamba模型的核心创新点。传统的SSM使用固定的参数矩阵来处理所有输入,就像用一把万能钥匙去开所有的锁。但Mamba意识到这种"一刀切"的方法并不高效,因为不同的输入内容应该有不同的处理方式。比如在处理一张图片时,如果图片中有重要的物体,我们希望模型能够重点关注这些区域;如果某些区域是背景噪声,我们希望模型能够适当忽略。S6的核心思想就是让模型的参数能够根据当前的输入内容动态调整,实现"因材施教"的效果。这种选择性机制使得模型能够更智能地决定哪些信息需要重点处理,哪些可以轻度处理,从而大幅提升了模型的表达能力和计算效率。
Bˉ=sB(x),Cˉ=sC(x),Δ=τA(Parameter+sA(x)),\begin{array}{l}{{\bar{B}=s_{B}(x),}}\\ {{\bar{C}=s_{C}(x),}}\\ {{\varDelta=\tau_{A}(\mathrm{Parameter}+s_{A}(x)),}}\end{array} Bˉ=sB(x),Cˉ=sC(x),Δ=τA(Parameter+sA(x)),
where sB(x)s_{B}(x)sB(x) and sC(x)s_{C}(x)sC(x) are linear functions that project input xxx into an Ndimensional space, while sA(x)s_{A}(x)sA(x) broadens a DDD -dimensional linear projection to the necessary dimensions. In terms of visual tasks, VMamba proposed the 2D Selective Scan (SS2D) [ 24 ], which maintains the integrity of 2D image structures by scanning four directed feature sequences. Each sequence is processed independently within an S6 block and then being combined to form a comprehensive 2D feature map.
【翻译】其中sB(x)s_{B}(x)sB(x)和sC(x)s_{C}(x)sC(x)是将输入xxx投影到N维空间的线性函数,而sA(x)s_{A}(x)sA(x)将D维线性投影扩展到必要的维度。在视觉任务方面,VMamba提出了二维选择性扫描(SS2D)[24],通过扫描四个方向的特征序列来保持2D图像结构的完整性。每个序列在S6块内独立处理,然后组合形成综合的2D特征图。
【解析】这组公式展示了选择性机制的具体实现方式。与传统SSM中固定的参数矩阵不同,这里的Bˉ\bar{B}Bˉ、Cˉ\bar{C}Cˉ和Δ\DeltaΔ都变成了输入xxx的函数,说明这些关键参数现在能够根据输入内容进行自适应调整。函数sB(x)s_B(x)sB(x)和sC(x)s_C(x)sC(x)负责处理输入到状态和状态到输出的映射关系,它们通过线性变换将输入特征投影到合适的维度空间。而sA(x)s_A(x)sA(x)则控制时间步长的选择,这个参数特别重要,因为它决定了模型在处理序列时的"采样密度"。当遇到重要信息时,模型可以选择更小的时间步长来精细处理;当遇到不那么重要的信息时,可以选择更大的时间步长来快速跳过。对于视觉任务,VMamba将这种一维的选择性扫描扩展到二维空间,通过四个方向(通常是从左到右、从右到左、从上到下、从下到上)的扫描来全面捕获图像的空间信息。这种设计保证了模型既能处理局部细节,又能建立全局的空间关联,最终通过融合四个方向的信息来构建完整的特征表示。
4 Method
In order to design light-weight models that are friendly to resource-limited devices, we propose EfficientVMamaba, which is summarized in Figure 3 . We introduce an efficient selective scan approach to reduce the computational complexity in Section 4.1 , and build a block considering both global and local feature extraction with the integration of SSMs and CNNs in Section 4.2 . Regarding the design of architecture, Section 4.4 then offers an in-depth look at various architectural variations tailored to different model sizes.
【翻译】为了设计对资源受限设备友好的轻量化模型,我们提出了EfficientVMamba,如图3所示。我们在4.1节中引入了一种高效的选择性扫描方法来降低计算复杂度,并在4.2节中构建了一个同时考虑全局和局部特征提取的块,该块集成了SSM和CNN。关于架构设计,4.4节深入探讨了针对不同模型大小量身定制的各种架构变体。
作者提出的解决方案包含三个关键层面:首先是算法层面的创新,通过改进选择性扫描机制来降低计算复杂度;其次是架构层面的融合,将SSM的全局建模能力与CNN的局部特征提取优势相结合,实现优势互补;最后是系统层面的优化,针对不同的应用场景和硬件约束提供多种模型规格。
4.1 Efficient 2D Scanning (ES2D)
In deep neural networks, downsampling via pooling or strided convolution is employed to broaden the receptive field with a lower computational cost; however, this comes at the expense of spatial resolution. Previous work [ 46 , 57 ] demonstrate apply atrous-based strategy benefits broadening the receptive field without sacrificing resolution. Inspired by this observation and aiming to alleviate and light the computational complexity of selective scanning, we propose an efficient 2D scanning (ES2D) method to scale down the visual selective scan block (SS2D) via skipping sampling for each patches on the feature map. Given a input feature map X∈RC×H×W\pmb{X}\in\mathbb{R}^{C\times H\times W}X∈RC×H×W , instead of cross-scan whole patches, we skip scan patches with a step size ppp and partition into selected spatial dimensional features {Oi}i=14\{O_{i}\}_{i=1}^{4}{Oi}i=14 :
【翻译】在深度神经网络中,通过池化或带步长的卷积进行下采样可以以较低的计算成本扩大感受野;然而,这是以牺牲空间分辨率为代价的。先前的工作[46, 57]证明了应用基于空洞(atrous)的策略有利于在不牺牲分辨率的情况下扩大感受野。受此观察启发,并旨在减轻和简化选择性扫描的计算复杂度,我们提出了一种高效的2D扫描(ES2D)方法,通过对特征图中每个补丁进行跳跃采样来缩减视觉选择性扫描块(SS2D)。给定输入特征图X∈RC×H×W\pmb{X}\in\mathbb{R}^{C\times H\times W}X∈RC×H×W,我们不是交叉扫描整个补丁,而是以步长ppp跳跃扫描补丁并分割成选定的空间维度特征{Oi}i=14\{O_{i}\}_{i=1}^{4}{Oi}i=14:
【解析】这段话介绍了ES2D方法的思路。传统的下采样方法虽然能够减少计算量,但会损失图像的细节信息,这对于需要精确空间信息的视觉任务来说是不可接受的。空洞卷积技术提供了一个解决方案,它通过在卷积核中插入空洞来扩大感受野,同时保持输出分辨率不变。ES2D方法借鉴了这种思想,但应用在状态空间模型的扫描机制上。与传统SSM需要逐个处理所有空间位置不同,ES2D采用跳跃采样策略,按照固定步长ppp选择性地处理特征图中的部分位置。这种策略的巧妙之处在于它不是随机丢弃信息,而是有规律地采样,确保在减少计算量的同时尽可能保留重要的空间结构信息。通过将完整的特征图分解为四个不同方向的子特征图,ES2D能够在降低计算复杂度的同时维持对空间信息的全面捕获。
Oi←scanX[:,m::p,n::p],O_{i} \xleftarrow{\text{scan}} X[:, m::p, n::p],OiscanX[:,m::p,n::p],
{O~i}i=14←SS2D({Oi}i=14),\{\tilde{\mathbf{O}}_{i}\}_{i=1}^{4} \leftarrow \operatorname{SS2D}(\{\mathbf{O}_{i}\}_{i=1}^{4}),{O~i}i=14←SS2D({Oi}i=14),
Y[:,m::p,n::p]←mergeO~i,Y[:, m::p, n::p] \xleftarrow{\text{merge}} \tilde{O}_{i},Y[:,m::p,n::p]mergeO~i,
(m,n)=(⌊12+12sin(π2(i−2))⌋,⌊12+12cos(π2(i−2))⌋),(m, n) = \left(\left\lfloor\frac{1}{2} + \frac{1}{2}\sin\left(\frac{\pi}{2}(i-2)\right)\right\rfloor, \left\lfloor\frac{1}{2} + \frac{1}{2}\cos\left(\frac{\pi}{2}(i-2)\right)\right\rfloor\right),(m,n)=(⌊21+21sin(2π(i−2))⌋,⌊21+21cos(2π(i−2))⌋),
where Oi,O~i∈RC×Hp×Wp\boldsymbol{O}_{i},\tilde{\boldsymbol{O}}_{i}\in\mathbb{R}^{C\times\frac{H}{p}\times\frac{W}{p}}Oi,O~i∈RC×pH×pW ∈ and the operation [:,m::p,n::p][:,m::p,n::p][:,m::p,n::p] represents slicing the matrix for each channel, starting at mmm on height (H)(H)(H) and nnn on width (W)(W)(W) , skipping every ppp steps. The process decompose the fully scanning method into both local and global sparse forms. Skip sampling for local receptive fields reduces computational complexity by selectively scanning smaller patches of the feature map. With a step size ppp , we sample the (C,H/p,W/p)(C,H/p,W/p)(C,H/p,W/p) patches at intervals of ppp , compared to (C,H,W)(C,H,W)(C,H,W) in the SS2D, decreasing the number of tokens processed from NNN to Np2\textstyle{\frac{N}{p^{2}}}p2N for each scan and merge operation, which improves feature extraction efficiency. Re-grouping for global spatial feature maps in ES2D involves combining the processed patches to reconstruct the global structure of the feature map. This integration captures broader contextual information, balancing local detail and global context in feature extraction. Accordingly, our design is intended to streamline the scanning and merging modules while maintaining the essential benefit of global integration in the state-space architecture, with the aim of ensuring that the feature extraction remains comprehensive on the spatial axis.
【翻译】其中Oi,O~i∈RC×Hp×Wp\boldsymbol{O}_{i},\tilde{\boldsymbol{O}}_{i}\in\mathbb{R}^{C\times\frac{H}{p}\times\frac{W}{p}}Oi,O~i∈RC×pH×pW,操作[:,m::p,n::p][:,m::p,n::p][:,m::p,n::p]表示对每个通道的矩阵进行切片,从高度(H)(H)(H)上的mmm和宽度(W)(W)(W)上的nnn开始,每ppp步跳跃一次。该过程将完全扫描方法分解为局部和全局稀疏形式。局部感受野的跳跃采样通过选择性扫描特征图的较小补丁来降低计算复杂度。使用步长ppp,我们以ppp的间隔采样(C,H/p,W/p)(C,H/p,W/p)(C,H/p,W/p)补丁,与SS2D中的(C,H,W)(C,H,W)(C,H,W)相比,将每次扫描和合并操作处理的标记数量从NNN减少到Np2\frac{N}{p^{2}}p2N,这提高了特征提取效率。ES2D中全局空间特征图的重新分组涉及组合处理过的补丁以重建特征图的全局结构。这种集成捕获更广泛的上下文信息,在特征提取中平衡局部细节和全局上下文。因此,我们的设计旨在简化扫描和合并模块,同时保持状态空间架构中全局集成的基本优势,目标是确保特征提取在空间轴上保持全面性。
【解析】这组公式展示了ES2D方法的具体实现细节。第一个公式说明如何从原始特征图中按照特定模式提取子区域,这里(m,n)(m,n)(m,n)的计算方式确保了四个不同的起始位置,分别对应四个扫描方向。通过三角函数的巧妙运用,当iii从1到4变化时,(m,n)(m,n)(m,n)会生成(0,1)(0,1)(0,1)、(1,1)(1,1)(1,1)、(1,0)(1,0)(1,0)、(0,0)(0,0)(0,0)这四个起始点,确保了对整个特征图的均匀覆盖。第二个公式将提取的子区域输入到标准的SS2D模块中进行处理,这保证了状态空间模型的核心功能得以保留。第三个公式则将处理后的结果重新组装回原始的空间布局。计算复杂度的大幅降低:原本需要处理N=H×WN=H \times WN=H×W个位置,现在只需要处理Np2\frac{N}{p^2}p2N个位置,当p=2p=2p=2时,计算量就降低到原来的四分之一。但这种降低不是简单的信息丢失,而是通过智能的采样和重组策略,确保重要的空间关系得以保留。重组过程通过将四个方向的处理结果融合,能够重建出接近完整分辨率的特征表示,从而在效率和性能之间找到了良好的平衡点。
4.2 Efficient Visual State Space Block (EVSS)
Based on the efficient selected scan approach, we introduce the Efficient Visual State Space (EVSS) block, which is designed to synergistically merge global and local feature representations while maintaining computational efficiency. It leverages a SqueezeEdit-modified ES2D for global information capture and a convolutional branch tailored to extract critical local features, with both branches undergoing a subsequent Squeeze-Excitation (SE) block [ 16 ]. The ES2D module aims to efficiently abstract global contextual information by implementing an intelligent skipping mechanism presented in 4.1 . It selectively scans the map with a step size ppp , reducing redundancy without sacrificing the representational quality of the global context in the resultant spatial dimensional features. Parallel to this, empirical evidence concurs that convolutional operations offer a more proficient approach to feature extraction, particularly in scenarios where local representations are adequate. We add the convolutional branch concentrates on discerning fine-grained local details through a 3×33\times33×3 convolution of stride 1. The subsequent SE block adaptively recalibrates the features, allowing the network to auto re-balanced the local and global respective field on the feature map.
【翻译】基于高效选择性扫描方法,我们引入了高效视觉状态空间(EVSS)块,该块旨在协同融合全局和局部特征表示,同时保持计算效率。它利用经过SqueezeEdit修改的ES2D进行全局信息捕获,以及一个专门用于提取关键局部特征的卷积分支,两个分支都经过后续的压缩激励(SE)块[16]。ES2D模块旨在通过实施4.1节中提出的智能跳跃机制来高效地抽象全局上下文信息。它以步长ppp选择性地扫描特征图,在不牺牲结果空间维度特征中全局上下文表示质量的情况下减少冗余。与此同时,经验证据表明卷积操作在特征提取方面提供了更熟练的方法,特别是在局部表示足够的场景中。我们添加的卷积分支专注于通过步长为1的3×33\times33×3卷积来识别细粒度的局部细节。后续的SE块自适应地重新校准特征,允许网络自动重新平衡特征图上的局部和全局相应字段。
【解析】EVSS块是作者提出的核心创新模块,传统的网络架构往往偏向于某一种特征提取方式,要么专注于局部细节(如CNN),要么擅长全局建模(如Transformer),很难同时兼顾两者的优势。EVSS块通过双分支并行设计巧妙地解决了这个问题。第一个分支使用改进的ES2D模块,它继承了状态空间模型强大的全局序列建模能力,能够捕获图像中远距离像素之间的依赖关系,这对于理解图像的整体结构和语义至关重要。第二个分支采用传统的3×33\times33×3卷积操作,专门负责提取局部纹理、边缘等细节特征,这些特征对于图像的精确识别不可或缺。更重要的是,每个分支后面都配备了SE注意力机制,能够根据输入内容的特点动态调整全局和局部特征的重要性权重。当图像内容需要更多全局理解时(比如场景识别),SE模块会增强全局分支的输出;当需要关注细节时(比如纹理分析),则会强化局部分支的贡献。这种自适应的特征融合机制使得EVSS块能够根据不同的视觉任务和输入内容自动调整其行为模式,实现真正的智能化特征提取。
Fig. 3: Architecture overview of EfficientVMamba. We hightlight our contributions with corresponding colors in the Figure. (1) ES2D 4.1 : Atrous-based selective scanning strategy via skip sampling and regrouping in the spatial space. (2) EVSS 4.2 : The EVSS block merges global and local feature extraction with modified ES2D and convolutional approaches enhanced by Squeeze-Excitation blocks for refined dual-pathway feature representation. Inverted Fusion 4.3 : Inverted Fusion places local-representation modules in deep layers, deviating from traditional designs by utilizing EVSS blocks early for global representation and inverted residual blocks later for local feature extraction.
【翻译】图3:EfficientVMamba的架构概览。我们在图中用相应的颜色突出显示了我们的贡献。(1)ES2D 4.1:通过在空间中进行跳跃采样和重新分组的基于空洞的选择性扫描策略。(2)EVSS 4.2:EVSS块将全局和局部特征提取与修改的ES2D和卷积方法相结合,通过压缩激励块增强,以实现精细的双路径特征表示。倒置融合4.3:倒置融合将局部表示模块放置在深层,通过在早期利用EVSS块进行全局表示和在后期使用倒置残差块进行局部特征提取,偏离了传统设计。
【解析】这个架构图展示了EfficientVMamba的整体设计思路和三个主要创新点。从图中可以看出,作者采用了一种与传统轻量级网络截然不同的设计策略。传统的轻量级网络通常在前面几层使用计算效率高的卷积操作来快速降低特征图尺寸,然后在后面几层引入全局建模模块。但EfficientVMamba反其道而行之,在网络的前期就引入了具有全局建模能力的EVSS块,这样做的好处是能够在高分辨率特征图上就开始建立全局的空间关联,为后续的特征处理奠定良好的基础。而在网络的后期,当特征图尺寸已经较小时,使用计算高效的倒置残差块来进行局部特征的精细化处理。这种"倒置"的设计理念充分利用了状态空间模型计算复杂度为线性的优势,使得在高分辨率下进行全局建模变得可行。同时,ES2D的跳跃采样策略进一步降低了计算成本,使得这种设计在实际应用中具备了可行性。
The outputs of the respective SE blocks are combined via element-wise summation to construct the EVSS’s output and the dual pathway could be denoted as:
【翻译】各自SE块的输出通过逐元素求和组合来构建EVSS的输出,双路径可以表示为:
【解析】这里描述的是EVSS块中两个并行分支的融合机制。逐元素求和是最简单也是最有效的特征融合方式之一,不需要额外的参数,计算成本极低,但能够有效地整合来自不同路径的信息,保持特征的维度和空间结构不变,同时允许两个分支的特征在每个空间位置上进行直接的信息交换和增强。相比于其他融合方式如拼接或复杂的注意力机制,逐元素求和既保证了计算效率,又避免了参数量的增加,有利于轻量级网络的设计。
Xl+1=SE(ES2D(Xl))+SE(Conv(Xl)),\begin{array}{r}{{\cal X}^{l+1}=\mathrm{SE}(\mathrm{ES2D}({\cal X}^{l}))+\mathrm{SE}(\mathrm{Conv}({\cal X}^{l})),}\end{array} Xl+1=SE(ES2D(Xl))+SE(Conv(Xl)),
where XlX^{l}Xl represent the feature map of the l\it ll -layer and SE(⋅)\operatorname{SE}(\cdot)SE(⋅) is the SqueezeExcitation operation. With each pathway utilizing a SE block, the EVSS ensures that the respective features of global and local information are dynamically rebalanced to emphasize the most salient features. This fusion aims to preserve the integrity of both the expansive global perspective and the intricate local specifics, facilitating a comprehensive feature representation.
【翻译】其中XlX^{l}Xl表示第lll层的特征图,SE(⋅)\operatorname{SE}(\cdot)SE(⋅)是压缩激励操作。通过每个路径都使用SE块,EVSS确保全局和局部信息的相应特征被动态重新平衡,以强调最显著的特征。这种融合旨在保持广阔的全局视角和复杂的局部细节的完整性,促进全面的特征表示。
【解析】ES2D分支负责捕获全局上下文信息,它通过跳跃采样策略在保持计算效率的同时获得长距离的空间依赖关系。Conv分支则专注于局部特征提取,使用标准的卷积操作来识别纹理、边缘等局部模式。两个分支分别通过SE模块进行特征重要性的自适应调节,这种设计确保了网络能够根据输入内容的特点来动态调整全局和局部特征的贡献权重。最终的逐元素相加操作不仅实现了特征融合,更重要的是创建了一种互补的特征表示,其中全局信息为局部特征提供上下文指导,而局部细节为全局理解提供精确的基础。这种双向的信息增强机制使得EVSS块能够产生既具有全局一致性又富含局部细节的综合特征表示,为后续的视觉任务提供了高质量的特征基础。SE模块的引入进一步增强了这种协同效应,通过通道级别的注意力机制来突出最重要的特征通道,抑制噪声和冗余信息。
4.3 Inverted Insertion of EfficientNet Blocks(EfficientNet块的倒置插入)
As a well-established consensus, the computational efficiency of convolutional operations is more efficient than that of the global-based block such as Transformer. Prior light-weight work efforts have predominantly employed computation-efficient convolutions in the former stages to scale down the token numbers to reduce computational complexity, subsequently integrating global-based blocks ( e.g. , Transthe latter stages. For example, MobileViT [ former with the computational complexity of 28 O(N2)\mathcal{O}(N^{2})O(N2) ] adopts pure MobileNetV2 blocks ) to capture global context in in the first two downsampling stages, while only integrating self-attention operations in the latter stages at low resolutions. EfficientFormer [ 19 ] introduces two types of base blocks, the convolution-based blocks with local pooling are used in the first three stages, and the transformer-like self-attention blocks are only leveraged in the last stage.
【翻译】作为一个已确立的共识,卷积操作的计算效率比基于全局的块(如Transformer)更高效。先前的轻量级工作主要在前期阶段采用计算高效的卷积来缩减标记数量以减少计算复杂度,随后在后期阶段集成基于全局的块(例如,具有O(N2)\mathcal{O}(N^{2})O(N2)计算复杂度的Transformer)来捕获全局上下文。例如,MobileViT在前两个下采样阶段采用纯MobileNetV2块,仅在低分辨率的后期阶段集成自注意力操作。EfficientFormer[19]引入了两种类型的基础块,基于卷积的块与局部池化在前三个阶段使用,而类似transformer的自注意力块仅在最后阶段使用。
【解析】卷积操作的计算复杂度相对较低,特别是在处理高分辨率特征图时,而全局建模模块(如Transformer的自注意力机制)的计算复杂度往往是平方级别的,在高分辨率下会产生巨大的计算开销。因此,几乎所有的轻量级网络都采用了"前卷积后全局"的设计策略:在网络的前几层使用高效的卷积操作快速降低特征图的空间分辨率,减少后续处理的数据量,然后在较低分辨率的特征图上应用计算密集的全局建模模块。MobileViT和EfficientFormer就是这种设计理念的典型代表,它们在网络的早期阶段大量使用MobileNet的深度可分离卷积或普通卷积来进行特征提取和尺寸压缩,只在网络的最后几层引入自注意力机制来建立全局的特征关联。这种设计在计算资源受限的环境下是合理的,但也限制了网络在高分辨率下进行全局建模的能力。
However, the observation is contrast on the Mamba-based block. In the SSM framework, the computational complexity for global representation is O(N)\mathcal O(N)O(N) , indicating that placing local representation modules at either the front or the back of the stage could be reasonable. Through empirical observation in Table 6 , we found positioning these local-representation modules towards the latter layers of the stage yields better results. This discovery significantly deviates from the design principles of previous CNN-based and Transformer-based lightweight models, thereby we call it inverted insertion. Consequently, our designed LLL stages architecture is an inverted insertion of EfficientNet Blocks (MobileNetV2 blocks with SE modules), which utilizes EVSS blocks 4.2 in the former two stages to capture global-representation and Inverted Residual blocks InRes (⋅)(\cdot)(⋅) [ 34 ] in the subsequent stages to extract local feature maps:
【翻译】然而,在基于Mamba的块上观察结果是相反的。在SSM框架中,全局表示的计算复杂度是O(N)\mathcal O(N)O(N),这表明将局部表示模块放置在阶段的前面或后面都可能是合理的。通过表6中的经验观察,我们发现将这些局部表示模块定位在阶段的后层会产生更好的结果。这一发现显著偏离了先前基于CNN和基于Transformer的轻量级模型的设计原则,因此我们称之为倒置插入。因此,我们设计的LLL阶段架构是EfficientNet块(带有SE模块的MobileNetV2块)的倒置插入,它在前两个阶段利用EVSS块4.2来捕获全局表示,在后续阶段使用倒置残差块InRes(⋅)(\cdot)(⋅)[34]来提取局部特征图:
【解析】状态空间模型的最大优势在于其线性的计算复杂度O(N)\mathcal O(N)O(N),线性复杂度说明即使在高分辨率的特征图上进行全局建模,计算成本也是可以接受的,这就为网络设计提供了全新的可能性。作者通过大量实验发现,将全局建模模块放在网络前期、局部特征提取模块放在网络后期,能够获得更好的性能表现。这种"倒置"的设计,背后的原理在于:在网络前期,特征图分辨率较高,包含丰富的细节信息,此时进行全局建模能够更好地建立像素间的长距离依赖关系,为后续的特征处理提供全局的上下文指导;而在网络后期,特征图已经经过多次抽象,空间尺寸较小,此时使用高效的卷积操作进行局部特征的精细化处理更为合适。
Xl+1={EVSS(Xl)ifXl∈{stage1,stage2};InRes(Xl)otherwise,\begin{array}{r}{X^{l+1}=\left\{\begin{array}{l l}{\mathrm{EVSS}(X^{l})}&{\mathrm{if}\quad X^{l}\in\{\mathrm{stage1,stage2}\};}\\ {\mathrm{InRes}(X^{l})}&{\mathrm{otherwise,}}\end{array}\right.}\end{array} Xl+1={EVSS(Xl)InRes(Xl)ifXl∈{stage1,stage2};otherwise,
【解析】这个分段函数清晰地定义了EfficientVMamba的倒置插入策略。在网络的前两个阶段(stage1和stage2),使用EVSS块进行特征处理,这些阶段对应于较高分辨率的特征图,EVSS块中的ES2D组件能够高效地进行全局特征建模,建立长距离的空间依赖关系。在其余阶段,使用倒置残差块(InRes)进行局部特征提取,这些阶段的特征图分辨率较低,使用计算高效的卷积操作更为合适。不同网络深度处有着不同的特征特性:浅层特征富含空间细节信息,适合进行全局关联建模;深层特征已经高度抽象,更适合进行局部模式的精细化识别。这种设计充分利用了状态空间模型和卷积操作各自的优势,在不同的网络阶段发挥最适合的特征处理能力。
where XlX^{l}Xl is the feature map in the l\it ll -layer. The inverted insertion design of using the shortcuts directly between the bottlenecks is considerably more memory efficient [ 34 ].
【翻译】其中XlX^{l}Xl是第lll层的特征图。使用瓶颈间直接快捷连接的倒置插入设计在内存方面相当高效[34]。
【解析】倒置残差块的核心设计理念是在低维瓶颈层之间建立快捷连接,而不是在高维的扩展层之间。这种设计的内存优势:首先,快捷连接建立在维度较小的特征图之间,需要保存的中间结果更少,显著降低了内存占用;其次,这种设计允许在前向传播过程中更早地释放一些中间特征图的内存,提高了内存的使用效率;最后,在反向传播过程中,梯度可以更直接地传播,减少了需要缓存的中间梯度信息。这种内存高效的设计对于资源受限的移动设备和边缘计算场景尤为重要,结合EVSS块的全局建模能力和倒置残差块的内存效率,EfficientVMamba实现了性能和效率的良好平衡。
4.4 模型变体(Model Variants)
To sufficiently demonstrate the effectiveness of our proposed model, we detail architectural variants rooted in plain structures as referenced in [ 61 ]. These variants are designated as EfficientVMamba-T, EfficientVMamba-S, and EfficientVMambaB, shown as Table 1 , corresponding to different scales of the model. EfficientVMambaT is the most lightweight with 6M parameters, followed by EfficientVMamba-S with 11M, and EfficientVMamba-B being the most complex with 33M. In terms of computational load, measured in FLOPs, the models exhibit a parallel increase with 0.8G for EfficientVMamba-T, 1.3G for EfficientVMamba-S, and 4.0G for EfficientVMamba-B, correlating directly with their complexity and feature size.
【翻译】为了充分证明我们提出模型的有效性,我们详细介绍了基于[61]中所引用的简单结构的架构变体。这些变体被命名为EfficientVMamba-T、EfficientVMamba-S和EfficientVMamba-B,如表1所示,对应于模型的不同规模。EfficientVMamba-T是最轻量级的,具有6M参数,其次是EfficientVMamba-S的11M参数,而EfficientVMamba-B是最复杂的,具有33M参数。在以FLOPs衡量的计算负载方面,模型呈现平行增长,EfficientVMamba-T为0.8G,EfficientVMamba-S为1.3G,EfficientVMamba-B为4.0G,这与它们的复杂性和特征大小直接相关。
Table 1: Model variants of EfficientVMamba.
表1:EfficientVMamba的模型变体。
5 Experiments
To rigorously evaluate the performance of our diverse model variants, we demonstrate the results of image classification task in Section 5.1 , investigate object detection performance in Section 5.2 and explore the image semantic segmentation in Section 5.3 . In section 5.4 We further pursued ablation study to comprehensively examine the effects of atrous selective scanning , the impact of SSM-Conv fusion blocks, and the implications of incorporating convolution blocks at different stages of the models.
【翻译】为了严格评估我们多样化模型变体的性能,我们在第5.1节展示了图像分类任务的结果,在第5.2节研究了目标检测性能,在第5.3节探索了图像语义分割。在第5.4节中,我们进一步进行了消融研究,以全面检验空洞选择性扫描的效果、SSM-Conv融合块的影响,以及在模型不同阶段引入卷积块的意义。
5.1 ImageNet图像分类
Training strategies. Following previous works [ 24 , 25 , 43 , 61 ], we train our models for 300 epochs with a base batch size of 1024 and an AdamW optimizer, a cosine annealing learning rate schedule is adopted with initial value 10−310^{-3}10−3 and 20-epoch warmup. For training data augmentation, we use random cropping, AutoAugment [ 5 ] with policy rand-m9-mstd0.5 , and random erasing of pixels with a probability of 0.25 on each image, then a MixUp [ 60 ] strategy with ratio 0.2 is adopted in each batch. An exponential moving average on model is adopted with decay rate 0.9999 .
【翻译】训练策略。遵循先前的工作[24, 25, 43, 61],我们使用基础批次大小为1024的AdamW优化器训练我们的模型300个epochs,采用余弦退火学习率调度,初始值为10−310^{-3}10−3,并进行20个epoch的预热。对于训练数据增强,我们使用随机裁剪、带有策略rand-m9-mstd0.5的AutoAugment[5],以及在每张图像上以0.25的概率随机擦除像素,然后在每个批次中采用比例为0.2的MixUp[60]策略。采用衰减率为0.9999的模型指数移动平均。
Tiny Models (FLOPs(G)∈[0,1]FLOPs(G)\in[0,1]FLOPs(G)∈[0,1]) . In the pursuit of efficiency, the results of tiny models are shown in Table 2 . EfficientVMamba-T achieves state-of-art performance with a Top-1 accuracy of 76.5%76.5\%76.5%, rivalling its counterparts that demand higher computational costs. With a modest expenditure of only 0.8 GFLOPs, our model surpasses the PVTv2-B0 by a 6% margin in accuracy and outperforms the MobileViT-XS by 1.7%, all with less computational demand.
【翻译】微型模型(FLOPs(G)∈[0,1]FLOPs(G)\in[0,1]FLOPs(G)∈[0,1])。在追求效率的过程中,微型模型的结果如表2所示。EfficientVMamba-T以76.5%的Top-1准确率实现了最先进的性能,与需要更高计算成本的同类模型相匹敌。仅需0.8 GFLOPs的适度开销,我们的模型在准确率上以6%的优势超越PVTv2-B0,并以1.7%的优势超越MobileViT-XS,所有这些都是在更低的计算需求下实现的。
Small Models (FLOPs(G)∈[1,2],FLOPs(G)\in[1,2],FLOPs(G)∈[1,2],). Our model, EfficientVMamba-S, ex- hibits a significant improvement in accuracy, achieving a Top-1 accuracy of 78.7%78.7\%78.7%. This represents a substantial increase over DeiT-Ti and MobileViTS, which achieve 72.2%72.2\%72.2% and 78.4%78.4\%78.4% respectively. Notably, EfficientVMamba-S maintains this high accuracy level with computational efficiency, requiring only 1 . 3 GFLOPs, which is on par with DeiT-Ti and lower than MobileViT-S’s 2.0 GFLOPs.
【翻译】小型模型(FLOPs(G)∈[1,2]FLOPs(G)\in[1,2]FLOPs(G)∈[1,2])。我们的模型EfficientVMamba-S在准确率上表现出显著改进,实现了78.7%的Top-1准确率。这相对于分别达到72.2%和78.4%的DeiT-Ti和MobileViT-S来说是大幅提升。值得注意的是,EfficientVMamba-S在保持高准确率水平的同时具有计算效率,仅需要1.3 GFLOPs,这与DeiT-Ti相当,并且低于MobileViT-S的2.0 GFLOPs。
Table 2: Comparison of different backbones on ImageNet-1K classification.
【翻译】表2:不同骨干网络在ImageNet-1K分类上的比较。
Base Models (FLOPs(G)∈[4,5],FLOPs(G)\in[4,5],FLOPs(G)∈[4,5],). EfficientVMamba-B achieves an impressive Top-1 accuracy of 81.8%81.8\%81.8%, surpassing DeiT-S by 2% and Vim-S by 1.5%1.5\%1.5% , as indicated in the third group of the Table 2 . This base model demonstrates the feasibility of coupling a substantial parameter count of 33 M with a modest computational demand of 4.0 GFLOPs. In comparison, VMamba-T, with a similar parameter count of 22 M requires a higher 5.6 GFLOPs.
【翻译】基础模型(FLOPs(G)∈[4,5]FLOPs(G)\in[4,5]FLOPs(G)∈[4,5])。EfficientVMamba-B实现了令人印象深刻的81.8%的Top-1准确率,如表2第三组所示,超越DeiT-S 2%,超越Vim-S 1.5%。这个基础模型证明了将33M的大量参数数量与4.0 GFLOPs的适度计算需求相结合的可行性。相比之下,具有类似22M参数数量的VMamba-T需要更高的5.6 GFLOPs。
5.2 Object Detection
Training strategies. We evaluate the efficacy of our EfficientVMamba model for object detection tasks on the MSCOCO 2017 [ 21 ] dataset. Our evaluation framework relies on the mmdetection library [ 3 ]. For comparisons with lightweight backbones, we follow PvT [ 49 ] to use RetinaNet as the detector and adopt 1 ×\times× training schedule. While for comparisons with larger backbones, our experiment follows the hyperparameter settings detailed in Swin [ 25 ] We use the AdamW optimization method to refine the weights of our pre-trained networks on ImageNet-1K for durations of 12 and 36 epochs. We apply drop path rates of 0.2%0.2\%0.2% across the board for EfficientVMamba-T/S/B variants. The learning rate begins at 1e−51e-51e−5 and is decreased tenfold at epochs 9 and 11. Multi-scale training and random flipping are implemented during training with a batch size of 16, adhering to standard procedures for evaluating object detection systems.
【翻译】训练策略。我们在MSCOCO 2017 [21]数据集上评估EfficientVMamba模型在目标检测任务中的有效性。我们的评估框架依赖于mmdetection库[3]。对于与轻量级骨干网络的比较,我们遵循PvT [49]使用RetinaNet作为检测器并采用1×训练计划。而对于与较大骨干网络的比较,我们的实验遵循Swin [25]中详述的超参数设置。我们使用AdamW优化方法在ImageNet-1K上细化预训练网络的权重,训练持续12和36个epoch。我们对EfficientVMamba-T/S/B变体全面应用0.2%的drop path率。学习率从1e-5开始,在第9和第11个epoch时降低十倍。在训练过程中实施多尺度训练和随机翻转,批次大小为16,遵循评估目标检测系统的标准程序。
Table 3: COCO detection results on RetinaNet.
【翻译】表3:RetinaNet上的COCO检测结果。
Results. We summarize the results of RetinaNet detector in Table 3 . Remarkably, each variants competitively reducing the sizes while simultaneously exhibits a performance enhancement. The EfficientVMamba-T model stands out with 13M parameters and an AP of 37.5%37.5\%37.5% , slightly higher by 5 . 7% compared to the ResNet-18, which has 21.3M21.3M21.3M parameters. The performance of EfficientVMamba-T also surpasses PVTv1-Tiny by 0 . 8% while matching it in terms of parameter count. EfficientVMamba-S, with only 19M parameters, achieves a commendable AP of 39.1%39.1\%39.1% , outstripping the larger ResNet50 model, which shows a lower AP of 36.3%36.3\%36.3% despite having 37.7M parameters. In the higher echelons, EfficientVMamba-B, which boasts 44M parameters, secures an AP of 42.8%42.8\%42.8% , signifying a significant lead over both ResNet101 and ResNeXt101-32x4d, highlighting the efficiency of our models even with a smaller parameter footprint. Notably, PVTv2-b0 with 13M parameters achieves an AP of 37.2%37.2\%37.2% , which EfficientVMamba-T closely follows, indicating competitive performance with a similar parameter budget. For the comparisons with other backbones on Mask R-CNN, see Appendix.
【翻译】结果。我们在表3中总结了RetinaNet检测器的结果。值得注意的是,每个变体在竞争性地减小尺寸的同时都表现出性能提升。EfficientVMamba-T模型以13M参数和37.5%的AP表现突出,相比具有21.3M参数的ResNet-18高出5.7%。EfficientVMamba-T的性能也超越PVTv1-Tiny 0.8%,同时在参数数量上与其匹配。EfficientVMamba-S仅用19M参数就实现了令人称赞的39.1%的AP,超越了更大的ResNet50模型,后者尽管有37.7M参数但AP仅为36.3%。在更高层次上,拥有44M参数的EfficientVMamba-B获得了42.8%的AP,相对于ResNet101和ResNeXt101-32x4d都有显著领先,突出了我们模型即使在较小参数占用下的效率。值得注意的是,具有13M参数的PVTv2-b0实现了37.2%的AP,EfficientVMamba-T紧随其后,表明在相似参数预算下具有竞争性能。关于与其他骨干网络在Mask R-CNN上的比较,请参见附录。
5.3 Semantic Segmentation
Training strategies. Aligning with Vmamba [ 24 ] settings, we integrate an UperHead into the pre-trained model structure. Utilizing the AdamW optimizer, we initiate the learning rate at 6×10−56\times10^{-5}6×10−5 . The fine-tuning stage consists of 160k160k160k iterations, using a batch size of 16. While the standard input resolution stands at 512×512512\times512512×512 , we also conduct experiments with 640×640640\times640640×640 inputs and apply multi-scale (MS) testing to broaden our evaluation.
【翻译】训练策略。与Vmamba [24]设置保持一致,我们将UperHead集成到预训练模型结构中。使用AdamW优化器,我们将学习率初始化为6×10−56\times10^{-5}6×10−5。微调阶段包含160k160k160k次迭代,使用批次大小为16。虽然标准输入分辨率为512×512512\times512512×512,我们也使用640×640640\times640640×640输入进行实验,并应用多尺度(MS)测试来扩大我们的评估范围。
Results. The EfficientVMamba-T model yields mIoUs of 38.9%38.9\%38.9% (SS) and 39.3%39.3\%39.3% (MS), surpassing the ResNet-50’s 42.1%42.1\%42.1% mIoU with far fewer parameters. EfficientVMamba-S achieves 41.5%41.5\%41.5% (SS) and 42.1 (MS) mIoUs, bettering the DeiT-S +^++ MLN despite having a lower computational footprint. The EfficientVMamba-B reaches 46.5%46.5\%46.5% (SS) and 47.3%47.3\%47.3% (MS), outperforming the heavier VMamba-S. These findings attest to the EfficientVMamba series’ balance of accuracy and computational efficiency in semantic segmentation.
【翻译】结果。EfficientVMamba-T模型产生38.9%38.9\%38.9%(SS)和39.3%39.3\%39.3%(MS)的mIoU,以更少的参数超越了ResNet-50的42.1%42.1\%42.1% mIoU。EfficientVMamba-S实现了41.5%41.5\%41.5%(SS)和42.1(MS)的mIoU,尽管计算占用更低,但仍优于DeiT-S+^++MLN。EfficientVMamba-B达到46.5%46.5\%46.5%(SS)和47.3%47.3\%47.3%(MS),超越了更重的VMamba-S。这些发现证明了EfficientVMamba系列在语义分割中准确性和计算效率的平衡。
Table 4: Results of semantic segmentation on ADE20K using UperNet [ 53 ]. We measure the mIoU with single-scale (SS) and multi-scale (MS) testings on the val set. The FLOPs are measured with an input size of 512×2048512\times2048512×2048 . MLN: multi-level neck.
【翻译】表4:使用UperNet [53]在ADE20K上的语义分割结果。我们在验证集上使用单尺度(SS)和多尺度(MS)测试来测量mIoU。FLOPs是在输入大小为512×2048512\times2048512×2048时测量的。MLN:多级颈部。
Table 5: Ablation Analysis: Evaluating the Efficacy of Enhanced Spatially Selective Dilatation (ES2D), Assessing the Synergistic Effect of Convolutional Branch Fusion Enhanced with Squeeze-and-Excitation (SE) Techniques, and Investigating the Role of Inverted Residual Blocks in Model Performance. For comparison with the baseline VMamba, we adjust the dimensions and number of layers of it to match the FLOPs.
【翻译】表5:消融分析:评估增强空间选择性扩张(ES2D)的功效,评估通过挤压激励(SE)技术增强的卷积分支融合的协同效应,并研究倒残差块在模型性能中的作用。为了与基线VMamba进行比较,我们调整其维度和层数以匹配FLOPs。
5.4 Ablation Study
Effect of atrous selective scan. We implement experiment to validate the efficacy of atrous selective scan in Table 5 . The upgrade from SS2D to ES2D significantly reduces the computational complexity from 0 . 8 GFLOPs while retains competitive accuracy at 73.6%73.6\%73.6% , a 1.5%1.5\%1.5% improvement on the tiny variant. Similarly, In the case of base variant, the model utilizing ES2D not only reduces the GFLOPs to 4 . 0 from VMamba-B’s 4 . 2 but also exhibits an increase in accuracy from 80.2%80.2\%80.2% to 80.9%80.9\%80.9% . The results suggest that the incorporation of ES2D in our EfficientVMamba models is one of the key factor in achieving the reduction of computational complexity by skip sampling while preserve the global respective field to keep competitive performance. The reduction of GLOPs also reveals the potency of ES2D in maintaining, and even enhancing, model accuracy while significantly reducing computational overhead, demonstrating its viability for resource-constrained scenarios.
【翻译】多孔选择性扫描的效果。我们在表5中实施实验来验证多孔选择性扫描的有效性。从SS2D升级到ES2D显著降低了计算复杂度从0.8 GFLOPs,同时在tiny变体上保持了73.6%的竞争性准确率,提升了1.5%。同样,在base变体的情况下,利用ES2D的模型不仅将GFLOPs从VMamba-B的4.2降低到4.0,还表现出准确率从80.2%增加到80.9%。结果表明,在我们的EfficientVMamba模型中纳入ES2D是通过跳跃采样实现计算复杂度降低同时保持全局感受野以保持竞争性能的关键因素之一。GLOPs的减少也揭示了ES2D在保持甚至增强模型准确性的同时显著减少计算开销的效力,证明了其在资源受限场景中的可行性。
Table 6: Comparisons of injecting convolution blocks at different stages on ImageNet dataset. We use EfficientVMamba-T in the experiments.
【翻译】表6:在ImageNet数据集上不同阶段注入卷积块的比较。我们在实验中使用EfficientVMamba-T。
Effect of SSM-Conv fusion block. The integration of a convolutional branch following with a SE block enhances the performance of our model. For tiny variance, adding the local fusion feature extraction improves accuracy from 73.6%73.6\%73.6% to 75.1%75.1\%75.1% . In the case of EfficientVMamba-B, introducing fusion mechanism increases accuracy from 80.9% to 81.2%81.2\%81.2% . The observed performance gains reveals the additional convolutional branch enhance the local feature extraction. By integrating Fusion, the models likely benefit from a more diversified feature set that captures a wider range of spatial details, improving the model’s ability to generalize and thus boosting accuracy. This suggests that the strategic addition of such branches can effectively enhance the model’s performance by providing a comprehensive and more nuanced respective field of the input feature map.
【翻译】SSM-Conv融合块的效果。集成一个后跟SE块的卷积分支增强了我们模型的性能。对于tiny变体,添加局部融合特征提取将准确率从73.6%提高到75.1%。在EfficientVMamba-B的情况下,引入融合机制将准确率从80.9%增加到81.2%。观察到的性能提升表明额外的卷积分支增强了局部特征提取。通过集成融合,模型可能受益于更多样化的特征集,捕获更广泛的空间细节,提高模型的泛化能力,从而提升准确率。这表明这种分支的战略性添加可以通过提供输入特征图的全面而更细致的感受野有效增强模型的性能。
Comparisons of injecting convolution block in different stages. In this paper, we obtain an interesting observation that our SSM based block, EVSS, is more beneficial in the early stages of the network. In contrast, previous works on light-weight ViTs usually inject the convolution blocks in the early stages and adopt Transformer blocks in the deep stages. As shown in Table 6 , we compare the performance of injecting convolution blocks in different stages of EfficientVMamba-T, and the results indicate that, adopting Inverted Residual blocks in the deep stages with performs better than that in early stages. A explanation to the opposite phenomenons between our light-weight VSSMs and ViTs is that, the self-attention in Transformers has higher computation complexity and thus its computation at high resolutions is inefficient; while the SSMs, tailored for efficient modeling of long sequences, is more efficient and beneficial on capturing information globally at high resolutions.
【翻译】在不同阶段注入卷积块的比较。在本文中,我们得到了一个有趣的观察,即我们基于SSM的块EVSS在网络的早期阶段更有益。相比之下,以往关于轻量级ViTs的工作通常在早期阶段注入卷积块,在深层阶段采用Transformer块。如表6所示,我们比较了在EfficientVMamba-T的不同阶段注入卷积块的性能,结果表明,在深层阶段采用倒残差块比在早期阶段表现更好。我们的轻量级VSSMs和ViTs之间相反现象的解释是,Transformers中的自注意力具有更高的计算复杂度,因此其在高分辨率下的计算效率低下;而SSMs专为长序列的高效建模而定制,在高分辨率下全局捕获信息更高效且更有益。
6 Conclusion
This paper proposed EfficientVMamba, a lightweight state-space network architecture that adeptly combines the strengths of global and local information extraction, addressing the trade-off between model accuracy and computational efficiency. By incorporating an atrous-based selective scan with efficient skip sampling, EfficientVMamba ensures comprehensive global receptive field coverage while minimizing computational load. The integration of this scanning approach with a convolutional branch, followed by optimization through a Squeeze-andExcitation module, allows for a robust re-balancing of global and local features. Additionally, the innovative use of inverted residual insertion further refines the model’s multi-layer stages, enhancing its depth and effectiveness. Experimental results affirm that EfficientVMamba not only scales down computational complexity to O(N)\mathcal O(N)O(N) but also delivers competitive performance across various vision tasks. The achievements of EfficientVMamba highlight its potential as a formidable framework in the evolution of lightweight, efficient, and generalpurpose visual models.
【翻译】本文提出了EfficientVMamba,一种轻量级状态空间网络架构,巧妙地结合了全局和局部信息提取的优势,解决了模型准确性和计算效率之间的权衡。通过采用基于多孔的选择性扫描和高效跳跃采样,EfficientVMamba确保了全面的全局感受野覆盖,同时最小化计算负载。将这种扫描方法与卷积分支集成,然后通过挤压激励模块进行优化,允许对全局和局部特征进行稳健的重新平衡。此外,倒残差插入的创新使用进一步细化了模型的多层阶段,增强了其深度和有效性。实验结果证实,EfficientVMamba不仅将计算复杂度缩减到O(N)\mathcal O(N)O(N),还在各种视觉任务中提供了竞争性能。EfficientVMamba的成就突出了其作为轻量级、高效和通用视觉模型演进中强大框架的潜力。
Table 7: Object detection and instance segmentation results on COCO val set.
【翻译】表7:COCO验证集上的目标检测和实例分割结果。
Appendix
与Mask R-CNN上其他骨干网络的比较
We also investigate the performance dynamics of our EfficientVMamba as a lightweight backbone within the Mask R-CNN schedule, as shown in Table 7 . For the Mask R N 1 ×\times× schedule, our EfficientVMamba-T model, with 11M parameters and 60 60G60G60G G FLOPs, achieves an Average Precision (AP) of 35.6%35.6\%35.6% . This is 1.6%1.6\%1.6% higher than ResNet-18, which has 31M parameters and 207G FLOPs. EfficientVMamba-S, with a greater number of parameters at 31M and 197G FLOPs, reaches an AP of 39.3%39.3\%39.3% , which is 0.5%0.5\%0.5% above the ResNet-50 model with 44M parameters and 260G FLOPs. Our largest model, EfficientVMamba-B, shows a superior AP of 43.7% with 53M parameters and a reduced computational requirement of 252G FLOPs, outperforming VMamba-T by 2.8%2.8\%2.8% . In terms of Mask R-CNN 3 ×\times× MS schedule, EfficientVMamba-T maintains an AP of 38.3%38.3\%38.3% , surpassing ResNet-18’s performance by 1.4%1.4\%1.4% . The small variant records an AP of 41.5%41.5\%41.5% , which is a 0.5%0.5\%0.5% improvement over PVT-T with a similar parameter count. Finally, EfficientVMamba-B achieves an AP of 45.0%45.0\%45.0% , indicating a notable advancement of 2.2%2.2\%2.2% over VMamba-T.
【翻译】我们还研究了EfficientVMamba作为Mask R-CNN调度中轻量级骨干网络的性能动态,如表7所示。对于Mask R-CNN 1×调度,我们的EfficientVMamba-T模型具有11M参数和60G FLOPs,实现了35.6%的平均精度(AP)。这比具有31M参数和207G FLOPs的ResNet-18高1.6%。EfficientVMamba-S具有更多参数31M和197G FLOPs,达到39.3%的AP,比具有44M参数和260G FLOPs的ResNet-50模型高0.5%。我们最大的模型EfficientVMamba-B显示出卓越的43.7% AP,具有53M参数和更低的252G FLOPs计算需求,超越VMamba-T 2.8%。在Mask R-CNN 3×MS调度方面,EfficientVMamba-T保持38.3%的AP,超越ResNet-18的性能1.4%。小变体记录了41.5%的AP,比具有相似参数数量的PVT-T提高0.5%。最后,EfficientVMamba-B实现了45.0%的AP,表明比VMamba-T显著提升2.2%。
与MobileNetV2骨干网络的比较
We compare variant architectures and reveal a significant performance difference based on the integration of our innovative block, EVSS, versus Inverted Residual (InRes) blocks at specific stages. This results in Table 8 shows that using InRes consistently across all stages in both tiny and base variants achieves a good performance, with the base variant notably reaching an accuracy of 81.4%81.4\%81.4% . When EVSS is applied across all stages (the strategy of MobileNetV2 [ 34 ]), we observe a slight decrease in accuracy for both variants, suggesting a nuanced balance between architectural consistency and computational efficiency. Our fusion approach that combines EVSS in the initial stages with InRes in the later stages enhances accuracy to 76.5%76.5\%76.5% and 81.8%81.8\%81.8% for the tiny and base variants, respectively. This strategy benefits from the early-stage efficiency of EVSS and the advanced-stage convolutional capabilities of InRes, thus optimizing network performance by leveraging the strengths of both block types with limit computational resources.
【翻译】我们比较了不同的架构变体,揭示了基于我们创新的EVSS块与倒残差(InRes)块在特定阶段集成的显著性能差异。表8中的结果显示,在tiny和base变体的所有阶段一致使用InRes都能取得良好性能,其中base变体显著达到了81.4%的准确率。当EVSS应用于所有阶段时(MobileNetV2的策略[34]),我们观察到两个变体的准确率都略有下降,这表明架构一致性和计算效率之间存在微妙的平衡。我们的融合方法在初始阶段结合EVSS,在后期阶段结合InRes,将tiny和base变体的准确率分别提升到76.5%和81.8%。这种策略受益于EVSS的早期阶段效率和InRes的高级阶段卷积能力,从而通过在有限的计算资源下利用两种块类型的优势来优化网络性能。
Table 8: Comparisons of MobileNetV2 (All stages composed with EVSS.) on ImageNet dataset. We assess both tiny and base models on the ImageNet.
【翻译】表8:MobileNetV2(所有阶段均由EVSS组成)在ImageNet数据集上的比较。我们在ImageNet上评估tiny和base模型。
局限性
Visual state space models that operate with a linear-time complexity O(N)\mathcal O(N)O(N) relative to sequence length demonstrate marked enhancements, particularly in high-resolution downstream tasks, which contrasted with prior CNN-based and Transformer-based models. However, the computational design of SSMs inherently exhibits increased computational sophistication than both convolutional and self-attention mechanisms, which complicates the performance of efficient parallel processing. There remains promising potential for future investigation on optimizing the computational efficiency and scalability of visual state space models (SSMs).
【翻译】相对于序列长度具有线性时间复杂度O(N)\mathcal O(N)O(N)的视觉状态空间模型展现出显著的增强效果,特别是在高分辨率下游任务中,这与之前基于CNN和Transformer的模型形成对比。然而,SSMs的计算设计本质上比卷积和自注意力机制表现出更高的计算复杂性,这使得高效并行处理的性能变得复杂。在优化视觉状态空间模型(SSMs)的计算效率和可扩展性方面,仍有很大的研究潜力。
【解析】SSMs通过状态传播机制实现了线性复杂度,既能处理长序列又能保持全局建模能力,这在高分辨率视觉任务中特别有价值,因为图像被展开成序列后往往非常长。然而,这段话也指出了SSMs当前面临的主要问题:虽然理论复杂度更优,但实际计算过程比传统方法更加复杂。其实这种复杂性主要体现在状态更新的递归性质上,使得并行化变得困难。