当前位置: 首页 > news >正文

Vision Transformer with Sparse Scan Prior

摘要

https://arxiv.org/pdf/2405.13335v1
In recent years, Transformers have achieved remarkable progress in computer vision tasks. However, their global modeling often comes with substantial computational overhead, in stark contrast to the human eye’s efficient information processing. Inspired by the human eye’s sparse scanning mechanism, we propose a Sparse Scan Self-Attention mechanism ( \left.\mathrm{S}^{3} \mathrm{~A}\right) . This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors, avoiding redundant global modeling and excessive focus on local information. This approach mirrors the human eye’s functionality and significantly reduces the computational load of vision models. Building on \mathrm{S}^{3} \mathrm{~A} , we introduce the Sparse Scan Vision Transformer (SSViT). Extensive experiments demonstrate the outstanding performance of SSViT across a variety of tasks. Specifically, on ImageNet classification, without additional supervision or training data, SSViT achieves top-1 accuracies of \mathbf{8 4 . 4 % / 8 5 . 7 %} with 4.4G/18.2G FLOPs. SSViT also excels in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Its robustness is further validated across diverse datasets. Code will be available at https:// github. com/qhfan/SSViT.
1 Introduction
Since its inception, the Vision Transformer (ViT) [12] has attracted considerable attention from the research community, primarily owing to its exceptional capability in modeling long-range dependencies. However, the self-attention mechanism [61], as the core of ViT, imposes significant computational overhead, thus constraining its broader applicability. Several strategies have been proposed to alleviate this limitation of self-attention. For instance, methods such as Swin-Transformer [40, 11] group tokens for attention, reducing computational costs and enabling the model to focus more on local information. Techniques like PVT [63,64,18,16,29] down-sample tokens to shrink the size of the \mathrm{QK} matrix, thus lowering computational demands while retaining global information. Meanwhile, approaches such as UniFormer [35, 47] forgo attention operations in the early stages of visual modeling, opting instead for lightweight convolution. Furthermore, some models [50] enhance computational efficiency by pruning redundant tokens.
Despite these advancements, the majority of methods primarily focus on reducing the token count in self-attention operations to boost ViT efficiency, often neglecting the manner in which human eyes process visual information. The human visual system operates in a notably less intricate yet highly efficient manner compared to ViT models. Unlike the fine-grained local spatial information modeling in models like Swin [40], NAT [20], LVT [69], or the indistinct global information modeling seen in models like PVT [63], PVTv2 [64], CMT [18], human vision employs a sparse scanning

http://www.lryc.cn/news/375754.html

相关文章:

  • 笔记-python 中BeautifulSoup入门
  • Tomcat Websocket应用实例研究
  • leetcode-11-二叉树前中后序遍历以及层次遍历
  • Python基础学习笔记(十一)——集合
  • FineReport
  • 嵌入式就业前景好么
  • 为啥找对象千万别找大厂男,还好我不是大厂的。。
  • 如何查看k8s中service的负载均衡策略
  • Linux-DNS域名解析服务01
  • [c++刷题]贪心算法.N01
  • 推荐常用的三款源代码防泄密软件
  • Android 13 高通设备热点低功耗模式(2)
  • web前端任职条件:全面解析
  • 分析医药零售数据该用哪个BI数据可视化工具?
  • 如何使用芯片手册做软件开发?
  • 基于深度学习的文本翻译
  • Unity制作透明材质直接方法——6.15山大软院项目实训
  • 【HarmonyOS NEXT】如何通过h5拉起应用(在华为浏览器中拉起应用)
  • 模板方法模式(大话设计模式)C/C++版本
  • 数据提取:数据治理过程中的质量保障
  • 第55期|GPTSecurity周报
  • 移植案例与原理 - utils子系统之file文件操作部件
  • 个股期权有哪些股票?金融新手必须知道!
  • 平庸的学术工作者
  • 安卓软件自动运行插件的开发源代码介绍!
  • 小程序餐饮点餐系统,扫码下单点菜,消费端+配送端+收银端+理端
  • 说说你这个项目的架构情况吧?
  • 接口响应时间测试
  • C++ 61 之 函数模版
  • 甘特图如何画以及具体实例详解