当前位置：首页 > news >正文

【每日论文】Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding

news 2025/8/20 21:21:14

下载PDF或者阅读论文，请点击查看：LlamaFactory - huggingface daily paper - 每日论文解读 | LlamaFactory | LlamaFactory

摘要

中文

在这篇论文中，我们提出了一种高效的多级卷积架构，用于3D视觉定位。传统的由于采用两阶段或基于点的架构，难以满足实时推理的要求。受多级全稀疏卷积架构在3D目标检测中成功应用的启发，我们旨在遵循这一技术路线构建一个新的3D视觉定位框架。然而，在3D视觉定位任务中，3D场景表示应与文本特征进行深度交互，由于大量体素特征，基于稀疏卷积的架构在此交互中效率低下。为此，我们提出了文本引导剪枝（TGP）和基于补全的添加（CBA），通过逐步区域剪枝和目标补全，以高效的方式深度融合3D场景表示和文本特征。具体来说，TGP迭代地稀疏化3D场景表示，并通过交叉注意力有效地使体素特征与文本特征交互。为了减轻剪枝对精细几何信息的影响，CBA通过体素补全自适应地修复过度剪枝的区域，而计算开销可以忽略不计。与之前的单阶段方法相比，我们的方法实现了最高的推理速度，速度比之前最快的方法提高了100% FPS。即使与两阶段方法相比，我们的方法也实现了最先进的精度，在ScanRefer上的Acc@0.5领先了+1.13，在NR3D和SR3D上分别领先了+2.6和+3.2。代码可在https://github.com/GWxuan/TSP3D上获取。

English

In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. Conventional methods are difficult to meet the requirements of real-time inference due to the two-stage or point-based architecture. Inspired by the success of multi-level fully sparse convolutional architecture in 3D object detection, we aim to build a new 3D visual grounding framework following this technical route. However, as in 3D visual grounding task the 3D scene representation should be deeply interacted with text features, sparse convolution-based architecture is inefficient for this interaction due to the large amount of voxel features. To this end, we propose text-guided pruning (TGP) and completion-based addition (CBA) to deeply fuse 3D scene representation and text features in an efficient way by gradual region pruning and target completion. Specifically, TGP iteratively sparsifies the 3D scene representation and thus efficiently interacts the voxel features with text features by cross-attention. To mitigate the affect of pruning on delicate geometric information, CBA adaptively fixes the over-pruned region by voxel completion with negligible computational overhead. Compared with previous single-stage methods, our method achieves top inference speed and surpasses previous fastest method by 100\% FPS. Our method also achieves state-of-the-art accuracy even compared with two-stage methods, with +1.13 lead of Acc@0.5 on ScanRefer, and +2.6 and +3.2 leads on NR3D and SR3D respectively. The code is available at https://github.com/GWxuan/TSP3D{https://github.com/GWxuan/TSP3D}.