当前位置：首页 > news >正文

FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection论文精读（逐段解析）

news 2025/7/15 6:45:07

FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection论文精读（逐段解析）

论文地址：https://arxiv.org/abs/2504.20670
Yao Xiao, Tingfa Xu, Yu Xin, Jianan Li

北京理工大学
AAAI 2025

Abstract

Embedded flight devices with visual capabilities have become essential for a wide range of applications. In aerial image detection, while many existing methods have partially addressed the issue of small target detection, challenges remain in optimizing small target detection and balancing detection accuracy with efficiency. These issues are key obstacles to the advancement of real-time aerial image detection. In this paper, we propose a new family of real-time detectors for aerial image detection, named FBRT-YOLO, to address the imbalance between detection accuracy and efficiency. Our method comprises two lightweight modules: Feature Complementary Mapping Module (FCM) and Multi-Kernel Perception Unit (MKP), designed to enhance object perception for small targets in aerial images. FCM focuses on alleviating the problem of information imbalance caused by the loss of small target information in deep networks. It aims to integrate spatial positional information of targets more deeply into the network, better aligning with semantic information in the deeper layers to improve the localization of small targets. We introduce MKP, which leverages convolutions with kernels of different sizes to enhance the relationships between targets of various scales and improve the perception of targets at different scales. Extensive experimental results on three major aerial image datasets, including Visdrone, UAVDT, and AI-TOD, demonstrate that FBRT-YOLO outperforms various real-time detectors in terms of performance and speed. Code is will be avaliable at https://github.com/galaxy-oss/FCM.

【翻译】具有视觉能力的嵌入式飞行设备已成为广泛应用的必需品。在航空图像检测中，虽然许多现有方法已部分解决了小目标检测问题，但在优化小目标检测和平衡检测精度与效率方面仍存在挑战。这些问题是实时航空图像检测发展的关键障碍。在本文中，我们提出了一个新的实时航空图像检测器系列，名为FBRT-YOLO，以解决检测精度与效率之间的不平衡问题。我们的方法包含两个轻量级模块：特征互补映射模块(FCM)和多核感知单元(MKP)，旨在增强航空图像中小目标的目标感知能力。FCM专注于缓解深度网络中小目标信息丢失导致的信息不平衡问题。它旨在将目标的空间位置信息更深入地集成到网络中，更好地与深层的语义信息对齐，以改善小目标的定位。我们引入了MKP，它利用不同尺寸核的卷积来增强不同尺度目标之间的关系，并改善不同尺度目标的感知。在三个主要航空图像数据集（包括Visdrone、UAVDT和AI-TOD）上的广泛实验结果表明，FBRT-YOLO在性能和速度方面均优于各种实时检测器。代码将在 https://github.com/galaxy-oss/FCM 上提供。

【解析】在航空场景中，目标通常很小且分布密集，传统的检测方法在处理这类场景时往往会出现两个主要困难：一是小目标容易在深度网络的特征提取过程中丢失重要信息，二是实时性要求与检测精度之间存在矛盾。作者提出的FBRT-YOLO试图通过两个关键模块来解决这些问题。FCM模块的核心思想是将浅层网络中包含丰富空间位置信息的特征与深层网络中包含丰富语义信息的特征进行有效融合，这样可以避免小目标的空间信息在网络传播过程中的丢失。MKP模块则通过使用不同尺寸的卷积核来捕获多尺度的特征信息，这对于检测不同大小的目标特别重要。整个方法的设计理念是在保证检测精度的同时尽可能降低计算复杂度，以满足实时检测的需求。

Introduction

Recent advancements in deep neural networks have significantly improved object detection in low-resolution natural images (2022; 2023). However, these methods struggle with efficiency and accuracy on high-resolution aerial images, especially in resource-constrained flight equipment. Key challenges include: i) detecting objects that are small or obscured by backgrounds in aerial images, and ii) balancing accuracy with real-time detection requirements on devices with limited computational resources.

【翻译】深度神经网络的最新进展显著改善了低分辨率自然图像的目标检测(2022; 2023)。然而，这些方法在高分辨率航空图像上的效率和精度方面存在困难，特别是在资源受限的飞行设备上。主要挑战包括：i)检测航空图像中较小或被背景遮挡的目标，以及ii)在计算资源有限的设备上平衡精度与实时检测要求。

【解析】航空图像有其独特的困难性，比如拍摄高度导致目标在图像中变得很小，而且由于俯视角度，目标容易与复杂的地面背景混淆。更重要的是，航空设备通常是无人机或其他飞行器，它们的计算能力远不如地面的高性能服务器，这就要求算法必须在保证检测效果的同时，还要足够轻量化才能在这些设备上实时运行。

To improve small object detection, increasing image resolution (2017; 2020) is common but adds computational burden, hampering real-time performance. A key challenge is the mismatch between low-resolution semantic information from deep networks and high-resolution spatial infor- mation from shallow networks. Feature pyramids (2017) address this by integrating deep and shallow features, enhancing small object localization and multi-scale feature expression while improving computational efficiency. However, as shown in Fig. 1(a), backbone networks still struggle with integrating and preserving shallow information, leading to feature mismatch issues.

【翻译】为了改善小目标检测，增加图像分辨率(2017; 2020)是常见做法，但这会增加计算负担，妨碍实时性能。一个关键挑战是深度网络的低分辨率语义信息与浅层网络的高分辨率空间信息之间的不匹配。特征金字塔(2017)通过整合深层和浅层特征来解决这个问题，增强小目标定位和多尺度特征表达，同时提高计算效率。然而，如图1(a)所示，骨干网络在整合和保留浅层信息方面仍然存在困难，导致特征不匹配问题。

【解析】浅层网络能够捕获到丰富的空间细节信息，比如边缘、纹理等，这些对小目标定位很重要；而深层网络则擅长提取高级语义信息，能够理解"这是什么物体"。但问题是，随着网络层数的加深，空间信息会逐渐丢失，而语义信息的分辨率也相对较低。特征金字塔网络试图将不同层次的特征融合起来，但在实际操作中，如何有效地将浅层的精细空间信息传递到深层，并与语义信息很好地结合，仍然是一个技术难题。

Figure 1: The previous method overlooked the embedding of spatial information in deeper layers of the backbone network during feature extraction, leading to spatial semantic inconsistencies. Our method aims to transfer shallow spatial location information into deeper layers of the network during the feature extraction process, thereby enhancing the expression of semantic information.

【翻译】图1：以前的方法在特征提取过程中忽略了在骨干网络的深层嵌入空间信息，导致空间语义不一致。我们的方法旨在在特征提取过程中将浅层空间位置信息传递到网络的深层，从而增强语义信息的表达。

【解析】传统的深度学习网络在逐层提取特征的过程中，往往只关注如何提取更抽象的语义特征，而忽略了空间信息的保持和传递。这就像一个人在看远处的小物体时，能够大概知道那是什么类型的物体（语义信息），但却很难准确指出它的具体位置（空间信息）。作者的创新点在于设计了一种机制，能够在网络的特征提取过程中，主动地将浅层网络中的空间位置信息"携带"到深层网络中去，这样深层网络在获得丰富语义信息的同时，也能保持对目标精确位置的感知能力。

To address the challenges associated with object detection in aerial images, we aim to achieve a more effective network design that meets the requirements for both accuracy and efficiency in real-time aerial image analysis. In this paper, we propose a novel network that includes two lightweight modules: the Feature Complementary Mapping Module (FCM) and the Multi-Kernel Perception Unit (MKP). Firstly, to alleviate information imbalance within the backbone network and promote better integration of semantic and spatial location information, we introduce the Feature Complementary Mapping Module (FCM). FCM implicitly encodes the target’s spatial location information into high-dimensional vectors, guiding the complementary learning of spatial and channel information across different stages of the backbone network. This facilitates the fusion of shallow spatial location information with deep semantic information, enhancing the consistency of spatial and semantic representations. This approach helps transfer shallow spatial location information to deeper layers of the network, improving feature alignment and enhancing the localization of small objects, as shown in Fig. 1(b).

【翻译】为了解决航空图像目标检测相关的挑战，我们旨在实现更有效的网络设计，满足实时航空图像分析中精度和效率的要求。在本文中，我们提出了一个包含两个轻量级模块的新型网络：特征互补映射模块(FCM)和多核感知单元(MKP)。首先，为了缓解骨干网络内的信息不平衡并促进语义和空间位置信息的更好集成，我们引入了特征互补映射模块(FCM)。FCM将目标的空间位置信息隐式编码为高维向量，指导骨干网络不同阶段的空间和通道信息的互补学习。这促进了浅层空间位置信息与深层语义信息的融合，增强了空间和语义表示的一致性。这种方法有助于将浅层空间位置信息传递到网络的深层，改善特征对齐并增强小目标的定位，如图1(b)所示。

【解析】这段描述了作者解决方案的核心思路。FCM模块的设计理念是解决深度网络中信息不平衡的问题。在传统网络中，随着层数的加深，网络逐渐"忘记"了浅层的空间细节信息，只保留了抽象的语义信息。FCM的创新在于它不是简单地将浅层和深层特征进行拼接或相加，而是通过一种"编码-传递"的机制，将空间位置信息转换成高维向量的形式，这样就能够在不增加太多计算负担的情况下，让这些信息能够"穿透"到网络的深层。这种互补学习的思想确保了网络在不同阶段都能同时考虑空间信息和语义信息，从而实现更好的特征对齐。

Figure 2: Our FBRT-YOLO is compared with other realtime detectors in terms of accuracy and efficiency on VisDrone dataset. The radius of the circle represents GFLOPs.

【翻译】图2：我们的FBRT-YOLO与其他实时检测器在VisDrone数据集上的精度和效率比较。圆圈的半径表示GFLOPs。

Secondly, due to the minimal representation of small objects in aerial images, which often comprise just a few pixels, these objects are susceptible to feature disappearance during convolutional neural network (CNN) feature extraction. To fully utilize the limited feature information and enhance the network’s perception of targets at different scales, we investigate the network’s receptive field and propose a Multi-Kernel Perception Unit (MKP). MKP consists of convolutional kernels of different sizes and incorporates spatial point convolutions between these sizes to focus on details at various scales and highlight multi-scale feature representation. We replace the final downsampling layer of the network with MKP. This approach enables multi-scale perception of targets, improving the network’s ability to capture features across different scales while further simplifying the network structure.

【翻译】其次，由于航空图像中小目标的表示非常有限，通常只包含几个像素，这些目标在卷积神经网络(CNN)特征提取过程中容易出现特征消失。为了充分利用有限的特征信息并增强网络对不同尺度目标的感知能力，我们研究了网络的感受野并提出了多核感知单元(MKP)。MKP由不同尺寸的卷积核组成，并在这些尺寸之间融合空间点卷积，以关注各种尺度的细节并突出多尺度特征表示。我们用MKP替换网络的最终下采样层。这种方法实现了目标的多尺度感知，提高了网络捕获不同尺度特征的能力，同时进一步简化了网络结构。

【解析】航空图像中的小目标面临一个严重的问题：它们在图像中只占据很少的像素点，这导致在CNN的逐层特征提取过程中，这些微弱的目标信息很容易被"冲淡"或"掩盖"掉。传统的卷积操作通常使用固定尺寸的卷积核，这样的设计对于大目标可能效果不错，但对于小目标来说，单一尺寸的卷积核往往无法有效捕获其特征。MKP的核心创新在于它采用了多种不同尺寸的卷积核并行工作的策略，这样可以同时从多个角度和尺度来观察同一个目标。不同尺寸的卷积核有不同的感受野大小，小的卷积核关注局部细节，大的卷积核关注更广的上下文信息。通过将这些不同尺度的信息进行融合，网络就能够获得更加丰富和鲁棒的特征表示。

To meet the requirements of real-time detection in aerial images, we propose FBRT-YOLO, which boasts fewer training parameters and reduced computational load compared to the baseline YOLOv8 model (2023). Extensive experiments conducted on widely-used aerial image benchmarks such as VisDrone (2018), UAVDT (2018), and AI-TOD (2021) demonstrate that our FBRT-YOLO significantly outperforms previous state-of-the-art YOLO series models in terms of the trade-off between computation and accuracy across various model scales. The results are displayed in Fig. 2. Our contributions can be summarized as follows:

【翻译】为了满足航空图像实时检测的要求，我们提出了FBRT-YOLO，与基线YOLOv8模型(2023)相比，它具有更少的训练参数和更低的计算负载。在广泛使用的航空图像基准测试上进行的大量实验，如VisDrone (2018)、UAVDT (2018)和AI-TOD (2021)，证明我们的FBRT-YOLO在各种模型规模下的计算量与精度权衡方面显著优于以前的最先进YOLO系列模型。结果显示在图2中。我们的贡献可以总结如下：

• We introduce a new family of real-time detectors for aerial image detection across different model scales, named FBRT-YOLO, achieving a highly balanced trade off between accuracy and efficiency.
• We propose a Feature Complementary Mapping Module (FCM) that enhances feature matching for small targets in deep networks by integrating rich semantic information with precise spatial positional information.
• We introduce Multi-Kernel Perception Unit (MKP) to replace the final downsampling operation, enhancing multi-scale target perception, and simplifying the network for high efficiency.

【翻译】• 我们引入了一个新的实时检测器系列，用于不同模型规模的航空图像检测，命名为FBRT-YOLO，实现了精度和效率之间的高度平衡权衡。
• 我们提出了特征互补映射模块(FCM)，通过整合丰富的语义信息与精确的空间位置信息，增强深度网络中小目标的特征匹配。
• 我们引入了多核感知单元(MKP)来替代最终的下采样操作，增强多尺度目标感知，并简化网络以获得高效率。

【解析】第一个贡献强调了方法的系统性和实用性。第二个贡献FCM模块：如何在网络加深的过程中保持浅层的空间细节信息。传统方法往往只是简单地将不同层的特征进行融合，而FCM通过"互补映射"的思想，实现了语义信息和空间信息的深度整合，这种整合不是表面的拼接，而是在特征学习过程中的相互指导和增强。第三个贡献MKP模块则从另一个角度优化了网络结构，通过多核卷积替换传统的下采样操作，既增强了多尺度感知能力，又简化了网络复杂度。

Related Work

Real-time Object Detectors. Real-time object detectors are crucial for resource-constrained platforms, emphasizing model size, memory, and computational efficiency. Currently, YOLO (2016) and FCOS (2020) are mainstream frameworks for state-of-the-art real-time object detection. While existing real-time detectors have shown significant performance improvements on public benchmarks such as COCO (2014; 2024) for low-resolution natural images, their performance on high-resolution aerial images remains unsatisfactory. We introduce FBRT-YOLO, a specialized realtime object detector designed to excel in high-resolution aerial settings, demonstrating superior performance compared to existing models.

【翻译】实时目标检测器。实时目标检测器对于资源受限的平台至关重要，强调模型大小、内存和计算效率。目前，YOLO (2016)和FCOS (2020)是最先进实时目标检测的主流框架。虽然现有的实时检测器在COCO (2014; 2024)等公共基准测试中对低分辨率自然图像显示出显著的性能改进，但它们在高分辨率航空图像上的性能仍然不令人满意。我们引入了FBRT-YOLO，这是一个专门设计用于在高分辨率航空环境中表现出色的实时目标检测器，与现有模型相比展现出优越的性能。

Small Object Detection. Detecting small objects has long been challenging. Recent solutions include augmenting small object datasets (2019) and using high-resolution images to retain detailed features. However, these methods often result in more complex models and slower detection speeds. ClusDet (2019) employs a cluster-based object scale estimation network to effectively detect small objects. DMNet (2020a) utilizes a density map-based cropping method to leverage spatial and contextual information among objects for improved detection performance. Despite their effectiveness in small object detection, these methods suffer from long inference times and low detection efficiency. QueryDet (2022), while leveraging high-resolution features, incorporates a novel query mechanism to accelerate inference speed for object detectors based on feature pyramids. CEASC (2023) introduces a context-enhanced sparse convolution to capture global information and enhance focal features, striking a balance between detection accuracy and efficiency. These works propose lightweight decoupled heads that to some extent accelerate networks. However, achieving real-time detection remains challenging.

【翻译】小目标检测。检测小目标一直是一个挑战。最近的解决方案包括增强小目标数据集(2019)和使用高分辨率图像来保留详细特征。然而，这些方法通常导致更复杂的模型和更慢的检测速度。ClusDet (2019)采用基于聚类的目标尺度估计网络来有效检测小目标。DMNet (2020a)利用基于密度图的裁剪方法来利用目标之间的空间和上下文信息以改善检测性能。尽管这些方法在小目标检测方面有效，但它们存在推理时间长和检测效率低的问题。QueryDet (2022)在利用高分辨率特征的同时，结合了一种新颖的查询机制来加速基于特征金字塔的目标检测器的推理速度。CEASC (2023)引入了上下文增强稀疏卷积来捕获全局信息并增强焦点特征，在检测精度和效率之间取得平衡。这些工作提出了轻量级解耦头，在一定程度上加速了网络。然而，实现实时检测仍然具有挑战性。

【解析】小目标检测的困难性在于信息量极其有限。一个小目标可能只占据几个到几十个像素点，这些微弱的信息在深度网络的层层处理中很容易被噪声淹没或者被下采样操作直接丢失掉。现有的解决方案大致分为几类：数据增强方法试图通过人工生成更多小目标样本来提高网络的学习能力；高分辨率方法希望通过保持图像的原始细节来避免信息丢失；聚类和密度图方法则试图利用小目标的分布规律来提高检测精度。但是所有这些方法都面临一个共同的问题：为了提高精度，它们往往需要增加模型的复杂度或者使用更大的图像，这就导致计算成本急剧上升，难以满足实时检测的需求。这就形成了精度与速度之间的矛盾，这个矛盾在航空图像检测中尤为突出。

Multi-scale Information Extraction and Representation. Small objects are often represented in feature maps by only a few pixels, necessitating multi-scale information to enhance the feature representation of these small objects. Many works have also been carried out from this aspect (2018; 2024). Feature Pyramid Network (FPN) integrates the deep features with the richest semantics information and the shallow features with spatial location information, which alleviates the problem of feature imbalance to a certain extent. PANet (2018) adds a bottom-up path on the basis of FPN, which promotes the propagation of bottom-layer information and enhances information exchange. IPG-Net (2020) in- troduces image pyramid into the backbone network to solve the problem of information imbalance. The whole process consumes a lot of computing resources, which is not conducive to real-time detection. In our work, we focus on integrating deep semantic information with shallow spatial positional information in the backbone network. This integration alleviates the imbalance in information extraction dur- ing feature extraction, thereby enhancing the representation of small objects. We employ multi-scale convolutional kernels to strengthen the feature representation of targets across various scales.

【翻译】多尺度信息提取和表示。小目标在特征图中通常只由几个像素表示，这需要多尺度信息来增强这些小目标的特征表示。许多工作也从这个方面开展(2018; 2024)。特征金字塔网络(FPN)整合了具有最丰富语义信息的深层特征和具有空间位置信息的浅层特征，在一定程度上缓解了特征不平衡的问题。PANet (2018)在FPN的基础上增加了自下而上的路径，促进了底层信息的传播并增强了信息交换。IPG-Net (2020)将图像金字塔引入骨干网络来解决信息不平衡问题。整个过程消耗大量计算资源，不利于实时检测。在我们的工作中，我们专注于在骨干网络中整合深层语义信息与浅层空间位置信息。这种整合缓解了特征提取过程中信息提取的不平衡，从而增强了小目标的表示。我们采用多尺度卷积核来加强各种尺度目标的特征表示。

【解析】多尺度信息提取的核心理念是认识到不同尺度的特征包含不同类型的信息。在深度卷积网络中，随着网络层数的增加，特征图的分辨率逐渐降低，但语义信息逐渐丰富。浅层网络保留了大量的空间细节，比如边缘、纹理、精确的位置信息，但对于"这是什么物体"的理解能力较弱；而深层网络恰恰相反，它能够很好地理解物体的类别和高级语义，但对于物体的精确位置和细节特征的保持能力较差。FPN等特征金字塔方法试图将这两种信息结合起来，但传统的融合方式往往是简单的特征拼接或相加，这种方式在信息整合的深度和效果上都有限制。更重要的是，为了获得多尺度信息，这些方法往往需要构建复杂的网络结构，或者需要处理多个不同分辨率的图像，这就大大增加了计算负担。作者的方法试图在骨干网络内部就实现这种信息的深度融合，避免额外的计算开销。

Figure 3: Framework of FBRT-YOLO. FCM module is embedded into each stage of the backbone network to integrate spatial positional information into deeper semantic information. In the final (fourth) stage of the backbone network, MKP units are introduced along with multi-scale convolutions to enhance perception of targets at various scales. It’s worth noting that MKP replaces the final downsampling layer while also reducing the corresponding detection heads.

【翻译】图3：FBRT-YOLO的框架。FCM模块嵌入到骨干网络的每个阶段，将空间位置信息整合到更深层的语义信息中。在骨干网络的最后(第四)阶段，引入MKP单元以及多尺度卷积来增强对各种尺度目标的感知。值得注意的是，MKP替换了最终的下采样层，同时也减少了相应的检测头。

【解析】FCM模块在网络的每个阶段都被嵌入，这说明空间位置信息的整合不是一次性的操作，而是贯穿整个特征提取过程的渐进式融合。这种设计确保了随着网络的加深，空间信息能够持续地被传递和保持，而不是在某个阶段突然丢失。MKP模块被放置在网络的最后阶段，这是因为此时网络已经提取了丰富的语义特征，需要通过多尺度感知来捕获不同大小的目标。MKP不仅仅是一个额外的模块，它还承担了原本下采样层的功能，实现了功能的整合，既增强了多尺度感知能力，又避免了额外的计算开销。减少检测头进一步轻量化，在保证检测性能的同时降低了模型的复杂度。

Method

We present the entire structure of FBRT-YOLO in Fig. 3. This includes two core lightweight modules: the Feature Complementary Mapping Module and the Multi-Kernel Perception Unit. FCM aims to integrate more spatial positional information into rich semantic features, enhancing the representation of small objects. MKP utilizes diverse convolutional kernels to capture target information across multiple scales. Additionally, for aerial image detection, we streamline the baseline network by removing non-critical or redundant computations, further refining the network.

【翻译】我们在图3中展示了FBRT-YOLO的完整结构。这包括两个核心轻量级模块：特征互补映射模块和多核感知单元。FCM旨在将更多的空间位置信息整合到丰富的语义特征中，增强小目标的表示。MKP利用多样化的卷积核来捕获多个尺度的目标信息。此外，对于航空图像检测，我们通过移除非关键或冗余的计算来简化基线网络，进一步优化网络。

特征互补映射模块

Insufficient integration of spatial positional and semantic information can lead to mismatches and misalignments in target information. To address this limitation, we propose the Feature Complementary Mapping Module. This module implicitly encodes more low-level spatial information into high-dimensional vectors, transmitting it to deeper layers of the network. This enables the detector to capture stronger structural information, thereby enhancing the expression of semantic information. The detailed structure of FCM is shown in Fig. 3, which utilizes a split, transformation, complementary mapping strategy and feature aggregation. The following is a detailed introduction to this module.

【翻译】空间位置信息和语义信息的整合不足会导致目标信息的不匹配和不对齐。为了解决这一限制，我们提出了特征互补映射模块。该模块将更多的低级空间信息隐式编码到高维向量中，并将其传输到网络的更深层。这使得检测器能够捕获更强的结构信息，从而增强语义信息的表达。FCM的详细结构如图3所示，它采用分割、变换、互补映射策略和特征聚合。以下是对该模块的详细介绍。

【解析】信息的不平衡在小目标检测中尤为致命，因为小目标本身的信息量就极其有限，如果在网络处理过程中空间信息和语义信息无法有效整合，就会出现"找不准位置"或"认不出是什么"的问题。FCM模块的核心思想是建立一个桥梁，让空间信息能够"搭便车"跟随语义信息一起传递到网络深层，同时让语义信息也能够借助空间信息来提高定位精度。这种互补不是简单的信息叠加，而是通过特定的映射机制让两种信息相互增强，实现1+1>2的效果。

Channel Split. We first split the channels of the input feature $(Xinput∈RC×H×W)(X^{input}\in\mathbb{R}^{C\times H\times W})$ into t with $αC\alpha C$ channels and tio. The value of $(1−α)C(1-\alpha)C$ − α is quite important in the network. As the channels, where 0 $0≤α≤10{\le}\alpha\le1$ ≤ ≤ 1 is the split ranetwork deepens, the branch with lower-level spatial information becomes more prominent, with increasing amounts of low-level spatial information being implicitly encoded into high-dimensional vectors. Enhancing the acquisition of low-level information at appropriate times can improve performance. The split stage can be formulated as:

【翻译】通道分割。我们首先将输入特征 $(Xinput∈RC×H×W)(X^{input}\in\mathbb{R}^{C\times H\times W})$ 的通道分割为具有 $αC\alpha C$ 通道和 $(1−α)C(1-\alpha)C$ 通道的两部分。 $(1−α)C(1-\alpha)C$ 中α的值在网络中非常重要。随着网络的加深，其中 $0≤α≤10{\le}\alpha\le1$ 是分割比例，具有较低级空间信息的分支变得更加突出，越来越多的低级空间信息被隐式编码到高维向量中。在适当的时候增强低级信息的获取可以提高性能。分割阶段可以表述为：

【解析】通道分割是FCM模块的第一步，在传统的卷积神经网络中，所有通道都会经过相同的处理流程，但FCM认识到不同的信息应该走不同的处理路径。通过参数α来控制分割比例，实际上是在决定有多少计算资源用于处理语义信息，有多少用于处理空间信息。这个比例的选择需要根据网络的深度来动态调整：在网络较浅的层次，空间信息丰富，可以分配更多资源给空间信息分支；而在网络较深的层次，语义信息更加重要，需要给语义分支更多资源。

$(X1,X2)=Split(Xinput),(X^{1},X^{2})=\mathrm{Split}(X^{\mathrm{input}}),$

$X1∈RαC×H×W,X2∈R(1−α)C×H×W.X^{1}\in\mathbb{R}^{\alpha C\times H\times W},X^{2}\in\mathbb{R}^{(1-\alpha)C\times H\times W}.$

Orientation Transformation. To separately obtain spatial mappings of semantic and positional information, we send the obtained $X^{1}$ to the branch composed of standard $3×33\times3$ convolution, more rich feature in ation is ext ed on each channel, it is represented as $X^{C}$ in Fig. 3. $X^{2}$ is sent to the branch composed of point-wise convolution, the point-wise convolution extracts relatively weak information, preserving a large amount of shallow spatial position information, it is represented as $X^{S}$ . This transformation process is represented by the formula:

【翻译】方向变换。为了分别获得语义和位置信息的空间映射，我们将获得的 $X^{1}$ 发送到由标准 $3×33\times3$ 卷积组成的分支，在每个通道上提取更丰富的特征信息，在图3中表示为 $X^{C}$ 。 $X^{2}$ 被发送到由逐点卷积组成的分支，逐点卷积提取相对较弱的信息，保留大量浅层空间位置信息，表示为 $X^{S}$ 。这个变换过程由公式表示：

【解析】方向变换是FCM模块的第二个关键步骤，它的核心思想是让不同类型的信息走不同的处理路径以发挥各自的优势。 $X^{1}$ 通过 $3×33\times3$ 标准卷积处理，这种较大的卷积核能够捕获更大范围的上下文信息，同时由于卷积核的复杂性，它能够学习到更加复杂和抽象的语义特征，这就是为什么称其为 $X^{C}$ （Channel-rich，通道丰富）的原因。相反， $X^{2}$ 通过 $1×11\times1$ 的逐点卷积处理，这种卷积核只能看到单个像素点的信息，无法获取空间上下文，因此它主要起到特征变换的作用，能够很好地保持原始的空间位置信息而不引入额外的空间混合，这就是 $X^{S}$ （Spatial-preserved，空间保持）的含义。这种分而治之的策略确保了语义信息和空间信息都能在最适合的处理路径中得到最大化的保留和增强。

$(XC,XS)=ϕ1(X1,X2),(X^{C},X^{S})=\phi_{1}(X^{1},X^{2}),$

where $ϕ1\phi_1$ represents learning a mapping relationship between spatial and semantic information, $XC∈RC×H×WX^C \in \mathbb{R}^{C \times H \times W}$ contains rich channel information, $XS∈RC×H×WX^S \in \mathbb{R}^{C \times H \times W}$ retains more original spatial location information.

【翻译】其中 $ϕ1\phi_1$ 表示学习空间和语义信息之间的映射关系， $XC∈RC×H×WX^C \in \mathbb{R}^{C \times H \times W}$ 包含丰富的通道信息， $XS∈RC×H×WX^S \in \mathbb{R}^{C \times H \times W}$ 保留更多原始空间位置信息。

【解析】函数 $ϕ1\phi_1$ 本质上是一个特征变换器，它的作用是建立从输入特征到专门化特征表示的映射。这个映射不是简单的数学变换，而是通过可学习的卷积操作来实现的，网络在训练过程中会自动学习如何最优地分离和增强不同类型的信息。值得注意的是，尽管 $X^C$ 和 $X^S$ 在维度上完全相同，都是 $\times H \times W$ ，但它们在信息内容上有着根本的区别： $X^C$ 的每个通道都携带着经过 $3×33\times3$ 卷积提取的丰富语义特征，这些特征能够表达物体的类别、属性等高级信息；而 $X^S$ 的每个空间位置都尽可能地保持着原始的位置编码信息，这些信息对于精确定位至关重要。这种设计为后续的互补映射奠定了基础。

Complementary Mapping. Currently, the features we obtain, $X^C$ and $X^S$ , while effective, are discrete. This can lead to imprecise matching of target features. Therefore, we perform complementary mapping between them to compensate for their respective missing feature mappings, achieving efficient feature matching. We take $X^C$ , which has richer channel information, into channel interaction. It can assign unique weights to the important information on each channel. This is then mapped to $X^S$ , which has low-level spatial location information features, for complementary feature fusion. This allows the information after interaction to obtain higher-level features. Similarly, $X^S$ with richer low-level spatial position information, through spatial interaction, assigns unique weights to the important information on each position, and maps it to $X^C$ with rich channel information features, to achieve complementary integration and obtain higher-level features. This process achieves the guidance of stronger features to guide weaker ones, thereby alleviating the problem of information imbalance.

【翻译】互补映射。目前，我们获得的特征 $X^C$ 和 $X^S$ 虽然有效，但是离散的。这可能导致目标特征的不精确匹配。因此，我们在它们之间执行互补映射来补偿各自缺失的特征映射，实现高效的特征匹配。我们将具有更丰富通道信息的 $X^C$ 进行通道交互。它可以为每个通道上的重要信息分配独特的权重。然后将其映射到具有低级空间位置信息特征的 $X^S$ ，进行互补特征融合。这使得交互后的信息获得更高级的特征。类似地，具有更丰富低级空间位置信息的 $X^S$ ，通过空间交互，为每个位置上的重要信息分配独特的权重，并将其映射到具有丰富通道信息特征的 $X^C$ ，以实现互补整合并获得更高级的特征。这个过程实现了更强特征指导较弱特征，从而缓解信息不平衡问题。

【解析】互补映射是FCM模块的核心创新，它解决了一个深层次的问题：单独的语义特征和空间特征虽然各有优势，但如果不能有效整合，就会出现"各自为政"的局面，导致检测性能的瓶颈。这个问题在小目标检测中尤为严重，因为小目标的信息本身就稀少，如果不能充分利用每一bit的信息，就很难实现准确检测。互补映射采用了一种"强弱互补"的策略：通道交互机制让语义丰富的 $X^C$ 生成注意力权重，这些权重表达了"哪些语义特征最重要"，然后将这些权重应用到空间特征 $X^S$ 上，相当于让语义信息告诉空间信息"你应该重点关注哪些区域"；反之，空间交互机制让 $X^S$ 生成空间注意力权重，告诉 $X^C$ "哪些位置最值得关注"。这种双向的信息指导机制确保了两种特征不仅能够保持各自的优势，还能够从对方那里获得自己缺失的信息，实现真正的协同增效。

Channel Interaction: First, we use a the depthwise convolution to perform convolution operations on each channel, cutting off the information between the channels, which is calculated as:

【翻译】通道交互：首先，我们使用深度卷积对每个通道执行卷积操作，切断通道之间的信息，计算如下：

【解析】通道交互的第一步是使用深度卷积，深度卷积的特点是每个卷积核只负责一个输入通道，这样就能够"切断通道之间的信息"，让每个通道独立地处理自己的信息。这种设计的好处是能够保持每个通道特征的独立性，避免不同通道的信息过早混合而导致特征表达能力的下降。在传统的卷积中，所有输入通道都会被混合在一起，这虽然能够学习通道间的关系，但也可能导致一些重要的单通道特征被稀释或掩盖。深度卷积通过独立处理每个通道，能够让每个通道都充分发挥自己的特征表达能力，为后续的全局池化和权重生成提供更加纯净和有效的特征。

$XiD=ϕ2(ki,XiC),X_{i}^{D}=\phi_{2}(k_{i},X_{i}^{C}),$

where $ϕ2\phi_{2}$ represents the mapping of each feature layer channel, $k_{i}$ is the i-th convolution kernel, $X_{i}^{C}$ is the i-th input channel, $X_{i}^{D}$ is the corresponding single output chanutput result after depthwise convolution is $XD∈X^{D}\in\mathbf{}$ $RC×H×W\mathbb{R}^{C\times H\times W}$ .

【翻译】其中 $ϕ2\phi_{2}$ 表示每个特征层通道的映射， $k_{i}$ 是第i个卷积核， $X_{i}^{C}$ 是第i个输入通道， $X_{i}^{D}$ 是对应的单个输出通道，深度卷积后的输出结果是 $XD∈RC×H×WX^{D}\in\mathbb{R}^{C\times H\times W}$ 。

【解析】这个公式描述了深度卷积的数学过程，其中每个卷积核 $k_i$ 只与对应的输入通道 $X_i^C$ 进行卷积操作，产生对应的输出通道 $X_i^D$ 。函数 $ϕ2\phi_2$ 表示这种一对一的映射关系，它确保了通道间的独立处理。最终的输出 $X^D$ 保持了与输入相同的通道数 $C$ ，但每个通道的特征都经过了独立的卷积变换，这为后续的全局信息提取提供了更加精细化的特征表示。这种设计的优势在于既保持了特征的多样性，又为每个通道提供了独立的特征增强机会。

then perform global average pooling to obtain the global information on each channel, and finally obtain the key information weights through the sigmoid layer. The unique weight $ω1∈RC×1×1\omega_{1}\in\mathbb{R}^{C\times1\times1}$ generated on the channel can be represented as:

【翻译】然后执行全局平均池化以获得每个通道上的全局信息，最后通过sigmoid层获得关键信息权重。在通道上生成的唯一权重 $ω1∈RC×1×1\omega_{1}\in\mathbb{R}^{C\times1\times1}$ 可以表示为：

【解析】全局平均池化是通道注意力机制的关键步骤，它将每个通道的 $\times W$ 空间特征压缩为单个数值，这个数值代表了该通道在整个空间范围内的平均激活强度。这种压缩不是简单的信息丢失，而是一种全局统计信息的提取，它能够反映出该通道对于当前输入特征的整体重要性。随后的sigmoid激活函数将这些全局统计信息转换为0到1之间的权重值，这些权重表达了每个通道的相对重要性。 $ω1\omega_1$ 的维度是 $C×1×1C\times1\times1$ ，这表明每个通道都有一个独特的权重值，这个权重将用于调节该通道在最终特征表示中的贡献程度。

$ω1=R(1H×W∑i=0H∑j=0WXD(i,j)),\omega_{1}=\mathcal{R}\left(\frac{1}{H\times W}\sum_{i=0}^{H}\sum_{j=0}^{W}X^{D}(i,j)\right),$

where $R\mathcal{R}$ represents an activation function.

【翻译】其中 $R\mathcal{R}$ 表示激活函数。

【解析】公式描述了通道权重的计算过程。首先， $1H×W∑i=0H∑j=0WXD(i,j)\frac{1}{H\times W}\sum_{i=0}^{H}\sum_{j=0}^{W}X^{D}(i,j)$ 计算了每个通道在空间维度上的平均值，这个操作等效于全局平均池化。然后，激活函数 $R\mathcal{R}$ （sigmoid函数）将这些平均值转换为权重。这种设计的理论基础是：如果一个通道在整个空间范围内都有较高的激活值，那么这个通道很可能包含了重要的语义信息，应该给予更高的权重；相反，如果一个通道的平均激活值较低，说明它对当前任务的贡献较小，应该给予较低的权重。这种基于全局统计的权重分配机制能够有效地突出重要特征，抑制冗余信息。

Spatial interaction: To further aggregate spatial information, we adopt a simple design, as shown in Fig. 3, which consists of a $1×11\times1$ spatial convolution layer, BN (2015) and sigmoid. Finally, we generate a spatial attention map, which is similar to channel interaction, and map it to the branch that go through the $3×33\times3$ standard convolution, making it more focused on spatial information.The spatial information weight $ω2\omega_{2}$ generated can be calculated as:

【翻译】空间交互：为了进一步聚合空间信息，我们采用简单的设计，如图3所示，它由 $1×11\times1$ 空间卷积层、BN (2015)和sigmoid组成。最后，我们生成一个空间注意力图，这与通道交互类似，并将其映射到经过 $3×33\times3$ 标准卷积的分支，使其更专注于空间信息。生成的空间信息权重 $ω2\omega_{2}$ 可以计算为：

【解析】空间交互采用了一种相对简洁但有效的设计策略。 $1×11\times1$ 卷积在这里的作用是对通道信息进行压缩和变换，将 $C$ 个通道的信息融合成单个通道的空间注意力图。批量归一化（BN）确保了特征的稳定性和训练的收敛性，而sigmoid激活函数将输出值限制在0到1之间，形成真正的注意力权重。这种设计的核心思想是：不同的空间位置对于目标检测的重要性是不同的，空间注意力机制能够学习并突出那些包含重要目标信息的空间区域，同时抑制背景噪声区域。相比于复杂的空间注意力设计，这种简单的结构在计算效率和效果之间取得了很好的平衡，特别适合实时检测的需求。

$ω2=R(F(XS)),\omega_{2}=\mathcal{R}(\mathcal{F}(X^{S})),$

where $F\mathcal{F}$ represents convolution mapping with spatial aggregation, $ω2∈R1×H×W\omega_2 \in \mathbb{R}^{1 \times H \times W}$ .

【翻译】其中 $F\mathcal{F}$ 表示具有空间聚合的卷积映射， $ω2∈R1×H×W\omega_2 \in \mathbb{R}^{1 \times H \times W}$ 。

【解析】函数 $F\mathcal{F}$ 封装了整个空间注意力权重生成的过程，包括 $1×11\times1$ 卷积、批量归一化和激活函数的组合操作。这个函数的输入是保留空间信息的特征 $X^S$ ，输出是空间注意力权重 $ω2\omega_2$ 。 $ω2\omega_2$ 的维度 $\times H \times W$ 说明它为每个空间位置分配一个权重值，这个权重反映了该位置对于目标检测任务的重要程度。与通道注意力的 $\times 1 \times 1$ 维度形成对比，空间注意力关注的是"在哪里"而不是"是什么"的问题。这种空间权重将被用于指导语义特征 $X^C$ ，告诉它哪些空间区域值得更多关注。

Feature Aggregation. After obtaining the channel information weight $ω1\omega_1$ and spatial information weight $ω2\omega_2$ , they are respectively mapped to the features containing $X^S$ and $X^C$ . Then the two branches are connected together to obtain the feature $X^{FCM}$ , which contains features with dual mappings of spatial and semantic relationships. $X^{FCM}$ is calculated as:
$XFCM=(XC⊗ω2)⊕(XS⊗ω1)X^{FCM} = (X^C \otimes \omega_2) \oplus (X^S \otimes \omega_1)$

【翻译】特征聚合。在获得通道信息权重 $ω1\omega_1$ 和空间信息权重 $ω2\omega_2$ 后，它们分别映射到包含 $X^S$ 和 $X^C$ 的特征。然后将两个分支连接在一起，获得特征 $X^{FCM}$ ，它包含空间和语义关系双重映射的特征。 $X^{FCM}$ 计算为：

【解析】特征聚合是FCM模块的最终步骤，它实现了空间信息和语义信息的深度融合。这个过程主要在于交叉权重的应用：语义丰富的特征 $X^C$ 被空间权重 $ω2\omega_2$ 调制，这相当于让空间信息告诉语义特征"应该关注哪些位置"；而空间信息丰富的特征 $X^S$ 被通道权重 $ω1\omega_1$ 调制，相当于让语义信息告诉空间特征"哪些通道更重要"。符号 $⊗\otimes$ 表示逐元素相乘（element-wise multiplication），这是注意力机制的标准操作，它通过权重来调节特征的强度；符号 $⊕\oplus$ 表示特征的融合操作。最终得到的 $X^{FCM}$ 既包含了经过空间指导的语义信息，也包含了经过语义指导的空间信息，这种双重映射确保了两类信息的充分整合和相互增强。

Overall, the FCM module adopts an information complementary fusion method with relatively low computational resources. It propagates shallow-level spatial positional information into deeper layers of the network, alleviating the loss of object spatial location information in the backbone network downsampling process.

【翻译】总的来说，FCM模块采用了计算资源相对较低的信息互补融合方法。它将浅层空间位置信息传播到网络的更深层，缓解了骨干网络下采样过程中目标空间位置信息的丢失。

【解析】FCM通过建立从浅层到深层的信息传递通道，确保了空间位置信息能够跨越多个网络层级得到保持和利用，这对于保持小目标的检测精度至关重要。同时，由于其设计的简洁性，它不会显著增加网络的计算负担。

Multi-Kernel Perception Unit（多核感知单元）

Small targets in aerial images are often obscured by background noise, resulting in limited effective information. To fully leverage the available feature information, we employ a multi-kernel perception unit to detect targets at different scales and establish spatial relationships across these scales, thereby enhancing the feature representation of contextual and small target information. As illustrated in Fig. 3, the Multi-Kernel Perception Unit (MKP) concatenates convolutional kernels of various sizes sequentially, incorporating point-wise convolutions between kernels of different scales. The entire process can be mathematically represented as follows:

【翻译】航空图像中的小目标经常被背景噪声遮挡，导致有效信息有限。为了充分利用可用的特征信息，我们采用多核感知单元来检测不同尺度的目标，并建立这些尺度之间的空间关系，从而增强上下文和小目标信息的特征表示。如图3所示，多核感知单元（MKP）按顺序连接不同大小的卷积核，在不同尺度的卷积核之间融入逐点卷积。整个过程可以用数学表示如下：

【解析】航空图像中的小目标检测面临着独特的挑战：由于拍摄高度和复杂背景的影响，小目标往往被各种背景噪声干扰，使得它们在图像中的有效特征信息变得稀少。多核感知单元的设计思路是通过使用不同尺寸的卷积核来捕获不同尺度的特征信息。这种方法的核心原理在于：不同大小的卷积核具有不同的感受野，小卷积核能够捕获细节特征，大卷积核能够捕获更大范围的上下文信息。通过将这些不同尺度的卷积核按照特定顺序串联，并在它们之间插入逐点卷积（1×1卷积），可以实现多尺度特征的有效融合。逐点卷积在这里起到了特征变换和信息交互的作用，它能够调整通道维度并促进不同尺度特征之间的信息交换。这种设计不仅能够提高小目标的检测精度，还能够建立起不同尺度之间的空间关系，为后续的特征处理提供更丰富的信息基础。

$X′=T2k+1(A(⋅⋅⋅A(Tk(X))⋅⋅⋅)),X^{\prime}={\mathcal{T}}_{2k+1}{\big(}A(\cdot\cdot\cdot A({\mathcal{T}}_{k}(X))\cdot\cdot\cdot{\big)}{\big)},$

where $X$ represents local features of the input, while $X^{'}$ represents globally mapped features across multiple s ales. $Tk\mathcal{T}_{k}$ represents depthwise convolution with kernel size k . In our experiment, we set $k = 3$ . $A\mathcal{A}$ represents point-wise convolution transformation.

【翻译】其中 $X$ 表示输入的局部特征，而 $X^{'}$ 表示跨多个尺度的全局映射特征。 $Tk\mathcal{T}_{k}$ 表示核大小为k的深度卷积。在我们的实验中，我们设置 $k = 3$ 。 $A\mathcal{A}$ 表示逐点卷积变换。

【解析】这个公式描述了多核感知单元的计算过程。输入特征 $X$ 首先经过核大小为 $k$ 的深度卷积 $Tk\mathcal{T}_k$ 处理，然后通过逐点卷积 $A$ 进行特征变换，这个过程被递归地应用多次。公式中的 $⋅⋅⋅\cdot\cdot\cdot$ 表示这种操作的重复执行，最终通过核大小为 $2 k + 1$ 的深度卷积 $T2k+1\mathcal{T}_{2k+1}$ 生成输出特征 $X^{'}$ 。这种设计的数学原理是构建一个从小感受野到大感受野的渐进式特征提取过程：从 $k = 3$ 的卷积核开始，逐步扩展到 $2 k + 1 = 7$ 的卷积核，形成了一个多层次的特征提取链。深度卷积 $Tk\mathcal{T}_k$ 负责在保持通道独立性的前提下进行空间特征提取，而逐点卷积 $A$ 则负责通道间的信息融合和特征变换。这种交替使用深度卷积和逐点卷积的策略，既保证了计算效率，又实现了多尺度特征的有效整合。最终的输出 $X^{'}$ 包含了从局部细节到全局上下文的多层次特征信息，为后续的目标检测提供了更加丰富和鲁棒的特征表示。

面向冗余削减的网络结构设计

Currently, real-time detection models are primarily designed for traditional low-resolution image detection, but this does not apply well to high-resolution aerial image detection, resulting in significant structural redundancy. For spatial downsampling in feature extraction, channel expansion precedes depthwise convolution sampling (2024). Post depthwise convolution, there is interference between channels, leading to a loss of spatial information, which is disadvantageous for detecting aerial images in complex environments.

【翻译】目前，实时检测模型主要是为传统的低分辨率图像检测而设计的，但这不能很好地适用于高分辨率航空图像检测，导致显著的结构冗余。在特征提取的空间下采样中，通道扩展先于深度卷积采样（2024）。深度卷积后，通道间存在干扰，导致空间信息丢失，这对在复杂环境中检测航空图像是不利的。

Table 1: Comparison of $AP⁡(%)\operatorname{AP}({\%})$ and Params/FPS on VisDrone by using our methods with different real-time object detectors.

【翻译】表1：在VisDrone上使用我们的方法与不同实时目标检测器的 $AP⁡(%)\operatorname{AP}({\%})$ 和参数/FPS比较。

Table 2: Comparison of $AP(%)\mathrm{AP}(\%)$ with state-of-the-art detectors on VisDrone.

【翻译】表2：在VisDrone上与最先进检测器的 $AP(%)\mathrm{AP}(\%)$ 比较。

However, we decouple this process by first applying group convolutions for spatial downsampling and then using point convolutions for channel expansion. The parameter calculations for both approaches are as follows:

【翻译】然而，我们通过首先应用组卷积进行空间下采样，然后使用点卷积进行通道扩展来解耦这个过程。两种方法的参数计算如下：

$P′=3×3×C1×C2,P^{'}=3\times3\times C_{1}\times C_{2},$

$P=3×3×C1×C1g+1×1×C1×C2,P=3\times3\times C_{1}\times\frac{C_{1}}{g}+1\times1\times C_{1}\times C_{2},$

where $P^{'}$ represents the parameter count for standard convo- lution, and $P$ represents the parameter count for our method. $C_{1}$ and $C_{2}$ denote the input and output channel numbers, respectively. During network downsampling, the channel expansion typically results in $C_{2}=2C_{1}$ . $g$ represents the number of groups.

【翻译】其中 $P^{'}$ 表示标准卷积的参数数量， $P$ 表示我们方法的参数数量。 $C_{1}$ 和 $C_{2}$ 分别表示输入和输出通道数。在网络下采样过程中，通道扩展通常导致 $C_{2}=2C_{1}$ 。 $g$ 表示组数。

Experiments

Implementation Details

We conduct extensive experiments on three object detection benchmarks based on aerial images, i.e . Visdrone, UAVDT, and AI-TOD. All experiments are conducted on an NVIDIA GeForce RTX 4090 GPU, except that the inference speed is test on a single RTX 3080 GPU. Our network is trained for 300 epochs using the stochastic gradient descent (SGD) optimizer with a momentum of 0.937, a weight decay of 0.0005, a batch size of 4, and an initial learning rate of 0.01.

【翻译】我们在三个基于航空图像的目标检测基准上进行了广泛的实验，即Visdrone、UAVDT和AI-TOD。所有实验都在NVIDIA GeForce RTX 4090 GPU上进行，除了推理速度是在单个RTX 3080 GPU上测试的。我们的网络使用随机梯度下降（SGD）优化器训练300个周期，动量为0.937，权重衰减为0.0005，批量大小为4，初始学习率为0.01。

Results on Visdrone Dataset

State-of-the-art Comparison. As shown in Table 1, we compare FBRT-YOLO with existing real-time detectors. Our FBRT-YOLO achieves superior performance and faster detection efficiency across various model scales. For resource-limited aerial operation equipment, we demonstrate results of FBRT-YOLO models at various scales compared to other real-time state-of-the-art object detectors. For small models, FBRT-YOLO-N/S reduces parameter count by $72%72\%$ and $74%74\%$ respectively compared to YOLOv8-N/S, while achieving an improved detection accuracy of $0.6%0.6\%$ and $2.3%2.3\%$ in average precision (AP). For medium models, FBRT-YOLO-M reduce GFLOPs by $26%26\%$ and $23%23\%$ compared to YOLOv8-M and YOLOv9-M, respectively, while achieving improvements in AP of $1.3%1.3\%$ and $1.2%1.2\%$ , respectively. For large models, compared to YOLOv8-X and YOLOv10-X, our FBRT-YOLO-X shows $66%66\%$ and $23%23\%$ fewer parameters, respectively, and achieves a significant improvement in AP of $1.2%1.2\%$ and $1.4%1.4\%$ . Moreover, compared to RT-DETR-R34/R50, FBRT-YOLO-M/L achieves fewer parameters, lower GFLOPs, higher detection speed, and better detection performance. These experimental results demonstrate the superiority of our FBRT-YOLO as a real-time aerial image detector.

【翻译】最先进方法比较。如表1所示，我们将FBRT-YOLO与现有的实时检测器进行比较。我们的FBRT-YOLO在各种模型规模上都实现了卓越的性能和更快的检测效率。对于资源有限的航空操作设备，我们展示了FBRT-YOLO模型在各种规模下与其他实时最先进目标检测器相比的结果。对于小模型，FBRT-YOLO-N/S相比YOLOv8-N/S分别减少了 $72%72\%$ 和 $74%74\%$ 的参数数量，同时在平均精度（AP）上分别实现了 $0.6%0.6\%$ 和 $2.3%2.3\%$ 的检测精度提升。对于中等模型，FBRT-YOLO-M相比YOLOv8-M和YOLOv9-M分别减少了 $26%26\%$ 和 $23%23\%$ 的GFLOPs，同时分别实现了 $1.3%1.3\%$ 和 $1.2%1.2\%$ 的AP改进。对于大型模型，相比YOLOv8-X和YOLOv10-X，我们的FBRT-YOLO-X分别显示出 $66%66\%$ 和 $23%23\%$ 更少的参数，并分别实现了 $1.2%1.2\%$ 和 $1.4%1.4\%$ 的AP显著改进。此外，相比RT-DETR-R34/R50，FBRT-YOLO-M/L实现了更少的参数、更低的GFLOPs、更高的检测速度和更好的检测性能。这些实验结果证明了我们的FBRT-YOLO作为实时航空图像检测器的优越性。

As shown in Table 2, it shows the comparison results of our method with other state-of-the-art methods on VisDrone. which indicates that our FBRT-YOLO can effectively detects aerial images.

【翻译】如表2所示，它显示了我们的方法与其他最先进方法在VisDrone上的比较结果。这表明我们的FBRT-YOLO能够有效地检测航空图像。

Figure 4: Visualization of the detection results and heatmaps on VisDrone. The highlighted areas represent the regions that the network is focusing on.

【翻译】图4：在VisDrone上检测结果和热图的可视化。突出显示的区域表示网络关注的区域。

Table 3: Comparison of $AP(%)\mathrm{AP}(\%)$ with state-of-the-art detectors on UAVDT.

【翻译】表3：在UAVDT上与最先进检测器的 $AP(%)\mathrm{AP}(\%)$ 比较。

Table 4: Comparison of $AP(%)\mathrm{AP}(\%)$ and Params/FPS on AI-TOD by using our methods with baseline.

【翻译】表4：在AI-TOD上使用我们的方法与基线的 $AP(%)\mathrm{AP}(\%)$ 和参数/FPS比较。

Qaulitative Resutls. To better demonstrate the superior performance of FBRT-YOLO in detecting aerial images, we visualize the heatmaps of both the baseline model and our method in Fig. 4. From the results, we observe that FBRTYOLO enhances focus on small and densely packed targets, showcasing the method’s superiority in enhancing spatial and multiscale information within the network.

【翻译】定性结果。为了更好地展示FBRT-YOLO在检测航空图像方面的卓越性能，我们在图4中可视化了基线模型和我们方法的热图。从结果中，我们观察到FBRT-YOLO增强了对小型和密集目标的关注，展示了该方法在增强网络内空间和多尺度信息方面的优越性。

Results on UAVDT Dataset

Quantitative Result. Table 3 reports our comparison results on the UAVDT dataset. Our proposed method surpasses existing methods, such as GLSAN (2020) and CEASC (2023). The results clearly show that our proposed FBRT-YOLO achieves superior performance with an AP of $18.4%18.4\%$ , outperforming other state-of-the-art methods in aerial image detection. This demonstrates the effectiveness of our detection framework.

【翻译】定量结果。表3报告了我们在UAVDT数据集上的比较结果。我们提出的方法超越了现有方法，如GLSAN（2020）和CEASC（2023）。结果清楚地表明，我们提出的FBRT-YOLO以 $18.4%18.4\%$ 的AP实现了卓越性能，在航空图像检测中优于其他最先进的方法。这证明了我们检测框架的有效性。

Table 5: Ablation on FCM, MKP and RR Module on VisDrone. ‘RR’ represents operations aimed at reducing inherent redundancy in the network. Replace the final downsampling layer with MKP and remove the corresponding detection head.

【翻译】表5：在VisDrone上对FCM、MKP和RR模块的消融实验。'RR’表示旨在减少网络固有冗余的操作。用MKP替换最终下采样层并移除相应的检测头。

Qaulitative Resutls. A complex background can significantly limit the effective information about the target. Our method focuses on effectively propagating the spatial information of the target through network layers to enhance feature representation. Visualization of detection results, as shown in Fig. 5, proves that our method significantly improves detection performance in complex backgrounds.

【翻译】定性结果。复杂的背景会显著限制关于目标的有效信息。我们的方法专注于通过网络层有效传播目标的空间信息以增强特征表示。如图5所示的检测结果可视化证明，我们的方法显著改善了复杂背景下的检测性能。

Results on AI-TOD Dataset

The AI-TOD dataset contains a significant proportion of small objects. To better validate the superiority of our method in small object detection, we also evaluate FBRTYOLO on AI-TOD. As reported in Table 4, our method reduces the parameter count by $74%74\%$ , the GFLOPs by $20%20\%$ , while achieving a $2.2%2.2\%$ increase in $AP50\mathrm{AP_{50}}$ and a $1.1%1.1\%$ increase in AP compared to the baseline.

【翻译】AI-TOD数据集包含很大比例的小目标。为了更好地验证我们的方法在小目标检测方面的优越性，我们还在AI-TOD上评估了FBRT-YOLO。如表4所示，与基线相比，我们的方法减少了 $74%74\%$ 的参数数量，减少了 $20%20\%$ 的GFLOPs，同时在 $AP50\mathrm{AP_{50}}$ 上实现了 $2.2%2.2\%$ 的增长，在AP上实现了 $1.1%1.1\%$ 的增长。

Ablation Study

To validate the effectiveness of the core module design in FBRT-YOLO, we design a series of ablation experiments on the VisDrone dataset. We use YOLOv8-S as the baseline model in all the ablation experiments.

【翻译】为了验证FBRT-YOLO中核心模块设计的有效性，我们在VisDrone数据集上设计了一系列消融实验。在所有消融实验中，我们使用YOLOv8-S作为基线模型。

Effect of Key Components. Experimental results in Table 5 exhibit the effectiveness of all contributions in this
work. We reduce inherent redundancy in the baseline model, optimize it, and achieve a $18%18\%$ reduction in parameters and a $11%11\%$ decrease in computational load, albeit with a slight decrease in accuracy. Introducing the FCM module into various stages of the backbone network incorporates spatial positional information in deeper layers, resulting in a $1.4%1.4\%$ increase in $AP50\mathrm{AP_{50}}$ and further reducing network computational resources. We replace the downsampling operation of the backbone network’s final layer with MKP units to detect targets at multiple scales, thereby increasing AP by $1.6%1.6\%$ . It’s worth noting that our network converges faster during the training process compared to the baseline network.

【翻译】关键组件的效果。表5中的实验结果展示了本工作中所有贡献的有效性。我们减少了基线模型中的固有冗余，对其进行优化，实现了 $18%18\%$ 的参数减少和 $11%11\%$ 的计算负载降低，尽管准确性略有下降。将FCM模块引入骨干网络的各个阶段在更深层中纳入了空间位置信息，导致 $AP50\mathrm{AP_{50}}$ 增加 $1.4%1.4\%$ 并进一步减少网络计算资源。我们用MKP单元替换骨干网络最终层的下采样操作以检测多尺度目标，从而将AP增加 $1.6%1.6\%$ 。值得注意的是，与基线网络相比，我们的网络在训练过程中收敛更快。

Table 6: The experiment validates the optimal configuration of the mapping relationship.

【翻译】表6：实验验证了映射关系的最优配置。

Figure 5: Visualizations of the detection results of baseline and our proposed method under low light and similar background conditions on UAVDT. The blue boxes represent the prediction results using the baseline model, while the red boxes represent the prediction results using our method.

【翻译】图5：在UAVDT上低光照和相似背景条件下基线和我们提出方法的检测结果可视化。蓝色框代表使用基线模型的预测结果，而红色框代表使用我们方法的预测结果。

Effect of Mapping Relationship. Table 6 shows the results of the proposed channel and spatial complementary mapping. In order to obtain the optimal configuration of two mapping relationships, we design a series of variant experiments. According to the experimental results, we find that models using channel or spatial mapping are superior to models without mapping relationships. Combining the two can achieve better results. Compared with the model without mapping relationships, this optimal configuration has improved $AP50\mathrm{AP}_{50}$ by $2.0%2.0\%$ .

【翻译】映射关系的效果。表6显示了提出的通道和空间互补映射的结果。为了获得两种映射关系的最优配置，我们设计了一系列变体实验。根据实验结果，我们发现使用通道或空间映射的模型优于没有映射关系的模型。将两者结合可以获得更好的结果。与没有映射关系的模型相比，这种最优配置将 $AP50\mathrm{AP}_{50}$ 提高了 $2.0%2.0\%$ 。

Effect of Split Ratio. Table 7 shows the impact of different parameters $α\alpha$ on the experimental results, where $α\alpha$ represents the split ratio of spatial feature information and channel feature information. From the experimental results, we can see that as the downsampling process progresses, the proportion of the spatial feature part (undergoing point-wise convolution) increases, and the experimental effect will be better. We speculate that the reason for this phenomenon is that when $α\alpha$ takes the values 0.75, 0.75, 0.25, 0.25, it retains more spatial location information in deeper networks, which is beneficial for the localization and matching of target features. Retaining more spatial position information in a deeper network is also consistent with the original intention of the FCM module design.

【翻译】分割比例的效果。表7显示了不同参数 $α\alpha$ 对实验结果的影响，其中 $α\alpha$ 表示空间特征信息和通道特征信息的分割比例。从实验结果中，我们可以看到随着下采样过程的进行，空间特征部分（经过逐点卷积）的比例增加，实验效果会更好。我们推测这种现象的原因是当 $α\alpha$ 取值0.75、0.75、0.25、0.25时，它在更深的网络中保留了更多的空间位置信息，这有利于目标特征的定位和匹配。在更深的网络中保留更多的空间位置信息也符合FCM模块设计的初衷。

Table 7: Experimental verification of the impact of the spatial and channel feature partition ratio of $α\alpha$ in the FCM module at each stage of the backbone network.

【翻译】表7：在骨干网络各阶段FCM模块中空间和通道特征分割比例 $α\alpha$ 影响的实验验证。

Table 8: Experiments with different kernel size in MKP.

【翻译】表8：MKP中不同核大小的实验。

Effect of Kernel Size. Table 8 shows the experimental results of different kernel sizes in MKP. From the experimental results, it can be observed that smaller kernels provide limited receptive fields for the network, failing to establish strong contextual associations, while larger kernels introduce significant background noise, which is detrimental to detection. By using convolutional kernels of varying sizes, we apture multi-scale features of targets spanning different sizes. Additionally, we introduce point-wise convolutions between different kernel sizes to integrate spatial information across scales, achieving optimal performance.

【翻译】核大小的效果。表8显示了MKP中不同核大小的实验结果。从实验结果中可以观察到，较小的核为网络提供了有限的感受野，无法建立强的上下文关联，而较大的核引入了显著的背景噪声，这对检测有害。通过使用不同大小的卷积核，我们捕获了跨越不同尺寸的目标的多尺度特征。此外，我们在不同核大小之间引入逐点卷积以整合跨尺度的空间信息，实现最优性能。

Conclusion

In this paper, we propose a new family of real-time detectors for aerial image detection, named FBRT-YOLO. Specifically, it introduces two lightweight modules: the Feature Complementary Mapping Module (FCM), which aims to improve the fusion of rich semantic information with precise spatial location details, and the Multi-Kernel Perception Unit (MKP), which enhances multi-scale target perception and improves the network’s ability to capture features across varying scales. For aerial image detection, we also reduce the inherent redundancies in conventional detectors, further accelerating the network. Extensive experimental results on the VisDrone, UAVDT, and AI-DOT datasets demonstrate that FBRT-YOLO achieves a highly balanced trade-off between accuracy and efficiency in aerial image detection.

【翻译】在本文中，我们提出了一种新的航空图像检测实时检测器系列，命名为FBRT-YOLO。具体而言，它引入了两个轻量级模块：特征互补映射模块（FCM），旨在改善丰富语义信息与精确空间位置细节的融合，以及多核感知单元（MKP），增强多尺度目标感知并改善网络捕获不同尺度特征的能力。对于航空图像检测，我们还减少了传统检测器中的固有冗余，进一步加速了网络。在VisDrone、UAVDT和AI-DOT数据集上的大量实验结果表明，FBRT-YOLO在航空图像检测的准确性和效率之间实现了高度平衡的权衡。

查看全文

http://www.lryc.cn/news/587696.html