v1v2v3 (latest)

P $^2$ HCT: Plug-and-Play Hierarchical C2F Transformer for Multi-Scale Feature Fusion

19 May 2025

ArXiv (abs)PDF HTML Github (54811★)

Main:9 Pages

6 Figures

Bibliography:3 Pages

6 Tables

Appendix:1 Pages

Abstract

Feature fusion plays a pivotal role in achieving high performance in vision models, yet existing attention-based fusion techniques often suffer from substantial computational overhead and implementation complexity, particularly in resource-constrained settings. To address these limitations, we introduce the Plug-and-Play Hierarchical C2F Transformer (P $^2$ HCT), a lightweight module that combines coarse-to-fine token selection with shared attention parameters to preserve spatial details while reducing inference cost. P $^2$ HCT is trainable using coarse attention alone and can be seamlessly activated at inference to enhance accuracy without retraining. Integrated into real-time detectors such as YOLOv11-N/S/M, P $^2$ HCT achieves mAP gains of 0.9\%, 0.5\%, and 0.4\% on MS COCO with minimal latency increase. Similarly, embedding P $^2$ HCT into ResNet-18/50/101 backbones improves ImageNet top-1 accuracy by 6.5\%, 1.7\%, and 1.0\%, respectively. These results underscore P $^2$ HCT's effectiveness as a hardware-friendly and general-purpose enhancement for both detection and classification tasks.

View on arXiv

Comments on this paper

P2^22HCT: Plug-and-Play Hierarchical C2F Transformer for Multi-Scale Feature Fusion

P $^2$ HCT: Plug-and-Play Hierarchical C2F Transformer for Multi-Scale Feature Fusion