84
v1v2v3 (latest)

P2^2HCT: Plug-and-Play Hierarchical C2F Transformer for Multi-Scale Feature Fusion

Main:9 Pages
6 Figures
Bibliography:3 Pages
6 Tables
Appendix:1 Pages
Abstract

Feature fusion plays a pivotal role in achieving high performance in vision models, yet existing attention-based fusion techniques often suffer from substantial computational overhead and implementation complexity, particularly in resource-constrained settings. To address these limitations, we introduce the Plug-and-Play Hierarchical C2F Transformer (P2^2HCT), a lightweight module that combines coarse-to-fine token selection with shared attention parameters to preserve spatial details while reducing inference cost. P2^2HCT is trainable using coarse attention alone and can be seamlessly activated at inference to enhance accuracy without retraining. Integrated into real-time detectors such as YOLOv11-N/S/M, P2^2HCT achieves mAP gains of 0.9\%, 0.5\%, and 0.4\% on MS COCO with minimal latency increase. Similarly, embedding P2^2HCT into ResNet-18/50/101 backbones improves ImageNet top-1 accuracy by 6.5\%, 1.7\%, and 1.0\%, respectively. These results underscore P2^2HCT's effectiveness as a hardware-friendly and general-purpose enhancement for both detection and classification tasks.

View on arXiv
Comments on this paper