ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.20625
54
0

T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting

28 February 2025
Yifei Qian
Zhongliang Guo
Bowen Deng
Chun Tong Lei
Shuai Zhao
Chun Pong Lau
Xiaopeng Hong
Michael P. Pound
    DiffM
ArXivPDFHTML
Abstract

Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available atthis https URL.

View on arXiv
@article{qian2025_2502.20625,
  title={ T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting },
  author={ Yifei Qian and Zhongliang Guo and Bowen Deng and Chun Tong Lei and Shuai Zhao and Chun Pong Lau and Xiaopeng Hong and Michael P. Pound },
  journal={arXiv preprint arXiv:2502.20625},
  year={ 2025 }
}
Comments on this paper