ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.12656
22
0

SPKLIP: Aligning Spike Video Streams with Natural Language

19 May 2025
Yongchang Gao
Meiling Jin
Zhaofei Yu
Tiejun Huang
Guozhang Chen
    CLIP
    VLM
ArXivPDFHTML
Abstract

Spike cameras offer unique sensing capabilities but their sparse, asynchronous output challenges semantic understanding, especially for Spike Video-Language Alignment (Spike-VLA) where models like CLIP underperform due to modality mismatch. We introduce SPKLIP, the first architecture specifically for Spike-VLA. SPKLIP employs a hierarchical spike feature extractor that adaptively models multi-scale temporal dynamics in event streams, and uses spike-text contrastive learning to directly align spike video with language, enabling effective few-shot learning. A full-spiking visual encoder variant, integrating SNN components into our pipeline, demonstrates enhanced energy efficiency. Experiments show state-of-the-art performance on benchmark spike datasets and strong few-shot generalization on a newly contributed real-world dataset. SPKLIP's energy efficiency highlights its potential for neuromorphic deployment, advancing event-based multimodal research. The source code and dataset are available at [link removed for anonymity].

View on arXiv
@article{gao2025_2505.12656,
  title={ SPKLIP: Aligning Spike Video Streams with Natural Language },
  author={ Yongchang Gao and Meiling Jin and Zhaofei Yu and Tiejun Huang and Guozhang Chen },
  journal={arXiv preprint arXiv:2505.12656},
  year={ 2025 }
}
Comments on this paper