v1v2 (latest)

Mamba Fusion: Learning Actions Through Questioning

17 September 2024

Irfan Essa

ArXiv (abs)PDF HTML Github (5★)

Papers citing "Mamba Fusion: Learning Actions Through Questioning"

31 / 31 papers shown

ChromouVQA: Benchmarking Vision-Language Models under Chromatic Camouflaged Images

193

30 Nov 2025

Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models

Wangjiaxuan Xin

LLMAG

300

24 Nov 2025

Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering

Elman Ghazaei

Erchan Aptoula

275

12 Aug 2025

HierSum: A Global and Local Attention Mechanism for Video Summarization

Apoorva Beedu

Irfan Essa

918

25 Apr 2025

Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition -- And Ways to Overcome ThemAAAI Conference on Artificial Intelligence (AAAI), 2024

317

21 Aug 2024

Fusion-Mamba for Cross-modality Object Detection

Xuhui Liu

373

14 Apr 2024

MambaDFuse: A Mamba-based Dual-phase Model for Multi-modality Image Fusion

222

12 Apr 2024

VideoMamba: State Space Model for Efficient Video UnderstandingEuropean Conference on Computer Vision (ECCV), 2024

Yu Qiao

353

459

11 Mar 2024

On the Efficacy of Text-Based Input Modalities for Action Anticipation

Apoorva Beedu

Karan Samel

Irfan Essa

458

23 Jan 2024

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space ModelInternational Conference on Machine Learning (ICML), 2024

547

1,627

17 Jan 2024

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu

Tri Dao

Mamba

809

6,333

01 Dec 2023

Training a Large Video Model on a Single Machine in a Day

Yue Zhao

Philipp Krahenbuhl

VLM

309

28 Sep 2023

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the BackboneIEEE International Conference on Computer Vision (ICCV), 2023

427

149

11 Jul 2023

Hiera: A Hierarchical Vision Transformer without the Bells-and-WhistlesInternational Conference on Machine Learning (ICML), 2023

...

Christoph Feichtenhofer

3DH

448

366

01 Jun 2023

Learning Video Representations from Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2022

442

245

08 Dec 2022

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent AttentionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022

227

21 Nov 2022

End-to-End Multimodal Representation Learning for Video Dialog

251

26 Oct 2022

Anticipative Feature Fusion Transformer for Multi-Modal Action AnticipationIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022

240

23 Oct 2022

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text RetrievalACM Multimedia (ACM MM), 2022

Ji Zhang

309

432

15 Jul 2022

Long Movie Clip Classification with State-Space Video ModelsEuropean Conference on Computer Vision (ECCV), 2022

Md. Mohaiminul Islam

Gedas Bertasius

VLM

481

145

04 Apr 2022

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video RecognitionComputer Vision and Pattern Recognition (CVPR), 2022

Christoph Feichtenhofer

ViT

522

261

20 Jan 2022

Omnivore: A Single Model for Many Visual ModalitiesComputer Vision and Pattern Recognition (CVPR), 2022

Rohit Girdhar

Mannat Singh

Nikhil Ravi

Laurens van der Maaten

Armand Joulin

Ishan Misra

705

299

20 Jan 2022

Efficiently Modeling Long Sequences with Structured State SpacesInternational Conference on Learning Representations (ICLR), 2021

Albert Gu

Karan Goel

Christopher Ré

1.2K

3,295

31 Oct 2021

Wav2CLIP: Learning Robust Audio Representations From CLIPIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021

410

336

21 Oct 2021

Ego4D: Around the World in 3,000 Hours of Egocentric Video

...

Antonio Torralba

Mingfei Yan

1.2K

1,646

13 Oct 2021

Object-Region Video Transformers

414

100

13 Oct 2021

Keeping Your Eye on the Ball: Trajectory Attention in Video TransformersNeural Information Processing Systems (NeurIPS), 2021

Ishan Misra Florian Metze

Christoph Feichtenhofer

Andrea Vedaldi

João F. Henriques

389

349

09 Jun 2021

Learning Transferable Visual Models From Natural Language SupervisionInternational Conference on Machine Learning (ICML), 2021

...

2.2K

46,392

26 Feb 2021

Rescaling Egocentric VisionInternational Journal of Computer Vision (IJCV), 2020

Dima Damen

...

641

629

23 Jun 2020

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Luke Zettlemoyer

6.0K

28,988

26 Jul 2019

Attention Is All You NeedNeural Information Processing Systems (NeurIPS), 2017

8.3K

172,602

12 Jun 2017