Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.07451
Cited By
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
13 May 2024
Yuanyuan Jiang
Jianqin Yin
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering"
6 / 6 papers shown
Title
Question-Aware Gaussian Experts for Audio-Visual Question Answering
Hongyeob Kim
Inyoung Jung
Dayoon Suh
Youjia Zhang
Sangmin Lee
Sungeun Hong
58
0
0
06 Mar 2025
Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization
Yuanyuan Jiang
Jianqin Yin
Yonghao Dang
27
4
0
11 Oct 2022
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
ViT
231
573
0
22 Apr 2021
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
398
532
0
21 Jul 2020
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky
Jia Deng
Hao Su
J. Krause
S. Satheesh
...
A. Karpathy
A. Khosla
Michael S. Bernstein
Alexander C. Berg
Li Fei-Fei
VLM
ObjD
279
39,083
0
01 Sep 2014
Efficient Estimation of Word Representations in Vector Space
Tomáš Mikolov
Kai Chen
G. Corrado
J. Dean
3DV
228
29,632
0
16 Jan 2013
1