Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.05863
Cited By
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models
9 October 2023
Guangzhi Sun
Wenyi Yu
Changli Tang
Xianzhao Chen
Tian Tan
Wei Li
Lu Lu
Zejun Ma
Chao Zhang
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models"
15 / 15 papers shown
Title
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
Zhenghao Xing
Xiaowei Hu
Chi-Wing Fu
W. Wang
Jifeng Dai
Pheng-Ann Heng
MLLM
OffRL
VLM
LRM
45
0
0
07 May 2025
Aligned Better, Listen Better for Audio-Visual Large Language Models
Yuxin Guo
Shuailei Ma
Shijie Ma
Xiaoyi Bao
Chen-Wei Xie
Kecheng Zheng
Tingyu Weng
Siyang Sun
Yun Zheng
Wei Zou
MLLM
AuLLM
58
2
0
02 Apr 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
Xin Ding
Hao Wu
Y. Yang
Shiqi Jiang
Donglin Bai
Zhibo Chen
Ting Cao
50
0
0
08 Mar 2025
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Sicong Leng
Yun Xing
Zesen Cheng
Yang Zhou
Hang Zhang
Xin Li
Deli Zhao
Shijian Lu
Chunyan Miao
Lidong Bing
25
7
0
16 Oct 2024
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Guangzhi Sun
Wenyi Yu
Changli Tang
Xianzhao Chen
Tian Tan
Wei Li
Lu Lu
Zejun Ma
Yuxuan Wang
Chao Zhang
45
22
0
22 Jun 2024
CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models
Guangzhi Sun
Potsawee Manakul
Adian Liusie
Kunat Pipatanakul
Chao Zhang
P. Woodland
Mark J. F. Gales
HILM
MLLM
16
7
0
22 May 2024
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu
Jaehong Yoon
Mohit Bansal
77
4
0
08 Feb 2024
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Ping Luo
Jiebo Luo
Chenliang Xu
VLM
47
81
0
29 Dec 2023
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
A. Piergiovanni
Isaac Noble
Dahun Kim
Michael S. Ryoo
Victor Gomes
A. Angelova
25
19
0
09 Nov 2023
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Changli Tang
Wenyi Yu
Guangzhi Sun
Xianzhao Chen
Tian Tan
Wei Li
Lu Lu
Zejun Ma
Chao Zhang
LM&MA
AuLLM
28
195
0
20 Oct 2023
VideoLLM: Modeling Video Sequence with Large Language Models
Guo Chen
Yin-Dong Zheng
Jiahao Wang
Jilan Xu
Yifei Huang
...
Yi Wang
Yali Wang
Yu Qiao
Tong Lu
Limin Wang
MLLM
92
76
0
22 May 2023
Instruction Tuning with GPT-4
Baolin Peng
Chunyuan Li
Pengcheng He
Michel Galley
Jianfeng Gao
SyDa
ALM
LM&MA
157
576
0
06 Apr 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
244
4,186
0
30 Jan 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
224
1,017
0
13 Oct 2021
1