Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2309.05519
Cited By
v1
v2
v3 (latest)
NExT-GPT: Any-to-Any Multimodal LLM
International Conference on Machine Learning (ICML), 2023
11 September 2023
Shengqiong Wu
Hao Fei
Leigang Qu
Wei Ji
Tat-Seng Chua
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (78 upvotes)
Papers citing
"NExT-GPT: Any-to-Any Multimodal LLM"
50 / 240 papers shown
Title
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Longji Xu
Shengqiong Wu
Yujiao Shi
William Yang Wang
Ziwei Liu
Jiebo Luo
Hao Fei
LRM
481
98
0
16 Mar 2025
AudioX: Diffusion Transformer for Anything-to-Audio Generation
Zeyue Tian
Yizhu Jin
Zhaoyang Liu
Ruibin Yuan
Xu Tan
Qifeng Chen
Wei Xue
Xu Tan
354
27
0
13 Mar 2025
TA-V2A: Textually Assisted Video-to-Audio Generation
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Yuhuan You
Xihong Wu
T. Qu
DiffM
219
3
0
12 Mar 2025
Learning to Match Unpaired Data with Minimum Entropy Coupling
Mustapha Bounoua
Giulio Franzese
Pietro Michiardi
275
2
0
11 Mar 2025
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
Xing Xie
Jiawei Liu
Ziyue Lin
Huijie Fan
Zhi Han
Yandong Tang
Liangqiong Qu
340
0
0
10 Mar 2025
LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?
Bangyan Li
Hao Wu
Chunjiang Ge
Longji Xu
Shaohui Lin
...
Ling You
Yinqi Zhang
Ke Li
Xing Sun
Yan Sun
172
3
0
10 Mar 2025
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
Sihao Lin
Chunwei Wang
Xiuwei Chen
Hongbin Xu
Jiawei Han
Xiandan Liang
J. N. Han
Hang Xu
Xiaodan Liang
VLM
611
14
0
09 Mar 2025
ToFu: Visual Tokens Reduction via Fusion for Multi-modal, Multi-patch, Multi-image Task
Vittorio Pippi
Matthieu Guillaumin
S. Cascianelli
Rita Cucchiara
M. Jaritz
Loris Bazzani
171
0
0
06 Mar 2025
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
Zhifei Xie
Mingbao Lin
Ziqiang Liu
Pengcheng Wu
Shuicheng Yan
Chunyan Miao
AuLLM
OffRL
LRM
321
63
0
04 Mar 2025
Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models
Tianjie Ju
Yi Hua
Hao Fei
Zhenyu Shao
Yubin Zheng
Haodong Zhao
Yang Deng
Wynne Hsu
Zhuosheng Zhang
Gongshen Liu
370
1
0
03 Mar 2025
Towards Enhanced Image Generation Via Multi-modal Chain of Thought in Unified Generative Models
Yi Wang
Mushui Liu
Wanggui He
Longxiang Zhang
Longxiang Zhang
...
Weilong Dai
Weilong Dai
Mingli Song
Hao Jiang
Jie Song
MLLM
MoE
LRM
325
13
0
03 Mar 2025
GPIoT: Tailoring Small Language Models for IoT Program Synthesis and Development
ACM International Conference on Embedded Networked Sensor Systems (SenSys), 2025
Leming Shen
Qiang Yang
Xinyu Huang
Zijing Ma
Yuanqing Zheng
201
13
0
02 Mar 2025
How Deep is Love in LLMs' Hearts? Exploring Semantic Size in Human-like Cognition
Yao Yao
Yifei Yang
Xinbei Ma
Dongjie Yang
Zhuosheng Zhang
Zuchao Li
Hai Zhao
169
1
0
01 Mar 2025
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Xiao Wang
Jingyun Hua
Weihong Lin
Yujiao Shi
Fuzheng Zhang
Yue Yu
Di Zhang
Liqiang Nie
VLM
582
1
0
28 Feb 2025
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
L. Chen
S. Bai
Wenhao Chai
Weichu Xie
Haozhe Zhao
Leon Vinci
Junyang Lin
Baobao Chang
DiffM
261
15
0
27 Feb 2025
Chain-of-Description: What I can understand, I can put into words
Jiaxin Guo
Daimeng Wei
Tianying Wang
Hengchao Shang
Yuanchang Luo
Hao Yang
207
0
0
22 Feb 2025
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
Hantao Lou
Changye Li
Yalan Qin
Yaodong Yang
286
6
0
22 Feb 2025
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
L. Yang
Xinchen Zhang
Ye Tian
Chenming Shang
Minghao Xu
Wentao Zhang
Tengjiao Wang
296
9
0
17 Feb 2025
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
Zhenxing Mi
Kuan-Chieh Wang
Guocheng Qian
Hanrong Ye
Runtao Liu
Sergey Tulyakov
Kfir Aberman
Dan Xu
LRM
281
7
0
12 Feb 2025
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Mohammad Mahdi Abootorabi
Amirhosein Zobeiri
Mahdi Dehghani
Mohammadali Mohammadkhani
Bardia Mohammadi
Omid Ghahroodi
M. Baghshah
Ehsaneddin Asgari
RALM
639
27
0
12 Feb 2025
UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Weijia Mao
Zhiyong Yang
Mike Zheng Shou
MoE
607
2
0
10 Feb 2025
Parameter-Efficient Fine-Tuning for Foundation Models
Dan Zhang
Tao Feng
Lilong Xue
Yuandong Wang
Yuxiao Dong
J. Tang
465
30
0
23 Jan 2025
LASER: Lip Landmark Assisted Speaker Detection for Robustness
Le Thien Phuc Nguyen
Xiaohua Xie
Yong Jae Lee
226
2
0
21 Jan 2025
Towards Advancing Code Generation with Large Language Models: A Research Roadmap
Haolin Jin
Huaming Chen
Qinghua Lu
Liming Zhu
LLMAG
202
5
0
20 Jan 2025
A Comprehensive Survey of Foundation Models in Medicine
IEEE Reviews in Biomedical Engineering (RBME), 2024
Wasif Khan
Seowung Leem
Kyle B. See
Joshua K. Wong
Shaoting Zhang
R. Fang
AI4CE
LM&MA
VLM
586
61
0
17 Jan 2025
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
International Conference on Machine Learning (ICML), 2024
Hao Fei
Shengqiong Wu
Wei Ji
Hao Zhang
Hao Fei
Yang Deng
Wynne Hsu
LRM
VGen
361
141
0
08 Jan 2025
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Neural Information Processing Systems (NeurIPS), 2024
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
407
70
0
31 Dec 2024
CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Yeyuan Wang
D. Gao
Bin Li
Rujiao Long
Lei Yi
Xiaoyan Cai
Libin Yang
Jinxia Zhang
Jinsong Chen
Qi Xuan
199
1
0
22 Dec 2024
Do Language Models Understand Time?
The Web Conference (WWW), 2024
Xi Ding
Lei Wang
664
9
0
18 Dec 2024
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models
AAAI Conference on Artificial Intelligence (AAAI), 2024
Zihui Cheng
Qiguang Chen
Jin Zhang
Hao Fei
Xiaocheng Feng
Wanxiang Che
Min Li
L. Qin
VLM
MLLM
LRM
374
26
0
17 Dec 2024
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning
AAAI Conference on Artificial Intelligence (AAAI), 2024
Shengqiong Wu
Hao Fei
Liangming Pan
William Yang Wang
Shuicheng Yan
Tat-Seng Chua
LRM
332
14
0
15 Dec 2024
Olympus: A Universal Task Router for Computer Vision Tasks
Computer Vision and Pattern Recognition (CVPR), 2024
Yuanze Lin
Yunsheng Li
Dongdong Chen
Weijian Xu
Ronald Clark
Juil Sock
VLM
ObjD
1.1K
2
0
12 Dec 2024
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
Siqi Kou
Jiachun Jin
Chang Liu
Ye Ma
Jian Jia
Quan Chen
Peng Jiang
Zhijie Deng
Zhijie Deng
DiffM
VGen
VLM
519
25
0
28 Nov 2024
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Rong-Cheng Tu
Zi-Ao Ma
Tian Lan
Yuehao Zhao
Heyan Huang
Xian-Ling Mao
MLLM
VLM
EGVM
317
9
0
23 Nov 2024
Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge
Ruiyang Qin
Dancheng Liu
Gelei Xu
Zheyu Yan
Chenhui Xu
Yuting Hu
Xiaolin Hu
Jinjun Xiong
Yiyu Shi
Y. Shi
AuLLM
431
1
0
21 Nov 2024
Spider: Any-to-Many Multimodal LLM
Jinxiang Lai
Jie Zhang
Jun Liu
Jian Li
Xiaocheng Lu
Song Guo
MLLM
452
4
0
14 Nov 2024
Autoregressive Models in Vision: A Survey
Jing Xiong
Gongye Liu
Lun Huang
Chengyue Wu
Taiqiang Wu
...
Hao Fei
Guillermo Sapiro
Jiebo Luo
Ping Luo
Ngai Wong
VGen
410
36
0
08 Nov 2024
CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM
Jingwei Xu
Chenyu Wang
Zibo Zhao
Wen Liu
Yi-An Ma
Shenghua Gao
302
33
0
07 Nov 2024
Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey
Ao Fu
Yi Zhou
Tao Zhou
Yue Yang
Bojun Gao
Qun Li
Guobin Wu
Ling Shao
VGen
225
5
0
05 Nov 2024
Generative Emotion Cause Explanation in Multimodal Conversations
International Conference on Multimedia Retrieval (ICMR), 2024
Lin Wang
Xiaocui Yang
Shi Feng
Daling Wang
Yifei Zhang
Zhitao Zhang
418
1
0
01 Nov 2024
On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
Neural Information Processing Systems (NeurIPS), 2024
Xiufeng Song
Xiao Guo
Junxuan Zhang
Qirui Li
Lei Bai
Xiaoming Liu
Guangtao Zhai
Xiaohong Liu
VGen
DiffM
548
28
0
31 Oct 2024
Analyzing Multimodal Interaction Strategies for LLM-Assisted Manipulation of 3D Scenes
IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR), 2024
Junlong Chen
Jens Grubert
Per Ola Kristensson
142
8
0
29 Oct 2024
A Hierarchical Language Model For Interpretable Graph Reasoning
Sambhav Khurana
Xiner Li
Shurui Gui
Shuiwang Ji
LRM
300
0
0
29 Oct 2024
Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation
Maohao Shen
Shun Zhang
Jilong Wu
Zhiping Xiu
Ehab AlBadawy
Yiting Lu
M. Seltzer
Qing He
145
6
0
27 Oct 2024
GiVE: Guiding Visual Encoder to Perceive Overlooked Information
Junjie Li
Jianghong Ma
Xiaofeng Zhang
Yuhang Li
Jianyang Shi
287
1
0
26 Oct 2024
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
Liwen Wang
Sheng Chen
Linnan Jiang
Shu Pan
Runze Cai
Sen Yang
Fei Yang
413
14
0
24 Oct 2024
Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image
Neural Information Processing Systems (NeurIPS), 2024
Yu Zhao
Hao Fei
Xiangtai Li
L. Qin
Jiayi Ji
Erik Cambria
Meishan Zhang
Hao Fei
Jianguo Wei
DiffM
214
2
0
20 Oct 2024
Roadmap towards Superhuman Speech Understanding using Large Language Models
Fan Bu
Yuhao Zhang
Xiang Wang
Benyou Wang
Qiang Liu
Haoyang Li
LM&MA
ELM
AuLLM
665
2
0
17 Oct 2024
SensorLLM: Aligning Large Language Models with Motion Sensors for Human Activity Recognition
Zechen Li
Shohreh Deldari
Linyao Chen
Hao Xue
Flora D. Salim
439
16
0
14 Oct 2024
Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models
International Conference on Learning Representations (ICLR), 2024
Qingni Wang
Tiantian Geng
Zhiyuan Wang
Teng Wang
Bo Fu
Feng Zheng
369
13
0
10 Oct 2024
Previous
1
2
3
4
5
Next