ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.17927
  4. Cited By
The Evolution of Multimodal Model Architectures

The Evolution of Multimodal Model Architectures

28 May 2024
S. Wadekar
Abhishek Chaurasia
Vasu Sharma
Eugenio Culurciello
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "The Evolution of Multimodal Model Architectures"

19 / 19 papers shown
Title
Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM
Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM
Chiori Hori
Yoshiki Masuyama
Siddarth Jain
Radu Corcodel
Devesh K. Jha
Diego Romeres
Jonathan Le Roux
64
0
0
21 Nov 2025
QuAnTS: Question Answering on Time Series
QuAnTS: Question Answering on Time Series
Felix Divo
Maurice Kraus
Anh Q. Nguyen
Hao Xue
Imran Razzak
Flora D. Salim
Kristian Kersting
Devendra Singh Dhami
72
0
0
07 Nov 2025
Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention
Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention
Xin Zou
Di Lu
Yizhou Wang
Yibo Yan
Yuanhuiyi Lyu
Xu Zheng
Linfeng Zhang
Xuming Hu
VLM
229
5
0
03 Oct 2025
InstructPLM-mu: 1-Hour Fine-Tuning of ESM2 Beats ESM3 in Protein Mutation Predictions
InstructPLM-mu: 1-Hour Fine-Tuning of ESM2 Beats ESM3 in Protein Mutation Predictions
Junde Xu
Yapin Shi
Lijun Lang
Taoyong Cui
Z. Zhang
Guangyong Chen
Jiezhong Qiu
Pheng-Ann Heng
127
0
0
03 Oct 2025
DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models
DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models
Zhiyi Shi
Binjie Wang
Chongjie Si
Yichen Wu
Junsik Kim
Hanspeter Pfister
KELMVLM
268
1
0
16 Jun 2025
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran
Adrien Bardes
David Fan
Q. Garrido
Russell Howes
...
Sarath Chandar
Franziska Meier
Yann LeCun
Michael G. Rabbat
Nicolas Ballas
248
117
0
11 Jun 2025
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language
  Tuning
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language TuningInternational Journal of Computer Vision (IJCV), 2024
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
255
3
0
23 Oct 2024
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
Xin Zou
Yizhou Wang
Yibo Yan
Yuanhuiyi Lyu
Kening Zheng
...
Junkai Chen
Peijie Jiang
Qingbin Liu
Chang Tang
Xuming Hu
372
24
0
04 Oct 2024
On-Device Language Models: A Comprehensive Review
On-Device Language Models: A Comprehensive Review
Jiajun Xu
Zhiyuan Li
Wei Chen
Qun Wang
Xin Gao
Qi Cai
Ziyuan Ling
452
89
0
26 Aug 2024
Building and better understanding vision-language models: insights and
  future directions
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
270
128
0
22 Aug 2024
Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit
Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEditAAAI Conference on Artificial Intelligence (AAAI), 2024
Qizhou Chen
Taolin Zhang
Chengyu Wang
Xiaofeng He
Dakan Wang
Tingting Liu
KELM
566
5
0
19 Aug 2024
Are Bigger Encoders Always Better in Vision Large Models?
Are Bigger Encoders Always Better in Vision Large Models?
Bozhou Li
Hao Liang
Zimo Meng
Wentao Zhang
VLM
169
5
0
01 Aug 2024
Computer Audition: From Task-Specific Machine Learning to Foundation Models
Computer Audition: From Task-Specific Machine Learning to Foundation Models
Andreas Triantafyllopoulos
Iosif Tsangko
Alexander Gebhard
A. Mesaros
Maria Sandsten
B. Schuller
344
6
0
22 Jul 2024
Audio-visual training for improved grounding in video-text LLMs
Audio-visual training for improved grounding in video-text LLMs
Shivprasad Sagare
Hemachandran S
Kinshuk Sarabhai
Prashant Ullegaddi
SA Rajeshkumar
131
1
0
21 Jul 2024
General Vision Encoder Features as Guidance in Medical Image
  Registration
General Vision Encoder Features as Guidance in Medical Image Registration
Fryderyk Kogl
Anna Reithmeir
Vasiliki Sideri-Lampretsa
Ines P. Machado
R. Braren
Daniel Rückert
Julia A. Schnabel
Veronika A. Zimmer
MedIm
179
3
0
18 Jul 2024
MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models
MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space ModelsNeural Information Processing Systems (NeurIPS), 2024
Zunnan Xu
Yukang Lin
Haonan Han
Sicheng Yang
Ronghui Li
Yachao Zhang
Xiu Li
Mamba
521
38
0
14 Mar 2024
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Chris Liu
Renrui Zhang
Longtian Qiu
Siyuan Huang
Weifeng Lin
...
Hao Shao
Pan Lu
Jiaming Song
Yu Qiao
Shiyang Feng
MLLM
437
135
0
08 Feb 2024
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
Chenyu Wang
Weixin Luo
Qianyu Chen
Haonan Mai
Jindi Guo
Sixun Dong
Xiaohua Xuan
MLLMLLMAG
329
40
0
19 Jan 2024
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Shilong Zhang
Pei Sun
Shoufa Chen
Min Xiao
Wenqi Shao
Wenwei Zhang
Yu Liu
Kai-xiang Chen
Ping Luo
MLLMVLM
741
308
0
07 Jul 2023
1