ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.10016
47
0
v1v2 (latest)

Multimodal Large Language Models: A Survey

29 May 2025
Longzhen Han
Awes Mubarak
Almas Baimagambetov
Nikolaos Polatidis
Thar Baker
    LRM
ArXiv (abs)PDFHTML
Main:27 Pages
3 Figures
2 Tables
Appendix:16 Pages
Abstract

Multimodal Large Language Models (MLLMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Architectural innovations like transformers and diffusion models underpin this convergence, enabling cross-modal transfer and modular specialization. We highlight emerging patterns of synergy, and identify open challenges in evaluation, modularity, and structured reasoning. This survey offers a unified perspective on MLLM development and identifies critical paths toward more general-purpose, adaptive, and interpretable multimodal systems.

View on arXiv
@article{han2025_2506.10016,
  title={ A Survey of Generative Categories and Techniques in Multimodal Large Language Models },
  author={ Longzhen Han and Awes Mubarak and Almas Baimagambetov and Nikolaos Polatidis and Thar Baker },
  journal={arXiv preprint arXiv:2506.10016},
  year={ 2025 }
}
Comments on this paper