v1v2 (latest)

Multimodal Learning with Transformers: A Survey

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

13 June 2022

Papers citing "Multimodal Learning with Transformers: A Survey"

50 / 305 papers shown

Handwritten Text Recognition for Low Resource Languages

104

01 Dec 2025

Sigma: The Key for Vision-Language-Action Models toward Telepathic Alignment

Libo Wang

104

30 Nov 2025

Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition

118

21 Nov 2025

LMM-IR: Large-Scale Netlist-Aware Multimodal Framework for Static IR-Drop PredictionDesign Automation Conference (DAC), 2025

16 Nov 2025

Learning Time in Static Classifiers

124

15 Nov 2025

Point Cloud Quantization through Multimodal Prompting for 3D Understanding

429

15 Nov 2025

MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains

132

09 Nov 2025

Towards Scalable Meta-Learning of near-optimal Interpretable Models via Synthetic Model Generations

417

06 Nov 2025

Caption Injection for Optimization in Generative Search Engine

Xiaolu Chen

Yong Liao

DiffM

132

06 Nov 2025

Enhancing Multimodal Reasoning via Latent Refocusing

178

04 Nov 2025

Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning

136

28 Oct 2025

MILES: Modality-Informed Learning Rate Scheduler for Balancing Multimodal Learning

Alejandro Guerra-Manzanares

Farah E. Shamout

128

20 Oct 2025

Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition

16 Oct 2025

FedMMKT:Co-Enhancing a Server Text-to-Image Model and Client Task Models in Multi-Modal Federated Learning

14 Oct 2025

ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

Haochen You

Baojing Liu

152

02 Oct 2025

MAESTRO : Adaptive Sparse Attention and Robust Learning for Multimodal Dynamic Time Series

29 Sep 2025

InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions

246

28 Sep 2025

PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction

124

24 Sep 2025

SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction

Runfei Chen

Shuyang Jiang

Wei Huang

24 Sep 2025

Single-Branch Network Architectures to Close the Modality Gap in Multimodal Recommendation

124

23 Sep 2025

Orchestrate, Generate, Reflect: A VLM-Based Multi-Agent Collaboration Framework for Automated Driving Policy Learning

116

21 Sep 2025

DAFTED: Decoupled Asymmetric Fusion of Tabular and Echocardiographic Data for Cardiac Hypertension Diagnosis

143

19 Sep 2025

Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

172

18 Sep 2025

Music4All A+A: A Multimodal Dataset for Music Information Retrieval Tasks

18 Sep 2025

MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

...

127

17 Sep 2025

From Embeddings to Equations: Genetic-Programming Surrogates for Interpretable Transformer Classification

124

16 Sep 2025

Video Understanding by Design: How Datasets Shape Architectures and Insights

237

11 Sep 2025

IMDMR: An Intelligent Multi-Dimensional Memory Retrieval System for Enhanced Conversational AI

10 Sep 2025

XSRD-Net: EXplainable Stroke Relapse Detection

...

09 Sep 2025

Testing chatbots on the creation of encoders for audio conditioned image generation

Jorge E. León

Miguel Carrasco

152

09 Sep 2025

Effectively obtaining acoustic, visual and textual data from videos

Jorge E. León

Miguel Carrasco

VGen

135

06 Sep 2025

AIVA: An AI-based Virtual Companion for Emotion-aware Interaction

Chenxi Li

03 Sep 2025

On Transferring, Merging, and Splitting Task-Oriented Network Digital Twins

02 Sep 2025

LightVLM: Acceleraing Large Multimodal Models with Pyramid Token Merging and KV Cache Compression

132

30 Aug 2025

A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic ComprehensionInformation Fusion (Inf. Fusion), 2025

Mohammad Zia Ur Rehman

108

22 Aug 2025

MoEcho: Exploiting Side-Channel Attacks to Compromise User Privacy in Mixture-of-Experts LLMs

140

20 Aug 2025

Separating Shared and Domain-Specific LoRAs for Multi-Domain Learning

154

05 Aug 2025

Parameter-Efficient Single Collaborative Branch for RecommendationACM Conference on Recommender Systems (RecSys), 2025

157

05 Aug 2025

Explainability Through Systematicity: The Hard Systematicity Challenge for Artificial Intelligence

Matthieu Queloz

138

29 Jul 2025

$T$^\text{3}$SVFND: Towards an Evolving Fake News Detector for Emergencies with Test-time Training on Short Video Platforms$

^\text{3}

SVFND: Towards an Evolving Fake News Detector for Emergencies with Test-time Training on Short Video Platforms

135

27 Jul 2025

Principled Multimodal Representation Learning

219

23 Jul 2025

Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics

269

14 Jun 2025

DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

317

13 Jun 2025

RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer

294

13 Jun 2025

Position Prediction Self-Supervised Learning for Multimodal Satellite Imagery Semantic Segmentation

John Waithaka

Moise Busogi

SSL

160

07 Jun 2025

SatelliteFormula: Multi-Modal Symbolic Regression from Remote Sensing Imagery for Physics Discovery

Zhenyu Yu

Mohd Yamani Idna Idris

174

06 Jun 2025

CogniAlign: Word-Level Multimodal Speech Alignment with Gated Cross-Attention for Alzheimer's DetectionKnowledge-Based Systems (KBS), 2025

David Ortiz-Perez

Manuel Benavent-Lledo

Javier Rodriguez-Juan

José García Rodríguez

David Tomás

307

02 Jun 2025

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

299

01 Jun 2025

Revisiting Self-attention for Cross-domain Sequential RecommendationKnowledge Discovery and Data Mining (KDD), 2025

184

27 May 2025

Residual Cross-Attention Transformer-Based Multi-User CSI Feedback with Deep Joint Source-Channel CodingIEEE Wireless Communications Letters (WCL), 2025

135

26 May 2025