Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models

27 February 2026

Hiroshi Sasaki

VLM

ArXiv (abs)PDF HTML

Main:8 Pages

4 Figures

Bibliography:1 Pages

3 Tables

Abstract

Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differences carry large semantic significance, such as diagram understanding, remain challenging due to the models' limited sensitivity to fine-grained structural variations.

View on arXiv

Comments on this paper