23

Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models

Hiroshi Sasaki
Main:8 Pages
4 Figures
Bibliography:1 Pages
3 Tables
Abstract

Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differences carry large semantic significance, such as diagram understanding, remain challenging due to the models' limited sensitivity to fine-grained structural variations.

View on arXiv
Comments on this paper