Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models
Hiroshi Sasaki
- VLM
Main:8 Pages
4 Figures
Bibliography:1 Pages
3 Tables
Abstract
Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differences carry large semantic significance, such as diagram understanding, remain challenging due to the models' limited sensitivity to fine-grained structural variations.
View on arXivComments on this paper
