Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. The Platonic Representation Hypothesis (PRH) suggests these models may nonetheless converge toward a shared statistical model of reality. This raises a fundamental question: can we move beyond post-hoc detection of such alignment and explicitly optimize for it? We argue this challenge is most critical in fine-grained contextual distinctions-where multiple descriptions share global semantics but differ in subtle compositional details. We address this with the Joint Autoencoder Modulator (JAM), which aligns frozen unimodal models by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives. We systematically evaluate JAM across three design axes: (i) alignment objectives, introducing our multimodal Spread Loss that outperforms classic contrastive methods; (ii) the layer depth at which alignment is most effective; and (iii) the role of foundation model scale in representational convergence. Our findings show that JAM reliably induces alignment even across independently trained representations, offering both theoretical insight into the structure of shared semantics and practical guidance for transforming generalist unimodal foundations into specialist multimodal models.
View on arXiv