We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code is open-sourced under a permissive license.
View on arXiv@article{jakubik2025_2504.11171, title={ TerraMind: Large-Scale Generative Multimodality for Earth Observation }, author={ Johannes Jakubik and Felix Yang and Benedikt Blumenstiel and Erik Scheurer and Rocco Sedona and Stefano Maurogiovanni and Jente Bosmans and Nikolaos Dionelis and Valerio Marsocci and Niklas Kopp and Rahul Ramachandran and Paolo Fraccaro and Thomas Brunschwiler and Gabriele Cavallaro and Juan Bernabe-Moreno and Nicolas Longépé }, journal={arXiv preprint arXiv:2504.11171}, year={ 2025 } }