Existing Vision Language Models (VLMs) often struggle to preserve logic, entity identity, and artistic style during extended, interleaved image-text interactions. We identify this limitation as "Multimodal Context Drift", which stems from the inherent tendency of implicit neural representations to decay or become entangled over long sequences. To bridge this gap, we propose IUT-Plug, a model-agnostic Neuro-Symbolic Structured State Tracking mechanism. Unlike purely neural approaches that rely on transient attention maps, IUT-Plug introduces the Image Understanding Tree (IUT) as an explicit, persistent memory module. The framework operates by (1) parsing visual scenes into hierarchical symbolic structures (entities, attributes, and relationships); (2) performing incremental state updates to logically lock invariant properties while modifying changing elements; and (3) guiding generation through topological constraints. We evaluate our approach on a novel benchmark comprising 3,000 human-annotated samples. Experimental results demonstrate that IUT-Plug effectively mitigates context drift, achieving significantly higher consistency scores compared to unstructured text-prompting baselines. This confirms that explicit symbolic grounding is essential for maintaining robust long-horizon consistency in multimodal generation.

View on arXiv

Comments on this paper