WISE-TTT:Worldwide Information Segmentation Enhancement

Video multi-target segmentation remains a major challenge in long sequences, mainly due to the inherent limitations of existing architectures in capturing global temporal dependencies. We introduce WISE-TTT, a synergistic architecture integrating Test-Time Training (TTT) mechanisms with the Transformer architecture through co-design. The TTT layer systematically compresses historical temporal data to generate hidden states containing worldwide information(Lossless memory to maintain long contextual integrity), while achieving multi-stage contextual aggregation through splicing. Crucially, our framework provides the first empirical validation that implementing worldwide information across multiple network layers is essential for optimal dependencythis http URLstudies show TTT modules at high-level features boost global modeling. This translates to 3.1% accuracy improvement(J&F metric) on Davis2017 long-term benchmarks -- the first proof of hierarchical context superiority in video segmentation. We provide the first systematic evidence that worldwide information critically impacts segmentation performance.
View on arXiv@article{hao2025_2504.00879, title={ WISE-TTT:Worldwide Information Segmentation Enhancement }, author={ Fenglei Hao and Yuliang Yang and Ruiyuan Su and Zhengran Zhao and Yukun Qiao and Mengyu Zhu }, journal={arXiv preprint arXiv:2504.00879}, year={ 2025 } }