ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.07493
54
0

V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation

10 March 2025
Guiwei Zhang
Tianyu Zhang
Mohan Zhou
Yalong Bai
Biye Li
ArXivPDFHTML
Abstract

We propose V2Flow, a novel tokenizer that produces discrete visual tokens capable of high-fidelity reconstruction, while ensuring structural and latent distribution alignment with the vocabulary space of large language models (LLMs). Leveraging this tight visual-vocabulary coupling, V2Flow enables autoregressive visual generation on top of existing LLMs. Our approach formulates visual tokenization as a flow-matching problem, aiming to learn a mapping from a standard normal prior to the continuous image distribution, conditioned on token sequences embedded within the LLMs vocabulary space. The effectiveness of V2Flow stems from two core designs. First, we propose a Visual Vocabulary resampler, which compresses visual data into compact token sequences, with each represented as a soft categorical distribution over LLM's vocabulary. This allows seamless integration of visual tokens into existing LLMs for autoregressive visual generation. Second, we present a masked autoregressive Rectified-Flow decoder, employing a masked transformer encoder-decoder to refine visual tokens into contextually enriched embeddings. These embeddings then condition a dedicated velocity field for precise reconstruction. Additionally, an autoregressive rectified-flow sampling strategy is incorporated, ensuring flexible sequence lengths while preserving competitive reconstruction quality. Extensive experiments show that V2Flow outperforms mainstream VQ-based tokenizers and facilitates autoregressive visual generation on top of existing.this https URL

View on arXiv
@article{zhang2025_2503.07493,
  title={ V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation },
  author={ Guiwei Zhang and Tianyu Zhang and Mohan Zhou and Yalong Bai and Biye Li },
  journal={arXiv preprint arXiv:2503.07493},
  year={ 2025 }
}
Comments on this paper