ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.14040
66
0

MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization

18 March 2025
Binjie Liu
Lina Liu
Sanyi Zhang
Songen Gu
Yihao Zhi
Tianyi Zhu
Lei Yang
Long Ye
    SLR
ArXivPDFHTML
Abstract

This work focuses on full-body co-speech gesture generation. Existing methods typically employ an autoregressive model accompanied by vector-quantized tokens for gesture generation, which results in information loss and compromises the realism of the generated gestures. To address this, inspired by the natural continuity of real-world human motion, we propose MAG, a novel multi-modal aligned framework for high-quality and diverse co-speech gesture synthesis without relying on discrete tokenization. Specifically, (1) we introduce a motion-text-audio-aligned variational autoencoder (MTA-VAE), which leverages pre-trained WavCaps' text and audio embeddings to enhance both semantic and rhythmic alignment with motion, ultimately producing more realistic gestures. (2) Building on this, we propose a multimodal masked autoregressive model (MMAG) that enables autoregressive modeling in continuous motion embeddings through diffusion without vector quantization. To further ensure multi-modal consistency, MMAG incorporates a hybrid granularity audio-text fusion block, which serves as conditioning for diffusion process. Extensive experiments on two benchmark datasets demonstrate that MAG achieves stateof-the-art performance both quantitatively and qualitatively, producing highly realistic and diverse co-speechthis http URLcode will be released to facilitate future research.

View on arXiv
@article{liu2025_2503.14040,
  title={ MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization },
  author={ Binjie Liu and Lina Liu and Sanyi Zhang and Songen Gu and Yihao Zhi and Tianyi Zhu and Lei Yang and Long Ye },
  journal={arXiv preprint arXiv:2503.14040},
  year={ 2025 }
}
Comments on this paper