ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.16273
19
11

M3^33GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

25 May 2024
Mingshuang Luo
Ruibing Hou
Hong Chang
Zimo Liu
Yaowei Wang
Shiguang Shan
ArXivPDFHTML
Abstract

This paper presents M3^33GPT, an advanced M\textbf{M}Multimodal, M\textbf{M}Multitask framework for M\textbf{M}Motion comprehension and generation. M3^33GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal control and generation signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling model generation directly in the raw motion space. This strategy circumvents the information loss associated with discrete tokenizer, resulting in more detailed and comprehensive model generation. Third, M3^33GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M3^33GPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight M3^33GPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks.

View on arXiv
Comments on this paper