ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.02936
27
11

M-VADER: A Model for Diffusion with Multimodal Context

6 December 2022
Samuel Weinbach
Marco Bellagente
C. Eichenberg
Andrew M. Dai
R. Baldock
Souradeep Nanda
Bjorn Deiseroth
Koen Oostermeijer
H. Teufel
Andres Felipe Cruz Salinas
    DiffM
ArXivPDFHTML
Abstract

We introduce M-VADER: a diffusion model (DM) for image generation where the output can be specified using arbitrary combinations of images and text. We show how M-VADER enables the generation of images specified using combinations of image and text, and combinations of multiple images. Previously, a number of successful DM image generation algorithms have been introduced that make it possible to specify the output image using a text prompt. Inspired by the success of those models, and led by the notion that language was already developed to describe the elements of visual contexts that humans find most important, we introduce an embedding model closely related to a vision-language model. Specifically, we introduce the embedding model S-MAGMA: a 13 billion parameter multimodal decoder combining components from an autoregressive vision-language model MAGMA and biases finetuned for semantic search.

View on arXiv
Comments on this paper