ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.16759
31
3

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

27 May 2024
C. N. Vasconcelos
Abdullah Rashwan Austin Waters
Trevor Walker
Keyang Xu
Jimmy Yan
Rui Qian
Shixin Luo
Zarana Parekh
Andrew Bunner
Hongliang Fei
Roopal Garg
Mandy Guo
Ivana Kajić
Yeqing Li
Henna Nandwani
Jordi Pont-Tuset
Yasumasa Onoe
Sarah Rosston
Su Wang
Wenlei Zhou
Kevin Swersky
David J. Fleet
Jason Baldridge
Oliver Wang
ArXivPDFHTML
Abstract

We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components. The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment {\it vs.} high-resolution rendering. We first demonstrate the benefits of scaling a {\it Shallow UNet}, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high-resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes. Vermeer, our full pipeline model trained with internal datasets to produce 1024x1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.

View on arXiv
Comments on this paper