InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework

Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character customization built upon a foundation diffusion transformer. InstantCharacter demonstrates three fundamental advantages: first, it achieves open-domain personalization across diverse character appearances, poses, and styles while maintaining high-fidelity results. Second, the framework introduces a scalable adapter with stacked transformer encoders, which effectively processes open-domain character features and seamlessly interacts with the latent space of modern diffusion transformers. Third, to effectively train the framework, we construct a large-scale character dataset containing 10-million-level samples. The dataset is systematically organized into paired (multi-view character) and unpaired (text-image combinations) subsets. This dual-data structure enables simultaneous optimization of identity consistency and textual editability through distinct learning pathways. Qualitative experiments demonstrate the advanced capabilities of InstantCharacter in generating high-fidelity, text-controllable, and character-consistent images, setting a new benchmark for character-driven image generation. Our source code is available atthis https URL.
View on arXiv@article{tao2025_2504.12395, title={ InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework }, author={ Jiale Tao and Yanbing Zhang and Qixun Wang and Yiji Cheng and Haofan Wang and Xu Bai and Zhengguang Zhou and Ruihuang Li and Linqing Wang and Chunyu Wang and Qin Lin and Qinglin Lu }, journal={arXiv preprint arXiv:2504.12395}, year={ 2025 } }