Unsupervised Novel View Synthesis from a Single Image

Novel view synthesis from a single image has recently achieved remarkable results, although the requirement of some form of 3D, pose, or multi-view supervision at training time limits the deployment in real scenarios. This work aims at relaxing these assumptions enabling training of conditional generative models for novel view synthesis in a completely unsupervised manner. We first pre-train a purely generative decoder model using a 3D-aware GAN formulation while at the same time train an encoder network to invert the mapping from latent space to images. Then, we swap encoder and decoder and train the network as a conditioned GAN with a mixture of an autoencoder-like objective and self-distillation. At test time, given a view of an object, our model first embeds the image content in a latent code and regresses its pose, then generates novel views of it by keeping the code fixed and varying the pose. We test our framework on both synthetic datasets such as ShapeNet and on unconstrained collections of natural images, where no competing methods can be trained.
View on arXiv