Generative Models of Visually Grounded Imagination

International Conference on Learning Representations (ICLR), 2017

30 May 2017

Abstract

Consider how easy it is for people to imagine what a "purple hippo" would look like, even though they do not exist. If we instead said "purple hippo with wings", they could just as easily create a different internal mental representation, to represent this more specific concept. To assess whether the person has correctly understood the concept, we can ask them to draw a few sketches, to illustrate their thoughts. We call the ability to map text descriptions of concepts to latent representations and then to images (or vice versa) visually grounded semantic imagination. We propose a latent variable model for images and attributes, based on variational auto-encoders, which can perform this task. Our method uses a novel training objective, and a novel product-of-experts inference network, which can handle partially specified (abstract) concepts in a principled and efficient way. We also propose a set of easy-to-compute evaluation metrics that capture our intuitive notions of what it means to have good imagination, namely correctness, coverage, and compositionality (the 3 C's). Finally, we perform a detailed comparison (in terms of the 3 C's) of our method with two existing joint image-attribute VAE methods (the JMVAE method of (Suzuki et al., 2017) and the bi-VCCA method of (Wang et al., 2016)) by applying them to two simple datasets based on MNIST, where it is easy to objectively evaluate performance in a controlled way.

View on arXiv

Comments on this paper