Caveats for information bottleneck in deterministic scenarios

23 August 2018

Abstract

Information bottleneck (IB) is a method for extracting information from one random variable $X$ that is relevant for predicting another random variable $Y$ . To do so, IB identifies an intermediate "bottleneck" variable $T$ that has low mutual information $I(X;T)$ and high mutual information $I(Y;T)$ . The 'IB curve' characterizes the set of bottleneck variables that achieve maximal $I(Y;T)$ for a given $I(X;T)$ , and is typically explored by optimizing the 'IB Lagrangian', $I(Y;T) - \beta I(X;T)$ . In some cases, $Y$ is a deterministic function $X$ , including many supervised classification scenarios where the output class $Y$ is a deterministic function of the input $X$ . We demonstrate several caveats when using IB in any situation where $Y$ is a deterministic function of $X$ : (1) the IB curve cannot be recovered by optimizing the IB Lagrangian for different values of $\beta$ ; (2) there are "uninteresting" trivial solutions at all points of the IB curve; and (3) for multi-layer classifiers that achieve low error rates, different layers cannot exhibit a strict trade-off between compression and prediction, contrary to a recent proposal. We also demonstrate that when $Y$ is a small perturbation away from being a deterministic function of $X$ , these issues arise in an approximate way. To address problem (1), we propose a functional that, unlike the IB Lagrangian, can recover the IB curve in all cases. We demonstrate these issues on the MNIST dataset.

View on arXiv

Comments on this paper