324

Caveats for information bottleneck in deterministic scenarios

Abstract

Information bottleneck (IB) is a method for extracting information from one random variable XX that is relevant for predicting another random variable YY. To do so, IB identifies an intermediate "bottleneck" variable TT that has low mutual information I(X;T)I(X;T) and high mutual information I(Y;T)I(Y;T). The 'IB curve' characterizes the set of bottleneck variables that achieve maximal I(Y;T)I(Y;T) for a given I(X;T)I(X;T), and is typically explored by optimizing the 'IB Lagrangian', I(Y;T)βI(X;T)I(Y;T) - \beta I(X;T). In some cases, YY is a deterministic function XX, including many supervised classification scenarios where the output class YY is a deterministic function of the input XX. We demonstrate several caveats when using IB in any situation where YY is a deterministic function of XX: (1) the IB curve cannot be recovered by optimizing the IB Lagrangian for different values of β\beta; (2) there are "uninteresting" trivial solutions at all points of the IB curve; and (3) for multi-layer classifiers that achieve low error rates, different layers cannot exhibit a strict trade-off between compression and prediction, contrary to a recent proposal. We also demonstrate that when YY is a small perturbation away from being a deterministic function of XX, these issues arise in an approximate way. To address problem (1), we propose a functional that, unlike the IB Lagrangian, can recover the IB curve in all cases. We demonstrate these issues on the MNIST dataset.

View on arXiv
Comments on this paper