Caveats for information bottleneck in deterministic scenarios

Information bottleneck (IB) is a method for extracting information from one random variable that is relevant for predicting another random variable . To do so, IB identifies an intermediate "bottleneck" variable that has low mutual information and high mutual information . The "IB curve" characterizes the set of bottleneck variables that achieve maximal for a given , and is typically explored by maximizing the "IB Lagrangian", . In some cases, is a deterministic function of , including many classification problems in supervised learning where the output class is a deterministic function of the input . We demonstrate three caveats when using IB in any situation where is a deterministic function of : (1) the IB curve cannot be recovered by maximizing the IB Lagrangian for different values of ; (2) there are "uninteresting" trivial solutions at all points of the IB curve; and (3) for multi-layer classifiers that achieve low prediction error, different layers cannot exhibit a strict trade-off between compression and prediction, contrary to a recent proposal. We also show that when is a small perturbation away from being a deterministic function of , these three caveats arise in an approximate way. To address problem (1), we propose a functional that, unlike the IB Lagrangian, can recover the IB curve in all cases. We demonstrate the three caveats on the MNIST dataset.
View on arXiv