Kernelized Classification in Deep Networks

We propose a kernelized classification layer for deep networks. Although conventional deep networks introduce an abundance of nonlinearity for representation (feature) learning, they almost universally use a linear classifier on the learned feature vectors. We advocate a nonlinear classification layer by using the kernel trick on the softmax cross-entropy loss function during training and the scorer function during testing. However, the choice of the kernel remains a challenge. To tackle this, we theoretically show the possibility of optimizing over all possible positive definite kernels applicable to our problem setting. This theory is then used to device a new kernelized classification layer that learns the optimal kernel function for a given problem automatically within the deep network itself. We show the usefulness of the proposed nonlinear classification layer on several datasets and tasks.
View on arXiv