We prove a precise geometric description of all one layer ReLU networks with a single linear unit and input/output dimensions equal to one that interpolate a given dataset and, among all such interpolants, minimize the -norm of the neuron weights. Such networks can intuitively be thought of as those that minimize the mean-squared error over plus an infinitesimal weight decay penalty. We therefore refer to them as ridgeless ReLU interpolants. Our description proves that, to extrapolate values for inputs lying between two consecutive datapoints, a ridgeless ReLU interpolant simply compares the signs of the discrete estimates for the curvature of at and derived from the dataset . If the curvature estimates at and have different signs, then must be linear on . If in contrast the curvature estimates at and are both positive (resp. negative), then is convex (resp. concave) on . Our results show that ridgeless ReLU interpolants achieve the best possible generalization for learning Lipschitz functions, up to universal constants.
View on arXiv