262

Piecewise-Linear Activations or Analytic Activation Functions: Which Produce More Expressive Neural Networks?

Abstract

Many currently available universal approximation theorems affirm that deep feedforward networks defined using any suitable activation function can approximate any integrable function locally in L1L^1-norm. Though different approximation rates are available for deep neural networks defined using other classes of activation functions, there is little explanation for the empirically confirmed advantage that ReLU networks exhibit over their classical (e.g. sigmoidal) counterparts. Our main result demonstrates that deep networks with piecewise linear activation (e.g. ReLU or PReLU) are fundamentally more expressive than deep feedforward networks with analytic (e.g. sigmoid, Swish, GeLU, or Softplus). More specifically, we construct a strict refinement of the topology on the space Lloc1(Rd,RD)L^1_{\operatorname{loc}}(\mathbb{R}^d,\mathbb{R}^D) of locally Lebesgue-integrable functions, in which the set of deep ReLU networks with (bilinear) pooling NNReLU+Pool\operatorname{NN}^{\operatorname{ReLU} + \operatorname{Pool}} is dense (i.e. universal) but the set of deep feedforward networks defined using any combination of analytic activation functions with (or without) pooling layers NNω+Pool\operatorname{NN}^{\omega+\operatorname{Pool}} is not dense (i.e. not universal). Our main result is further explained by \textit{quantitatively} demonstrating that this "separation phenomenon" between the networks in NNReLU+Pool\operatorname{NN}^{\operatorname{ReLU}+\operatorname{Pool}} and those in NNω+Pool\operatorname{NN}^{\omega+\operatorname{Pool}} by showing that the networks in NNReLU\operatorname{NN}^{\operatorname{ReLU}} are capable of approximate any compactly supported Lipschitz function while \textit{simultaneously} approximating its essential support; whereas, the networks in NNω+pool\operatorname{NN}^{\omega+\operatorname{pool}} cannot.

View on arXiv
Comments on this paper