316

Stochastic activations

Main:10 Pages
7 Figures
Bibliography:4 Pages
8 Tables
Appendix:6 Pages
Abstract

We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways:

View on arXiv
Comments on this paper