Stochastic activations
- LLMSV
Main:10 Pages
7 Figures
Bibliography:4 Pages
8 Tables
Appendix:6 Pages
Abstract
We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways:
View on arXivComments on this paper
