Self-Adjust Softmax

The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one, achieving superior performances over other alternative functions. However, the softmax function can face a gradient vanishing issue when some elements of the attention scores approach extreme values, such as probabilities close to one or zero. In this paper, we propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying to and its normalized variant . We theoretically show that SA-Softmax provides enhanced gradient properties compared to the vanilla softmax function. Moreover, SA-Softmax Attention can be seamlessly integrated into existing Transformer models to their attention mechanisms with minor adjustments. We conducted experiments to evaluate the empirical performance of Transformer models using SA-Softmax compared to the vanilla softmax function. These experiments, involving models with up to 2.7 billion parameters, are conducted across diverse datasets, language tasks, and positional encoding methods.
View on arXiv@article{zheng2025_2502.18277, title={ Self-Adjust Softmax }, author={ Chuanyang Zheng and Yihang Gao and Guoxuan Chen and Han Shi and Jing Xiong and Xiaozhe Ren and Chao Huang and Xin Jiang and Zhenguo Li and Yu Li }, journal={arXiv preprint arXiv:2502.18277}, year={ 2025 } }