VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method, and we integrate it into the Floating-Point Unit (FPU) of the RISC-V cores of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1\%. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7 less latency and 74.3 less energy compared to the baseline cluster, achieving an 8.2 performance improvement and 4.1 higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3 and ViT, achieving up to 5.8 and 3.6 reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.
View on arXiv@article{wang2025_2504.11227, title={ VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers }, author={ Run Wang and Gamze Islamoglu and Andrea Belano and Viviane Potocnik and Francesco Conti and Angelo Garofalo and Luca Benini }, journal={arXiv preprint arXiv:2504.11227}, year={ 2025 } }