219
v1v2 (latest)

KV-weights are all you need for skipless transformers

Main:4 Pages
6 Figures
Bibliography:2 Pages
1 Tables
Abstract

He and Hofmann (arXiv:2311.01906) detailed a skipless transformer without the V and P (post-attention projection) linear layers, which reduces the total number of weights. However, this scheme is only applicable to MHA (multi-head attention), but not for MQA (multi-query attention) and GQA (grouped-query attention). The latter schemes are used by many popular LLMs such as Llama 2, Mistral, Mixtral, PaLM, and Gemma. Therefore, this micro-paper proposes mathematically equivalent versions that are suitable for MQA and GQA. For example, removing Q and P from a skipless version of Mistral-7B would remove 15% of its weights (and thus reduce its compute and memory complexity). Watch our explainer videothis https URLand seethis https URLfor code and more transformer tricks.

View on arXiv
Comments on this paper