KV-weights are all you need for skipless transformers
He and Hofmann (arXiv:2311.01906) detailed a skipless transformer without the V and P (post-attention projection) linear layers, which reduces the total number of weights. However, this scheme is only applicable to MHA (multi-head attention), but not for MQA (multi-query attention) and GQA (grouped-query attention). The latter schemes are used by many popular LLMs such as Llama 2, Mistral, Mixtral, PaLM, and Gemma. Therefore, this micro-paper proposes mathematically equivalent versions that are suitable for MQA and GQA. For example, removing Q and P from a skipless version of Mistral-7B would remove 15% of its weights (and thus reduce its compute and memory complexity). Watch our explainer videothis https URLand seethis https URLfor code and more transformer tricks.
View on arXiv