Prompt Cache: Modular Attention Reuse for Low-Latency Inference

7 November 2023

Papers citing "Prompt Cache: Modular Attention Reuse for Low-Latency Inference"

2 / 52 papers shown

Title
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation Ofir Press Noah A. Smith M. Lewis 234 690 0 27 Aug 2021
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism M. Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 243 1,791 0 17 Sep 2019