Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative
Model Inference with Unstructured Sparsity

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

19 September 2023

Zhen Zheng

Yong Li

Papers citing "Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity"

6 / 6 papers shown

Title
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU Ying Sheng Lianmin Zheng Binhang Yuan Zhuohan Li Max Ryabinin ... Joseph E. Gonzalez Percy Liang Christopher Ré Ion Stoica Ce Zhang 144 366 0 13 Mar 2023
Efficient Quantized Sparse Matrix Operations on Tensor Cores Shigang Li Kazuki Osawa Torsten Hoefler 72 31 0 14 Sep 2022
SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning Zihao Ye Ruihang Lai Junru Shao Tianqi Chen Luis Ceze 76 91 0 11 Jul 2022
Can Foundation Models Wrangle Your Data? A. Narayan Ines Chami Laurel J. Orr Simran Arora Christopher Ré LMTD AI4CE 176 213 0 20 May 2022
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks Torsten Hoefler Dan Alistarh Tal Ben-Nun Nikoli Dryden Alexandra Peste MQ 139 684 0 31 Jan 2021
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism M. Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 243 1,817 0 17 Sep 2019