Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.14123
Cited By
AI and Memory Wall
21 March 2024
A. Gholami
Z. Yao
Sehoon Kim
Coleman Hooper
Michael W. Mahoney
Kurt Keutzer
Re-assign community
ArXiv
PDF
HTML
Papers citing
"AI and Memory Wall"
46 / 46 papers shown
Title
GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance
Jinuk Kim
Marwa El Halabi
W. Park
Clemens JS Schaefer
Deokjae Lee
Yeonhong Park
Jae W. Lee
Hyun Oh Song
MQ
29
0
0
11 May 2025
EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices
Arnab Sanyal
Prithwish Mukherjee
Gourav Datta
Sandeep P. Chinchali
MQ
97
0
0
05 May 2025
CIMFlow: An Integrated Framework for Systematic Design and Evaluation of Digital CIM Architectures
Yingjie Qi
Jianlei Yang
Yiou Wang
Yikun Wang
Dayu Wang
Ling Tang
Cenlin Duan
Xiaolin He
Weisheng Zhao
19
0
0
02 May 2025
KVCrush: Key value cache size-reduction using similarity in head-behaviour
Gopi Krishna Jha
Sameh Gobriel
Liubov Talamanova
Alexander Kozlov
Nilesh Jain
MQ
34
0
0
24 Feb 2025
A Survey on Memory-Efficient Large-Scale Model Training in AI for Science
Kaiyuan Tian
Linbo Qiao
Baihui Liu
Gongqingjian Jiang
Dongsheng Li
31
0
0
21 Jan 2025
TinyLLM: A Framework for Training and Deploying Language Models at the Edge Computers
Savitha Viswanadh Kandala
Pramuka Medaranga
Ambuj Varshney
70
1
0
19 Dec 2024
Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads
Mukul Lokhande
Gopal Raut
Santosh Kumar Vishvakarma
70
1
0
16 Dec 2024
SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization
Runsheng Bai
Qiang Liu
B. Liu
MQ
59
1
0
05 Dec 2024
Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification
Bei Liu
Yanmin Qian
69
0
0
02 Dec 2024
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
Marco Federici
Davide Belli
M. V. Baalen
Amir Jalalirad
Andrii Skliar
Bence Major
Markus Nagel
Paul N. Whatmough
76
0
0
02 Dec 2024
PIM-AI: A Novel Architecture for High-Efficiency LLM Inference
Cristobal Ortega
Yann Falevoz
Renaud Ayrignac
78
1
0
26 Nov 2024
BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
Yuzong Chen
Ahmed F. AbouElhamayed
Xilai Dai
Yang Wang
Marta Andronic
G. Constantinides
Mohamed S. Abdelfattah
MQ
100
1
0
18 Nov 2024
ML
2
^2
2
Tuner: Efficient Code Tuning via Multi-Level Machine Learning Models
JooHyoung Cha
Munyoung Lee
Jinse Kwon
Jubin Lee
Jemin Lee
Yongin Kwon
34
0
0
16 Nov 2024
Shrinking the Giant : Quasi-Weightless Transformers for Low Energy Inference
Shashank Nag
Alan T. L. Bacellar
Zachary Susskind
Anshul Jha
Logan Liberty
...
Krishnan Kailas
P. Lima
Neeraja J. Yadwadkar
F. M. G. França
L. John
33
0
0
04 Nov 2024
A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models
Cong Guo
Feng Cheng
Zhixu Du
James Kiessling
Jonathan Ku
...
Qilin Zheng
Guanglei Zhou
Hai
Li-Wei Li
Yiran Chen
31
7
0
08 Oct 2024
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning
Wei An
Xiao Bi
Guanting Chen
Shanhuang Chen
Chengqi Deng
...
Chenggang Zhao
Yao Zhao
Shangyan Zhou
Shunfeng Zhou
Yuheng Zou
34
6
0
26 Aug 2024
Robust Regression with Ensembles Communicating over Noisy Channels
Yuval Ben-Hur
Yuval Cassuto
15
0
0
20 Aug 2024
On Exact Bit-level Reversible Transformers Without Changing Architectures
Guoqiang Zhang
J. P. Lewis
W. Kleijn
MQ
AI4CE
32
0
0
12 Jul 2024
OPIMA: Optical Processing-In-Memory for Convolutional Neural Network Acceleration
Febin P. Sunny
Amin Shafiee
Abhishek Balasubramaniam
Mahdi Nikdast
S. Pasricha
47
1
0
11 Jul 2024
Recent and Upcoming Developments in Randomized Numerical Linear Algebra for Machine Learning
Michał Dereziński
Michael W. Mahoney
20
5
0
17 Jun 2024
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
Haoran You
Yipin Guo
Yichao Fu
Wei Zhou
Huihong Shi
Xiaofan Zhang
Souvik Kundu
Amir Yazdanbakhsh
Y. Lin
KELM
44
7
0
10 Jun 2024
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Namgyu Ho
Sangmin Bae
Taehyeon Kim
Hyunjik Jo
Yireun Kim
Tal Schuster
Adam Fisch
James Thorne
Se-Young Yun
45
8
0
04 Jun 2024
SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
R. Prabhakar
R. Sivaramakrishnan
Darshan Gandhi
Yun Du
Mingran Wang
...
Urmish Thakker
Dawei Huang
Sumti Jairath
Kevin J. Brown
K. Olukotun
MoE
39
12
0
13 May 2024
Reducing the Barriers to Entry for Foundation Model Training
Paolo Faraboschi
Ellis Giles
Justin Hotard
Konstanty Owczarek
Andrew Wheeler
21
4
0
12 Apr 2024
Balanced Data Placement for GEMV Acceleration with Processing-In-Memory
M. Ibrahim
Mahzabeen Islam
Shaizeen Aga
21
2
0
29 Mar 2024
MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Processing
Geraldo F. Oliveira
Ataberk Olgun
A. G. Yaglikçi
F. N. Bostanci
Juan Gómez Luna
Saugata Ghose
Onur Mutlu
26
6
0
29 Feb 2024
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
Wei Huang
Yangdong Liu
Haotong Qin
Ying Li
Shiming Zhang
Xianglong Liu
Michele Magno
Xiaojuan Qi
MQ
77
68
0
06 Feb 2024
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Coleman Hooper
Sehoon Kim
Hiva Mohammadzadeh
Michael W. Mahoney
Y. Shao
Kurt Keutzer
A. Gholami
MQ
17
173
0
31 Jan 2024
T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
Suchita Pati
Shaizeen Aga
Mahzabeen Islam
Nuwan Jayasena
Matthew D. Sinclair
20
12
0
30 Jan 2024
SpiNNaker2: A Large-Scale Neuromorphic System for Event-Based and Asynchronous Machine Learning
Hector A. Gonzalez
Jiaxin Huang
Florian Kelber
Khaleelulla Khan Nazeer
Tim Langer
...
Bernhard Vogginger
Timo C. Wunderlich
Yexin Yan
Mahmoud Akl
Christian Mayr
24
16
0
09 Jan 2024
Coop: Memory is not a Commodity
Jianhao Zhang
Shihan Ma
Peihong Liu
Jinhui Yuan
30
4
0
01 Nov 2023
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon
Zhuohan Li
Siyuan Zhuang
Ying Sheng
Lianmin Zheng
Cody Hao Yu
Joseph E. Gonzalez
Haotong Zhang
Ion Stoica
VLM
29
1,773
0
12 Sep 2023
Dynamic Encoding and Decoding of Information for Split Learning in Mobile-Edge Computing: Leveraging Information Bottleneck Theory
Omar Alhussein
Moshi Wei
A. Akhavain
10
3
0
06 Sep 2023
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
Xiaoxia Wu
Z. Yao
Yuxiong He
MQ
27
43
0
19 Jul 2023
SqueezeLLM: Dense-and-Sparse Quantization
Sehoon Kim
Coleman Hooper
A. Gholami
Zhen Dong
Xiuyu Li
Sheng Shen
Michael W. Mahoney
Kurt Keutzer
MQ
24
166
0
13 Jun 2023
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
Z. Yao
Xiaoxia Wu
Cheng-rong Li
Stephen Youn
Yuxiong He
MQ
63
57
0
15 Mar 2023
Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent
Xiaonan Nie
Yi Liu
Fangcheng Fu
J. Xue
Dian Jiao
Xupeng Miao
Yangyu Tao
Bin Cui
MoE
19
16
0
06 Mar 2023
MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation
Samuel Hsia
Udit Gupta
Bilge Acun
Newsha Ardalani
Pan Zhong
Gu-Yeon Wei
David Brooks
Carole-Jean Wu
25
17
0
21 Feb 2023
Reversible Vision Transformers
K. Mangalam
Haoqi Fan
Yanghao Li
Chaoxiong Wu
Bo Xiong
Christoph Feichtenhofer
Jitendra Malik
ViT
11
45
0
09 Feb 2023
FP8 Formats for Deep Learning
Paulius Micikevicius
Dusan Stosic
N. Burgess
Marius Cornea
Pradeep Dubey
...
Naveen Mellempudi
S. Oberman
M. Shoeybi
Michael Siu
Hao Wu
BDL
VLM
MQ
67
121
0
12 Sep 2022
ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization
Cong Guo
Chen Zhang
Jingwen Leng
Zihan Liu
Fan Yang
Yun-Bo Liu
Minyi Guo
Yuhao Zhu
MQ
14
54
0
30 Aug 2022
DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation
Yu Tang
Chenyu Wang
Yufan Zhang
Yuliang Liu
Xingcheng Zhang
Linbo Qiao
Zhiquan Lai
Dongsheng Li
13
4
0
30 Mar 2022
Query Processing on Tensor Computation Runtimes
Dong He
Supun Nakandala
Dalitso Banda
Rathijit Sen
Karla Saur
Kwanghyun Park
Carlo Curino
Jesús Camacho-Rodríguez
Konstantinos Karanasos
Matteo Interlandi
11
35
0
03 Mar 2022
Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
Youjie Li
Amar Phanishayee
D. Murray
Jakub Tarnawski
N. Kim
4
19
0
02 Feb 2022
Compute and Energy Consumption Trends in Deep Learning Inference
Radosvet Desislavov
Fernando Martínez-Plumed
José Hernández Orallo
8
112
0
12 Sep 2021
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
Torsten Hoefler
Dan Alistarh
Tal Ben-Nun
Nikoli Dryden
Alexandra Peste
MQ
139
684
0
31 Jan 2021
1