ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2002.10941
  4. Cited By
A$^3$: Accelerating Attention Mechanisms in Neural Networks with
  Approximation

A3^33: Accelerating Attention Mechanisms in Neural Networks with Approximation

International Symposium on High-Performance Computer Architecture (HPCA), 2020
22 February 2020
Tae Jun Ham
Sungjun Jung
Seonghak Kim
Young H. Oh
Yeonhong Park
Yoonho Song
Jung-Hun Park
Sanghee Lee
Kyoung Park
Jae W. Lee
D. Jeong
ArXiv (abs)PDFHTML

Papers citing "A$^3$: Accelerating Attention Mechanisms in Neural Networks with Approximation"

50 / 88 papers shown
ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity
ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity
Hongxiang Liu
Zhifang Deng
Tong Pu
Shengli Lu
211
0
0
02 Dec 2025
CAMformer: Associative Memory is All You Need
CAMformer: Associative Memory is All You Need
Tergel Molom-Ochir
Benjamin Morris
Mark Horton
Chiyue Wei
Cong Guo
...
Peter Liu
Shan X. Wang
Deliang Fan
Hai Helen Li
Yiran Chen
123
0
0
24 Nov 2025
QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations
QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations
Zhixiong Zhao
Haomin Li
Fangxin Liu
Yuncheng Lu
Zongwu Wang
Tao Yang
Li Jiang
Haibing Guan
297
2
0
10 Nov 2025
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention CacheIACR Cryptology ePrint Archive (IACR ePrint), 2025
Dinghong Song
Yuan Feng
Y. Wang
S. Chen
Cyril Guyot
F. Blagojevic
Hyeran Jeon
Pengfei Su
Dong Li
268
1
0
29 Oct 2025
SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference
SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference
Wenxun Wang
Shuchang Zhou
Wenyu Sun
Peiqin Sun
Y. Liu
146
42
0
20 Oct 2025
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized DataflowIEEE Transactions on Circuits and Systems Part 1: Regular Papers (TCAS-I), 2025
Ching-Lin Hsiung
Tian-Sheuan Chang
ViT
160
1
0
16 Oct 2025
Bhasha-Rupantarika: Algorithm-Hardware Co-design approach for Multilingual Neural Machine Translation
Bhasha-Rupantarika: Algorithm-Hardware Co-design approach for Multilingual Neural Machine Translation
Mukul Lokhande
Tanushree Dewangan
Mohd Sharik Mansoori
Tejas Chaudhari
Akarsh J.
Damayanti Lokhande
Adam Teman
Santosh Kumar Vishvakarma
86
0
0
12 Oct 2025
Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors
Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors
Vasileios Titopoulos
K. Alexandridis
G. Dimitrakopoulos
135
0
0
08 Oct 2025
Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving
Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving
Yue Pan
Zihan Xia
Po-Kai Hsu
Lanxiang Hu
Hyungyo Kim
...
Minxuan Zhou
Nam Sung Kim
Shimeng Yu
Tajana Rosing
Mingu Kang
MoE
127
7
0
06 Oct 2025
LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications
LEGO: Spatial Accelerator Generation and Optimization for Tensor ApplicationsInternational Symposium on High-Performance Computer Architecture (HPCA), 2025
Yujun Lin
Zhekai Zhang
Song Han
216
1
0
15 Sep 2025
KLLM: Fast LLM Inference with K-Means Quantization
KLLM: Fast LLM Inference with K-Means Quantization
Xueying Wu
Baijun Zhou
Zhihui Gao
Yuzhe Fu
Qilin Zheng
Yintao He
Hai Helen Li
MQ
293
0
0
30 Jul 2025
Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware Operators
Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware OperatorsIEEE Computer Society Annual Symposium on VLSI (VLSI), 2025
K. Alexandridis
Vasileios Titopoulos
G. Dimitrakopoulos
339
4
0
20 May 2025
FLASH-D: FlashAttention with Hidden Softmax Division
FLASH-D: FlashAttention with Hidden Softmax Division
K. Alexandridis
Vasileios Titopoulos
G. Dimitrakopoulos
324
1
0
20 May 2025
LightNobel: Improving Sequence Length Limitation in Protein Structure Prediction Model via Adaptive Activation Quantization
LightNobel: Improving Sequence Length Limitation in Protein Structure Prediction Model via Adaptive Activation QuantizationInternational Symposium on Computer Architecture (ISCA), 2025
Seunghee Han
S. Choi
Joo-Young Kim
208
1
0
09 May 2025
Uni-Render: A Unified Accelerator for Real-Time Rendering Across Diverse Neural Renderers
Uni-Render: A Unified Accelerator for Real-Time Rendering Across Diverse Neural RenderersInternational Symposium on High-Performance Computer Architecture (HPCA), 2025
Chaojian Li
Sixu Li
Linrui Jiang
Jingqun Zhang
Yingyan Lin
356
9
0
31 Mar 2025
A Low-Power Streaming Speech Enhancement Accelerator For Edge Devices
A Low-Power Streaming Speech Enhancement Accelerator For Edge DevicesIEEE Open Journal of Circuits and Systems (OJ-CAS), 2025
Ci-Hao Wu
Tian-Sheuan Chang
291
2
0
27 Mar 2025
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the EdgeComputer Vision and Pattern Recognition (CVPR), 2025
Xuan Shen
Weize Ma
Jing Liu
Changdi Yang
Rui Ding
...
Wei Niu
Yanzhi Wang
Pu Zhao
Jun Lin
Jiuxiang Gu
MQ
397
8
0
20 Mar 2025
Attention Condensation via Sparsity Induced Regularized Training
Eli Sason
Darya Frolova
Boris Nazarov
Felix Goldberd
1.1K
0
0
03 Mar 2025
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing SystemInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025
Yintao He
Haiyu Mao
Christina Giannoula
Mohammad Sadrosadati
Juan Gómez Luna
Huawei Li
Xiaowei Li
Ying Wang
O. Mutlu
434
25
0
21 Feb 2025
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Konstantin Berestizshevsky
Renzo Andri
Lukas Cavigelli
483
2
0
12 Feb 2025
EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models
EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion ModelsInternational Symposium on High-Performance Computer Architecture (HPCA), 2025
Jaehoon Heo
Adiwena Putra
Jieon Yoon
Sungwoong Yune
Hangyeol Lee
Ji-Hoon Kim
Joo-Young Kim
DiffM
347
8
0
10 Jan 2025
Deploying Foundation Model Powered Agent Services: A Survey
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
513
7
0
18 Dec 2024
EXAQ: Exponent Aware Quantization For LLMs Acceleration
EXAQ: Exponent Aware Quantization For LLMs Acceleration
Moran Shkolnik
Maxim Fishman
Brian Chmiel
Hilla Ben-Yaacov
Ron Banner
Kfir Y. Levy
MQ
153
1
0
04 Oct 2024
ProTEA: Programmable Transformer Encoder Acceleration on FPGA
ProTEA: Programmable Transformer Encoder Acceleration on FPGA
Ehsan Kabir
Jason D. Bakos
David Andrews
Miaoqing Huang
233
1
0
21 Sep 2024
FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs
FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAsInternational Conference on Field-Programmable Technology (ICFPT), 2024
Ehsan Kabir
Md. Arafat Kabir
Austin R. J. Downey
Jason D. Bakos
David Andrews
Miaoqing Huang
GNN
299
3
0
21 Sep 2024
An Analog and Digital Hybrid Attention Accelerator for Transformers with
  Charge-based In-memory Computing
An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing
Ashkan Moradifirouzabadi
Divya Sri Dodla
Mingu Kang
164
4
0
08 Sep 2024
Hardware Acceleration of LLMs: A comprehensive survey and comparison
Hardware Acceleration of LLMs: A comprehensive survey and comparison
Nikoletta Koilia
C. Kachris
346
11
0
05 Sep 2024
Duplex: A Device for Large Language Models with Mixture of Experts,
  Grouped Query Attention, and Continuous Batching
Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous BatchingMicro (MICRO), 2024
Sungmin Yun
Kwanhee Kyung
Juhwan Cho
Jaewan Choi
Jongmin Kim
Byeongho Kim
Sukhan Lee
Kyomin Sohn
Jung Ho Ahn
MoE
266
40
0
02 Sep 2024
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference
  Serving at Scale
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at ScaleIEEE International Symposium on Workload Characterization (IISWC), 2024
Jaehong Cho
Minsu Kim
Hyunmin Choi
Guseul Heo
Jongse Park
399
30
0
10 Aug 2024
1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your
  Language Model Thrives on Quality Data
1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data
Calvin Tan
Jerome Wang
ALM
301
6
0
07 Aug 2024
KWT-Tiny: RISC-V Accelerated, Embedded Keyword Spotting Transformer
KWT-Tiny: RISC-V Accelerated, Embedded Keyword Spotting TransformerACM Symposium on Cloud Computing (SoCC), 2024
Aness Al-Qawlaq
Ajay Kumar
Deepu John
169
2
0
22 Jul 2024
Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference
Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference
Ghadeer Jaradat
M. Tolba
Ghada Alsuhli
Hani Saleh
Mahmoud Al-Qutayri
Thanos Stouraitis
Baker Mohammad
177
2
0
17 Jul 2024
Co-Designing Binarized Transformer and Hardware Accelerator for
  Efficient End-to-End Edge Deployment
Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment
Yuhao Ji
Chao Fang
Shaobo Ma
Haikuo Shao
Zhongfeng Wang
MQ
277
4
0
16 Jul 2024
Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths
Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths
Tianyu Fu
Haofeng Huang
Xuefei Ning
Genghan Zhang
Boju Chen
...
Shiyao Li
Shengen Yan
Guohao Dai
Huazhong Yang
Yu Wang
MQ
406
41
0
21 Jun 2024
LARS-VSA: A Vector Symbolic Architecture For Learning with Abstract
  Rules
LARS-VSA: A Vector Symbolic Architecture For Learning with Abstract Rules
Mohamed Mejri
C. Amarnath
Abhijit Chatterjee
308
1
0
23 May 2024
The CAP Principle for LLM Serving: A Survey of Long-Context Large
  Language Model Serving
The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving
Pai Zeng
Zhenyu Ning
Jieru Zhao
Weihao Cui
Mengwei Xu
Liwei Guo
Xusheng Chen
Yizhou Shan
LLMAG
328
5
0
18 May 2024
From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of
  Deep Neural Networks
From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of Deep Neural NetworksIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2024
Xue Geng
Zhe Wang
Chunyun Chen
Qing Xu
Kaixin Xu
...
Zhenghua Chen
M. Aly
Jie Lin
Ruibing Jin
Xiaoli Li
328
8
0
09 May 2024
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Chengyue Wu
Haotian Tang
Shang Yang
Zhekai Zhang
Guangxuan Xiao
Chuang Gan
Song Han
718
173
0
07 May 2024
LATTE: Low-Precision Approximate Attention with Head-wise Trainable
  Threshold for Efficient Transformer
LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer
Jiing-Ping Wang
Ming-Guang Lin
An-Yeu Wu
Wu
153
4
0
11 Apr 2024
Lightweight Deep Learning for Resource-Constrained Environments: A
  Survey
Lightweight Deep Learning for Resource-Constrained Environments: A Survey
Hou-I Liu
Marco Galindo
Hongxia Xie
Lai-Kuan Wong
Hong-Han Shuai
Yung-Hui Li
Wen-Huang Cheng
383
187
0
08 Apr 2024
CHAI: Clustered Head Attention for Efficient LLM Inference
CHAI: Clustered Head Attention for Efficient LLM InferenceInternational Conference on Machine Learning (ICML), 2024
Saurabh Agarwal
Bilge Acun
Basil Homer
Mostafa Elhoushi
Yejin Lee
Shivaram Venkataraman
Dimitris Papailiopoulos
Carole-Jean Wu
289
14
0
12 Mar 2024
Stochastic Spiking Attention: Accelerating Attention with Stochastic
  Computing in Spiking Networks
Stochastic Spiking Attention: Accelerating Attention with Stochastic Computing in Spiking Networks
Zihang Song
Prabodh Katti
Osvaldo Simeone
Bipin Rajendran
312
5
0
14 Feb 2024
Compressing Deep Reinforcement Learning Networks with a Dynamic
  Structured Pruning Method for Autonomous Driving
Compressing Deep Reinforcement Learning Networks with a Dynamic Structured Pruning Method for Autonomous Driving
Wensheng Su
Zhenni Li
Minrui Xu
Jiawen Kang
Dusit Niyato
Shengli Xie
208
15
0
07 Feb 2024
ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters
ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters
Shiwei Liu
Guanchen Tao
Yifei Zou
Derek Chow
Zichen Fan
Kauna Lei
Bangfei Pan
Dennis Sylvester
Gregory Kielian
Mehdi Saligane
299
15
0
31 Jan 2024
BETA: Binarized Energy-Efficient Transformer Accelerator at the Edge
BETA: Binarized Energy-Efficient Transformer Accelerator at the EdgeInternational Symposium on Circuits and Systems (ISCAS), 2024
Yuhao Ji
Chao Fang
Zhongfeng Wang
268
8
0
22 Jan 2024
A Survey on Hardware Accelerators for Large Language Models
A Survey on Hardware Accelerators for Large Language Models
C. Kachris
251
32
0
18 Jan 2024
FlightLLM: Efficient Large Language Model Inference with a Complete
  Mapping Flow on FPGAs
FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAsSymposium on Field Programmable Gate Arrays (FPGA), 2024
Shulin Zeng
Jun Liu
Guohao Dai
Xinhao Yang
Tianyu Fu
...
Zehao Wang
Ruoyu Zhang
Kairui Wen
Xuefei Ning
Yu Wang
307
128
0
08 Jan 2024
A Heterogeneous Chiplet Architecture for Accelerating End-to-End
  Transformer Models
A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models
Harsh Sharma
Pratyush Dhingra
J. Doppa
Ümit Y. Ogras
P. Pande
246
16
0
18 Dec 2023
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey
Tianyu Ding
Tianyi Chen
Haidong Zhu
Jiachen Jiang
Yiqi Zhong
Jinxin Zhou
Guangzhi Wang
Zhihui Zhu
Ilya Zharkov
Luming Liang
483
37
0
01 Dec 2023
Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse
  Multi-DNN Workloads
Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN WorkloadsMicro (MICRO), 2023
Hongxiang Fan
Stylianos I. Venieris
Alexandros Kouris
Nicholas D. Lane
234
15
0
17 Oct 2023
12
Next
Page 1 of 2