Generating Long Sequences with Sparse Transformers

23 April 2019

Papers citing "Generating Long Sequences with Sparse Transformers"

50 / 1,283 papers shown

Lag-Relative Sparse Attention In Long Context Training

197

13 Jun 2025

On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention

430

11 Jun 2025

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

...

228

10 Jun 2025

AstroCompress: A benchmark dataset for multi-purpose compression of astronomical dataInternational Conference on Learning Representations (ICLR), 2025

179

10 Jun 2025

Spark Transformer: Reactivating Sparsity in FFN and Attention

...

239

07 Jun 2025

MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context InferenceAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

213

06 Jun 2025

DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference AccelerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

191

06 Jun 2025

Kinetics: Rethinking Test-Time Scaling Laws

461

05 Jun 2025

Beyond Text Compression: Evaluating Tokenizers Across ScalesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

280

03 Jun 2025

COGNATE: Acceleration of Sparse Tensor Programs on Emerging Hardware using Transfer Learning

Chamika Sudusinghe

Gerasimos Gerogiannis Damitha Lenadora

Damitha Sandeepa Lenadora

Charles Block

Josep Torrellas

Charith Mendis

316

31 May 2025

SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling

208

30 May 2025

INSIGHT: A Survey of In-Network Systems for Intelligent, High-Efficiency AI and Topology Optimization

Aleksandr Algazinov

Joydeep Chandra

Matt Laing

141

30 May 2025

Transformers Are Universally Consistent

Sagar Ghosh

Kushal Bose

Swagatam Das

147

30 May 2025

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

350

29 May 2025

AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity

260

29 May 2025

VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

313

28 May 2025

Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

414

28 May 2025

ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction

1.3K

27 May 2025

Vision Transformers with Self-Distilled Registers

473

27 May 2025

Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers

Yukun Zhang

Xueqing Zhou

AI4TS

175

27 May 2025

CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge

169

26 May 2025

MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

292

26 May 2025

How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation

217

24 May 2025

MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

255

24 May 2025

Why Do Some Inputs Break Low-Bit LLM Quantization?

275

24 May 2025

L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models

419

23 May 2025

Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models

379

23 May 2025

Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse

Josh Alman

Zhao Song

374

22 May 2025

Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization

521

21 May 2025

Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware OperatorsIEEE Computer Society Annual Symposium on VLSI (VLSI), 2025

K. Alexandridis

Vasileios Titopoulos

G. Dimitrakopoulos

288

20 May 2025

FLASH-D: FlashAttention with Hidden Softmax Division

K. Alexandridis

Vasileios Titopoulos

G. Dimitrakopoulos

271

20 May 2025

Fast RoPE Attention: Combining the Polynomial Method and Fast Fourier Transform

Josh Alman

Zhao Song

361

17 May 2025

Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency

Markos A. Katsoulakis

256

16 May 2025

MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection

Pouya Shaeri

Ariane Middel

16 May 2025

ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention

331

15 May 2025

FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs

228

13 May 2025

Lost in Transmission: When and Why LLMs Fail to Reason Globally

690

13 May 2025

Learning Advanced Self-Attention for Linear Transformers in the Singular Value DomainInternational Joint Conference on Artificial Intelligence (IJCAI), 2025

Hyowon Wi

Jeongwhan Choi

Noseong Park

351

13 May 2025

Fused3S: Fast Sparse Attention on Tensor CoresInternational Conference on Supercomputing (ICS), 2025

Zitong Li

Aparna Chandramowlishwaran

GNN

207

12 May 2025

A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting

Lhuqita Fazry

VLM

447

11 May 2025

Graph Laplacian Wavelet Transformer via Learnable Spectral Decomposition

Andrew Kiruluta

Eric Lundy

Priscilla Burity

206

09 May 2025

Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution

278

04 May 2025

Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing

590

01 May 2025

Polysemy of Synthetic Neurons Towards a New Type of Explanatory Categorical Vector Spaces

Michael Veillet-Guillem

MILM

296

30 Apr 2025

From Attention to Atoms: Spectral Dictionary Learning for Fast, Interpretable Language Models

Andrew Kiruluta

233

29 Apr 2025

Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity

Ruifeng Ren

Yong Liu

970

26 Apr 2025

The Rise of Small Language Models in Healthcare: A Comprehensive Survey

494

23 Apr 2025

Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

...

202

23 Apr 2025

Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access

440

23 Apr 2025

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

...

357

22 Apr 2025