Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
1904.10509
Cited By

Generating Long Sequences with Sparse Transformers

Generating Long Sequences with Sparse Transformers

23 April 2019

ArXiv (abs)PDF HTML

Papers citing "Generating Long Sequences with Sparse Transformers"

50 / 1,282 papers shown

PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling

PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling

157

0

0

27 Sep 2025

ECHO: Toward Contextual Seq2Seq Paradigms in Large EEG Models

ECHO: Toward Contextual Seq2Seq Paradigms in Large EEG Models

91

0

0

26 Sep 2025

Achilles' Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data

Achilles' Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data

Zhi-Qin John Xu

169

0

0

22 Sep 2025

Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few

Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few

380

0

0

21 Sep 2025

Attention Schema-based Attention Control (ASAC): A Cognitive-Inspired Approach for Attention Management in Transformers

Attention Schema-based Attention Control (ASAC): A Cognitive-Inspired Approach for Attention Management in Transformers

Federico Jurado Ruiz

192

0

0

19 Sep 2025

Local Mechanisms of Compositional Generalization in Conditional Diffusion

Local Mechanisms of Compositional Generalization in Conditional Diffusion

244

1

0

19 Sep 2025

Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

175

0

0

18 Sep 2025

The Few-shot Dilemma: Over-prompting Large Language Models

The Few-shot Dilemma: Over-prompting Large Language Models

Christian Koerner

232

4

0

16 Sep 2025

FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

Xiaozhuan Liang

161

1

0

16 Sep 2025

A Comprehensive Review of Reinforcement Learning for Autonomous Driving in the CARLA Simulator

A Comprehensive Review of Reinforcement Learning for Autonomous Driving in the CARLA Simulator

Feeza Khan Khanzada

145

3

0

10 Sep 2025

Customizing the Inductive Biases of Softmax Attention using Structured Matrices

Customizing the Inductive Biases of Softmax Attention using Structured Matrices

Andres Potapczynski

Andrew Gordon Wilson

119

0

0

09 Sep 2025

Faster VGGT with Block-Sparse Global Attention

Faster VGGT with Block-Sparse Global Attention

Chung-Shien Brian Wang

Christian Schmidt

Jens Piekenbrinck

116

8

0

08 Sep 2025

Rethinking the long-range dependency in Mamba/SSM and transformer models

Rethinking the long-range dependency in Mamba/SSM and transformer models

Kayvan Najarian

150

1

0

04 Sep 2025

Differentiable Entropy Regularization: A Complexity-Aware Approach for Neural Optimization

Differentiable Entropy Regularization: A Complexity-Aware Approach for Neural Optimization

Ibne Farabi Shihab

81

0

0

03 Sep 2025

DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

153

27

0

02 Sep 2025

REFRAG: Rethinking RAG based Decoding

REFRAG: Rethinking RAG based Decoding

Bryan Kian Hsiang Low

Anshumali Shrivastava

226

1

0

01 Sep 2025

DTRNet: Dynamic Token Routing Network to Reduce Quadratic Costs in Transformers

DTRNet: Dynamic Token Routing Network to Reduce Quadratic Costs in Transformers

Parsa Farinneya

Benyamin Jamialahmadi

Marzieh S. Tahaei

Mehdi Rezagholizadeh

86

1

0

31 Aug 2025

Spiking Decision Transformers: Local Plasticity, Phase-Coding, and Dendritic Routing for Low-Power Sequence Control

Spiking Decision Transformers: Local Plasticity, Phase-Coding, and Dendritic Routing for Low-Power Sequence Control

Debasmita Biswas

65

0

0

29 Aug 2025

ATM-GAD: Adaptive Temporal Motif Graph Anomaly Detection for Financial Transaction Networks

ATM-GAD: Adaptive Temporal Motif Graph Anomaly Detection for Financial Transaction Networks

AI4TS MLAU AIFin

168

1

0

28 Aug 2025

Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models

Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models

76

0

0

26 Aug 2025

Limitations of Normalization in Attention Mechanism

Limitations of Normalization in Attention Mechanism

Timur Mudarisov

Mikhail Burtsev

Tatiana Petrova

95

2

0

25 Aug 2025

Exploring Scaling Laws of CTR Model for Online Performance Improvement

Exploring Scaling Laws of CTR Model for Online Performance ImprovementACM Conference on Recommender Systems (RecSys), 2025

180

2

0

21 Aug 2025

Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation

Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation

Guangcong Zheng

152

2

0

18 Aug 2025

Pre-trained Transformer-models using chronic invasive electrophysiology for symptom decoding without patient-individual training

Pre-trained Transformer-models using chronic invasive electrophysiology for symptom decoding without patient-individual training

Richard M. Koehler

...

Nicole R. Provenza

Reza Abbasi-Asl

Wolf-Julian Neumann

107

0

0

13 Aug 2025

P/D-Device: Disaggregated Large Language Model between Cloud and Devices

P/D-Device: Disaggregated Large Language Model between Cloud and Devices

...

Zhengyong Zhang

217

1

0

12 Aug 2025

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal

...

131

268

0

08 Aug 2025

Generalizing Scaling Laws for Dense and Sparse Large Language Models

Generalizing Scaling Laws for Dense and Sparse Large Language Models

Md Arafat Hossain

183

0

0

08 Aug 2025

Deformable Attention Graph Representation Learning for Histopathology Whole Slide Image Analysis

Deformable Attention Graph Representation Learning for Histopathology Whole Slide Image Analysis

85

0

0

07 Aug 2025

GFocal: A Global-Focal Neural Operator for Solving PDEs on Arbitrary Geometries

GFocal: A Global-Focal Neural Operator for Solving PDEs on Arbitrary Geometries

209

2

0

06 Aug 2025

Trainable Dynamic Mask Sparse Attention

Trainable Dynamic Mask Sparse Attention

351

3

0

04 Aug 2025

Pointer: Linear-Complexity Long-Range Modeling without Pre-training

Pointer: Linear-Complexity Long-Range Modeling without Pre-training

103

0

0

04 Aug 2025

Hebbian Memory-Augmented Recurrent Networks: Engram Neurons in Deep Learning

Hebbian Memory-Augmented Recurrent Networks: Engram Neurons in Deep Learning

Daniel Szelogowski

98

1

0

29 Jul 2025

MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse

MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse

239

1

0

29 Jul 2025

TriangleMix: Accelerating Prefilling via Decoding-time Contribution Sparsity

TriangleMix: Accelerating Prefilling via Decoding-time Contribution Sparsity

Chengruidong Zhang

170

0

0

29 Jul 2025

Onboard Hyperspectral Super-Resolution with Deep Pushbroom Neural Network

Onboard Hyperspectral Super-Resolution with Deep Pushbroom Neural NetworkRemote Sensing (RS), 2025

Davide Piccinini

424

1

0

28 Jul 2025

EcoTransformer: Attention without Multiplication

EcoTransformer: Attention without Multiplication

Shirin Amiraslani

112

1

0

27 Jul 2025

SAMUeL: Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion

SAMUeL: Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion

Hei Shing Cheung

Jonathan H. Chan

195

0

0

26 Jul 2025

Modality Agnostic Efficient Long Range Encoder

Modality Agnostic Efficient Long Range Encoder

158

0

0

25 Jul 2025

Efficient Attention Mechanisms for Large Language Models: A Survey

Efficient Attention Mechanisms for Large Language Models: A Survey

245

10

0

25 Jul 2025

Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

243

0

0

24 Jul 2025

Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models

Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models

Gjergji Kasneci

256

6

0

24 Jul 2025

Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers

Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers

Vasileios Titopoulos

K. Alexandridis

G. Dimitrakopoulos

94

0

0

22 Jul 2025

Artifacts and Attention Sinks: Structured Approximations for Efficient Vision Transformers

Artifacts and Attention Sinks: Structured Approximations for Efficient Vision Transformers

147

1

0

21 Jul 2025

SAS: Simulated Attention Score

SAS: Simulated Attention Score

Chuanyang Zheng

...

Anderson Schneider

243

2

0

10 Jul 2025

ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

Kiarash Zahirnia

Zahra Golpayegani

260

0

0

08 Jul 2025

All in One: Visual-Description-Guided Unified Point Cloud Segmentation

All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Mohamed El Amine Boudjoghra

Rao Muhammad Anwer

222

1

0

07 Jul 2025

BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers

BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers

Patrik Okanovic

Sameer Deshmukh

Grzegorz Kwa'sniewski

...

Kentaro Katayama

Yusuke Nagasaka

Torsten Hoefler

203

0

0

03 Jul 2025

A unified framework for establishing the universal approximation of transformer-type architectures

A unified framework for establishing the universal approximation of transformer-type architectures

155

0

0

30 Jun 2025

RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models

RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models

257

2

0

18 Jun 2025

StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns

StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns

233

2

0

16 Jun 2025

1 2 3 4 5...24 25 26