ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1804.04235
  4. Cited By
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

11 April 2018
Noam M. Shazeer
Mitchell Stern
    ODL
ArXiv (abs)PDFHTML

Papers citing "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost"

50 / 799 papers shown
Generating abstractive summaries of Lithuanian news articles using a
  transformer model
Generating abstractive summaries of Lithuanian news articles using a transformer modelInternational Conference on Information and Software Technologies (ICIST), 2021
Lukas Stankevicius
M. Lukoševičius
127
3
0
23 Apr 2021
The Power of Scale for Parameter-Efficient Prompt Tuning
The Power of Scale for Parameter-Efficient Prompt TuningConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Brian Lester
Rami Al-Rfou
Noah Constant
VPVLM
1.4K
4,984
0
18 Apr 2021
DEUX: An Attribute-Guided Framework for Sociable Recommendation Dialog
  Systems
DEUX: An Attribute-Guided Framework for Sociable Recommendation Dialog Systems
Yu Li
Shirley Anugrah Hayati
Weiyan Shi
Zhou Yu
196
6
0
16 Apr 2021
Comparison of Grammatical Error Correction Using Back-Translation Models
Comparison of Grammatical Error Correction Using Back-Translation ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Aomi Koyama
Kengo Hotate
Masahiro Kaneko
Mamoru Komachi
112
12
0
16 Apr 2021
Planning with Learned Entity Prompts for Abstractive Summarization
Planning with Learned Entity Prompts for Abstractive SummarizationTransactions of the Association for Computational Linguistics (TACL), 2021
Shashi Narayan
Yao-Min Zhao
Joshua Maynez
Gonçalo Simões
Vitaly Nikolaev
Ryan T. McDonald
LRM
278
130
0
15 Apr 2021
Pushing the Limits of Non-Autoregressive Speech Recognition
Pushing the Limits of Non-Autoregressive Speech RecognitionInterspeech (Interspeech), 2021
Edwin G. Ng
Chung-Cheng Chiu
Yu Zhang
William Chan
VLM
262
31
0
07 Apr 2021
SpeechStew: Simply Mix All Available Speech Recognition Data to Train
  One Large Neural Network
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
William Chan
Daniel S. Park
Chris A. Lee
Yu Zhang
Quoc V. Le
Mohammad Norouzi
AI4TS
360
147
0
05 Apr 2021
Efficient Attentions for Long Document Summarization
Efficient Attentions for Long Document SummarizationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
L. Huang
Shuyang Cao
Nikolaus Nova Parulian
Heng Ji
Lu Wang
330
356
0
05 Apr 2021
Cryptonite: A Cryptic Crossword Benchmark for Extreme Ambiguity in
  Language
Cryptonite: A Cryptic Crossword Benchmark for Extreme Ambiguity in LanguageConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Avia Efrat
Uri Shaham
D. Kilman
Omer Levy
ELM
262
19
0
01 Mar 2021
Do Transformer Modifications Transfer Across Implementations and
  Applications?
Do Transformer Modifications Transfer Across Implementations and Applications?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Sharan Narang
Hyung Won Chung
Yi Tay
W. Fedus
Thibault Févry
...
Wei Li
Nan Ding
Jake Marcus
Adam Roberts
Colin Raffel
215
134
0
23 Feb 2021
WebRED: Effective Pretraining And Finetuning For Relation Extraction On
  The Web
WebRED: Effective Pretraining And Finetuning For Relation Extraction On The Web
Róbert Ormándi
Mohammad Saleh
Erin Winter
Vinay Rao
103
11
0
18 Feb 2021
Civil Rephrases Of Toxic Texts With Self-Supervised Transformers
Civil Rephrases Of Toxic Texts With Self-Supervised TransformersConference of the European Chapter of the Association for Computational Linguistics (EACL), 2021
Leo Laugier
John Pavlopoulos
Jeffrey Scott Sorensen
Lucas Dixon
265
52
0
01 Feb 2021
Gravity Optimizer: a Kinematic Approach on Optimization in Deep Learning
Gravity Optimizer: a Kinematic Approach on Optimization in Deep Learning
Dariush Bahrami
Sadegh Pouriyan Zadeh
ODL
96
5
0
22 Jan 2021
Analyzing Commonsense Emergence in Few-shot Knowledge Models
Analyzing Commonsense Emergence in Few-shot Knowledge ModelsConference on Automated Knowledge Base Construction (AKBC), 2021
Jeff Da
Ronan Le Bras
Ximing Lu
Yejin Choi
Antoine Bosselut
AI4MHKELM
470
42
0
01 Jan 2021
Studying Strategically: Learning to Mask for Closed-book QA
Studying Strategically: Learning to Mask for Closed-book QA
Qinyuan Ye
Belinda Z. Li
Sinong Wang
Benjamin Bolte
Hao Ma
Anuj Kumar
Xiang Ren
Madian Khabsa
OffRL
265
12
0
31 Dec 2020
Promoting Graph Awareness in Linearized Graph-to-Text Generation
Promoting Graph Awareness in Linearized Graph-to-Text GenerationFindings (Findings), 2020
Alexander Miserlis Hoyle
Ana Marasović
Noah A. Smith
AI4CE
169
32
0
31 Dec 2020
AraGPT2: Pre-Trained Transformer for Arabic Language Generation
AraGPT2: Pre-Trained Transformer for Arabic Language GenerationWorkshop on Arabic Natural Language Processing (WANLP), 2020
Wissam Antoun
Fady Baly
Hazem M. Hajj
VLM
283
131
0
31 Dec 2020
Few-Shot Text Generation with Pattern-Exploiting Training
Few-Shot Text Generation with Pattern-Exploiting TrainingConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Timo Schick
Hinrich Schütze
282
155
0
22 Dec 2020
Contrastive Learning with Adversarial Perturbations for Conditional Text
  Generation
Contrastive Learning with Adversarial Perturbations for Conditional Text GenerationInternational Conference on Learning Representations (ICLR), 2020
Seanie Lee
Dong Bok Lee
Sung Ju Hwang
545
117
0
14 Dec 2020
Collaborative Storytelling with Large-scale Neural Language Models
Collaborative Storytelling with Large-scale Neural Language ModelsMotion in Games (MIG), 2020
Eric Nichols
Leo Gao
R. Gomez
177
54
0
20 Nov 2020
Whale: Efficient Giant Model Training over Heterogeneous GPUs
Whale: Efficient Giant Model Training over Heterogeneous GPUsUSENIX Annual Technical Conference (USENIX ATC), 2020
Chencan Wu
Le Jiang
Ang Wang
Wencong Xiao
Ziji Shi
...
Lan-yue Chen
Yong Li
Zhen Zheng
Xiaoyong Liu
Wei Lin
274
68
0
18 Nov 2020
Stochastic Optimization with Laggard Data Pipelines
Stochastic Optimization with Laggard Data PipelinesNeural Information Processing Systems (NeurIPS), 2020
Naman Agarwal
Rohan Anil
Tomer Koren
Kunal Talwar
Cyril Zhang
86
12
0
26 Oct 2020
GO FIGURE: A Meta Evaluation of Factuality in Summarization
GO FIGURE: A Meta Evaluation of Factuality in SummarizationFindings (Findings), 2020
Saadia Gabriel
Asli Celikyilmaz
Rahul Jha
Yejin Choi
Jianfeng Gao
HILM
544
105
0
24 Oct 2020
Towards Zero-Shot Multilingual Synthetic Question and Answer Generation
  for Cross-Lingual Reading Comprehension
Towards Zero-Shot Multilingual Synthetic Question and Answer Generation for Cross-Lingual Reading Comprehension
Siamak Shakeri
Noah Constant
Mihir Kale
Linting Xue
SyDa
357
29
0
22 Oct 2020
CUNI Systems for the Unsupervised and Very Low Resource Translation Task
  in WMT20
CUNI Systems for the Unsupervised and Very Low Resource Translation Task in WMT20
Ivana Kvapilíková
Tom Kocmi
Ondrej Bojar
95
5
0
22 Oct 2020
Pushing the Limits of Semi-Supervised Learning for Automatic Speech
  Recognition
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
Yu Zhang
James Qin
Daniel S. Park
Wei Han
Chung-Cheng Chiu
Ruoming Pang
Quoc V. Le
Yonghui Wu
VLMSSL
577
327
0
20 Oct 2020
Effects of Parameter Norm Growth During Transformer Training: Inductive
  Bias from Gradient Descent
Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient DescentConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
William Merrill
Vivek Ramanujan
Yoav Goldberg
Roy Schwartz
Noah A. Smith
AI4CE
598
42
0
19 Oct 2020
Expectigrad: Fast Stochastic Optimization with Robust Convergence
  Properties
Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties
Brett Daley
Chris Amato
ODL
138
4
0
03 Oct 2020
Tasks, stability, architecture, and compute: Training more effective
  learned optimizers, and using them to train themselves
Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves
Luke Metz
Niru Maheswaranathan
C. Freeman
Ben Poole
Jascha Narain Sohl-Dickstein
295
69
0
23 Sep 2020
Seq2Edits: Sequence Transduction Using Span-level Edit Operations
Seq2Edits: Sequence Transduction Using Span-level Edit OperationsConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Felix Stahlberg
Shankar Kumar
BDL
196
95
0
23 Sep 2020
PTT5: Pretraining and validating the T5 model on Brazilian Portuguese
  data
PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data
Diedre Carmo
Marcos Piau
Israel Campiotti
Rodrigo Nogueira
R. Lotufo
LM&MA
130
64
0
20 Aug 2020
Whitening and second order optimization both make information in the
  dataset unusable during training, and can reduce or prevent generalization
Whitening and second order optimization both make information in the dataset unusable during training, and can reduce or prevent generalization
Neha S. Wadia
Daniel Duckworth
S. Schoenholz
Ethan Dyer
Jascha Narain Sohl-Dickstein
450
17
0
17 Aug 2020
Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning
Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning
Sai Praneeth Karimireddy
Martin Jaggi
Satyen Kale
M. Mohri
Sashank J. Reddi
Sebastian U. Stich
A. Suresh
FedML
548
236
0
08 Aug 2020
Data Weighted Training Strategies for Grammatical Error Correction
Data Weighted Training Strategies for Grammatical Error CorrectionTransactions of the Association for Computational Linguistics (TACL), 2020
Jared Lichtarge
Chris Alberti
Shankar Kumar
237
50
0
07 Aug 2020
A Comparison of Optimization Algorithms for Deep Learning
A Comparison of Optimization Algorithms for Deep LearningInternational journal of pattern recognition and artificial intelligence (IJPRAI), 2020
Derya Soydaner
215
187
0
28 Jul 2020
Binary Search and First Order Gradient Based Method for Stochastic
  Optimization
Binary Search and First Order Gradient Based Method for Stochastic Optimization
V. Pandey
ODL
119
0
0
27 Jul 2020
Improving compute efficacy frontiers with SliceOut
Improving compute efficacy frontiers with SliceOut
Pascal Notin
Aidan Gomez
Joanna Yoo
Y. Gal
153
1
0
21 Jul 2020
HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable
  Hyper Projections
HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable Hyper Projections
Yi Tay
Zhe Zhao
Dara Bahri
Donald Metzler
Da-Cheng Juan
186
9
0
12 Jul 2020
Descending through a Crowded Valley - Benchmarking Deep Learning
  Optimizers
Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers
Robin M. Schmidt
Frank Schneider
Philipp Hennig
ODL
804
186
0
03 Jul 2020
GShard: Scaling Giant Models with Conditional Computation and Automatic
  Sharding
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin
HyoukJoong Lee
Yuanzhong Xu
Dehao Chen
Orhan Firat
Yanping Huang
M. Krikun
Noam M. Shazeer
Zhiwen Chen
MoE
393
1,635
0
30 Jun 2020
SEAL: Segment-wise Extractive-Abstractive Long-form Text Summarization
SEAL: Segment-wise Extractive-Abstractive Long-form Text Summarization
Yao-Min Zhao
Mohammad Saleh
Peter J. Liu
RALM
166
27
0
18 Jun 2020
Modeling Graph Structure via Relative Position for Text Generation from
  Knowledge Graphs
Modeling Graph Structure via Relative Position for Text Generation from Knowledge Graphs
Martin Schmitt
Leonardo F. R. Ribeiro
Philipp Dufter
Iryna Gurevych
Hinrich Schütze
GNN
231
8
0
16 Jun 2020
ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
ADAHESSIAN: An Adaptive Second Order Optimizer for Machine LearningAAAI Conference on Artificial Intelligence (AAAI), 2020
Z. Yao
A. Gholami
Sheng Shen
Mustafa Mustafa
Kurt Keutzer
Michael W. Mahoney
ODL
452
333
0
01 Jun 2020
WT5?! Training Text-to-Text Models to Explain their Predictions
WT5?! Training Text-to-Text Models to Explain their Predictions
Sharan Narang
Colin Raffel
Katherine Lee
Adam Roberts
Noah Fiedel
Karishma Malkan
209
213
0
30 Apr 2020
Recipes for building an open-domain chatbot
Recipes for building an open-domain chatbotConference of the European Chapter of the Association for Computational Linguistics (EACL), 2020
Stephen Roller
Emily Dinan
Naman Goyal
Da Ju
Mary Williamson
...
Myle Ott
Kurt Shuster
Eric Michael Smith
Y-Lan Boureau
Jason Weston
ALM
530
1,085
0
28 Apr 2020
Automatic Cross-Replica Sharding of Weight Update in Data-Parallel
  Training
Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training
Yuanzhong Xu
HyoukJoong Lee
Dehao Chen
Hongjun Choi
Blake A. Hechtman
Shibo Wang
201
50
0
28 Apr 2020
AdaX: Adaptive Gradient Descent with Exponential Long Term Memory
AdaX: Adaptive Gradient Descent with Exponential Long Term Memory
Wenjie Li
Zhaoyang Zhang
Xinjiang Wang
Ping Luo
ODL
209
29
0
21 Apr 2020
TuringAdvice: A Generative and Dynamic Evaluation of Language Use
TuringAdvice: A Generative and Dynamic Evaluation of Language Use
Rowan Zellers
Ari Holtzman
Elizabeth Clark
Lianhui Qin
Ali Farhadi
Yejin Choi
ELMLRM
233
13
0
07 Apr 2020
Efficient Content-Based Sparse Attention with Routing Transformers
Efficient Content-Based Sparse Attention with Routing TransformersTransactions of the Association for Computational Linguistics (TACL), 2020
Aurko Roy
M. Saffar
Ashish Vaswani
David Grangier
MoE
968
686
0
12 Mar 2020
Talking-Heads Attention
Talking-Heads Attention
Noam M. Shazeer
Zhenzhong Lan
Youlong Cheng
Nan Ding
L. Hou
268
92
0
05 Mar 2020
Previous
123...141516
Next