ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1903.12136
  4. Cited By
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

28 March 2019
Raphael Tang
Yao Lu
Linqing Liu
Lili Mou
Olga Vechtomova
Jimmy J. Lin
ArXivPDFHTML

Papers citing "Distilling Task-Specific Knowledge from BERT into Simple Neural Networks"

50 / 50 papers shown
Title
Mitigating Catastrophic Forgetting in the Incremental Learning of Medical Images
Mitigating Catastrophic Forgetting in the Incremental Learning of Medical Images
Sara Yavari
Jacob Furst
CLL
53
0
0
28 Apr 2025
Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity
Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity
Mutian He
Philip N. Garner
80
0
0
09 Oct 2024
Can't Hide Behind the API: Stealing Black-Box Commercial Embedding Models
Can't Hide Behind the API: Stealing Black-Box Commercial Embedding Models
Manveer Singh Tamber
Jasper Xian
Jimmy Lin
MLAU
SILM
134
0
0
13 Jun 2024
Augmenting Offline RL with Unlabeled Data
Augmenting Offline RL with Unlabeled Data
Zhao Wang
Briti Gangopadhyay
Jia-Fong Yeh
Shingo Takamatsu
OffRL
26
0
0
11 Jun 2024
Integrating Domain Knowledge for handling Limited Data in Offline RL
Integrating Domain Knowledge for handling Limited Data in Offline RL
Briti Gangopadhyay
Zhao Wang
Jia-Fong Yeh
Shingo Takamatsu
OffRL
32
0
0
11 Jun 2024
Sentence-Level or Token-Level? A Comprehensive Study on Knowledge
  Distillation
Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation
Jingxuan Wei
Linzhuang Sun
Yichong Leng
Xu Tan
Bihui Yu
Ruifeng Guo
43
3
0
23 Apr 2024
Teaching MLP More Graph Information: A Three-stage Multitask Knowledge
  Distillation Framework
Teaching MLP More Graph Information: A Three-stage Multitask Knowledge Distillation Framework
Junxian Li
Bin Shi
Erfei Cui
Hua Wei
Qinghua Zheng
41
0
0
02 Mar 2024
Confidence Preservation Property in Knowledge Distillation Abstractions
Confidence Preservation Property in Knowledge Distillation Abstractions
Dmitry Vengertsev
Elena Sherman
24
0
0
21 Jan 2024
Mixed Distillation Helps Smaller Language Model Better Reasoning
Mixed Distillation Helps Smaller Language Model Better Reasoning
Chenglin Li
Qianglong Chen
Liangyue Li
Wang Caiyu
Yicheng Li
Zhang Yin
Yin Zhang
LRM
30
11
0
17 Dec 2023
Teacher-Student Architecture for Knowledge Distillation: A Survey
Teacher-Student Architecture for Knowledge Distillation: A Survey
Chengming Hu
Xuan Li
Danyang Liu
Haolun Wu
Xi Chen
Ju Wang
Xue Liu
21
16
0
08 Aug 2023
Accurate Retraining-free Pruning for Pretrained Encoder-based Language
  Models
Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models
Seungcheol Park
Ho-Jin Choi
U. Kang
VLM
25
5
0
07 Aug 2023
f-Divergence Minimization for Sequence-Level Knowledge Distillation
f-Divergence Minimization for Sequence-Level Knowledge Distillation
Yuqiao Wen
Zichao Li
Wenyu Du
Lili Mou
30
53
0
27 Jul 2023
Vesper: A Compact and Effective Pretrained Model for Speech Emotion
  Recognition
Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition
Weidong Chen
Xiaofen Xing
Peihao Chen
Xiangmin Xu
VLM
28
35
0
20 Jul 2023
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
  Transformers
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers
Chen Liang
Haoming Jiang
Zheng Li
Xianfeng Tang
Bin Yin
Tuo Zhao
VLM
24
24
0
19 Feb 2023
Distillation of encoder-decoder transformers for sequence labelling
Distillation of encoder-decoder transformers for sequence labelling
M. Farina
D. Pappadopulo
Anant Gupta
Leslie Huang
Ozan Irsoy
Thamar Solorio
VLM
100
3
0
10 Feb 2023
Gradient Knowledge Distillation for Pre-trained Language Models
Gradient Knowledge Distillation for Pre-trained Language Models
Lean Wang
Lei Li
Xu Sun
VLM
23
5
0
02 Nov 2022
Teacher-Student Architecture for Knowledge Learning: A Survey
Teacher-Student Architecture for Knowledge Learning: A Survey
Chengming Hu
Xuan Li
Dan Liu
Xi Chen
Ju Wang
Xue Liu
20
35
0
28 Oct 2022
An Effective, Performant Named Entity Recognition System for Noisy
  Business Telephone Conversation Transcripts
An Effective, Performant Named Entity Recognition System for Noisy Business Telephone Conversation Transcripts
Xue-Yong Fu
Cheng Chen
Md Tahmid Rahman Laskar
TN ShashiBhushan
Simon Corston-Oliver
30
6
0
27 Sep 2022
Chemical transformer compression for accelerating both training and
  inference of molecular modeling
Chemical transformer compression for accelerating both training and inference of molecular modeling
Yi Yu
K. Börjesson
19
0
0
16 May 2022
Adaptable Adapters
Adaptable Adapters
N. Moosavi
Quentin Delfosse
Kristian Kersting
Iryna Gurevych
48
21
0
03 May 2022
Attention Mechanism with Energy-Friendly Operations
Attention Mechanism with Energy-Friendly Operations
Yu Wan
Baosong Yang
Dayiheng Liu
Rong Xiao
Derek F. Wong
Haibo Zhang
Boxing Chen
Lidia S. Chao
MU
96
1
0
28 Apr 2022
Delta Keyword Transformer: Bringing Transformers to the Edge through
  Dynamically Pruned Multi-Head Self-Attention
Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention
Zuzana Jelčicová
Marian Verhelst
26
5
0
20 Mar 2022
Sparse Distillation: Speeding Up Text Classification by Using Bigger
  Student Models
Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models
Qinyuan Ye
Madian Khabsa
M. Lewis
Sinong Wang
Xiang Ren
Aaron Jaech
29
5
0
16 Oct 2021
Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context
  Prediction Network
Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network
Takaaki Saeki
Shinnosuke Takamichi
Hiroshi Saruwatari
18
3
0
22 Sep 2021
Student Surpasses Teacher: Imitation Attack for Black-Box NLP APIs
Student Surpasses Teacher: Imitation Attack for Black-Box NLP APIs
Qiongkai Xu
Xuanli He
Lingjuan Lyu
Lizhen Qu
Gholamreza Haffari
MLAU
30
21
0
29 Aug 2021
Knowledge Distillation for Quality Estimation
Knowledge Distillation for Quality Estimation
Amit Gajbhiye
M. Fomicheva
Fernando Alva-Manchego
Frédéric Blain
A. Obamuyide
Nikolaos Aletras
Lucia Specia
14
11
0
01 Jul 2021
XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation
XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation
Subhabrata Mukherjee
Ahmed Hassan Awadallah
Jianfeng Gao
17
22
0
08 Jun 2021
The NLP Cookbook: Modern Recipes for Transformer based Deep Learning
  Architectures
The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures
Sushant Singh
A. Mahmood
AI4TS
55
92
0
23 Mar 2021
Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation
Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation
Lingyun Feng
Minghui Qiu
Yaliang Li
Haitao Zheng
Ying Shen
38
10
0
20 Jan 2021
LRC-BERT: Latent-representation Contrastive Knowledge Distillation for
  Natural Language Understanding
LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding
Hao Fu
Shaojun Zhou
Qihong Yang
Junjie Tang
Guiquan Liu
Kaikui Liu
Xiaolong Li
29
57
0
14 Dec 2020
Reinforced Multi-Teacher Selection for Knowledge Distillation
Reinforced Multi-Teacher Selection for Knowledge Distillation
Fei Yuan
Linjun Shou
J. Pei
Wutao Lin
Ming Gong
Yan Fu
Daxin Jiang
8
121
0
11 Dec 2020
BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online
  E-Commerce Search
BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online E-Commerce Search
Yunjiang Jiang
Yue Shang
Ziyang Liu
Hongwei Shen
Yun Xiao
Wei Xiong
Sulong Xu
Weipeng P. Yan
Di Jin
29
17
0
20 Oct 2020
Pretrained Transformers for Text Ranking: BERT and Beyond
Pretrained Transformers for Text Ranking: BERT and Beyond
Jimmy J. Lin
Rodrigo Nogueira
Andrew Yates
VLM
219
608
0
13 Oct 2020
Structural Knowledge Distillation: Tractably Distilling Information for
  Structured Predictor
Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor
Xinyu Wang
Yong-jia Jiang
Zhaohui Yan
Zixia Jia
Nguyen Bach
Tao Wang
Zhongqiang Huang
Fei Huang
Kewei Tu
26
10
0
10 Oct 2020
Efficient Transformers: A Survey
Efficient Transformers: A Survey
Yi Tay
Mostafa Dehghani
Dara Bahri
Donald Metzler
VLM
74
1,101
0
14 Sep 2020
DualDE: Dually Distilling Knowledge Graph Embedding for Faster and
  Cheaper Reasoning
DualDE: Dually Distilling Knowledge Graph Embedding for Faster and Cheaper Reasoning
Yushan Zhu
Wen Zhang
Mingyang Chen
Hui Chen
Xu-Xin Cheng
Wei Zhang
Huajun Chen Zhejiang University
8
27
0
13 Sep 2020
Students Need More Attention: BERT-based AttentionModel for Small Data
  with Application to AutomaticPatient Message Triage
Students Need More Attention: BERT-based AttentionModel for Small Data with Application to AutomaticPatient Message Triage
Shijing Si
Rui Wang
Jedrek Wosik
Hao Zhang
D. Dov
Guoyin Wang
Ricardo Henao
Lawrence Carin
12
24
0
22 Jun 2020
Knowledge Distillation: A Survey
Knowledge Distillation: A Survey
Jianping Gou
B. Yu
Stephen J. Maybank
Dacheng Tao
VLM
19
2,835
0
09 Jun 2020
Movement Pruning: Adaptive Sparsity by Fine-Tuning
Movement Pruning: Adaptive Sparsity by Fine-Tuning
Victor Sanh
Thomas Wolf
Alexander M. Rush
13
466
0
15 May 2020
Detecting Adverse Drug Reactions from Twitter through Domain-Specific
  Preprocessing and BERT Ensembling
Detecting Adverse Drug Reactions from Twitter through Domain-Specific Preprocessing and BERT Ensembling
Amy Breden
L. Moore
15
13
0
11 May 2020
GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy
  Efficient Inference
GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference
Ali Hadi Zadeh
Isak Edo
Omar Mohamed Awad
Andreas Moshovos
MQ
19
183
0
08 May 2020
The Right Tool for the Job: Matching Model and Instance Complexities
The Right Tool for the Job: Matching Model and Instance Complexities
Roy Schwartz
Gabriel Stanovsky
Swabha Swayamdipta
Jesse Dodge
Noah A. Smith
33
167
0
16 Apr 2020
Squeezed Deep 6DoF Object Detection Using Knowledge Distillation
Squeezed Deep 6DoF Object Detection Using Knowledge Distillation
H. Felix
Walber M. Rodrigues
David Macêdo
Francisco Simões
Adriano Oliveira
Veronica Teichrieb
Cleber Zanchettin
3DPC
14
9
0
30 Mar 2020
Pre-trained Models for Natural Language Processing: A Survey
Pre-trained Models for Natural Language Processing: A Survey
Xipeng Qiu
Tianxiang Sun
Yige Xu
Yunfan Shao
Ning Dai
Xuanjing Huang
LM&MA
VLM
243
1,450
0
18 Mar 2020
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
  of Pre-Trained Transformers
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
Wenhui Wang
Furu Wei
Li Dong
Hangbo Bao
Nan Yang
Ming Zhou
VLM
45
1,198
0
25 Feb 2020
Pre-training Tasks for Embedding-based Large-scale Retrieval
Pre-training Tasks for Embedding-based Large-scale Retrieval
Wei-Cheng Chang
Felix X. Yu
Yin-Wen Chang
Yiming Yang
Sanjiv Kumar
RALM
11
301
0
10 Feb 2020
Reducing Transformer Depth on Demand with Structured Dropout
Reducing Transformer Depth on Demand with Structured Dropout
Angela Fan
Edouard Grave
Armand Joulin
19
584
0
25 Sep 2019
DocBERT: BERT for Document Classification
DocBERT: BERT for Document Classification
Ashutosh Adhikari
Achyudh Ram
Raphael Tang
Jimmy J. Lin
LLMAG
VLM
11
296
0
17 Apr 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
297
6,950
0
20 Apr 2018
Convolutional Neural Networks for Sentence Classification
Convolutional Neural Networks for Sentence Classification
Yoon Kim
AILaw
VLM
250
13,364
0
25 Aug 2014
1