DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

References in zbMATH (referenced in 11 articles )

Showing results 1 to 11 of 11.
Sorted by year (citations)

  1. Hettiarachchi, Hansi; Adedoyin-Olowe, Mariam; Bhogal, Jagdev; Gaber, Mohamed Medhat: Embed2detect: temporally clustered embedded words for event detection in social media (2022)
  2. Quamar, Abdul; Efthymiou, Vasilis; Lei, Chuan; Özcan, Fatma: Natural language interfaces to data (2022)
  3. Yeonghyeon Lee, Kangwook Jang, Jahyun Goo, Youngmoon Jung, Hoirin Kim: FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning (2022) arXiv
  4. Janizek, Joseph D.; Sturmfels, Pascal; Lee, Su-In: Explaining explanations: axiomatic feature interactions for deep networks (2021)
  5. Juan Manuel Pérez, Juan Carlos Giudici, Franco Luque: pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks (2021) arXiv
  6. Mehdi Bahrami, N.C. Shrikanth, Shade Ruangwan, Lei Liu, Yuji Mizobuchi, Masahiro Fukuyori, Wei-Peng Chen, Kazuki Munakata, Tim Menzies: PyTorrent: A Python Library Corpus for Large-scale Language Models (2021) arXiv
  7. Tao Gui, Xiao Wang, Qi Zhang, Qin Liu, Yicheng Zou, Xin Zhou, Rui Zheng, Chong Zhang, Qinzhuo Wu, Jiacheng Ye, Zexiong Pang, Yongxin Zhang, Zhengyan Li, Ruotian Ma, Zichu Fei, Ruijian Cai, Jun Zhao, Xinwu Hu, Zhiheng Yan, Yiding Tan, Yuan Hu, Qiyuan Bian, Zhihua Liu, Bolin Zhu, Shan Qin, Xiaoyu Xing, Jinlan Fu, Yue Zhang, Minlong Peng, Xiaoqing Zheng, Yaqian Zhou, Zhongyu Wei, Xipeng Qiu, Xuanjing Huang: TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing (2021) arXiv
  8. Tripathy, Jatin Karthik; Sethuraman, Sibi Chakkaravarthy; Cruz, Meenalosini Vimal; Namburu, Anupama; P., Mangalraj; R., Nandha Kumar; S., Sudhakar Ilango; Vijayakumar, Vaidehi: Comprehensive analysis of embeddings and pre-training in NLP (2021)
  9. Jaap Jumelet: diagNNose: A Library for Neural Activation Analysis (2020) arXiv
  10. Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.: Language Models are Few-Shot Learners (2020) arXiv
  11. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Jamie Brew: HuggingFace’s Transformers: State-of-the-art Natural Language Processing (2019) arXiv