Abstract
Recent advances including the Transformer architecture have revolutionized the Natural Language Processing community by providing immense performance improvements across many tasks, including the development of Large Language Models (LLMs). LLMs show enormous promise as few-shot learners, common-sense knowledge repositories, conversational agents, writing assistants, and coding tools, and are gaining widespread traction in commercial industry. However, LLMs are expensive and time-consuming to train, requiring many passes over terabytes of data for the largest models. In this paper, we present Superstilling, a method for reducing the sample complexity of language model training by distilling the knowledge from a previously-trained model (the teacher) into a new, larger model (the student). This method does not require conformity between the architectures of the two models, and can be applied even when the weights and training data of the teacher model are not available, for example in federated learning scenarios. We apply Superstilling to train models of various sizes and show this method can decrease sample complexity by more than 10\% on models with over 160M parameters. We also show that in certain scenarios, Superstilling can be used to speed up training despite the need to run the teacher and student models simultaneously.
Degree
MS
College and Department
Computational, Mathematical, and Physical Sciences; Computer Science
Rights
https://lib.byu.edu/about/copyright/
BYU ScholarsArchive Citation
Gundry, Chaz Allen, "The Student Becomes The Teacher: Training High-Performance Language Models More Sample-Efficiently From Small Models Via Superstilling" (2023). Theses and Dissertations. 10527.
https://scholarsarchive.byu.edu/etd/10527
Date Submitted
2023-08-14
Document Type
Thesis
Handle
http://hdl.lib.byu.edu/1877/etd13365
Keywords
large language models, knowledge distillation, sample efficiency, transformer, deep learning
Language
english