Massively parallel distributed fine-tuning of transformer-based language models

Nowadays, in the field of modern natural language processing, the transformer-based neural network architectures are the state-of-the-art techniques. One of the main base models is the BERT, which is trained on a large corpus to provide natural language understanding, and the architecture is designed to be fine-tuned to any domain-specific task. The efficient fine-tuning process requires a hyperparameter optimization, and one of the main parameters is the learning rate, which will control the size of the modification of the model parameters during the training. This study will conclude a BERT model fine-tuning process on the selected Web of Science dataset to observe the effect of the learning rate on the model’s performance and training time. The whole experiment was implemented in the Komondor HPC system to utilize the necessary parallel computational resources. The key finding of this paper is that the learning rate can slow the convergence of the model or lead to catastrophic forgetting, while early stopping plays a key role in the training times. The research concludes that the utilization of the HPC resources can effectively scale up the whole hyperparameter search to determine the optimal learning rate range.

Morzsák

Oldal címe

Massively parallel distributed fine-tuning of transformer-based language models

Címlapos tartalom