Paper Reading: Cramming- Training a Language Model on a Single GPU in One Day

IBM's NLP research expert Leshem Choshen commented on Twitter, "This paper summarizes all the big model training trips you can think of.

###Limited resources

In order to simulate the resource environment of ordinary practitioners and researchers, this study first constructed a resource constrained research environment:

A transformer based language model of any size, trained entirely from scratch using masked language modeling;
The pipeline cannot contain existing pre trained models;
Any raw text (excluding downstream data) can be included in the training, which means that acceleration can be achieved by wisely selecting how and when to sample the data, provided that the sampling mechanism does not require pre training the model;
The download and pre-processing of original data are not included in the total budget. The pre-processing here includes CPU based tokenizer construction, tokenization and filtering, but does not include Feature learning;
Training is only conducted on a single GPU for 24 hours;
Downstream performance is evaluated on GLUE, and downstream fine-tuning on GLUE is limited to simple training using only training data from downstream tasks (5 epochs or less), and requires the use of global hyperparameters set for all GLUE tasks. Downstream fine-tuning is not included in the total budget.

###Improvement methods

The researchers implemented and tested some modification directions proposed by existing work, including general implementation and initial data settings, and attempted to modify the architecture, train, and modify the dataset methods.

The experiment was conducted in PyTorch, without using idiosyncratic implementations to be as fair as possible. All content was kept at the implementation level of the PyTorch framework, and only automatic operator fusion that could be applied to all components was allowed. In addition, the efficient attention kernel would only be re enabled after selecting the final architectural variant.

The first method we think of to improve performance is definitely modifying the model architecture. Intuitively, smaller/lower capacity models seem to be optimal in a one card per day training session. However, after studying the relationship between model type and training efficiency, researchers found that the scaling law poses a huge obstacle to scaling down. The training efficiency of each token largely depends on the model size, rather than the type of transformer.

In addition, smaller models have lower learning efficiency, which largely slows down the increase in throughput. Fortunately, the fact that training efficiency remains almost constant in models of the same size means that we can find suitable designs in architectures with similar parameter quantities, mainly based on the calculation time that affects individual gradient steps.

###summary

Overall, using the method in the paper, the training results are already very close to the original BERT, but it should be noted that the total FLOPS used by the latter is 45-136 times that of the new method (which takes four days on 16 TPUs). When the training time was extended by 16 times (trained on 8 GPUs for two days), the performance of the new method actually improved significantly compared to the original BERT, reaching the level of RoBERTA.

In this work, people discussed how much performance a transformer based language model can achieve in environments with very limited computational complexity. Fortunately, several modification directions can enable us to achieve good downstream performance on GLUE. Researchers expressed the hope that this work can provide a baseline for further improvement and provide theoretical support for many improvements and techniques proposed for Transformer architecture in recent years.