TL;DR, Look below for the list of pre-trained Dutch and Dutch+English models.
A few months ago, I was given access to Google's TPU Research Cloud (TRC). My goal was to train several Dutch and Dutch+English T5 models, limited to model sizes that can run on a single GPU. T5 is a text-to-text transfer transformer, a neural network model with natural language text as input and output. It can be fine-tuned on a wide range of tasks.
Background on Google's TPU-VM and how to use the Huggingface transformers library to pre-train models can be found at the following pages:
This project is a continuation of the work I performed together with Dat Nguyen during the Flax/JAX Community Week to create a T5 model pre-trained from scratch on Dutch.
The multilingual C4 (mC4) dataset was created by the original T5 authors. It was prepared and released by AllenNLP on the Huggingface Dataset hub. Our team cleaned Dutch mC4 with code adapted from the C4 TensorFlow dataset, and used the resulting text files in the pre-training scripts. We also verified that Dutch C4 was deduplicated.
To be able to easily reuse this dataset for more pre-training sessions with Huggingfaces scripts, a Huggingface dataset was created: mc4_nl_cleaned. For Dutch and English training, a couple of additional configs were added to the generation script. These configs produce interleaved Dutch and English texts with a 1:1 ratio. For instance, the micro_en_nl config config mixes Dutch with English samples. The cleaned English C4 dataset is about 5 times larger (in compressed bytes) than the Dutch part. 1:1 interleaving with Dutch discards about 80% of English C4. The full cleaned Dutch mC4 dataset is 151GB, and still is (June '22) the largest Dutch cleaned corpus currently available on the HF Hub.
The Dutch and Dutch+English T5 models are pre-trained with the masked language modeling (MLM) "span corruption" objective. During pre-training, 15% of the tokens are masked and each span of masked tokens is replaced by a sentinel token.
When I was using an old version of the Flax T5 MLM pretraining script, I noticed that the per-batch training speed seemed slower at the beginning of epochs when a larger dataset config was used. Also, on large configs, batch shuffling would fail with a TPU out-of-memory error. For these reasons, I started experimenting with training for more epochs on smaller configs.
This should be ok. In the original T5 paper downstream performance was compared between training on 235 tokens vs training multiple epochs on a smaller part. 64 repeats of 229 tokens did not result in degraded downstream performance. The model yhavinga/t5-v1_1-base-dutch-english-cased
is trained on the small
config for 10 epochs.
In the end, a change to the pre-training script to perform batch shuffling (permuting an array) on the CPU instead of the accelerator device solved all related issues, and larger configs could be used without any issues.
During the Flax/Jax Community week we quickly decided on using Adafactor with learning rate 5e-3. I was sure that with more time, a better setting could be found. After performing seven sweeps with Adafactor, AdamW and Distributed Shampoo (experimental PJIT version from Dall-E mini), I gave up to find better settings. The graph below shows the runs from all 7 sweeps combined. Apologies for the legend, I cannot show the optimizer in the legend, because the initial version of the training script had the optimizer --adafactor
as boolean, which I later changed to a string with the optimizer name. All runs in the graph below that get the loss below 4 use Adafactor. Peach-sweep-6 is dashed orange and has learning rate 5e-3.
While there probably is a setting that will allow Adam and Shampoo to also converge fast below loss 4.0, I was unable to find it. In a recent tweet Lucas Nestler had more success with Shampoo (https://twitter.com/_clashluke/status/1535994026876252160) so maybe I need to revisit the attempt with the latest upstream code bases.
I had some additional options in the pre-training script that I wanted to use. An exponential decay learning rate schedule would allow me to pre-train for as long as desired, instead of a fixed number of steps. I was also keen to pre-train with bfloat16, for the reduced memory footprint and speed. This failed. The graph below shows different attempts with the legend showing the optimizer, dtype, learning rate, total batch size and lr-schedule to train t5-small-24L-dutch-english.
In the end, all models released on the hub are trained with Flax in float32
. For reference, I've ran Stas Bekman's script for bf16, fp16 or fp32 model pretrain detection.
name | abs min | abs max
---------------------------------------------------|-----------|-----------
yhavinga/t5-base-dutch | 1.757e-09 | 6.792e+01
yhavinga/t5-v1.1-base-dutch-uncased | 1.218e-09 | 6.708e+02
yhavinga/t5-v1.1-base-dutch-cased | 3.009e-09 | 8.821e+02
yhavinga/t5-v1.1-large-dutch-cased | 0.000e+00 | 5.053e+03
yhavinga/t5-v1_1-base-dutch-english-cased | 5.140e-09 | 3.111e+03
yhavinga/t5-v1_1-base-dutch-english-cased-1024 | 9.359e-10 | 1.308e+02
yhavinga/t5-small-24L-dutch-english | 1.577e-09 | 1.276e+02
yhavinga/t5-xl-4L-dutch-english-cased | 3.234e-11 | 3.986e+01
yhavinga/t5-base-36L-dutch-english-cased | 2.409e-10 | 6.104e+01
yhavinga/t5-eff-xl-8l-dutch-english-cased | 5.530e-10 | 8.912e+02
yhavinga/t5-eff-large-8l-dutch-english-cased | 1.086e-10 | 5.128e+02
yhavinga/t5-base-36L-ccmatrix-multi | 1.715e-11 | 3.746e+01
yhavinga/t5-small-24L-ccmatrix-multi | 7.086e-10 | 1.053e+02
The following image shows the loss curves of the sessions in which I was trying to find the right combination of total batch size (by adjusting gradient accumulation), learning rate and datatype. Unfortunately, again I could not find a good setting for bfloat16. The three green runs are the ones that end up in t5-base-36L-dutch-english
. Numbers shown are learning reate, dtype and total batch size.
Finetuning summarization requires more memory than translation due to the longer sequence lengths involved. I wondered if I could use Adafactor instead of Adam and ran a sweep to test this. The sweep was configured with Hyperband, so not all training runs completed to the end.
The training losses are graphed below:
While the Adafactor run with learning rate 7e-4 came close to the Adam runs, the consistent stability of training with Adam made me stick with Adam as optimizer for evaluation runs on the several models. For translation the results were similar, though in the end I needed to configure a lower learning rate for all models to converge during fine-tuning.
The original T5 paper evaluated by fine-tuning on downstream tasks with a constant learning rate of 0.001. According to the sweep 0.001 would work nicely with the Adam optimizer for summarization. A single model evaluation consisted of fine-tuning the model, followed by running predictions and metrics calculation on the test split. Fine-tuning for evaluation was done on a limited set of example from the fine-tuning datasets.
Summarization | Translation | |
---|---|---|
Dataset | CNN Dailymail NL | CCMatrix en -> nl |
#Samples | 50K | 50K |
Optimizer | Adam | Adam |
learning rate | 0.001 | 0.0005 |
source length | 1024 | 128 |
target length | 142 | 128 |
#eval samples | 1000 | 1000 |
The graph below shows the train loss curves for the summarization runs:
The graph below shows the train loss curves for the translation runs:
The figure below shows the evaluation scores, where the x-axis shows the translation Bleu score (higher is better) and y-axis the summarization Rouge1 translation score (higher is better). Point size is proportional to the model size. Models with faster inference speed are green, slower inference speed is plotted as blue.
While it is clear that the model t5-base-36L-dutch-english-cased
(with 729M parameters) has the best scores, it also among the slowest models. The model t5-eff-large-8l-dutch-english-cased
(with 335M parameters) has the second best training loss after 390 steps in both tasks, but with a 4 times faster inference. Surprizing is the difference between t5-v1_1-base-dutch-english-cased
and t5-v1_1-base-dutch-english-cased-1024
, most notable on the summarization task. This might be due to the difference in pre-training sequence length:
The models t5-v1_1-base-dutch-english-cased
and t5-v1_1-base-dutch-english-cased-1024
have the same model dimensions, but are pre-trained on different sequence lenghts, 512 and 1024 respectively. The evaluation loss and accuracy of the models do not look too different. Since training of the 1024 sequence length model was very slow and didn't converge a was was very slow, I stopped it early. The figure below shows the evaluation loss and accuracy.
The 512 sequence length model was trained for 10 epochs of the small
nl+en config (186B tokens total) and the 1024 sequence length model about 2 epochs of the large
nl+en config (100B tokens total). While I expected both models to perform similarly on downstream tasks, the 1024 sequence length model has better scores for both summarization and translation.
Some final notes:
t5-small
model with 24 layers is not small.bfloat16
is hard to get right. If suspicious of a result, switch back to float32
first.t5-eff-large-8l-dutch-english-cased
has good aptitude for the translation task and is fast - good candidate for serious fine-tuningt5-xl-4l-dutch-english-cased
is both slow and exhibits bad fine-tuning performance.This project would not have been possible without compute generously provided by Google through the TPU Research Cloud. The HuggingFace 🤗 ecosystem was instrumental in all parts of the training. Weights & Biases made it possible to keep track of many training sessions and orchestrate hyper-parameter sweeps with insightful visualizations.
Created by Yeb Havinga
Three types of T5 models have been trained. t5-base-dutch
is the only model with an original T5 config. The other model types t5-v1.1 and t5-eff have gated-relu
instead of relu
as activation function, and trained with a drop-out of 0.0
unless training would diverge (t5-v1.1-large-dutch-cased
). The T5-eff models are models that differ in their number of layers. The table will list the several dimensions of these models. Not all t5-eff models are efficient, the best example being the inefficient t5-xl-4L-dutch-english-cased
.
t5-base-dutch | t5-v1.1-base-dutch-uncased | t5-v1.1-base-dutch-cased | t5-v1.1-large-dutch-cased | t5-v1_1-base-dutch-english-cased | t5-v1_1-base-dutch-english-cased-1024 | t5-small-24L-dutch-english | t5-xl-4L-dutch-english-cased | t5-base-36L-dutch-english-cased | t5-eff-xl-8l-dutch-english-cased | t5-eff-large-8l-dutch-english-cased | |
---|---|---|---|---|---|---|---|---|---|---|---|
type | t5 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5 eff | t5 eff | t5 eff | t5 eff | t5 eff |
d_model | 768 | 768 | 768 | 1024 | 768 | 768 | 512 | 2048 | 768 | 1024 | 1024 |
d_ff | 3072 | 2048 | 2048 | 2816 | 2048 | 2048 | 1920 | 5120 | 2560 | 16384 | 4096 |
num_heads | 12 | 12 | 12 | 16 | 12 | 12 | 8 | 32 | 12 | 32 | 16 |
d_kv | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 128 | 64 |
num_layers | 12 | 12 | 12 | 24 | 12 | 12 | 24 | 4 | 36 | 8 | 8 |
num parameters | 223M | 248M | 248M | 783M | 248M | 248M | 250M | 585M | 729M | 1241M | 335M |
feed_forward_proj | relu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu |
dropout | 0.1 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 |
dataset | mc4_nl_cleaned | mc4_nl_cleaned full | mc4_nl_cleaned full | mc4_nl_cleaned | mc4_nl_cleaned small_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl |
tr. seq len | 512 | 1024 | 1024 | 512 | 512 | 1024 | 512 | 512 | 512 | 512 | 512 |
batch size | 128 | 64 | 64 | 64 | 128 | 64 | 128 | 512 | 512 | 64 | 128 |
total steps | 527500 | 1014525 | 1210154 | 1120k/2427498 | 2839630 | 1520k/3397024 | 851852 | 212963 | 212963 | 538k/1703705 | 851850 |
epochs | 1 | 2 | 2 | 2 | 10 | 4 | 1 | 1 | 1 | 1 | 1 |
duration | 2d9h | 5d5h | 6d6h | 8d13h | 11d18h | 9d1h | 4d10h | 6d1h | 17d15h | 4d 19h | 3d 23h |
optimizer | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor |
lr | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.009 | 0.005 | 0.005 |
warmup | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 5000.0 | 20000.0 | 2500.0 | 1000.0 | 1500.0 | 1500.0 |
eval loss | 1,38 | 1,20 | 0,96 | 1,07 | 1,11 | 1,13 | 1,18 | 1,27 | 1,05 | 1,3019 | 1,15 |
eval acc | 0,70 | 0,73 | 0,78 | 0,76 | 0,75 | 0,74 | 0,74 | 0,72 | 0,76 | 0,71 | 0,74 |
The models t5-small-24L-dutch-english
and t5-base-36L-dutch-english
have been fine-tuned for both language directions on the first 25M samples from CCMatrix, giving a total of 50M training samples. Evaluation is performed on out-of-sample CCMatrix and also on Tatoeba and Opus Books. The _bp
columns list the brevity penalty. The avg_bleu
score is the bleu score averaged over all three evaluation datasets. The best scores displayed in bold for both translation directions.
t5-base-36L-ccmatrix-multi | t5-base-36L-ccmatrix-multi | t5-small-24L-ccmatrix-multi | t5-small-24L-ccmatrix-multi | |
---|---|---|---|---|
source_lang | en | nl | en | nl |
target_lang | nl | en | nl | en |
source_prefix | translate English to Dutch: | translate Dutch to English: | translate English to Dutch: | translate Dutch to English: |
ccmatrix_bleu | 56.8 | 62.8 | 57.4 | 63.1 |
tatoeba_bleu | 46.6 | 52.8 | 46.4 | 51.7 |
opus_books_bleu | 13.5 | 24.9 | 12.9 | 23.4 |
ccmatrix_bp | 0.95 | 0.96 | 0.95 | 0.96 |
tatoeba_bp | 0.97 | 0.94 | 0.98 | 0.94 |
opus_books_bp | 0.8 | 0.94 | 0.77 | 0.89 |
avg_bleu | 38.96 | 46.86 | 38.92 | 46.06 |
max_source_length | 128 | 128 | 128 | 128 |
max_target_length | 128 | 128 | 128 | 128 |
adam_beta1 | 0.9 | 0.9 | 0.9 | 0.9 |
adam_beta2 | 0.997 | 0.997 | 0.997 | 0.997 |
weight_decay | 0.05 | 0.05 | 0.002 | 0.002 |
lr | 5e-05 | 5e-05 | 0.0005 | 0.0005 |
label_smoothing_factor | 0.15 | 0.15 | 0.1 | 0.1 |
train_batch_size | 128 | 128 | 128 | 128 |
warmup_steps | 2000 | 2000 | 2000 | 2000 |
total steps | 390625 | 390625 | 390625 | 390625 |
duration | 4d 5h | 4d 5h | 3d 2h | 3d 2h |
num parameters | 729M | 729M | 250M | 250M |