Small text updates.
Browse files
README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
---
|
2 |
-
title: Pre-training Dutch T5 Models, evaluation and model lists
|
3 |
emoji: π
|
4 |
colorFrom: blue
|
5 |
colorTo: pink
|
|
|
1 |
---
|
2 |
+
title: Pre-training Dutch T5 and UL2 Models, evaluation and model lists
|
3 |
emoji: π
|
4 |
colorFrom: blue
|
5 |
colorTo: pink
|
app.py
CHANGED
@@ -320,10 +320,11 @@ mT5 green and the other models black.
|
|
320 |
* For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
|
321 |
`UL2 Dutch` pre-trained Dutch models are consistently better than their `Flan`, `T5 Dutch` and
|
322 |
`mT5` counterparts of the comparable size.
|
323 |
-
* Fine-tuning of `t5-v1.1-large-dutch-cased` failed with the fixed
|
|
|
324 |
Since the `UL2` models are better across the board, I've disabled this model on the hub.
|
325 |
* The `long-t5` models show bad performance on both tasks.
|
326 |
-
I cannot explain this the translation task. With a sequence length of 128 input and output
|
327 |
tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.
|
328 |
I've retried the fine-tuning of these models with
|
329 |
`float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
|
@@ -388,10 +389,11 @@ mT5 green and the other models black.
|
|
388 |
"""## Miscellaneous remarks
|
389 |
|
390 |
* Use loss regularization when training with `bfloat16` for better results (more info below).
|
391 |
-
* Be cautious of the dropout rate in the config.json file
|
|
|
|
|
392 |
Check in a model's `config.json` what the dropout rate has been set to. Unless you
|
393 |
intend to run many epochs on the same data, its worth to try a training run without dropout.
|
394 |
-
If you want to compare losses, be sure to set the dropout rate equal.
|
395 |
The smaller models can probably always be trained without.
|
396 |
* Training with more layers is much slower than you'd expect from the increased model size.
|
397 |
It is also more difficult to get batch size and learning rate right. Below is a section
|
@@ -628,7 +630,7 @@ I am grateful to the [https://huggingface.co/Finnish-NLP](Finnish-NLP) authors f
|
|
628 |
definitions, and to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for his support in getting me started with the T5X framework.
|
629 |
Lastly, I want to express my gratitude to Google for their openness and generosity in releasing T5X and related repositories.
|
630 |
|
631 |
-
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
|
632 |
Some of the sentences were reworded by ChatGPT.
|
633 |
"""
|
634 |
)
|
|
|
320 |
* For the translation task from English to Dutch, the Dutch+English pre-trained models perform well. Also
|
321 |
`UL2 Dutch` pre-trained Dutch models are consistently better than their `Flan`, `T5 Dutch` and
|
322 |
`mT5` counterparts of the comparable size.
|
323 |
+
* Fine-tuning of `t5-v1.1-large-dutch-cased` failed with the hyperparameters that were fixed to the same value for the
|
324 |
+
evaluation of every model.
|
325 |
Since the `UL2` models are better across the board, I've disabled this model on the hub.
|
326 |
* The `long-t5` models show bad performance on both tasks.
|
327 |
+
I cannot explain this, especially for the translation task. With a sequence length of 128 input and output
|
328 |
tokens, the sliding attention window with radius length 127 of the `long-t5` models should be able to handle this.
|
329 |
I've retried the fine-tuning of these models with
|
330 |
`float32` instead of `bfloat16`, but the results were the same. Maybe this is normal behaviour for these models
|
|
|
389 |
"""## Miscellaneous remarks
|
390 |
|
391 |
* Use loss regularization when training with `bfloat16` for better results (more info below).
|
392 |
+
* Be cautious of the dropout rate in the config.json file, as besides learning rate it is probably the most important
|
393 |
+
hyperparameter.
|
394 |
+
If you are evaluating different pre-trained models, be sure to fine-tune with dropout set equal.
|
395 |
Check in a model's `config.json` what the dropout rate has been set to. Unless you
|
396 |
intend to run many epochs on the same data, its worth to try a training run without dropout.
|
|
|
397 |
The smaller models can probably always be trained without.
|
398 |
* Training with more layers is much slower than you'd expect from the increased model size.
|
399 |
It is also more difficult to get batch size and learning rate right. Below is a section
|
|
|
630 |
definitions, and to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for his support in getting me started with the T5X framework.
|
631 |
Lastly, I want to express my gratitude to Google for their openness and generosity in releasing T5X and related repositories.
|
632 |
|
633 |
+
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/).
|
634 |
Some of the sentences were reworded by ChatGPT.
|
635 |
"""
|
636 |
)
|