flexudy/t5-small-wav2vec2-grammar-fixer

Oct 24, 2022

hello, what kind of data set did you use, could you give some information about the training?

Owner Oct 24, 2022

@Sercan We generated the dataset.
Capitalised everything, removed all punctuation, and then used the transformed data as the source and the original data as the target.
As simple as that :))

Sercan

Oct 24, 2022

But I have a dataset like this. 40 thousand in total. However, I can't get very good results for the Turkish language. In a long text, for example, in a 300-character sentence, he always writes the same words.

flexudy

Owner Oct 24, 2022

What pretrained model are you using? @Sercan

Sercan

Oct 24, 2022

https://huggingface.co/google/mt5-small and https://huggingface.co/google/mt5-base but I have problems with long texts on both models. He corrects one part of the sentence and then starts to write the same words over and over.

flexudy

Owner Oct 24, 2022

In my humble experience, here are a few things that you might consider:

mt5 is a bit hard to train and still struggles a lot with grammar. I would recommend getting much more data if possible for training. I prefer mt5-base over mt5-small.

This could be due to your decoding strategy. I don't know exactly what you use for this. Beam search? how many beams? typical p? temperature? I would recommend a beam search here for the correctness and little hallucination.

Also, how do you prompt? How did you design your prefix and special chars?

Sercan

Oct 25, 2022

We did the training with happytransformer. But I think we made a mistake there.

flexudy changed discussion status to closed Dec 19, 2022

flexudy
/

t5-small-wav2vec2-grammar-fixer

Dataset Example