Dataset Example

#1
by Sercan - opened

hello, what kind of data set did you use, could you give some information about the training?

@Sercan We generated the dataset.
Capitalised everything, removed all punctuation, and then used the transformed data as the source and the original data as the target.
As simple as that :))

But I have a dataset like this. 40 thousand in total. However, I can't get very good results for the Turkish language. In a long text, for example, in a 300-character sentence, he always writes the same words.

What pretrained model are you using? @Sercan

https://huggingface.co/google/mt5-small and https://huggingface.co/google/mt5-base but I have problems with long texts on both models. He corrects one part of the sentence and then starts to write the same words over and over.

In my humble experience, here are a few things that you might consider:

mt5 is a bit hard to train and still struggles a lot with grammar. I would recommend getting much more data if possible for training. I prefer mt5-base over mt5-small.

This could be due to your decoding strategy. I don't know exactly what you use for this. Beam search? how many beams? typical p? temperature? I would recommend a beam search here for the correctness and little hallucination.

Also, how do you prompt? How did you design your prefix and special chars?

We did the training with happytransformer. But I think we made a mistake there.

flexudy changed discussion status to closed

Sign up or log in to comment