Results validation with another benchmarks?

#1
by VlSav - opened

Hello!
Nice work! Very interesting results!
Did you try to validate with other benchmarks? I tried to check with MMLU (lm-eval-harness) and looks like MMLU results degrades a bit comparing with original suzume_multilingual. Wondering if MT-bench score is preferable...

Lightblue KK. org

Yeah, as I found in the paper, the Belebele scores drop when doing ORPO training while the MT-Bench scores increase. I think this is because they measure different things - while the MT-Bench scores measures the chat ability of the output given, Belebele and MMLU measure the logit scores of the "correct" answer. So I think this ORPO trained model will be better at chatting, but worse at logit-based knowledge testing tasks. We found in the paper that the lightblue/suzume-llama-3-8B-multilingual-orpo-borda-top25 did better at Belebel than the base model, so that might also be better at MMLU?

Thanks. Yes, you are right, I've checked for https://huggingface.co/lightblue/suzume-llama-3-8B-multilingual-orpo-borda-top25 and MMLU are better for it, as well as other logit-based testing benchmarks. BTW, when you did MT-bench scoring do you have kind of length control? As mentioned in some papers (f.e. https://arxiv.org/html/2404.04475v1) OpenAI's GPT typically prefer lengthy answers, so may be it is also the case with ORPO trained models?

Yes, there is a preference for long answers. And in this version of the model, the answers are just huge. In fact, the training dataset should have long examples in both positive and negative example answers, otherwise the model will learn that you should just write a long answer. Hence, you need to carefully validate the training dataset, in terms of answer lengths, and in terms of examples accepted and rejected.

Lightblue KK. org

Hey, yeah, I agree that that is something I need to work on for the next iteration of this model. If you just say "Hi" to the model, it list this loooong answer about how it is here to helpful and how useful it will be. Ironically, not very helpful haha.

The idea of training using long negatives is a good one - I have not checked whether the positives are substantially longer than the negatives, but I would wager they are.

However, I think I will probably focus on training using a method like SimPO (https://arxiv.org/pdf/2405.14734), as it contains a length penalty naturally, which would (I think) mean that I would be able to use any length of answers for both positives and negatives.

These results were obtained on a benchmark: a 1.5-fold increase in response length. The quality is compared with RLHFlow/LLaMA3-iterative-DPO-final and IlyaGusev/saiga_llama3_8b - this card describes the benchmark and training data
photo_2024-06-03_12-01-46.jpg

VlSav changed discussion status to closed

Sign up or log in to comment