Papers
arxiv:2409.15268

Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Published on Sep 23
· Submitted by penfever on Sep 24
Authors:
,
,
,
,
,
,

Abstract

The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM judges. In this work, we attempt to answer the following question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench, the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judgments do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at https://github.com/penfever/sos-bench.

Community

Paper author Paper submitter

With new LLMs like OpenAI o1 and QWEN 2.5 releasing almost every week, robust benchmarks we can run locally are key. LLM-judges like Alpaca-Eval, MT-Bench and Arena-Hard-Auto are used most often. Unfortunately, they have hidden biases. In principle, LLM-judges are supposed to be impartial. In practice, they weight some judgment criteria much higher than others. In particular, they pay more attention to stylistic cues (like a friendly tone) than they do to correctness and safety! This behavior is called stylistic reward hacking.

To counteract it, SOS-Bench, a new meta-benchmark, has been introduced. It's two orders of magnitude larger than LLM-judge benchmarks, and it has ground truth measures of helpfulness, harmlessness and honesty. Evaluating over 30 fine-tunes of LLAMA-3-8B and Mistral-7B on SOS-Bench reveals that more is more in alignment; data scaling in the SFT stage, rather than any particular collection method, is the best predictor of improved alignment.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.15268 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.15268 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.15268 in a Space README.md to link it from this page.

Collections including this paper 1