arxiv:2409.15268

Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Published on Sep 23

· Submitted by

penfever on Sep 24

Upvote

Authors:

Benjamin Feuer ,

Abstract

The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM judges. In this work, we attempt to answer the following question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench, the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judgments do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at https://github.com/penfever/sos-bench.

View arXiv page View PDF Add to collection

Community

penfever

Paper author Paper submitter 5 days ago

With new LLMs like OpenAI o1 and QWEN 2.5 releasing almost every week, robust benchmarks we can run locally are key. LLM-judges like Alpaca-Eval, MT-Bench and Arena-Hard-Auto are used most often. Unfortunately, they have hidden biases. In principle, LLM-judges are supposed to be impartial. In practice, they weight some judgment criteria much higher than others. In particular, they pay more attention to stylistic cues (like a friendly tone) than they do to correctness and safety! This behavior is called stylistic reward hacking.

To counteract it, SOS-Bench, a new meta-benchmark, has been introduced. It's two orders of magnitude larger than LLM-judge benchmarks, and it has ground truth measures of helpfulness, harmlessness and honesty. Evaluating over 30 fine-tunes of LLAMA-3-8B and Mistral-7B on SOS-Bench reveals that more is more in alignment; data scaling in the SFT stage, rather than any particular collection method, is the best predictor of improved alignment.