arxiv:2409.06820

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

Published on Sep 10

· Submitted by

IlyaGusev on Sep 12

#1 Paper of the day

Authors:

Ilya Gusev

Abstract

We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.

View arXiv page View PDF Add to collection

Community

Paper author Paper submitter 8 days ago

Hey!

Here is the benchmark page: https://ilyagusev.github.io/ping_pong_bench/en_v2
And the GitHub repo: https://github.com/IlyaGusev/ping_pong_bench/
I hope the benchmark will be helpful to both RP model developers and users.

7 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.06820 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.06820 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.06820 in a Space README.md to link it from this page.

Collections including this paper 4