File size: 3,382 Bytes
1f868cb
 
 
e8461eb
1f868cb
 
963e536
1f868cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
963e536
e8461eb
81630e4
1f868cb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
HEAD_TEXT = """
This is the official leaderboard for 🏅StructEval benchmark. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a **structured assessment across multiple cognitive levels and critical concepts**, and therefore offers a comprehensive, robust and consistent evaluation for LLMs. 

Please refer to 🐱[StructEval repository](https://github.com/c-box/StructEval) for model evaluation and 📖[our paper](https://arxiv.org/abs/2408.03281) for experimental analysis.

🚀 **_Latest News_** 
* [2024.8.6] We released the first version of StructEval leaderboard, which includes 22 open-sourced language models, more datasets and models are comming soon🔥🔥🔥.

* [2024.7.31] We regenerated the StructEval Benchmark based on the latest [Wikipedia](https://www.wikipedia.org/) pages (20240601) using [GPT-4o-mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) model, which could minimize the impact of data contamination🔥🔥🔥.
"""

ABOUT_TEXT = """# What is StructEval?
Evaluation is the baton for the development of large language models.
Current evaluations typically employ *a single-item assessment paradigm* for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions.
To this end, we propose a novel evaluation framework referred to as ***StructEval***. 
Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a **structured assessment across multiple cognitive levels and critical concepts**, and therefore offers a comprehensive, robust and consistent evaluation for LLMs.
Experiments demonstrate that **StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases**, thereby providing more reliable and consistent conclusions regarding model capabilities. 
Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols.

# How to evaluate?
Our 🐱[repo](https://github.com/c-box/StructEval) provides easy-to-use scripts for both evaluating LLMs on existing StructEval benchmarks and generating new benchmarks based on StructEval framework.

# Contact
If you have any questions, feel free to reach out to us at [boxi2020@iscas.ac.cn](mailto:boxi2020@iscas.ac.cn).
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"

CITATION_BUTTON_TEXT = r"""
comming soon
"""

ACKNOWLEDGEMENT_TEXT = """
Inspired from the [🤗 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
"""


NOTES_TEXT = """
* Base benchmark refers to the original dataset, while struct benchmarks refer to the benchmarks constructed using StructEval with these base benchmarks as seed data.
* On most models on base MMLU, we collected the results for their official technical report. For the models that have not been reported, we use [opencompass](https://opencompass.org.cn/home) for evaluation.
* For other 2 base benchmarks and all 3 structured benchmarks: for chat models, we evaluate them under 0-shot setting; for completion model, we evaluate them under 0-shot setting with ppl. And we keep the prompt format consistent across all benchmarks.
"""