StructEval_leaderboard / text_content.py
Bowieee's picture
add models
1f868cb
raw
history blame
No virus
3.08 kB
HEAD_TEXT = """
This is the official leaderboard for 🏅StructEval benchmark. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a **structured assessment across multiple cognitive levels and critical concepts**, and therefore offers a comprehensive, robust and consistent evaluation for LLMs.
Please refer to 🐱[StructEval repository](https://github.com/c-box/StructEval) for model evaluation and 📖[our paper]() for experimental analysis.
🚀 **_Latest News_**
* [2024.8.2] We released the first version of StructEval leaderboard, which includes 21 open-sourced language models, more datasets and models as comming soon🔥🔥🔥.
* [2024.7.31] We regenerated the StructEval Benchmark based on the latest [Wikipedia](https://www.wikipedia.org/) pages (20240601) using [GPT-4o-mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) model, which could minimize the impact of data contamination🔥🔥🔥.
"""
ABOUT_TEXT = """# What is StructEval?
Evaluation is the baton for the development of large language models.
Current evaluations typically employ *a single-item assessment paradigm* for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions.
To this end, we propose a novel evaluation framework referred to as ***StructEval***.
Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a **structured assessment across multiple cognitive levels and critical concepts**, and therefore offers a comprehensive, robust and consistent evaluation for LLMs.
Experiments demonstrate that **StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases**, thereby providing more reliable and consistent conclusions regarding model capabilities.
Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols.
# How to evaluate?
Our 🐱[repo](https://github.com/c-box/StructEval) provides easy-to-use scripts for both evaluating LLMs on existing StructEval benchmarks and generating new benchmarks based on StructEval framework.
# Contact
If you have any questions, feel free to reach out to us at [boxi2020@iscas.ac.cn](mailto:boxi2020@iscas.ac.cn).
"""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
comming soon
"""
ACKNOWLEDGEMENT_TEXT = """
Inspired from the [🤗 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
"""
NOTES_TEXT = """
* On most models on base MMLU, we collected the results for their official technical report. For the models that have not been reported, we use opencompass for evaluation.
* For other 2 base benchmarks and all 3 structured benchmarks: for chat models, we evaluate them under 0-shot setting; for completion model, we evaluate them under 0-shot setting with ppl.
"""