Spaces:

Bowieee
/

StructEval_leaderboard

Sleeping

App Files Files Community

StructEval_leaderboard / text_content.py

Bowieee

add models

1f868cb about 2 months ago

raw

history blame

No virus

3.08 kB

	HEAD_TEXT = """
	This is the official leaderboard for 🏅StructEval benchmark. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluation for LLMs.

	Please refer to 🐱[StructEval repository](https://github.com/c-box/StructEval) for model evaluation and 📖[our paper]() for experimental analysis.

	🚀 _Latest News_
	* [2024.8.2] We released the first version of StructEval leaderboard, which includes 21 open-sourced language models, more datasets and models as comming soon🔥🔥🔥.

	* [2024.7.31] We regenerated the StructEval Benchmark based on the latest [Wikipedia](https://www.wikipedia.org/) pages (20240601) using [GPT-4o-mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) model, which could minimize the impact of data contamination🔥🔥🔥.
	"""

	ABOUT_TEXT = """# What is StructEval?
	Evaluation is the baton for the development of large language models.
	Current evaluations typically employ a single-item assessment paradigm for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions.
	To this end, we propose a novel evaluation framework referred to as *StructEval*.
	Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluation for LLMs.
	Experiments demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases, thereby providing more reliable and consistent conclusions regarding model capabilities.
	Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols.

	# How to evaluate?
	Our 🐱[repo](https://github.com/c-box/StructEval) provides easy-to-use scripts for both evaluating LLMs on existing StructEval benchmarks and generating new benchmarks based on StructEval framework.

	# Contact
	If you have any questions, feel free to reach out to us at [boxi2020@iscas.ac.cn](mailto:boxi2020@iscas.ac.cn).
	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"

	CITATION_BUTTON_TEXT = r"""
	comming soon
	"""

	ACKNOWLEDGEMENT_TEXT = """
	Inspired from the [🤗 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
	"""


	NOTES_TEXT = """
	* On most models on base MMLU, we collected the results for their official technical report. For the models that have not been reported, we use opencompass for evaluation.
	* For other 2 base benchmarks and all 3 structured benchmarks: for chat models, we evaluate them under 0-shot setting; for completion model, we evaluate them under 0-shot setting with ppl.
	"""