Skip to content

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

Arxiv Link - 2024-03-18 23:21:37

Abstract

The rapid development of Chinese large language models (LLMs) poses big challenges for efficient LLM evaluation. While current initiatives have introduced new benchmarks or evaluation platforms for assessing Chinese LLMs, many of these focus primarily on capabilities, usually overlooking potential alignment and safety issues. To address this gap, we introduce OpenEval, an evaluation testbed that benchmarks Chinese LLMs across capability, alignment and safety. For capability assessment, we include 12 benchmark datasets to evaluate Chinese LLMs from 4 sub-dimensions: NLP tasks, disciplinary knowledge, commonsense reasoning and mathematical reasoning. For alignment assessment, OpenEval contains 7 datasets that examines the bias, offensiveness and illegalness in the outputs yielded by Chinese LLMs. To evaluate safety, especially anticipated risks (e.g., power-seeking, self-awareness) of advanced LLMs, we include 6 datasets. In addition to these benchmarks, we have implemented a phased public evaluation and benchmark update strategy to ensure that OpenEval is in line with the development of Chinese LLMs or even able to provide cutting-edge benchmark datasets to guide the development of Chinese LLMs. In our first public evaluation, we have tested a range of Chinese LLMs, spanning from 7B to 72B parameters, including both open-source and proprietary models. Evaluation results indicate that while Chinese LLMs have shown impressive performance in certain tasks, more attention should be directed towards broader aspects such as commonsense reasoning, alignment, and safety.

Socials

LinkedIn X
🚀 Exciting News in the World of Chinese Large Language Models (LLMs) 🚀

The fast-paced advancement of Chinese Large Language Models (LLMs) has brought about new challenges in their evaluation process. While existing benchmarks primarily focus on capabilities, the critical aspects of alignment and safety often get overlooked. To bridge this gap, we are thrilled to introduce OpenEval, a comprehensive evaluation testbed designed to assess Chinese LLMs across capability, alignment, and safety.

🔍 Capability Assessment:
- 12 benchmark datasets covering NLP tasks, disciplinary knowledge, commonsense reasoning, and mathematical reasoning.

📏 Alignment Assessment:
- 7 datasets examining bias, offensiveness, and illegalness in the outputs of Chinese LLMs.

🛡️ Safety Evaluation:
- 6 datasets focusing on anticipated risks like power-seeking and self-awareness in advanced LLMs.

Our phased public evaluation and benchmark update strategy ensures that OpenEval stays aligned with the evolving landscape of Chinese LLMs, offering cutting-edge datasets for their development.

In our inaugural public evaluation, we tested various Chinese LLMs ranging from 7B to 72B parameters, including open-source and proprietary models. While these LLMs displayed remarkable performance in specific tasks, there is a clear need to shift focus towards broader aspects such as commonsense reasoning, alignment, and safety.

For more details, check out the research paper at: http://arxiv.org/abs/2403.12316v1

#LLMs #ChineseLLMs #EvaluationTestbed #AIResearch #TechInnovation #OpenEval #NLP #SafetyEvaluation #AlignmentAssessment
🚀 Exciting news in the world of Chinese large language models (LLMs)! Introducing OpenEval, a comprehensive evaluation testbed for Chinese LLMs covering capability, alignment, and safety. Learn more about the benchmarks and evaluation results in the latest research paper: http://arxiv.org/abs/2403.12316v1 #AI #NLP #LLMs #OpenEval 🧠🇨🇳🔍

PDF