Skip to content

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

Arxiv Link - 2024-01-30 07:03:32

Abstract

Despite the utility of Large Language Models (LLMs) across a wide range of tasks and scenarios, developing a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs to assess responses generated by LLMs. However, the meta-evaluation conducted to assess the effectiveness of these LLMs as evaluators is typically constrained by the coverage of existing benchmarks or requires extensive human annotation. This underscores the urgency of methods for scalable meta-evaluation that can effectively, reliably, and efficiently evaluate the performance of LLMs as evaluators across diverse tasks and scenarios, particularly in potentially new, user-defined scenarios. To fill this gap, we propose ScaleEval, an agent-debate-assisted meta-evaluation framework that leverages the capabilities of multiple communicative LLM agents. This framework supports multi-round discussions to assist human annotators in discerning the most capable LLMs as evaluators, which significantly eases their workload in cases that used to require large-scale annotations during meta-evaluation. We release the code for our framework, which is publicly available at: \url{https://github.com/GAIR-NLP/scaleeval}.

Socials

LinkedIn X
🚀 Exciting news in the world of Large Language Models (LLMs)!

Evaluating LLMs across diverse tasks and scenarios is crucial yet challenging. To address this, we introduce ScaleEval, a cutting-edge meta-evaluation framework that employs agent-debate assistance to assess the effectiveness of LLMs as evaluators efficiently and reliably. Our framework facilitates multi-round discussions, aiding human annotators in identifying the most adept LLM evaluators.

Curious to learn more about ScaleEval and how it enhances the evaluation of LLMs? Dive into the details and access our framework's code at: http://arxiv.org/abs/2401.16788v1

#AI #NLP #LLMs #TechInnovation #ScaleEval
🚀 Exciting news in the world of Large Language Models (LLMs)! Check out ScaleEval, a cutting-edge meta-evaluation framework for assessing LLM performance across diverse tasks and scenarios. Developed by experts, this framework leverages multiple communicative LLM agents to streamline evaluation processes. Dive into the details and access the code here: http://arxiv.org/abs/2401.16788v1 #AI #NLP #LLMs #ScaleEval

PDF