Skip to content

Humans or LLMs as the Judge? A Study on Judgement Biases

Arxiv Link - 2024-04-17 09:56:26

Abstract

Adopting human and large language models (LLM) as judges (\textit{a.k.a} human- and LLM-as-a-judge) for evaluating the performance of LLMs has recently gained attention. Nonetheless, this approach concurrently introduces potential biases from human and LLM judges, questioning the reliability of the evaluation results. In this paper, we propose a novel framework that is free from referencing groundtruth annotations for investigating Fallacy Oversight Bias, Authority Bias and Beauty Bias on LLM and human judges. We curate a dataset referring to the revised Bloom's Taxonomy and conduct thousands of human and LLM evaluations. Results show that human and LLM judges are vulnerable to perturbations to various degrees, and that even the cutting-edge judges possess considerable biases. We further exploit their weakness and conduct attacks on LLM judges. We hope that our work can notify the community of the vulnerability of human- and LLM-as-a-judge against perturbations, as well as the urgency of developing robust evaluation systems.

Socials

LinkedIn X
🚀 Exciting new research alert! 🚀

Adopting human and large language models (LLM) as judges for evaluating LLM performance has been a hot topic lately. But, what if I told you that this approach might introduce biases that could impact the reliability of evaluation results? 😱

In a recent paper, a novel framework was proposed to investigate Fallacy Oversight Bias, Authority Bias, and Beauty Bias on LLM and human judges without relying on groundtruth annotations. Thousands of human and LLM evaluations were conducted using a curated dataset based on the revised Bloom's Taxonomy.

The results were eye-opening! Both human and LLM judges showed vulnerabilities to perturbations, with even cutting-edge judges displaying significant biases. The study went a step further by conducting attacks on LLM judges to exploit these weaknesses.

Check out the full paper to learn more about the vulnerability of human- and LLM-as-a-judge against perturbations and the importance of developing robust evaluation systems. Knowledge is power! 💪

Read the full paper here: Link to the research paper

#AI #NLP #LLM #Research #Tech #EvaluationBias #Innovation #TechNews #ArtificialIntelligence
🤖🧠 New research alert! Investigating biases in human and large language models (LLM) as judges for evaluating LLM performance. Results show vulnerability to perturbations and considerable biases even in cutting-edge judges. Learn more at: http://arxiv.org/abs/2402.10669v3 #AI #NLP #LLM #Research #TechBias

PDF