Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!¶

Arxiv Link - 2024-06-17 15:11:58

Abstract¶

Leveraging Large Language Models (LLMs) as judges for evaluating the performance of LLMs has recently garnered attention. Nonetheless, this type of approach concurrently introduces potential biases from LLMs, raising concerns about the reliability of the evaluation results. To mitigate this issue, we propose and study two versions of many-shot in-context prompts, Reinforced and Unsupervised ICL, for helping GPT-4o-as-a-Judge in single answer grading. Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the agreement and quality of the evaluation. Furthermore, we first reveal the symbol bias in GPT-4o-as-a-Judge for pairwise comparison and then propose a simple yet effective approach to mitigate it. Experimental results show that advanced long-context LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot regime. Meanwhile, the experimental results further verify the effectiveness of the symbol bias mitigation approach.

Socials¶

X

🚀 Exciting advancements in the world of Large Language Models (LLMs)! 🤖 Our latest study explores the use of many-shot in-context prompts, Reinforced and Unsupervised ICL, to enhance the evaluation process of LLMs like GPT-4o-as-a-Judge in single answer grading. Find out how scaling the number of in-context examples impacts evaluation quality and agreement. Discover the revealed symbol bias in GPT-4o-as-a-Judge and a novel approach to address it effectively. 📊🔍

Read more about our research and experimental results here: Link to the study

#AI #NLP #LLMs #Research #GPT4o #TechInnovation #ArtificialIntelligence #LanguageModels

🚀 Exciting new research on leveraging Large Language Models (LLMs) as judges for evaluating LLM performance! Learn about the proposed many-shot in-context prompts for GPT-4o-as-a-Judge in single-answer grading and how to mitigate potential biases. Check out the study at: http://arxiv.org/abs/2406.11629v1 #AI #NLP #LLMs #Research #GPT4o

Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!¶

Abstract¶

Socials¶

PDF¶