Skip to content

MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition

Arxiv Link - 2024-03-03 02:23:19

Abstract

Although Large Language Models (LLMs) have shown strong performance in Multi-hop Question Answering (MHQA) tasks, their real reasoning ability remains exploration. Current LLM QA evaluation benchmarks have shown limitations, including 1) data contamination, the evaluation data are potentially exposed to LLMs during the pretraining stage; and 2) ignoration of the reasoning chain evaluation. Thus we introduce an LLM MHQA evaluation benchmark, the first QA benchmark based on the new, unprecedented knowledge by editing the off-the-shelf HotpotQA dataset; Besides, we also annotate and evaluate the reasoning chain in the form of sub-questions and intermediate answers corresponding to the multi-hop questions. Specifically, based on the observation, 1) LLMs show a performance gap between the original HotpotQA and our edited data, deeming that current MHQA benchmarks have the potential risk of data contamination that hard to evaluate LLMs' performance objectively and scientifically; 2) LLMs only get a small percentage of the right reasoning chain, e.g. GPT-4 only gets 36.3\% right reasoning chain. We believe this new Multi-hop QA evaluation benchmark and novel evaluation methods will facilitate the development of trustworthy LLM evaluation on the MHQA task.

Socials

LinkedIn X
🚀 Exciting news in the world of Large Language Models (LLMs) and Multi-hop Question Answering (MHQA) tasks! 🌟

A recent study has shed light on the limitations of current LLM QA evaluation benchmarks, highlighting issues such as data contamination and the lack of evaluation of reasoning chains. To address these shortcomings, a new LLM MHQA evaluation benchmark has been introduced, based on the HotpotQA dataset. This benchmark includes annotated reasoning chains in the form of sub-questions and intermediate answers, providing a more comprehensive evaluation of LLM performance.

Key findings from the study include a performance gap between the original HotpotQA dataset and the new benchmark, as well as LLMs achieving only a small percentage of correct reasoning chains. For example, GPT-4 scored 36.3% in this aspect.

For more details on this groundbreaking research and its implications for the development of trustworthy LLM evaluation in MHQA tasks, check out the full paper here: http://arxiv.org/abs/2402.11924v2

#LLM #MHQA #AI #NLP #Research #Tech #Innovation

Let's continue pushing the boundaries of AI technology together! 🚀💡🔍
🚀 Exciting news in the world of Large Language Models (LLMs) and Multi-hop Question Answering (MHQA)! A new evaluation benchmark has been introduced to address current limitations and enhance the assessment of LLM reasoning abilities. Check out the details in the research paper here: http://arxiv.org/abs/2402.11924v2 #AI #NLP #LLMs #MHQA #TechResearch 🤖📚

PDF