Science Cast

Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization

Ondrej SkopekOctober 13, 2023 9:29am

Views (36)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization

arXivPDFOctober 12, 2023 12:00am

Authors

Ondrej Skopek, Rahul Aralikatte, Sian Gooding, Victor Carbune

Abstract

Despite recent advances, evaluating how well large language models (LLMs) follow user instructions remains an open problem. While evaluation methods of language models have seen a rise in prompt-based approaches, limited work on the correctness of these methods has been conducted. In this work, we perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of LLMs. Our investigation is performed on grounded query-based summarization by collecting a new short-form, real-world dataset riSum, containing $300$ document-instruction pairs with $3$ answers each. All $900$ answers are rated by $3$ human annotators. Using riSum, we analyze agreement between evaluation methods and human judgment. Finally, we propose new LLM-based reference-free evaluation methods that improve upon established baselines and perform on-par with costly reference-based metrics which require high-quality summaries.

TwitterandLinkedIn

0 comments

Add comment

Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization

Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments