Science Cast

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Ming LiFebruary 26, 2024 11:36am

Views (71)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

arXivPDF

Authors

Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, Jing Xiao

Abstract

In the realm of Large Language Models, the balance between instruction data quality and quantity has become a focal point. Recognizing this, we introduce a self-guided methodology for LLMs to autonomously discern and select cherry samples from vast open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal tool to identify discrepancies between a model's expected responses and its autonomous generation prowess. Through the adept application of IFD, cherry samples are pinpointed, leading to a marked uptick in model training efficiency. Empirical validations on renowned datasets like Alpaca and WizardLM underpin our findings; with a mere 10% of conventional data input, our strategy showcases improved results. This synthesis of self-guided cherry-picking and the IFD metric signifies a transformative leap in the optimization of LLMs, promising both efficiency and resource-conscious advancements.

TwitterandLinkedIn

0 comments

Add comment

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments