Comprehensive Survey on Evaluation Methods for Large Language Models

سال انتشار: 1404
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 55

فایل این مقاله در 6 صفحه با فرمت PDF قابل دریافت می باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

ICTBC09_043

تاریخ نمایه سازی: 26 خرداد 1405

چکیده مقاله:

Large Language Models (LLMs) have rapidly gained prominence in both research and industry due to their remarkable performance across diverse linguistic and cross-disciplinary tasks. This rapid progress highlights the need for systematic evaluation, not only in terms of technical performance but also regarding ethical, social, and safety concerns. This survey provides a comprehensive overview of evaluation approaches for LLMs along three dimensions: what to evaluate, where to evaluate, and how to conduct evaluation. It reviews tasks such as natural language processing, reasoning, medicine, education, social sciences, and agent-based applications. In addition, it summarizes common datasets, benchmarks, and metrics, while identifying both success cases and limitations of current models. Finally, the paper outlines future challenges, including the design of new evaluation protocols, handling bias, ensuring factual reliability, and improving trustworthiness, aiming to guide the development of more capable, safe, and reliable LLMs.

کلیدواژه ها:

نویسندگان

Pouya Faridfar

Department of computer engineering, Mashhad branch, Islamic Azad University, Mashhad, Iran.

Roshanak ghasemian

Independent Researcher, Mashhad, Iran.

Reza Foroughnia

Department of computer engineering, Mashhad branch, Islamic Azad University, Mashhad, Iran.