-
leveraging large language models for NLG Evaluation: A surveyGenerative AI/benchmarks 2024. 1. 22. 15:32
Li, Z., Xu, X., Shen, T., Xu, C., Gu, J. C., & Tao, C. (2024). Leveraging Large Language Models for NLG Evaluation: A Survey. arXiv preprint arXiv:2401.07103.
1. Introduction
ㅇ imperative to establish robust evaluation methodologies that can reliably gauge the quality of the generated content
ㅇ shortage of Traditional NLG evaluation metrics
- BLEU, ROUGE, TER.. they only focus surface-level text dfferences
- fall short in assessing semantic aspects
- low alignment with human judgement, lack of interpretability for the scoure
- 이를 보완하기 위해 다른 방법들이 사용되기도 함:semantic equivalence and fluency
ㅇ 최근 LLM 동향: Chain-Of-Though, zero-shot instruction, better alignment 등 발전
- LLM evaluation: LLMs could generate reasonable explanations to support the ultimate score
2. taxonomy: classify works in NLP evaluation alogn threeep rimart dimensions; evaluation task (T), evaluation refrences(r), evaluation function (f)
1) evaluation task: machine translation, text summarization, dialogue generation, story generation, image caption genration, data-to-text generation, general genration
> 각 태스크는 목표하는 평가 양상과 시나리오를 결정; each task determines the target evaluation aspects and scenarios
ex) text summarization: fucus on relevance to the source content / dialouge generation: coherence of the response
2) Evaluation References: divide the evaluation scenarios into reference-based and reference-free scenarios
(1) reference-based evaluation: gereated text h(ypohthesis) is compared to a set of ground truth references r.
- metrics: accuracy, relevance, cohernece, degree of similarity to the r ...
(2) reference-free evaluation: not rely on any external refernces. Evaluates the generated text h based on the intrinsic qualities or alignment with the provixed source context s
- metrics: fluency, originality, relevance to the context ...(3) Evaluation Function: matching-based vs generative-based
i) matching-based methods: measure the semantic equivalence between the reference and hypothesis or proper degree between the source text and hypothesis
- token-level matching functions: where in representation space or discrete string space
- sequence-level
ii) generative-based methods: LLMs are employed to generate evaluation metrics directly by designing instructions
3. 구체적인 taxonomy 연구: 논문 내 그림 참조
ㅇ LLMs for NLG Evaluation
1. Taxonomy of generative Evaluation
1) Prompt-based: 프롬프트를 이용해서 바로 평가
(1) score-based, (2) probability-based (3) likert-style, (4) pairwise, (5) ensemble(인간+기계 등) (6) advanced(fine-grained eval. shema)
2) tuning-based: 소규모 LLM을 직접 파인튜닝하여 평가모델로 사용
(1) probability-based, (2) likert-style (3) pairwise (4) advanced
2. meta-evaluation benchmarks
1) machine translation 2) text summarization 3) dialouge generation 4) image caption 5) data-to-text 6) story generation 7) general generation사실 이 표가 가장 핵심임 'Generative AI > benchmarks' 카테고리의 다른 글