leveraging large language models for NLG Evaluation: A survey

Generative AI/benchmarks 2024. 1. 22. 15:32

Li, Z., Xu, X., Shen, T., Xu, C., Gu, J. C., & Tao, C. (2024). Leveraging Large Language Models for NLG Evaluation: A Survey. arXiv preprint arXiv:2401.07103.

1. Introduction
  ㅇ imperative to establish robust evaluation methodologies that can reliably gauge the quality of the generated content
  ㅇ shortage of  Traditional NLG evaluation metrics
      - BLEU, ROUGE, TER.. they only focus surface-level text dfferences
      - fall short in assessing semantic aspects
      - low alignment with human judgement, lack of interpretability for the scoure
      - 이를 보완하기 위해 다른 방법들이 사용되기도 함:semantic equivalence and fluency
  ㅇ 최근 LLM 동향: Chain-Of-Though, zero-shot instruction, better alignment 등 발전
- LLM evaluation: LLMs could generate reasonable explanations to support the ultimate score

2. taxonomy: classify works in NLP evaluation alogn threeep rimart dimensions; evaluation task (T), evaluation refrences(r), evaluation function (f)

    1) evaluation task: machine translation, text summarization, dialogue generation, story generation, image caption genration, data-to-text generation, general genration
> 각 태스크는 목표하는 평가 양상과 시나리오를 결정; each task determines the target evaluation aspects and scenarios
     ex) text summarization: fucus on relevance to the source content / dialouge generation: coherence of the response
    2) Evaluation References: divide the evaluation scenarios into reference-based and reference-free scenarios
         (1) reference-based evaluation: gereated text h(ypohthesis) is compared to a set of ground truth references r.
               - metrics: accuracy, relevance, cohernece, degree of similarity to the r ...
         (2) reference-free evaluation: not rely on any external refernces. Evaluates the generated text h based on the intrinsic qualities or alignment with the provixed source context s
               - metrics: fluency, originality, relevance to the context ...

(3) Evaluation Function: matching-based vs generative-based

i) matching-based methods: measure the semantic equivalence between the reference and hypothesis or proper degree between the source text and hypothesis
                  - token-level matching functions: where in representation space or discrete string space
                 - sequence-level
               ii) generative-based methods: LLMs are employed to generate evaluation metrics directly by designing instructions

3. 구체적인 taxonomy 연구: 논문 내 그림 참조

ㅇ LLMs for NLG Evaluation
   1. Taxonomy of generative Evaluation
       1) Prompt-based: 프롬프트를 이용해서 바로 평가
            (1) score-based, (2) probability-based (3) likert-style, (4) pairwise, (5) ensemble(인간+기계 등)  (6) advanced(fine-grained eval. shema)
       2) tuning-based: 소규모 LLM을 직접 파인튜닝하여 평가모델로 사용
            (1) probability-based,  (2) likert-style  (3) pairwise  (4) advanced
   2. meta-evaluation benchmarks
       1) machine translation  2) text summarization  3) dialouge generation  4) image caption  5) data-to-text   6) story generation  7) general generation

'Generative AI > benchmarks' 카테고리의 다른 글

A survey on evaluation of large language models (0)	2024.07.30
A Comprehensive Overview of Large Language Models (2)	2023.12.06
Can Large Language Models Understand Real-World Complex Instructions? (0)	2023.11.30
초록) CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models (1)	2023.11.20
초록) Human Still Wins over LLM: An Empirical Study of Active Learning on Domain-Specific Annotation Tasks (0)	2023.11.20

ABOUT ME

저는 딥러닝을 모릅니다 저는 딥러닝을 모릅니다

'Generative AI > benchmarks' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'Generative AI > benchmarks' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바