ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • leveraging large language models for NLG Evaluation: A survey
    Generative AI/benchmarks 2024. 1. 22. 15:32

    Li, Z., Xu, X., Shen, T., Xu, C., Gu, J. C., & Tao, C. (2024). Leveraging Large Language Models for NLG Evaluation: A Survey. arXiv preprint arXiv:2401.07103.

     

    1. Introduction 
      ㅇ imperative to establish robust evaluation methodologies that can reliably gauge the quality of the generated content 
      ㅇ shortage of  Traditional NLG evaluation metrics 
          - BLEU, ROUGE, TER.. they only focus surface-level text dfferences
          - fall short in assessing semantic aspects
          - low alignment with human judgement, lack of interpretability for the scoure
          - 이를 보완하기 위해 다른 방법들이 사용되기도 함:semantic equivalence and fluency
      ㅇ 최근 LLM 동향: Chain-Of-Though, zero-shot instruction, better alignment 등 발전
           - LLM evaluation: LLMs could generate reasonable explanations to support the ultimate score 
     


    2. taxonomy: classify works in NLP evaluation alogn threeep rimart dimensions; evaluation task (T), evaluation refrences(r), evaluation function (f)


        1) evaluation task: machine translation, text summarization, dialogue generation, story generation, image caption genration, data-to-text generation, general genration
    > 각 태스크는 목표하는 평가 양상과 시나리오를 결정; each task determines the target evaluation aspects and scenarios 
         ex) text summarization: fucus on relevance to the source content / dialouge generation: coherence of the response
        2) Evaluation References: divide the evaluation scenarios into reference-based and reference-free scenarios
             (1) reference-based evaluation: gereated text h(ypohthesis) is compared to a set of ground truth references r.
                   - metrics: accuracy, relevance, cohernece, degree of similarity to the r ...
             (2) reference-free evaluation: not rely on any external refernces. Evaluates the generated text h based on the intrinsic qualities or alignment with the provixed source context s
                   - metrics: fluency, originality, relevance to the context ...

             (3) Evaluation Function: matching-based vs generative-based 

                  i) matching-based methods: measure the semantic equivalence between the reference and hypothesis or proper degree between the source text and hypothesis
                      - token-level matching functions: where in representation space or discrete string space 
                     - sequence-level 
                   ii) generative-based methods: LLMs are employed to generate evaluation metrics directly by designing instructions


    3. 구체적인 taxonomy 연구: 논문 내 그림 참조


    ㅇ LLMs for NLG Evaluation
       1. Taxonomy of generative Evaluation 
           1) Prompt-based: 프롬프트를 이용해서 바로 평가  
                (1) score-based, (2) probability-based (3) likert-style, (4) pairwise, (5) ensemble(인간+기계 등)  (6) advanced(fine-grained eval. shema)
           2) tuning-based: 소규모 LLM을 직접 파인튜닝하여 평가모델로 사용 
                (1) probability-based,  (2) likert-style  (3) pairwise  (4) advanced
       2. meta-evaluation benchmarks
           1) machine translation  2) text summarization  3) dialouge generation  4) image caption  5) data-to-text   6) story generation  7) general generation 

    사실 이 표가 가장 핵심임

     

    댓글

Designed by Tistory.