NLG) BERTScore

NLP Evaluation/metrics

NLG) BERTScore

김아다만티움 2023. 3. 7. 11:55

Huggingface에 metrics를 묶어서 소개하는 페이지입니다.

* https://huggingface.co/evaluate-metric

evaluate-metric (Evaluate Metric)

🤗 Evaluate provides access to a wide range of evaluation tools. It covers a range of modalities such as text, computer vision, audio, etc. as well as tools to evaluate models or datasets. It has three types of evaluations: Metric: measures the performan

huggingface.co

1. BERTScore 개요

BERTScore는 텍스트 생성(text generation) task를 측정하기 위한 자동 평가 지표입니다.

BERT의 사전 학습된 context 임베딩을 사용하여(leverages) 후보(candidate) 문장 내 단어 토큰과 참조(reference) 문장 토큰 간 코사인 유사도를 측정하며, 문장 단위(sentence-level)와 시스템 단위(system-level)에서의 인간 평가(human judgement)와 상관관계(correlation)가 있는 metric이기도 합니다.

특히 BERTScore는 정밀도(precision), 재현율(recall), 그리고 f1 score를 계산하는데, 이는 언어 생성 task를 측정할 때 유용하게 사용할 수 있습니다.

2. BERTScore 사용법(@Huggingface)

1) BERTScore는 반드시 들어가야 하는 3가지 인자(argument)를 가지고 있습니다.

'predictions': candidate 문장에 해당하는 문자열 list
'references': reference 문장에 해당하는 문자열 list, 혹은 list의 list
'lang' 혹은 'model_type
- lang: ISO 639-1에 따라 2가지 문자로 표현된 언어 기호
- model_type: BERT 파생 모델을 지정할 때 사용
(보통 BERTScore를 사용할 때에는 target 언어에 따라 모델이 조정되나, model_type을 사용해 지정도 가능 )

기본 예시 #1: lang을 'en'으로 지정

1
2
3
4
5
6
7

from evaluate import load
 
bertscore = load("bertscore")
predictions = ["hello there", "general kenobi"] # 생성된 text를 prediction에 입력
references = ["hello there", "general kenobi"] # 정답 text를 references에 입력
 
results = bertscore.compute(predictions=predictions, references=references, lang="en")

cs

2) 이외의 인자들

num_layers (int)	The layer of representation to use. The default is the number of layers tuned on WMT16 correlation data, which depends on the model_type used.
verbose (bool)	Turn on intermediate status update. The default value is False.
idf (bool or dict)	Use idf weighting; can also be a precomputed idf_dict.
device (str)	On which the contextual embedding model will be allocated on. If this argument is None, the model lives on cuda:0 if cuda is available.
nthreads (int)	Number of threads used for computation. The default value is 4.
rescale_with_baseline (bool)	Rescale BERTScore with the pre-computed baseline. The default value is False.
batch_size (int)	BERTScore processing batch size, at least one of model_type or lang. lang needs to be specified when rescale_with_baseline is True.
baseline_path (str)	Customized baseline file.
use_fast_tokenizer (bool)	use_fast parameter passed to HF tokenizer. The default value is False.

3. BERTScore의 결과

BERTScore는 다음의 내용을 딕셔너리로 반환합니다.

'precision': 0.0에서 1.0 사이의 값. 'predictions', 'references' 기준별 precision 값을 반환
'recall': 0.0에서 1.0 사이의 값. 'predictions', 'references' 기준별 recall 값을 반환
'f1-score': 0.0에서 1.0 사이의 값. 'predictions', 'references' 기준별 f1-score 값을 반환
'hashcode': 라이브러리의 hashcode 반환

Maximal values 예시 #1: model_type = 'distilbert-base-uncased' (결과에 hashcode로 나옵니다)

1
2
3
4
5
6
7
8
9
10

from evaluate import load
 
bertscore = load("bertscore")
predictions = ["hello world", "general kenobi"]
references = ["hello world", "general kenobi"]
results = bertscore.compute(predictions=predictions, references=references, model_type="distilbert-base-uncased")
 
print(results)
#{'precision': [1.0, 1.0], 'recall': [1.0, 1.0], 'f1': [1.0, 1.0], 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.10(hug_trans=4.10.3)'}
 
Colored by Color Scripter

cs

partial match 예시 #2: model_type = 'distilbert-base-uncased' (결과에 hashcode로 나옵니다)

1
2
3
4
5
6
7
8
9
10
11

from evaluate import load
 
bertscore = load("bertscore")
predictions = ["hello world", "general kenobi"]
references = ["goodnight moon", "the sun is shining"]
 
results = bertscore.compute(predictions=predictions, references=references, model_type="distilbert-base-uncased")
 
print(results)
#{'precision': [0.7380737066268921, 0.5584042072296143], 'recall': [0.7380737066268921, 0.5889028906822205], 'f1': [0.7380737066268921, 0.5732481479644775], 
# 'hashcode': 'bert-base-uncased_L5_no-idf_version=0.3.10(hug_trans=4.10.3)'}
Colored by Color Scripter

cs

4. 한계와 편향(bias)

1) BERTScore를 제안한 논문에서는 문장 단위, 시스템 단위의 인간 평가(human judgement)와 상관관계가 있음에도 불구하고, 이 상관관계는 모델과 언어쌍에 따라 달라진다는 한계를 밝히고 있습니다.

2) 또한 모든 언어들을 BERTScore가 평가할 수 있는 것도 아닌데요(ㅠㅠ)... 가능한 언어 리스트는 여기에서 확인할 수 있습니다.

3) 이외에도 BERTScore를 계산할 때에는 BERT 모델을 다운로드 받아야하는 번거로움이 있습니다.

가령 영어(en)로 학습된 'roberta-large' 모델을 다운로드 받을 경우 1.4GB를 차지하게 됩니다. 보다 낮은 용량 모델(예: 'distilbert-base-uncased'은 268MB)을 다운로드 받는 것도 방법이긴 합니다. 서로 호환 가능한 모델 목록은 이 파일에서 확인하실 수 있습니다.