논문 리뷰/MultiWordExpression

The Impact of Word Representations on Sequential Neural MWE Identification(Nicolas Zampieri, Carlos Ramisch, Geraldine Damnati, 2019)

김아다만티움 2021. 8. 7. 15:48

Nicolas Zampieri, Carlos Ramisch, Geraldine Damnati. The Impact of Word Representations on Sequential Neural MWE Identification. Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), Aug 2019, Florence, Italy. pp.169 - 175, ff10.18653/v1/W19-5121f

 

<선행연구>

1. finding MWEs in running text(Constant,2017)

2. PRSEME 1.1(Ramisch et al. 2018)

3. FastText(character n-gram, Bojanowski et al. 2017)

4. ‘Character-based embeddings have been shown useful to predict MWE compositionality out of text(Hakimi Parizi and Cook, 2018)’

 

<연구대상>

verbal MWE(VMWE) identification

    -lemmas vs surface forms

    -traditional word embedding vs subword representation

대상 언어: French, Polish, Basque(morphological: Basque)

 

<실험방법>

1. 사용 말뭉치

● PARSEME shared Task 1.1 VMWEs-annotated corpora

    -Basque: 117000 tokens, morphological richness(2.32)

    -French: 420000 tokens, discontinuous VMEs high(42.12%)

    -Polish: 220000 tokens

 

  2. 실험 architecture

Veyn: sequence tagginf using RNN

    -concatenate embedding of the words’ feature(lemmas, POS)....

    -OUT: CRF lyrs

    -tagging BIOG+cat 형태

    -trained by using shared task training corpora

    -val: dev corpus사용

임베딩: surface form, lemmas 두 타입으로 임베딩

    -w2v. FTxt 사용

contextual X: ELMo, BERT_지원 트랙이 달랐음

Evaluation metrics

    -MWE-based measure: F1 score for fully predicted VMWEs

    -token-based measure: F! scoure for tokens belonging to a VMWEs.

 

<결론>

word2Vec: MWEboundary는 잘 찾지 못하나, parts는 잘 찾음,  single token 찾기 뛰어남

metric 점수 면에선 FastText가 더 나은 결과, expression 자체를 잘 찾아냄

morphological 할수록 lemmas가 도움, morphological에 가장 성능이 좋은건 form+lemmas

결론적으로 subword represenationMWE찾기에 도움, morphological 할수록 lemmas+forms