Verb+Noun Multiword Expressions: A linguistic analysis for identification and translation

Multiword Expressions (MWEs) are combinations of words which exhibit some kind of idiosyncrasy. Due to their idiosyncratic nature, they pose several problems to Natural Language Processing (NLP). In this PhD, two of the most challenging tasks concerning MWE processing are addressed: the automatic identification of MWE occurrences in corpora and their translation in Machine Translation (MT). On the one hand, to test whether the use of specific linguistic data was beneficial for MWE identification, an in-depth analysis of Spanish verb+noun MWEs was undertaken where lexical and morphosyntactic
data were carefully considered. These data were used to identify occurrences of the studied MWEs, improving on results reported by related work. On the other hand, the Basque translations of the studied MWEs were also analysed along lexical and morphosyntactic dimensions. This additional information was then added into a rule-based MT system, and an improvement was observed concerning MT quality, both according to a manual evaluation and according to statistical measures. All the analysed linguistic data was collected in a publicly available database, which can be either queried online or fully downloaded to be used for NLP-related purposes. Finally, to complete the analysis of Basque MWEs, verbal MWEs were annotated in a Basque corpus, which was then released along with annotated corpora in 19 more languages. Part of this multilingual corpus served as a basis for a subsequent study on literal occurrences of MWEs, carried out in five languages from different phylogenetic families, including Basque. Both the annotation and the study on literal occurrences are included in this PhD.

Authors (IXA members): 
Uxoa IƱurrieta
Public documents: 
Itziar Aduriz, Gorka Labaka

