Scaling language models for low-resource languages

Deskribapen motza, derrigorrezkoa proiektuak logorik ez badu (eu):

Use of computational resources in the EuroHPC SuperComputer to scale up the experiments and build very large models for European languages with few resources

Hizkuntza- eredu handiak (HE) Adimen Artifizialaren egungo iraultzaren oinarrian daude, eta Hizkuntza Naturalaren Prozesamenduan aurrerapen izugarriak lortzeko oinarriak ezarri dituzte. HEak eraikitzeko baliabide handiak behar dira, bai konputazioari dagokionez, eta baita datuei dagokienez ere. Horrela, gaur egun soilik enpresa pribatu gutxi batzuk dira gai HEak aentrenatzeko. Ondorioz, HEak baliabide handiko hizkuntzetan eraiki ohi dira dira, ingelesa kasu, baina beste hizkuntza askok, batez ere baliabide urriak badituzte, oso atzean geratzeko arriskua dute. Hainbat proposamen egin dira aldez aurretik aurre-entrenatutako HEan hizkuntza berrietara egokitzeko, baina egin diren saiakerak ereduetan txikiekin egin ohi dute lan. Proiektu honetan, EuroHPC SuperComputer delakoaren baliabide konputazionalak erabiltzea proposatzen dugu, esperimentuak eskalatzeko eta baliabide gutxiko Europako hizkuntzetarako eredu oso handiak eraikitzeko. Kalkuluen eta datuen eskala aldatuz, modeloek zeregin askotara erraz egokitzeko gaitasuna duten aztertuko dugu. Proiektuaren emaitzek hizkuntza horietan NLP aplikazioak sustatzen lagunduko dute, eta hizkuntza gutxituen eta ingelesaren artean dagoen arrakala ixten.

Deskribapen motza, derrigorrezkoa proiektuak logorik ez badu (en):

Use of computational resources in the EuroHPC SuperComputer to scale up the experiments and build very large models for European languages with few resources

Deskribapena (en):

Large language models (LLMs) are at the core of the current AI revolution, and have laid the groundwork for tremendous advancements in Natural Language Processing. Building LLMs needs huge resources, both in terms of compute and data, and only a handful of private companies are able to face the extreme amount of computational power required to train them. As a result, LLMs shine in high-resource languages like English, but lag behind in many others, especially in those where training resources are scarce, including many regional languages in Europe. There have been several proposals in the literature to adapt pre-trained LLMs to new languages, but all past efforts focus on models of relatively small size. In this project, we propose to use the computational resources of the EuroHPC SuperComputer to scale up the experiments and build very large models for European languages with few resources. By varying the compute and data scale, we will analyze whether the models exhibit emergent capabilities that allow them to be easily adapted to many tasks. The results of the project will help fostering NLP applications in these languages, and closing the existing gap between minority languages and English.

Deskribapen motza, derrigorrezkoa proiektuak logorik ez badu (es):

Use of computational resources in the EuroHPC SuperComputer to scale up the experiments and build very large models for European languages with few resources

Deskribapena (es):

Large language models (LLMs) are at the core of the current AI revolution, and have laid the groundwork for tremendous advancements in Natural Language Processing. Building LLMs needs huge resources, both in terms of compute and data, and only a handful of private companies are able to face the extreme amount of computational power required to train them. As a result, LLMs shine in high-resource languages like English, but lag behind in many others, especially in those where training resources are scarce, including many regional languages in Europe. There have been several proposals in the literature to adapt pre-trained LLMs to new languages, but all past efforts focus on models of relatively small size. In this project, we propose to use the computational resources of the EuroHPC SuperComputer to scale up the experiments and build very large models for European languages with few resources. By varying the compute and data scale, we will analyze whether the models exhibit emergent capabilities that allow them to be easily adapted to many tasks. The results of the project will help fostering NLP applications in these languages, and closing the existing gap between minority languages and English.

Kode ofiziala:

EHPC-EXT-2023E01-013

Ikertzaile nagusia:

Aitor Soroa

Erakundea:

EuroHPC Joint Undertaking

Saila:

Hitz Zentroa

Hasiera data:

2023/10/10

Bukaera data:

2024/10/10