CORIA TALN RJCRI RECITAL 2023

coria-taln-2023 : CORIA TALN RJCRI RECITAL 2023

5-9 juin 2023 PARIS (France)

ATTENTION : Une migration de la base de données est programmée jeudi 21 août.
Elle peut occasionner des problèmes d'accès à Sciencesconf.

sciencesconf.org:coria-taln-2023:461064

Cross-lingual Strategies for Low-resource Language Modeling: A Study on Five Indic Dialects

Niyati Bafna 1, *, @ , Cristina España-Bonet 2, *, @ , Josef Van Genabith 2, 3, *, @ , Benoît Sagot 1, *, @ , Rachel Bawden 1, *, @

1 : Inria

L'Institut National de Recherche en Informatique et e n Automatique (INRIA)

2 : DFKI GmbH, Saarland Informatics Campus

3 : Saarland University, Saarland Informatics Campus

* : Auteur correspondant

Neural language models play an increasingly central role for language processing, given their success for a range of NLP tasks. In this study, we compare some canonical strategies in language modeling for low-resource scenarios, evaluating all models by their (finetuned) performance on a POS-tagging downstream task. We work with five (extremely) low-resource dialects from the Indic dialect continuum (Braj, Awadhi, Bhojpuri, Magahi, Maithili), which are closely related to each other and the standard mid-resource dialect, Hindi. The strategies we evaluate broadly include from-scratch pretraining, and cross-lingual transfer between the dialects as well as from different kinds of off-the- shelf multilingual models; we find that a model pretrained on other mid-resource Indic dialects and languages, with extended pretraining on target dialect data, consistently outperforms other models. We interpret our results in terms of dataset sizes, phylogenetic relationships, and corpus statistics, as well as particularities of this linguistic system.

Type :	:	TALN - Travaux de recherche originaux - Longs
Langue du texte intégral	:	anglais
Thématiques	:	TALN 4
Mots-Clés	:	language modeling ; crosslingual transfer ; low resource ; Indic languages ; dialect continuum ; POS tagging

Vie privée | Accessibilité