Dr. Mélanie Jouitteau

Title: Working with and from the speaking communities

Abstract

Dr. Jouitteau presents I present three multidisciplinary participative science projects addressing these challenges and deployed for Breton, a Celtic highly endangered language whose 110.000 speakers are bilingual with French in Western Brittany; The ENTRELANGUES wiki site inventories the 96 languages of the French State, and its immigration languages. For each language, documentation is aggregated to document the available resources for the members of the triumvirats to find each other and to work with. The ARBRES wikigrammar supports a descriptive grammar of the language. By design, the illustrative examples of the grammar constitute a corpus of exceptionally high structural diversity, over-representing rare syntactic structures with exhaustively. Research has shown this corpus to be a high quality product for the fine-tuning of translation AI models (Grobol & Jouitteau 2024).The YAR ANR project addresses aligned sound/text corpora. It consists of a phone and a web application that collect sound clips to be geotagged on a map, and a platform to collectively transcribe them. It transversally addresses several scientific and social needs. First, the mapping of the recordings visibilizes this highly minoritized language in public uses and supports different specific cultural practices, including teaching. Doing so, it addresses variation at its source because the collected Breton varieties will reflect those in actual contemporary uses, providing raw data to of direct interest for transcription practices (Jouitteau, Antoine, Grobol & Millour 2025).

Biography

Dr. Mélanie Jouitteau is a linguist in the CNRS in France since 2007. She specializes in grassroots collaborative science projects in minoritized contexts, and language revitalisation by digital development. 

In the pressing age of AI, her long term focus on Breton, a highly endangered Celtic language spoken led her to the conclusion that reducing the digital gap is now at the heart of the preservation of linguistic diversity, especially in the contexts of bilingualism with a dominant variety, as is the case for the non-French languages of France. In practice, reducing the digital gap means for mostly minoritized and economically challenged communities to provide NLP developers with suitably licensed and accurately diverse linguistic data, enriched by community made metadata labeling. For that to happen, Mélanie advocates for building triumvirats with NLP specialists, sociolinguists and formal and descriptive linguists, and having these triadic alliances together address 3 priorities: (i) Establishing conditions for community sovereignty over the quality of distributed data and metadata (for example, providing tools enhancing community agency in resource building), (ii) Establishing conditions for community sovereignty over the evaluation of the outputs of tools (for example, tools for crowdsourcing evaluation sets with easy cross-validation) and (iii) The community’s transition to copyright practices compatible with AI training, allowing for sharing and modifications. (for example by providing tools set for the production of natively open data, with built-in gamified annotation features). In accordance to the fragile nature of under-resourced ecosystems, this action plan has to be powered by sustainable practices of resource sharing (FAIR practices, open science), participatory science tools, and cross-community sharing and recycling for pedagogical uses. Often carried out far removed from the communities themselves, this overt interventionism carries the same risks that have been documented throughout the history of colonialism. Even in urgent situations, intervention must be approached with the same care as any medical intervention: “First, do no harm.”.

Dr. Nasredine Semmar

Title: Multilinguality in Large Language Models

Abstract

General-purpose Large Language Models (LLMs) have achieved impressive performances in a wide range of Natural Language Processing (NLP) tasks and applications. However, nowadays LLMs with the best performance are those built for resource-rich languages where annotated and non-annotated corpora are available. In addition, due to their training on linguistically and culturally diverse data, LLMs are particularly susceptible to generating biased or inaccurate outputs across varied linguistic contexts. To extend LLMs to new languages, several approaches have been proposed. We can cite, in particular, the initiatives which enhance multilingual performance without parameter adjustments through translation, aligning representations and prompting, as well as methods which focus on improving multilingual abilities for a single task via cross-lingual transfer, and other techniques which aim to enhance multilingual proficiency by continuous training in one language to obtain mono-lingual LLMs.

In this talk, we will discuss the main challenges faced in extending LLMs to new languages. This will cover the fundamental concepts behind LLMs and their architectures, the current state-of-the-art LLMs, and the different approaches to extend their abilities to handle multiple languages and more specifically low-resource languages. Concretely, we will first formally describe the definitions of monolingual and multilingual large language models. Then, we will introduce the widely used LLMs and summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature. We will also present some resources including open-source software and diverse corpora involved in training these models, and address the current explainability and interpretability methods for MLLMs.

Biography

Dr. Nasredine Semmar is a Director of Research at CEA List – Université Paris-Saclay. Before joining CEA List in August 2002, he worked in industry from June 1996 to July 2002, first at Lionbridge Technologies–Bowne Global Solutions as R&D engineer and then at SAP-Business Objects as expert in software internationalization and localization. His current research interests include emerging methods and technologies in cutting edge areas of Natural Language Processing (NLP) and Artificial Intelligence (AI), and his expertise emphasis is on the use of Generative AI (GenAI) for inducing multilingual resources and tools for low-resource languages. Dr. Nasredine Semmar has supervised five completed PhD theses and is currently supervising four PhD students. He has published over than 120 papers in refereed journals and conferences and he is in the editorial board of the “Natural Language Processing” Journal and is member of the Scientific Committees of major NLP conferences. Dr. Nasredine Semmar participated to the evaluation campaign EVALDA-ARCADE II in the field of sentence and word alignment from parallel corpora and to the shared task on the discrimination and identification of Similar Languages, Varieties and Dialects at VarDial workshops. He has coordinated and participated in more than 20 research projects in EU FP7, H2020, international and national projects. He has acted as keynote speaker at the 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT 2016) and INFOL@NGUES 2019, and has provided a tutorial at the 16th International Conference on Human System Interaction (HSI 2024). He is co-chair of the Track NLP of the ACS/IEEE International Conference on Computer Systems and Applications (AICCSA) since 2020.