Dr. Mélanie Jouitteau

Title: Working with and from the speaking communities

Abstract

In the pressing age of AI, reducing the digital gap is a matter of survival for the most languages in the world.  In practice, reducing this gap means for mostly minoritized and economically challenged communities to provide NLP developers with suitably licensed and accurately diverse linguistic data, enriched by community made metadata labeling.
Dr. Jouitteau presents two citizen science projects addressing these challenges and deployed for Breton, a Celtic highly endangered language whose 110.000 speakers are bilingual with French.
The ARBRES project supports a descriptive wikigrammar of the language. By design, the illustrative examples of the grammar constitute a corpus of exceptionally high structural diversity, which is a high quality product for the fine-tuning of translation AI models (Grobol & Jouitteau 2024).
The YAR project addresses aligned sound/text corpora. It consists of a phone and a web application that collect sound clips to be geotagged on a map, and a platform to collectively transcribe them. It transversally addresses several scientific and social needs. First, the mapping of the recordings visibilizes this highly minoritized language in public uses and supports different specific cultural practices, including teaching. Doing so, it addresses variation at its source because the collected Breton varieties will reflect those in actual contemporary uses, providing raw data to of direct interest for transcription practices (Jouitteau, Antoine, Grobol & Millour 2025)

Biography

Dr. Mélanie Jouitteau is a researcher on the Breton language for the CNRS in France since 2007. She specializes in grassroots collaborative science projects in minoritized contexts, formal and descriptive linguistics as well as the multidisciplinary bridges with both sociolinguistics and NLP.

Dr. Nasredine Semmar

Title: Multilinguality in Large Language Models

Abstract

General-purpose Large Language Models (LLMs) have achieved impressive performances in a wide range of Natural Language Processing (NLP) tasks and applications. However, nowadays LLMs with the best performance are those built for resource-rich languages where annotated and non-annotated corpora are available. In this talk, we will discuss the main challenges faced in extending LLMs to new languages. This will cover the fundamental concepts behind LLMs and their architectures, the current state-of-the-art LLMs, and the different approaches to extend their abilities to handle multiple languages and more specifically low-resource languages.

Biography

Dr. Nasredine Semmar is a Director of Research at CEA List – Université Paris-Saclay. He obtained his PhD in computer science from University of Paris Sud (France) in 1995 and he received an Accreditation to Supervise Research (HDR) in 2021 from Paris-Saclay University. He worked in industry from 1996 to 2002, first at Lionbridge Technologies–Bowne Global Solutions as R&D engineer and then at SAP-Business Objects as expert in software internationalization and localization. Dr. Nasredine Semmar joined CEA List in 2002 and his current research interests include emerging methods and technologies in cutting edge areas of Natural Language Processing (NLP) and Artificial Intelligence (AI). His expertise emphasis is on the use of Generative AI (GenAI) for inducing multilingual resources and tools for low-resource languages. Dr. Nasredine Semmar has supervised five completed PhD theses and is currently supervising four PhD students. He has published over than 120 papers in refereed journals and conferences and he is in the editorial board of the “Natural Language Processing” Journal and is member of the Scientific Committees of major NLP conferences (ACL, EMNLP, NAACL, IJCAI, COLING, LREC…). Dr. Nasredine Semmar participated to the evaluation campaign EVALDA-ARCADE II in the field of sentence and word alignment from parallel corpora and to the shared task on the discrimination and identification of Similar Languages, Varieties and Dialects at VarDial workshops. He has coordinated and participated in more than 20 research projects in EU FP7, H2020, international and national projects. He has acted as keynote speaker at the 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT 2016) and INFOL@NGUES 2019, and has provided a tutorial at the 16th International Conference on Human System Interaction (HSI 2024). He is co-chair of the Track NLP of the ACS/IEEE International Conference on Computer Systems and Applications (AICCSA) since 2020.