• OC-0002 - From Citizen Science to Digital Dictionary Database
The Client : ( OC-0002 )
Project type: Research projects ARRS
Project duration: 2025 - 2026
  • Description

Openly available language resources, such as digital dictionary databases, play a critical role in advancing natural language processing and the development of language technologies. Today, these databases are increasingly leveraged to enhance large language models with high-quality linguistic data, paving the way for improved generative AI tools and solutionsa goal also pursued by the LLM4DHLarge Language Models for Digital Humanities project (ARIS GC-0002). Slovene, spoken by a small community of approximately 2 million people, requires proactive and innovative approaches to resource development. To address this need, we have introduced the responsive dictionary concept, which accelerates the creation of openly available dictionary data by combining traditional lexicography with machine-assisted methods and citizen participation. In the responsive Thesaurus of Modern Slovene, first published in 2018, citizens are encouraged to contribute their own suggestions of synonyms and antonyms to enhance the dictionarys quality and scope (https://viri.cjvt.si/sopomenke/eng/). Citizen participation has already proven invaluable, with over 1,300 contributors providing more than 75,000 suggestions. This engagement captures linguistic diversity often missed by traditional methods, including dialectal expressions, slang, and emerging terminology. By involving the public directly, the dictionary becomes a dynamic and democratic resource, reflecting real-world language use, empowering speakers to actively contribute to the preservation and documentation of their language, and bridging the gap between professional lexicography and community knowledge. Currently, the synonyms and antonyms contributed by citizens are displayed in the interface of the Thesaurus of Modern Slovene, but they have not yet been incorporated into openly accessible dictionary databases. This project aims to lexicographically validate and integrate the contributions into the Digital Dictionary Database for Slovene, enhancing its value for the LLM4DH project. As part of this work, the responsive concept and the Thesaurus interface will be upgraded, enabling us to share the project's insights with both the research community and citizen contributors. By showcasing the improved usefulness of their collected data, we aim to further motivate collaboration and deepen citizen engagement in the data collection process. The validated dataat least 45.000 citizen-collected synonyms and antonymswill be published on the CLARIN.SI repository, ensuring open access and long-term availability for further use in research and development. The project builds on an already successful citizen participation initiative, bringing it to a successful conclusion while creating a concrete and holistic example of good practice that highlights the value and impact of such collaborative efforts in modern society