The Basic Research for the Development of Spoken Language Resources and Speech Technologies for the Slovenian Language (teMeljnE raZiskave Za rAzvoj govorNih vIrov in tehNologij za slovEnščino, project ID: J7-4642) is a large basic research project financed by the Slovenian Research and Innovation Agency for the period from October 2022 till the end of September 2025. Within this interdisciplinary project, corpus linguists, dialectologists, phoneticians, lexicographers, sociolinguists, language technologists, and other researchers collaborate with the common goal of the strategic and efficient development of open-access speech resources. These resources are indispensable for in-depth and broadly relevant studies of spoken language in numerous disciplines, such as phonetics, phonology, dialectology, grammar, lexicography, sociolinguistics as well as speech technology.
The planned project results include technical recording guidelines, the standardization of dialectal phonemes, the annotation of speech corpora on various levels, and the approach to spoken lexis in lexicographic resources. Moreover, we will prepare a pipeline for the automatic linguistic annotation of speech corpora, and the diachronic and synchronic phonetic (dialectal) maps. The lexicon of Slovene words Sloleks will be expanded to include the data on spoken lexis. Additionally, a training corpus with manually annotated prosodic units, disfluencies, and dialect acts will be created. The conducted basic research into spoken language resources, the approaches to recording and transcribing speech, the automatic Slovene language recognition, dialectology, prosodic units, speech disfluencies, part of speech annotation, lemmatization, and syntactic parsing, dialect acts, the canonical forms of non-standard spoken lexis and its lexicographic descriptions will be presented in original scientific publications. You can follow results on: https://mezzanine.um.si/en/results/.