EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media

Recently, the use of deep neural networks significantly increased success of artificial intelligence approaches in natural language, from speech recognition to machine translation and text understanding tasks such as sentiment analysis. However, the success of deep learning relies on the availability of large annotated datasets in the language and domain required. Most modern machine learning models for language processing use word embeddings: representations of words not as symbols but as vectors of numerical values. These vectors encode important information about word meaning, and can preserve semantic relations between words, and this is even true across languages: word embedding spaces exhibit similar structures across languages. By aligning independently produced embeddings from monolingual text resources, we get a common cross-lingual representation, which allows for fast and effective integration of information in different languages. This cross-lingual mapping therefore provides great potential for less-resourced languages: machine learning tools can be developed using one languages resources, but can operate on another.

In Europe, advanced natural language research and resources exist for a few dominant languages (English, French, German), while many of EU smaller language communities-and the news media industry that serves them-lack appropriate tools. The EMBEDDIA project seeks to address these challenges by leveraging innovations in the use of cross-lingual embeddings coupled with deep neural networks to allow existing monolingual resources to be used across languages. In three years, the projects six academic and four industry partners will develop novel solutions for under-represented languages, and test them in real-world news and media production contexts.

Collaborators

Logo