Upgrade of corpuses Gigafida, Kres, ccGigafida and ccKress
Gigafida is a reference corpus of Slovene language containing Slovene texts from daily newspapers, magazines, all sorts of books, web pages, parliamentary speeches transcripts etc., all together around 1.2 billion words in 40,000 documents. It is a basis for balanced corpus Kres, and freely available corpora ccGigaFida and ccKres. Currently these corpora contain documents created until 2012. More information about the corpora is available at http://www.slovenscina.eu/korpusi/
The project to upgrade these corpora has three goals: collecting new materials, machine processing of new and existing documents, and public availability of upgraded corpora, their distribution and public dissemination. The collection of new materials is going to focus on currently underrepresented texts (like textbooks and other primary and secondary schools materials), news portals and daily newspapers. The aim is to increase the Gigafida corpus to 1.5 billon words. Machine processing shall automatically tag all the documents in a uniform way and store them in a standardized format. The documents will be deduplicated. The updated corpuses will be publicly available through concordancers in CLARIN infrastructure and presented to general public and professional community.