During the project we are going to investigate and propose lexico-syntactic constraints for open information extraction in Slovene and Serbian languages. Open information extraction (Open IE) was introduced in 2007 (Banko and Etzioni, 2007) in order to avoid using hand-labeled training examples and to avoid exploiting domain-specific verbs and nouns to develop unlexicalized, domain-independent extractors that scale to web corpus. The goal of Open IE is thus to extract triples in the form of subject, predicate and object.
Traditionally, information extraction (IE) (Sarawagi, 2008) techniques consist of the main tasks, which are named entity recognition, relationship extraction and coreference resolution. The existing methods are mostly supervised and thus require to be trained on hand-labeled corpora, which can be very domain-specific. In our case, no datasets that would have labeled data for IE tasks, exist for Slovene nor Serbian language. We only can employ some developed preprocessing methods, such as lemmatizers, part-of-speech taggers and shallow parsers. We are going to use these for syntactic labeling of input data and further to investigate our texts.
Slavic languages contain many inflections, which may be very useful when defining constraints to extract domain-independent relationships (Przepiorkowski, 2007). Both, Slovene and Serbian language, share similar structure and that is why we are designing the framework that will be used for both languages. For each language, their own preprocessing techniques and constraints will be used to extract relationships from the given dataset.
We are going to use newswire corpora, retrieved from web and balanced text sources that were developed by linguists. These datasets were used for the preprocessing techniques that are available nowadays. Along with the text extracted from web, we are going to use the following sources:
- For Slovene: JOS100k, Gigafida, Kres (http://www.slovenscina.eu/)
- For Serbian: SrpKor2013, SrpLemKor (http://korpus.matf.bg.ac.rs/prezentacija/korpusi.html)
As these corpora are not labeled (with semantics), we will need to evaluate the results using manual supervision and distant comparison. The distant comparison will be performed automatically by comparing the types of extracted triples to the Slovene and Serbian part of DBPedia (http://dbpedia.org/). The main goal of the evaluation is to assess how many of the extracted relationships are real and not only general noun-verb-noun tuples.
The main goals of the collaboration will be:
a) Design and development of the framework to investigate text corpora and to define lexico-syntactic rules: The framework will feature enrichment of input dataset with preprocessing techniques, tuples extraction according to given constraints, graphical constraints definition, filtering of results for iterative refining, and evaluation techniques.
b) A ranked list of lexico-syntactic constraints: The list of constraints, which will give best results according to the evaluation measures. We will try to justify the use of each constraint and align it with the knowledge from linguistics field.
c) Offering a public web service for importing and searching over the corpora: The final methods will be gathered as a publicly available web service, which will enable the linguistics community and interested public to get better insight into the language and to detect new knowledge. The offered web service will be similar to the web service (http://openie.allenai.org/) that is available for english language and based on the most sophisticated Open IE system. In addition to that, we will also offer additional constraints definition and custom dataset uploading.
We are going to jointly develop the framework and the web service, while the investigation in-depth language-specific features of datasets and constraints definition will be done separately as we need to understand the investigated text to define new constraints. After that we will jointly