Machine learning of imbalanced data

An important group of methods solving these problems is based on constructing ensembles of base learners. Bagging, boosting, random forests, and their variants are the most popular examples of this methodology and due to their classification strength and efficiency they belong to the state-of-the-art methods in classification and regression. When constructing an efficient prediction model these methods evaluate the predictive power of descriptive attributes and construct recursive partitioning of the problem space. The aim of our cooperation is to develop and test a new class of distance-based and uniform distribution based attribute evaluation measures which are better suited to the problem of imbalanced class distribution. Additionally we will investigate their multiclass extensions. For testing the proposed methods we will design complex artificial data sets for testing wide range of types of configurations of regions with different classification. The developed attribute evaluation measures will be used in state-of-the-art learning system based on random forests.

The above scientific goals need fast and efficient data mining algorithms and good exploratory data analysis tools. In data mining community the open-source R statistical environment is getting more and more attention for its wide applicability, availability, ease of use and good visualization capabilities. It is our aim in this project to develop new machine learning algorithms in this environment and therefore make it ready available for further scientific and commercial exploitation.

We are expecting that this joint research proposal will produce valuable research and practical results that will be of interest to the data mining and machine learning community and practitioners solving complex problems involving imbalanced data sets, e.g., in medicine, engineering, finance, public sector, etc.

Collaborators