Computer based modeling in bioinformatics for gene based cancer classification focused on reliability and machine learning
The Client :
Javna agencija za znanstvenoraziskovalno in inovacijsko dejavnost RS
Project type:
Bilateral projects
Project duration:
2014 -
2015
Description
Cancer research is currently one of the leading fields of clinical research. One major issue in this field is cancer classification for accurate diagnosis and treatment and the other is utilizing an increasing amount of microarray raw data available. Up to now, majority of cancer classification studies were based on the patients overall clinical picture including histological findings at the tissue level. Thus, with very limited diagnostic precision. This is because many existing tumor classes are heterogeneous, molecularly distinct and follow different clinical courses. Therefore, a differential diagnosis among a group of histologically similar cancers poses a challenging problem in clinical medicine. At the same time, an extensive use of DNA microarray technology for characterization of cellular processes is leading to an increasing amount of microarray data from cancer studies. Even though similar questions are being addressed in different studies of same cancer type a comparative analysis of their results is encumbered by the use of heterogeneous microarray platforms and analysis methods. Hence, today, there exists a tremendous amount of array data available; however, much of it remains as raw and only a small percent of its potential is being utilized. Cancer classification based on gene expression analysis derived from microarray data is a way to utilize this raw data in order to get the most accurate cancer diagnosis. Such cancer classification promises to have higher specificity in cancer differentiation as well as an improved clinical picture which addresses molecular changes in diseased leading to personalized therapy with lower toxic side effects. Current approaches to gene-based cancer classifications use the data about different exon mutations from public databases where the exact classification is possible if for a new patient the exact match is found in a database. Due to combinatorial nature of all possible mutations a controled sampling of combinations of mutations is neccessary that takes into account known parts of distributions which can be derived from databases. Due to lack of different combinations of mutations in databases, the classification of new patients without exact match can be done only probabilisticaly. In addition, the reliability of single classification of such probabilistic approach is highly important, as well as its explanation which can provide the end-user (physician) with additional insight in the decision process. The task of the project is to prepare an experimental data set describing patients with known cancer type and known exon mutations, to supplement this databases with negative examples (healthy patients) and additionally enhance it using the controlled sampling. Such enchaced database will be analysed with machine learning algorithms. In particular, the feature evaluation techniques will be used to study the importance of particular parts of exons and the interaction of multiple mutations. Further, the database will be analysed with ILP (inductive logic programming) algorithms which are capable of inducing logical relations between objects and therefore discover interesting relationships among different mutations. During the project. the ILP algorithms will be adapted by LKM (Laboratory for Cognitive Modeling) for the target problem. Besides ILP, the state-of-the-art machine learning algorithms will be used, such as SVM, random forests, neural networks and their ensebles, in order to obtain as much accurate classification as possible. The developed models will be equipped with explanation of single predictions as well as the reliabiliy of provided classifications. The methodology for evaluating the reliability of single predictions, developed by LKM in recent years, will be adapted for this problem. Also, the state-of-.the-art methodology for providing the explanations of single predictions, developed by LKM, will be adapted for this type of classification.