This page has hierarchy - Parent page: Research

This research field concerns the analysis of potentially large biological dataset to discover relevant knowledge.

Research projects

From march 2013 this research activity has been founded by the GenData2020 Italian Project, founded by MIUR.

Microarray data analysis for selecting genes relevant for tumor classification


Feature selection is a fundamental task in microarray data analysis. It aims at identifying the genes which are mostly associated with a tissue category, disease state or clinical outcome, thus avoiding the insignificant, noisy and redundant ones. An effective feature selection allows biologists to investigate only a subset of genes instead of the entire dataset. Furthermore, if applied before a learning algorithm, it reduces computation costs and increases classification accuracy.

Painter’s feature selection algorithm

We propose a method which measures the ability of the gene to distinguish among classes, based on an overlap score. It is designed to self adapt to each gene expression value distribution without user intervention on parameter estimation. Adaptability is provided by a density based technique which reduces the impact of outliers. A multivariate technique to select the most relevant subset of genes is exploited. Analogously to other approaches, the number of selected genes can be set by the user. However, our algorithm may also automatically detect the minimum set of genes needed to correctly classify all the samples in the training set. The effectiveness of our approach has been evaluated by comparing its performance with other feature selection methods on multicategory databases with different numbers of classes. The experimental results show that the proposed approach leads to high classification accuracies compared with widely used feature selection techniques.

Minimum number of genes for feature selection

A fundamental problem in microarray analysis is to identify relevant genes from large amounts of expression data. Feature selection aims at identifying a subset of features for building robust learning models. However, finding the optimal number of features is a challenging problem, as it is a trade off between information loss when pruning excessively and noise increase when pruning is too weak. We propose a novel representation of genes as strings of bits and a method which automatically selects the minimum number of genes to reach a good classification accuracy on the training set. Our method first eliminates redundant features, which do not add further information for classification, then it exploits a set covering algorithm.

Gene clustering


Grouping together sets of genes which show a similar behavior is a fundamental problem in many biological studies. Genes with similar expression pattern under various conditions or time course may imply co-regulations or relations in functional pathways. Furthermore, gene partitioning allows focusing the analysis on a reduced subset of genes, instead of on thousands of genes. Gene partitioning may be also a pre-processing step before a feature selection or a classification task, to focus on distinct but still highly informative genes. Clustering is a popular approach for analyzing large datasets and automatically dividing data into meaningful and useful groups. Many conventional clustering algorithms have been applied or adapted to gene expression data and new algorithms, which specifically address gene expression data, have recently been proposed.

New classification distance for gene clustering

In general, microarray data are clustered based on the continuous expression values of genes. However, when additional information is available (e.g., tumor classification), it may be beneficial to exploit it to improve cluster quality. Hence, in this work, we define the concept of classification power of a gene, which measures how many samples are correctly classified by a gene, by considering the expression values assumed by the samples of each class. Instead of discovering genes with similar expression profiles, we aim at detecting genes which play an equivalent role for the classification task (i.e., genes that give a similar contribution for patient or tumor classification). Hence, two genes are considered equivalent if they classify in the same way all the considered samples. We defined a new similarity measure, named classification distance, between genes based on the similarity of their gene masks, and we extended it also to inter-cluster distance. This measure has been exploited in a hierarchical clustering algorithm, which iteratively groups genes or gene clusters through a bottom up strategy. However, the proposed similarity measures are general and may be straightforwardly adopted by other clustering algorithms as well.

More information may be found in the Classification Distance specific page.

Gene regulatory networks


A great challenge in the bioinformatics field is to discover relationships and interactions among genes. Particularly, a gene regulatory network aims at representing relationships that govern the rates at which genes in the network are transcribed into mRNA. By considering single-time-point expression data, it is possible to discover sets of co-regulated genes, which show a similar behavior under different conditions. This analysis can be limitative, because it does not consider interactions among genes which happen with a time delay. In fact, there can be a significant delay between the expression of a regulator gene and its effects (i.e. the activation or inhibition of another gene).

Temporal association rules for gene regulatory networks

Data mining techniques, such as association rules mining, have been successfully used to discover relationships among genes, but previous works have considered only contemporaneous or single time instance delay relationships. We developed a method to learn relationships among genes by means of temporal association rule mining. It takes as input time-series gene expression data, then they are discretized and associated to an expression level by using a suitable function. Different techniques are exploited. The discretized data are organized in a time-delay matrix and analyzed by the Apriori algorithm, which extracts the association rules. Finally, rules are reduced and evaluated by means of an appropriate quality index. The selected rules are the building blocks in the definition of a gene network.

Microarray Data Integration


When analyzing the relationship between genes under different scenarios, the integration of different microarray experiments becomes a relevant task. This paper presents a framework to address some intrinsic problems of integration, due for instance to scaling issues, error bias, different experimental conditions or technology and protocols.

Gene-Markers Representation

Our approach projects original microarray data in a common transformed space to create a common representation of different microarray datasets. This approach allows us to integrate data from various microarray platforms or microarrays based on different experimental conditions. We validate our framework with experiments on real microarray datasets. The results suggest that our approach can be a profitably exploited for microarray data integration and further gene expression analysis applications.

Biological validation

The BioSumm project tackles the problem of managing and exploiting the huge mass of information contained in increasingly wider text repositories such as PubMed Central. BioSumm is a flexible and modular framework which analyzes large collections of unclassified biomedical texts and produces ad hoc summaries oriented to biological information. The summary generation is driven by a novel grading function, which biases sentence selection by means of an appropriate domain dictionary. In the current version of BioSumm, in order to focus on a biological target, the dictionary contains genes and proteins names and aliases. BioSumm is neither a traditional summarizer nor a extractor of dictionary terms. It is designed to be a summarizer oriented to the biological domain. Thus, its summaries have both the expressive power of the traditional summaries and the domain specificity of documents produced by a dictionary entry extractor. The final goal of the project is to become a powerful automatic instrument to support for both knowledge inference from scientific papers and biological validation of gene/proteins interactions obtained in different ways (e.g., with other data mining techniques).

More information may be found in the BioSumm specific page.