Algorithmic Analyses in Functional Genomics
Gustavo Stolovitzky
Watson Research Center, IBM, New York
Elmer Fernández
Universidad Católica de Córdoba
Mariano Alvarez
Instituto Leloir, Buenos AiresThe availability of vast amounts of high throughput data generated by recent advances in biotechnology has motivated a parallel interest in extracting as much knowledge as possible from such data. This mini-course will introduce some of the main thrusts of today's research in functional genomics, focusing on algorithms that aim to extract valuable biological information from transcription data. The course will be divided in four units as follows:
1) Gene selection in gene expression arrays: in this lecture, we will review univariate and multivariate techniques to identify genes that show differential expression between two classes of tissue, a problem known as gene or feature selection. We will also discuss validation methods of the selected genes concentrating in validation by classification, validation by statistical significance (touching on the problem of multiple testing), and validation by consistency.
2) Classification, clustering and visualization of gene expression data: the field of functional genomics incorporates ideas from disciplines such as Artificial Intelligence and Statistics to develop the analytical tools required for new diagnostics and inference methods based on gene expression data. These tools can be categorized into supervised (Neural Networks, Support Vector Machines and K-Nearest Neighbours) and unsupervised methods (Self-Organizing Maps, K-means and hierarchical clustering). In this lecture we will discuss the basic concepts of these methods and their applications to classification, prediction and visualization of gene expression data.
3) Characterization of variability in gene expression measurements: in order to overcome some of the problems arising from gene expression data and have estimates of additional sources of experimental noise, it is important to understand both the random and systematic variability present even in replicate microarray measurements. We will focus on the identification of systematic errors and how to correct for them through data normalization. Methods such as global lowess, print-tip lowess, and spatial normalization will be discussed and illustrated in the context of the two-colour based microarray technology. The MAS and RMA algorithms will be discussed in the context of the one-colour microarray platforms. Noise characterization will be illustrated using one-colour (single channel) chips. Finally, performance of three commonly used inferential analysis methods: t-test, SAM and limma, will be evaluated using artificially generated noise over real two-colour experimental data.
4) Functional profiling of gene expression data: different types of cancer may be associated with differences in the behaviour of cellular processes, whose differential regulation could be uncovered using transcriptional profiling. This can be done using "Gene-Centric" approaches, in which we identify those cellular processes enriched in genes that are differentially expressed between two types of cancer, or "Function-Centric" approaches, in which we explore a catalogue of biological processes to look for those processes whose genes contain enough information to discriminate between the given cancer types. In this lecture we will discuss different implementation of functional profiling methods and apply them to two sub-phenotypes in Chronic Lymphocytic Leukaemia.