Pattern Recognition on Random Trees

In this talk we address the problem of identifying differences between populations of trees. An example of such populations are estimations of the context tree of a Variable Length Markov Chain, an important modeling tool that have been used recently for protein classification without sequence alignment. We introduce two different approaches, a generative one, based on a hypothesis test proposed recently by Balding et al (2004) (BFFS--test), and a discriminative one, adapting three popular techniques of supervised and unsupervised classification to work in our compact metric space of trees: K-means, k-nearest neighbors and Ward linkage. We applied both methods to different problems in genomics, the outline of functionality families of proteins, classification of a new protein and evolutionary study of a functionality family. We consider a protein functionality family as a random variable
assuming values in the space of context trees and conjecture that different families induce different random variables. The context tree of a protein is then seen as a realization of the random variable corresponding to its family. We test (simultaneously) differences among 10 families of proteins, transforming their amino acid chains into trees via the Probabilistic Suffix Tree algorithm from Bejerano et al (2004), we perform a 10 fold classification  procedure of these families using our adaptation of 3-nearest neighbors, we observe how the families emerge via K-means clustering and we study the phylogeny of the FGF family via Ward linkage. We compare the results obtained with classical approaches based on alignment methods.

Contacto: ebiomat@famaf.unc.edu.ar