Exploratory data analysis with applications to bioinformatics

Kanti V. Mardia, John T. Kent, Zhengzheng Zhang and Charles C. Taylor

There are many statistical challenging problems in protein and RNA Bioinformatics. We can regard the backbone of a typical protein as an articulated object in three dimensions with fixed bond lengths between successive amino acids. Hence it can be viewed as a long time series (with hundreds or thousands of amino acids), where all the information lies in the angles between successive bonds. There are two types of angles: bond or planar angles, analogous to colatitude, which are nearly constant here; and dihedral angles, analogous to longitude, which contain all the information. Thus the basic protein description is reduced to a circular time series. There is also further angular information coming from side chains.

The aim is to find patterns in such data (see for example, Boomsma et al, 2008). This report will give new methods and visualisation tools, including time series analysis, circular principal component analysis and clustering (cf: Mu et al, 2005; Altis et al, 2007), together with examples.

Keywords: Circular time series analysis, circular principal component analysis, Torus distance and clustering analysis.