Introduction

Below we describe some projects that could form the basis of research work for a postgraduate degree at the Department of Statistics at Leeds. This list of projects is by no means exhaustive - many other topics are available and can be discussed with potential supervisors.

All enquiries regarding postgraduate study should be addressed to: Dr Leonid Bogachev, Postgraduate Research Tutor, Department of Statistics, University of Leeds, Leeds, LS2 9JT, UK.

Project 3:  Spatial Modelling for Electrical Tomography  (Dr. R.G. Aykroyd)

Electrical tomography techniques have extensive application potential in industrial-process monitoring. They provide a fast, versatile and non-invasive method for on-line observation of 3d processes in real time. Currently linear approximations are used to produce process visualizations which have artifacts including blurring, masking, shadowing and distortions. This project will consider statistical methods for nonlinear reconstruction that minimize artifacts to improve visualization and allow direct estimation of control parameters. The work will be in collaboration with the Leeds University School of Mining and Mineral Engineering.

Project 4:  Assessment of Facial Growth  (Prof. J.T. Kent and Prof. K.V. Mardia)

Statistical shape analysis based on a set of known landmark positions is now fairly well-understood. However, when studying the shapes of surfaces, such as human faces, there are still many unresolved issues, including the identification of landmarks and the description of other aspects of the surface such as curvature. This project will address these issues in the context of human faces, with applications to growth and to the design of surgical implants for facial deformities with collaborators at St. James' Hospital, Leeds.

Project 6:  Statistical Modelling Of Swimming Micro-organisms  (Prof. C.C. Taylor)

Swimming micro-organisms interact in such a way that, over time, areas of high concentration can form patterns. The characterization of such time-dependent patterns is a complex challenge which should lead to a greater understanding of the way micro-organisms behave. If a small set of parameters can be used to model such patterns, then these can be related to various characteristics of the organisms, their concentration and the domain in which they live. Hu et al. (1995) have employed standard applied mathematical techniques to characterize such patterns by their wave numbers, wave number distributions, average roll curvature, and average roll orientation adjacent to the sidewall.

It is our goal to use statistical methods used in image analysis and stochastic geometry to model such bioconvection patterns. Initial work will focus on simulation of various models to determine a class of models which can be adapted to this problem. Given a specific image, a set of parameters can then be estimated by Markov Chain Monte Carlo (MCMC) methods, for example. These parameters then need to be interpreted in biological terms, which will require a large number of experiments to gain an understanding. The project involves the close collaboration of Dr. Nick Hill, Department of Applied Mathematics, University of Leeds.

Project 7:  Data Mining  (Prof. C.C. Taylor)

Recent advances in technology - in particular falling costs of high capacity storage media - have rapidly increased the acquisition and archiving of large amounts of data. Consequently, Knowledge Discovery in Databases (KDD) or Data Mining is a rapidly growing theme amongst researchers, and is a potential source of useful information for both industry and commerce. Various classes of patterns that can be detected in data have recently been studied in the KDD community. The ability to ``learn'' (or discover) knowledge from data in a more directed framework has had a much longer practice both in Machine Learning (ML) and in Statistics. Here, the data are generally of the form (variables, class), and the objective is simply to learn a ``rule'' whereby a new observation can be classified into one of the classes using the information in the attributes.

In both cases, however, difficulties arise when the data are not static, and so the validity of patterns and rules (discovered `knowledge') depends, in some unspecified way, on time, and it changes with time. In this case, learning and classification (or prediction) procedures need robust mechanisms for detecting and adapting to changes.

In the machine learning literature, the generic term for such time-dependent changes is concept drift. Concept drift may be caused by continuous changes of the world and the environment - for example, in economic problems, the presence of inflation causes a continual drift in the real value of money which means that rules based on absolute monetary values will quickly become out of date - or it may occur when the attribute values or the concepts depend on a certain (possibly unknown) context. For example, the relative proportions of vehicles made by each manufacturer will depend on the region (or country) of observation. Thus, the location can be seen as the context in which the data are collected - knowledge of the context (or a change of context) will aid the knowledge discovery process.

There are various ways in which concept drift can manifest itself: the distribution of the attributes in a class can change; new attributes become available or existing attributes can no longer be measured (for example due to changes in the law); or a subset of the variables can change in their importance or ability to predict the class.

Project 8:  Data Fusion In Archaeological Geophysics  (Dr. R.G. Aykroyd)

Current archaeological thinking is to investigate a site without excavation. Surface magnetometry produces an image of the features not visible on the surface, but gives very little information about the depth and extent of buried features. Possible sources of this missing information include: magnetic stratigraphy where soil cores are used to study the distinct layers at the site; additional magnetic sensors to give multiple recordings; and use of other detector types, such as resistivity. This project will develop models and algorithms which can take data from the different sources to produce full three dimensional reconstructions of an archaeological site. Data and technical assistance will be supplied by staff in the Department of Archaeological Science, University of Bradford.

Project 10:  Robustness in Regression Analysis  (Prof. J.T. Kent)

The overall goal of Statistics can be viewed as trying to extract a signal from noisy data. A simple example is the estimation of a regression line from a set of bivariate data. Specific assumptions about the errors (e.g., normality) enable the signal to be estimated as efficiently as possible. However, if the errors are less structured and/or outliers are present, then stronger measures are needed to reliably find the signal.

Two common approaches to robustness in regression analysis include M-estimation and "Least Median of Squares". A recent innovation which combines features of both these methods, CM-estimation, has been developed at Leeds. The project will involve further developments of these methods and adaptation to other statistical models. Applications to computer vision, statistical methodology.

Project 11:  Efficiency of MCMC  (Prof. J.T. Kent)

The Bayesian approach to inference is to study the posterior distribution of a parameter vector given a prior distribution and data available through a likelihood. Of particular interest is the mode of this posterior distribution. The maximum likelihood estimate arises as a special case.

In all but the simplest modelling situations, it is impossible to write down properties of the posterior density in closed form. Two important methodologies have been developed over the past 20 years. The EM algorithm is a tool for calculating the posterior mode for problems with incomplete data. It turns a messy problem into a simpler problem, at the price of requiring an iterative algorithm. Another, more recent approach, is MCMC (Markov chain Monte Carlo). This approach involves the construction of a Markov chain which can be easily simulated, and whose equilibrium distribution is the desired posterior distribution. Thus, properties of the posterior can be studied through simulation.

For both of these algorithms there is the question of speed of convergence. For EM, it is sometimes possible to tune the representation of the incompleteness in the data to speed the convergence to the mode. In the MCMC algorithm, there are various choices in the setting up of the Markov chain which affect the speed of convergence to the equilibrium distribution. Rather surprisingly, it turns out that the two methodologies are closely linked. That is, tuning an EM algorithm to have good convergence properties can lead to an MCMC algorithm which also has good convergence properties. This project will study these questions in more detail. Applications: a wide range of statistical problems.

Project 12:  Inhomogeneous Priors For Image Reconstruction  (Dr. R.G. Aykroyd)

In a Bayesian approach to image reconstruction knowledge of the expected smoothness can be incorporated in terms of a prior distribution. Most low-level priors are defined by homogeneous Markov random field models. These constrain the smoothing to be constant over the image. Inhomogeneous models, however, allow model parameters to vary across the site hence reducing biases which can lead to features being masked.

A common approach to estimation involves Markov Chain Monte Carlo (MCMC) methods. This allows all prior model parameters to be estimated in addition to the image itself. Unfortunately large scale calibration experiments are need to estimate the otherwise intractable normalization constant of the homogeneous prior distribution. When moving to inhomogeneous models the problem again becomes unmanageable.

This project will look at the use of different types of prior model, both homogeneous and inhomogeneous. Methods for estimation of the normalization constants will be studied, in particular approximate estimation for the inhomogeneous case.

Project 13:  Extinction and Propagation in Catalytic Media  (Dr. L. Bogachev)

Consider a random walker in d dimensions, and assume that at certain points the particle can undergo branching, in that it dies at some rate giving birth to (a random number of) daughter particles, which thereafter evolve in a similar manner. Such Markov stochastic processes are called branching random walks in a catalytic branching environment and are motivated e.g. by the model of strong reaction centres in the chemical kinetics.

The goal of the project will be to study the asymptotic properties of the particle population, in particular related to (local and global) extinction and front propagation. One can expect that results will involve the interplay between the recurrence/transience properties of the underlying random walk and the intensities of the catalysts. Some computer simulations may prove useful to guess and verify the answers.

Project 14:  Estimation of Parameters of Diffusion Processes  (Prof. A.Yu. Veretennikov)

In many situations an estimate of the unknown density is required. Amongst the most popular are kernel estimators. Once a kernel itself is chosen, the next problem is to choose its parameter - ``width''. Now suppose the observations are dependent. Can one use standard estimators constructed for the i.i.d. case in this new situation? Should the width be adjusted? Are the estimators optimal in some appropriate sense?

Likewise, if one has a parametric model, questions arise if it is possible to use ``standard'' estimators of an unknown parameter in the dependent case, whether they retain any optimal properties, and under what conditions. Diffusion processes are the most general class of Markov processes with continuous trajectories. This class of processes is attractive to study because a process may be defined using only two characteristics at any point of the state space. However, similar settings are available for Markov chains.

Both theories - parametric and nonparametric estimators for processes - are under development, they rely upon recent achievements in the theory of stochastic processes. An important part of it concerns recurrent and ergodic properties of processes which are to be ``incorporated'' into the estimation techniques. Optimal properties of the estimators are to be investigated.

Project 15:  Data processing with complex-valued wavelets   (Dr S. Barber)

Surprisingly, complex-valued wavelets are more effective at analysing certain types of real-valued data than real-values wavelets. Barber & Nason (2003,4) have shown that denoising real-valued data with complex wavelets is both simple and effective. Further work is needed to extend these methods into denoising images and complex-valued data (such as tomographic measurements of flow in a pipe, magnetic resonance imaging scans, and radar data). In some of these applications, generalisations of wavelets such as wedgelets, ridgelets, and curvelets have been proposed; complex-valued versions of these basis functions would be worth studying. It would also be interesting to extend the use of complex-valued wavelets to other areas of statistics where wavelets have proven effective, such as density estimation, time series analysis, and changepoint detection.

Project 16:  Optimality of wavelet predictors in regression   (Dr S. Barber)

Wavelets (and extensions such as lifting or wavelet packets) provide a multiscale representation of data. This means we can take account of both localised and large-scale behaviour at once. This can be used in modelling how an outcome depends on both local and global behaviour of predictors (Hunt & Nason, 2001; Goodwin, Barber, & Aykroyd, 2004; Goodwin, Aykroyd, & Barber, 2005). Alternative approaches have been used such as a kernel-based approach known as geographically weighted smoothing (Brundson, Fotheringham, & Charlton, 1996), but these methods do not allow for predictors that have locally irregular behaviour (for example, mountain ranges impose "shadows" in climate variables, or human habitation imposes sharp changes in environmental conditions). This project will involve studying the various competing methods and assessing their relative merits through considering bounds on error measures for the predictions generated by these methods.

Project 17:  A wavelet-lifting approach to spatial-temporal prediction applied to crop monitoring   (Dr R.G. Aykroyd & Dr S. Barber)

Defra have funded the monitoring of pests and diseases across a wide variety of important crops over the last three decades. In particular, the Central Science Laboratory have access to a unique database containing records of pest numbers, incidence and severity of disease (taken at various time intervals) and effectiveness of crop management practice. The aim is to analyse the data to predict risk to the crops, seasonal variation in this risk and the effectiveness of control strategies, to alert farmers of emerging threats and advise on appropriate actions.

Statistically, the problem we wish to address is one of describing the densities of key environmental quantities over a large and geographically diverse region given measurements on an irregular grid. In some cases, several different quantities are measured (often on different grids), resulting in several densities or response surfaces. Further, for some data sets, these measurements are made at irregular time points, creating a complex three- dimensional data set. Once these response surfaces are estimated, they will be used to predict the incidence of pest explosions or other undesirable events. Both logistic regression and Bayesian modelling will be considered as methods of making these predictions.

Project 19:  Structural constraints in protein folding pathways   (Prof W. R. Gilks)

Proteins are long chains of residues, each residue being one of 20 types of amino acid. The amino acid sequence of a protein determines its three-dimensional, folded, shape. Nature knows the mapping from sequence to structure but science does not, despite decades of research. Currently, the best methods for predicting a protein's structure rely on homology, i.e. sequence similarlity with another protein whose structure has been determined experimentally. In this project, we will explore a theory that folding pathways are constrained in local sequence-structure space, and aim to discover these constraints through Markov modelling of known protein structures.

Project 20:   Modelling highly conserved non-coding DNA sequences   (Prof W. R. Gilks)

DNA contains genes which code for proteins, which perform diverse cellular functions. The DNA sequence of fundamental genes is often highly conserved across widely differing species, reflecting ancient common origins in the universal tree of life. However, genes account for only a small fraction of the total DNA in higher organisms. The remaining intergenic DNA used to be thought of mostly as poorly conserved 'junk'.

Surprisingly, recent research has revealed thousands of short intergenic DNA sequences (CNEs) which are even more highly conserved than genes, suggesting their supreme importance to the organism. Yet, the function of these CNEs is poorly understood, and their mode of action entirely unknown. This project aims to develop statistical models of CNE sequence, to help find them, compare them, and uncover their evolutionary origins.

Project 21:  Reconstructing ancient genomes   (Prof W. R. Gilks)

Currently, genome sequencing projects have succeeded in revealing the DNA sequence of about a dozen vertebrate species, ranging from human, through chimpanzee and mouse, to frog and fish, each sequence being of the order of billions of base-pairs. More genome sequences projects are underway. Soon there will be enough information to reconstruct the genome sequence of the long-extinct common ancestor of all mammals. The aim of this project will be to develop statistical models of genome evolution, to assist in genomic reconstruction of ancient species.

Project 23:  Analysing evolutionary trees  (Dr S. Barber & Prof W.R. Gilks)

Imagine having information on how "different" a collection of species of plants or animals are. From this information, we might like to draw "family tree" type diagrams to explore when different species diverged. Such trees are called phylogenies and can be constructed in various ways once we have information on how different the various species are. For a more detailed discussion of phylogenies, see the book "Inferring Phylogenies" by Felsenstein (2004).

Unfortunately, the information we have is often imperfect, so our trees are subject to some doubt. Statistically, we might say that our difference information has been corrupted by noise. Wavelet shrinkage (Donoho & Johnstone, 1994, Biometrika pp. 425-455) is a powerful tool for removing noise when we have regularly spaced data in one or two dimensions. The idea of lifting was developed by Sweldens (1997, Siam J. Math. Anal pp. 511-546) as a means of generalising the wavelet shrinkage method to more irregular point sets but has been used to apply the method to more complex data structure. Moreover, the lifting algorithm is strikingly similar to the algorithms used to construct phylogenetic trees. This project will explore the use of lifting to construct phylogenies when our information is corrupted by noise.

Project 24:  Statistical Models on Cylinders with Applications to Bioinformatics   (Prof K.V. Mardia & Prof J.T. Kent)

There are various situations where there is a combination of angular and 'distance' type variables. The simplest example is of the wind direction and the wind speed. More complicated problems are found in Bioinformatics, for example in hydrogen bonding. Hydrogen bonding is vital for determining the 3-D structures found in proteins, such as the alpha-helix, the beta-strand and so on. This project will make suitable models motivated by the geometry of variables, and test these on real data. One of the goals of the project is the application to protein structure prediction, by incorporating the model into the PHAISTOS software package (http://www.phaistos.org).

Project 25: Statistics of Surface Representation and Protein Bioinformatics   (Prof K.V. Mardia)

There are many novel and challenging statistical problems are raised in Life Sciences. This project will develop the work which will have impact on Drug Discovery.

Proteins provide much of the molecular machinery of life. There is a huge variety of them in each cell of every species, each type performing a specialized function. They work by interacting with other molecules through 'active' sites on their surface, but the location of these active sites is unknown for most proteins. Identifying a protein's active sites may help in predicting which other molecules it interacts with and ultimately its biological function.

Many attempts have been made by bioinformaticians to predict protein active sites based on their three-dimensional shape and electrical charge profile, but with limited success. Essentially, this is a statistical problem, as there are experimental data to guide us, but these data are often noisy and complex. In this project we will develop statistical models of a protein's surface to help unravel this complexity, and improve predictions. Non-statistical and deterministic work already exist, for example, Pickering et al (2001) and Davies et al (2002) but it needs to be extended to allow for statistical variation.

No prior biological knowledge will be assumed: training will be provided as required.

Project 26: Protein Structure and Distributions on Rotations   (Prof K.V. Mardia) - in collaboration with Dr. T. Hamelryck, University of Copenhagen.

An organism's DNA, including its genes, holds almost all the information required for its development and function. Human understanding of this information is at an early stage, but is accumulating rapidly due to new high throughput forms of experimentation. This has led to large and rapidly expanding databases of DNA sequence, and related databases of the structure and function of biomolecules such as proteins. Bioinformatics is concerned with the development of these databases, and tools for deciphering and exploiting the information they contain.

Bioinformatics has various challenging problems related to protein structure. Protein structure can be described through some angles known as conformational angles. There is a need for suitable directional models to understand and investigate properties of the structure. Mardia and Jupp's research monograph (Directional Statistics, 2000, Wiley) provides a starting point. However, various tools need to be developed as the conformational angles are correlated. In particular, this project will get deeper into how protein folds through their neighboring atomic orientation; it needs extending distributions on rotation matrices known as Fisher Matrix Distribution. The research will be driven by protein data of which we have a good representative from Copenhagen University as well as from our recent research papers.

No knowledge of protein structure or directional statistics will be assumed since full training will be given as necessary.

Project 27: Bayesian Hierarchical Alignment Models and Flexible Proteins   (Prof K.V. Mardia)

Various new challenging and interesting problems in shape analysis have been appearing from different scientific areas including bioinformatics and image analysis.

In shape analysis, one assumes that the points in two or more configurations are labeled and these configurations are to be matched after filtering out some transformation such as a rigid transformation .Green, P.J. and Mardia, K.V. (2006, Biometrika, 93) have developed a new method to align unlabelled configurations under rigid transformation. This project will extend the methodology to allow for various different type of transformations, mostly motivated by problems in Protein Bioinformatics.

It is well known that proteins are the work-horses of all living systems. A protein is a sequence of amino acids, of which there are twenty types. The sequence folds into a 3-dimensional structure.This three- dimensional shape of a protein plays a key role in determining its function, so proteins in which particular atoms have very similar configurations in space often have similar functions. There is therefore a need for efficient methodology to align the proteins after allowing for appropriate geometrical transformations .Real problems in Protein Bioinformatics can be very complex such as their flexibility . This project will develop methods to allow for their flexibility. Various data sets are already available to test the adequacy of the tools developed under this project.