
Introduction
Below we describe some projects that could form the basis of research work
for a postgraduate degree at the Department of Statistics at Leeds.
This list of projects is by no means exhaustive  many other topics are
available and can be discussed with potential supervisors.
All enquiries regarding postgraduate study
should be addressed to:
Dr Leonid Bogachev,
Postgraduate Research Tutor,
Department of Statistics, University of Leeds, Leeds, LS2 9JT, UK.
Project 3: Spatial Modelling for Electrical Tomography
(Dr. R.G. Aykroyd)

Electrical tomography techniques have extensive application
potential in industrialprocess monitoring. They provide a fast,
versatile and noninvasive method for online observation of 3d
processes in real time. Currently linear approximations are used to
produce process visualizations which have artifacts including
blurring, masking, shadowing and distortions. This project will
consider statistical methods for nonlinear reconstruction that
minimize artifacts to improve visualization and allow direct
estimation of control parameters. The work will be in collaboration
with the Leeds University School of Mining and Mineral Engineering.
Project 4: Assessment of Facial Growth
(Prof. J.T. Kent and Prof. K.V. Mardia)

Statistical shape analysis based on a set of known landmark
positions is now fairly wellunderstood. However, when studying the
shapes of surfaces, such as human faces, there are still many
unresolved issues, including the identification of landmarks and the
description of other aspects of the surface such as curvature. This
project will address these issues in the context of human faces, with
applications to growth and to the design of surgical implants for
facial deformities with collaborators at St. James' Hospital, Leeds.
Project 6: Statistical Modelling Of Swimming Microorganisms
(Prof. C.C. Taylor)

Swimming microorganisms interact in such a way that, over time, areas
of high concentration can form patterns. The characterization of such
timedependent patterns is a complex challenge which should lead to a
greater understanding of the way microorganisms behave. If a small
set of parameters can be used to model such patterns, then these can be
related to various characteristics of the organisms, their
concentration and the domain in which they live. Hu et al. (1995) have
employed standard applied mathematical techniques to characterize such
patterns by their wave numbers, wave number distributions, average roll
curvature, and average roll orientation adjacent to the sidewall.
It is our goal to use statistical methods used in image analysis and
stochastic geometry to model such bioconvection patterns. Initial work
will focus on simulation of various models to determine a class of
models which can be adapted to this problem. Given a specific image,
a set of parameters can then be estimated by Markov Chain Monte Carlo
(MCMC) methods, for example. These parameters then need to be
interpreted in biological terms, which will require a large number of
experiments to gain an understanding. The project involves the close
collaboration of Dr. Nick Hill, Department of Applied Mathematics,
University of Leeds.
Project 7: Data Mining
(Prof. C.C. Taylor)

Recent advances in technology  in particular falling costs of high
capacity storage media  have rapidly increased the acquisition and
archiving of large amounts of data. Consequently, Knowledge Discovery
in Databases (KDD) or Data Mining is a rapidly growing theme amongst
researchers, and is a potential source of useful information for both
industry and commerce. Various classes of patterns that can be
detected in data have recently been studied in the KDD community. The
ability to ``learn'' (or discover) knowledge from data in a more
directed framework has had a much longer practice both in Machine
Learning (ML) and in Statistics. Here, the data are generally of the
form (variables, class), and the objective is simply to learn a
``rule'' whereby a new observation can be classified into one of the
classes using the information in the attributes.
In both cases, however, difficulties arise when the data are not
static, and so the validity of patterns and rules (discovered
`knowledge') depends, in some unspecified way, on time, and it changes
with time. In this case, learning and classification (or prediction)
procedures need robust mechanisms for detecting and adapting to
changes.
In the machine learning literature, the generic term for such
timedependent changes is concept drift. Concept drift may be caused by
continuous changes of the world and the environment  for example, in
economic problems, the presence of inflation causes a continual drift
in the real value of money which means that rules based on absolute
monetary values will quickly become out of date  or it may occur
when the attribute values or the concepts depend on a certain (possibly
unknown) context. For example, the relative proportions of
vehicles made by each manufacturer will depend on the region (or
country) of observation. Thus, the location can be seen as the context
in which the data are collected  knowledge of the context (or a
change of context) will aid the knowledge discovery process.
There are various ways in which concept drift can manifest itself: the
distribution of the attributes in a class can change; new attributes
become available or existing attributes can no longer be measured (for
example due to changes in the law); or a subset of the variables can
change in their importance or ability to predict the class.
Project 8: Data Fusion In Archaeological Geophysics
(Dr. R.G. Aykroyd)

Current archaeological thinking is to investigate a site without
excavation. Surface magnetometry produces an image of the features not
visible on the surface, but gives very little information about the
depth and extent of buried features. Possible sources of this missing
information include: magnetic stratigraphy where soil cores are used
to study the distinct layers at the site; additional magnetic sensors
to give multiple recordings; and use of other detector types, such as
resistivity.
This project will develop models and algorithms which can take data
from the different sources to produce full three dimensional
reconstructions of an archaeological site.
Data and technical assistance will be supplied by staff in the
Department of Archaeological Science, University of Bradford.
Project 10: Robustness in Regression Analysis
(Prof. J.T. Kent)

The overall goal of Statistics can be viewed as trying to extract a
signal from noisy data. A simple example is the estimation of a
regression line from a set of bivariate data. Specific assumptions
about the errors (e.g., normality) enable the signal to be estimated as
efficiently as possible. However, if the errors are less structured
and/or outliers are present, then stronger measures are needed to
reliably find the signal.
Two common approaches to robustness in regression analysis include
Mestimation and "Least Median of Squares". A recent innovation which
combines features of both these methods, CMestimation, has been
developed at Leeds. The project will involve further developments of
these methods and adaptation to other statistical models.
Applications to computer vision, statistical methodology.
Project 11: Efficiency of MCMC
(Prof. J.T. Kent)

The Bayesian approach to inference is to study the posterior
distribution of a parameter vector given a prior distribution and data
available through a likelihood. Of particular interest is the mode of
this posterior distribution. The maximum likelihood estimate arises as
a special case.
In all but the simplest modelling situations, it is impossible to write
down properties of the posterior density in closed form. Two important
methodologies have been developed over the past 20 years. The EM
algorithm is a tool for calculating the posterior mode for problems
with incomplete data. It turns a messy problem into a simpler problem,
at the price of requiring an iterative algorithm. Another, more recent
approach, is MCMC (Markov chain Monte Carlo). This approach involves
the construction of a Markov chain which can be easily simulated, and
whose equilibrium distribution is the desired posterior distribution.
Thus, properties of the posterior can be studied through simulation.
For both of these algorithms there is the question of speed of
convergence. For EM, it is sometimes possible to tune the
representation of the incompleteness in the data to speed the
convergence to the mode. In the MCMC algorithm, there are various
choices in the setting up of the Markov chain which affect the speed of
convergence to the equilibrium distribution. Rather surprisingly, it
turns out that the two methodologies are closely linked. That is,
tuning an EM algorithm to have good convergence properties can lead to
an MCMC algorithm which also has good convergence properties. This
project will study these questions in more detail.
Applications: a wide range of statistical problems.
Project 12: Inhomogeneous Priors For Image Reconstruction
(Dr. R.G. Aykroyd)

In a Bayesian approach to image reconstruction knowledge of the
expected smoothness can be incorporated in terms of a prior
distribution. Most lowlevel priors are defined by homogeneous Markov
random field models. These constrain the smoothing to be constant over
the image. Inhomogeneous models, however, allow model parameters to
vary across the site hence reducing biases which can lead to features
being masked.
A common approach to estimation involves Markov Chain Monte Carlo
(MCMC) methods. This allows all prior model parameters to be estimated
in addition to the image itself. Unfortunately large scale calibration
experiments are need to estimate the otherwise intractable
normalization constant of the homogeneous prior distribution. When
moving to inhomogeneous models the problem again becomes unmanageable.
This project will look at the use of different types of prior model,
both homogeneous and inhomogeneous. Methods for estimation of the
normalization constants will be studied, in particular approximate
estimation for the inhomogeneous case.
Project 13: Extinction and Propagation in Catalytic Media
(Dr. L. Bogachev)

Consider a random walker in d dimensions, and assume
that at certain points the particle can undergo branching, in that it
dies at some rate giving birth to (a random number of) daughter
particles, which thereafter evolve in a similar manner. Such Markov
stochastic processes are called branching random walks in a catalytic
branching environment and are motivated e.g. by the model of strong
reaction centres in the chemical kinetics.
The goal of the project will be to study the asymptotic properties of
the particle population, in particular related to (local and global)
extinction and front propagation. One can expect that results will
involve the interplay between the recurrence/transience properties of
the underlying random walk and the intensities of the catalysts. Some
computer simulations may prove useful to guess and verify the answers.
Project 14: Estimation of Parameters of Diffusion Processes
(Prof. A.Yu. Veretennikov)

In many situations an estimate of the unknown density is required.
Amongst the most popular are kernel estimators. Once a kernel itself
is chosen, the next problem is to choose its parameter  ``width''. Now
suppose the observations are dependent. Can one use standard
estimators constructed for the i.i.d. case in this new situation? Should
the width be adjusted? Are the estimators optimal in some appropriate
sense?
Likewise, if one has a parametric model, questions arise if it is
possible to use ``standard'' estimators of an unknown parameter in the
dependent case, whether they retain any optimal properties, and under what
conditions. Diffusion processes are the most general class of Markov
processes with continuous trajectories. This class of processes is
attractive to study because a process may be defined using only two
characteristics at any point of the state space. However, similar
settings are available for Markov chains.
Both theories  parametric and nonparametric estimators for
processes  are under development, they rely upon recent
achievements in the theory of stochastic processes. An important part
of it concerns recurrent and ergodic properties of processes which are
to be ``incorporated'' into the estimation techniques. Optimal
properties of the estimators are to be investigated.
Project 15: Data processing with complexvalued wavelets
(Dr S. Barber)

Surprisingly, complexvalued wavelets are more effective at analysing
certain types of realvalued data than realvalues wavelets. Barber &
Nason (2003,4) have shown that denoising realvalued data with complex
wavelets is both simple and effective. Further work is needed to extend
these methods into denoising images and complexvalued data (such as
tomographic measurements of flow in a pipe, magnetic resonance imaging
scans, and radar data). In some of these applications, generalisations of
wavelets such as wedgelets, ridgelets, and curvelets have been proposed;
complexvalued versions of these basis functions would be worth studying.
It would also be interesting to extend the use of complexvalued wavelets
to other areas of statistics where wavelets have proven effective, such as
density estimation, time series analysis, and changepoint detection.
Project 16: Optimality of wavelet predictors in regression
(Dr S. Barber)

Wavelets (and extensions such as lifting or wavelet packets) provide a
multiscale representation of data. This means we can take account of both
localised and largescale behaviour at once. This can be used in
modelling how an outcome depends on both local and global behaviour of
predictors (Hunt & Nason, 2001; Goodwin, Barber, & Aykroyd, 2004;
Goodwin, Aykroyd, & Barber, 2005). Alternative approaches have been used
such as a kernelbased approach known as geographically weighted smoothing
(Brundson, Fotheringham, & Charlton, 1996), but these methods do not allow
for predictors that have locally irregular behaviour (for example, mountain
ranges impose "shadows" in climate variables, or human habitation imposes
sharp changes in environmental conditions). This project will involve
studying the various competing methods and assessing their relative merits
through considering bounds on error measures for the predictions
generated by these methods.
Project 17: A waveletlifting approach to spatialtemporal
prediction applied to crop monitoring (Dr R.G. Aykroyd & Dr S.
Barber)

Defra have funded the monitoring of pests and diseases across a wide
variety of important crops over the last three decades. In particular,
the Central Science Laboratory have access to a unique database
containing records of pest numbers, incidence and severity of disease
(taken at various time intervals) and effectiveness of crop management
practice. The aim is to analyse the data to predict risk to the crops,
seasonal variation in this risk and the effectiveness of control
strategies, to alert farmers of emerging threats and advise on
appropriate actions.
Statistically, the problem we wish to address is one of describing the
densities of key environmental quantities over a large and
geographically diverse region given measurements on an irregular
grid. In some cases, several different quantities are measured (often
on different grids), resulting in several densities or response
surfaces. Further, for some data sets, these measurements are made at
irregular time points, creating a complex three dimensional data
set. Once these response surfaces are estimated, they will be used to
predict the incidence of pest explosions or other undesirable
events. Both logistic regression and Bayesian modelling will be
considered as methods of making these predictions.
Project 19: Structural constraints in protein folding pathways
(Prof W. R. Gilks)

Proteins are long chains of residues, each residue being one of 20
types of amino acid. The amino acid sequence of a protein determines
its threedimensional, folded, shape. Nature knows the mapping from
sequence to structure but science does not, despite decades of
research. Currently, the best methods for predicting a protein's
structure rely on homology, i.e. sequence similarlity with another
protein whose structure has been determined experimentally. In this
project, we will explore a theory that folding pathways are
constrained in local sequencestructure space, and aim to discover
these constraints through Markov modelling of known protein
structures.
Project 20: Modelling highly conserved noncoding DNA
sequences
(Prof W. R. Gilks)

DNA contains genes which code for proteins, which perform diverse
cellular functions. The DNA sequence of fundamental genes is often
highly conserved across widely differing species, reflecting ancient
common origins in the universal tree of life. However, genes account
for only a small fraction of the total DNA in higher organisms. The
remaining intergenic DNA used to be thought of mostly as
poorly conserved 'junk'.
Surprisingly, recent research has revealed thousands of short
intergenic DNA sequences (CNEs) which are even more highly conserved
than genes, suggesting their supreme importance to the organism. Yet,
the function of these CNEs is poorly understood, and their mode of
action entirely unknown. This project aims to develop statistical
models of CNE sequence, to help find them, compare them, and uncover
their evolutionary origins.
Project 21: Reconstructing ancient genomes
(Prof W. R. Gilks)

Currently, genome sequencing projects have succeeded in revealing the
DNA sequence of about a dozen vertebrate species, ranging from human,
through chimpanzee and mouse, to frog and fish, each sequence being of
the order of billions of basepairs. More genome sequences projects
are underway. Soon there will be enough information to reconstruct the
genome sequence of the longextinct common ancestor of all
mammals. The aim of this project will be to develop statistical models
of genome evolution, to assist in genomic reconstruction of ancient
species.
Project 23: Analysing evolutionary trees
(Dr S. Barber & Prof W.R. Gilks)

Imagine having information on how "different" a collection of species
of plants or animals are. From this information, we might like to
draw "family tree" type diagrams to explore when different species
diverged. Such trees are called phylogenies and can be constructed in
various ways once we have information on how different the various
species are. For a more detailed discussion of phylogenies, see the
book "Inferring Phylogenies" by Felsenstein (2004).
Unfortunately, the information we have is often imperfect, so our
trees are subject to some doubt. Statistically, we might say that our
difference information has been corrupted by noise. Wavelet shrinkage
(Donoho & Johnstone, 1994, Biometrika pp. 425455) is a powerful tool
for removing noise when we have regularly spaced data in one or two
dimensions. The idea of lifting was developed by Sweldens (1997, Siam
J. Math. Anal pp. 511546) as a means of generalising the wavelet
shrinkage method to more irregular point sets but has been used to
apply the method to more complex data structure. Moreover, the
lifting algorithm is strikingly similar to the algorithms used to
construct phylogenetic trees. This project will explore the use of
lifting to construct phylogenies when our information is corrupted by
noise.
Project 24: Statistical Models on Cylinders with Applications to
Bioinformatics (Prof K.V. Mardia & Prof J.T. Kent)

There are various situations where there is a combination of angular
and 'distance' type variables. The simplest example is of the wind
direction and the wind speed. More complicated problems are found in
Bioinformatics, for example in hydrogen bonding. Hydrogen bonding is
vital for determining the 3D structures found in proteins, such as
the alphahelix, the betastrand and so on. This project will make
suitable models motivated by the geometry of variables, and test these
on real data. One of the goals of the project is the application to
protein structure prediction, by incorporating the model into the
PHAISTOS software package (http://www.phaistos.org).
Project 25: Statistics of Surface Representation and Protein Bioinformatics
(Prof K.V. Mardia)

There are many novel and challenging statistical problems are raised
in Life Sciences. This project will develop the work which will have
impact on Drug Discovery.
Proteins provide much of the molecular machinery of life. There is a
huge variety of them in each cell of every species, each type
performing a specialized function. They work by interacting with other
molecules through 'active' sites on their surface, but the location of
these active sites is unknown for most proteins. Identifying a
protein's active sites may help in predicting which other molecules it
interacts with and ultimately its biological function.
Many attempts have been made by bioinformaticians to predict protein
active sites based on their threedimensional shape and electrical
charge profile, but with limited success. Essentially, this is a
statistical problem, as there are experimental data to guide us, but
these data are often noisy and complex. In this project we will
develop statistical models of a protein's surface to help unravel this
complexity, and improve predictions. Nonstatistical and deterministic
work already exist, for example, Pickering et al (2001) and Davies et
al (2002) but it needs to be extended to allow for statistical
variation.
No prior biological knowledge will be assumed: training will be
provided as required.
Project 26: Protein Structure and Distributions on Rotations
(Prof K.V. Mardia)  in collaboration with Dr. T. Hamelryck, University of Copenhagen.

An organism's DNA, including its genes, holds almost all the
information required for its development and function. Human
understanding of this information is at an early stage, but is
accumulating rapidly due to new high throughput forms of
experimentation. This has led to large and rapidly expanding
databases of DNA sequence, and related databases of the structure and
function of biomolecules such as proteins. Bioinformatics is concerned
with the development of these databases, and tools for deciphering and
exploiting the information they contain.
Bioinformatics has various challenging problems related to protein
structure. Protein structure can be described through some angles
known as conformational angles. There is a need for suitable
directional models to understand and investigate properties of the
structure. Mardia and Jupp's research monograph (Directional
Statistics, 2000, Wiley) provides a starting point. However, various
tools need to be developed as the conformational angles are
correlated. In particular, this project will get deeper into how
protein folds through their neighboring atomic orientation; it needs
extending distributions on rotation matrices known as Fisher Matrix
Distribution. The research will be driven by protein data of which we
have a good representative from Copenhagen University as well as from
our recent research papers.
No knowledge of protein structure or directional statistics will be
assumed since full training will be given as necessary.
Project 27: Bayesian Hierarchical Alignment Models and Flexible Proteins
(Prof K.V. Mardia)

Various new challenging and interesting problems in shape analysis
have been appearing from different scientific areas including
bioinformatics and image analysis.
In shape analysis, one assumes that the points in two or more
configurations are labeled and these configurations are to be matched
after filtering out some transformation such as a rigid transformation
.Green, P.J. and Mardia, K.V. (2006, Biometrika, 93) have developed a
new method to align unlabelled configurations under rigid
transformation. This project will extend the methodology to allow for
various different type of transformations, mostly motivated by
problems in Protein Bioinformatics.
It is well known that proteins are the workhorses of all living
systems. A protein is a sequence of amino acids, of which there are
twenty types. The sequence folds into a 3dimensional structure.This
three dimensional shape of a protein plays a key role in determining
its function, so proteins in which particular atoms have very similar
configurations in space often have similar functions. There is
therefore a need for efficient methodology to align the proteins after
allowing for appropriate geometrical transformations .Real problems in
Protein Bioinformatics can be very complex such as their flexibility
. This project will develop methods to allow for their
flexibility. Various data sets are already available to test the
adequacy of the tools developed under this project.
