---
title: "MATH5835M Practical Solutions"
output:
pdf_document: default
html_notebook: default
html_document:
df_print: paged
---
This text gives an example solution for the MATH5835M practical.
* Each question is worth 20 marks, and there are also 20 marks for
presentation, for a total of $4 \times 20 + 20 = 100$ marks.
Overall marking criteria:
* Are all methods used and results explained?
* Do the explanations make sense?
* Is the presentation concise and well-structured?
* Is all relevant R code included in the report? Is all R code in
the report relevant?
* Is R used competently?
Marking criteria for individual questions are listed at the end
of each question.
# Solutions
## Task 1
The first task is to import the data and to give a short description of
the data. After downloading the file `test-pre3.csv.gz` from from
https://www.seehuhn.de/maths/letters/
we can use the following code (from the web page) to import the data:
```{r}
test <- read.csv("test-pre3.csv.gz")
im <- as.matrix(test[,-1])
lab <- as.factor(LETTERS[test[,1]+1])
rm(test)
```
We can see that the data set contains 5835 samples of size 1728 each:
according to the web page, each of the samples corresponds to a 54×32
image. The column names `r1c1`, `r1c2` etc.\ indicate that the first
few values corrspond to row 1 of the picture (`r` stands for "row",
`c` stands for "column"):
```{r}
str(im)
```
The provided labels indicate which letter each image corresponds to:
```{r}
str(lab)
```
As a test we consider the first letter. Inspecting `lab[1]`, we see that
this letter is mean to be an `r lab[1]`. Using the code similar to that
from the web page, we can show the corresponding image. We simultaneously
verify that the value 0 corresponds to black, and the the first value
in each row corresponds to the top-left pixel, but setting the corresponding
value in the image data to 0.
```{r}
pic <- matrix(im[1,], 32, 54)
pic[1, 1] <- 0
image(pic[,54:1], asp=1, col = gray((0:255)/255), xlab="", ylab="")
```
This seems indeed to represent an H, so everything is consistent. As expected
the top-left pixel is shown black, because we set `im[1, 1]` to 0.
Criteria for marking:
* Does the report convince the reader that the data was imported correctly?
* Is there an explanation of how the numbers in R correspond to images?
* Is there an explanation of the role of the labels provided as part of
the data?
* Is the R code shown, understandable and correct?
* Is there any evidence that the R code from the web page has been
understood, rather than just copied?
## Task 2
In this task we have to compute the standard deviation of the
pixels in each image. This gives a measure of how much variation
there is within each image:
```{r}
s <- apply(im, 1, sd)
str(s)
s[1]
```
The result shows that the vector of standard deviations has the
correct length 5825, and that the first standard deviation equals
the value given in the hint.
```{r fig.height=7}
par(mfrow=c(3, 1))
b = seq(10, 90, length.out = 41)
hist(s, breaks=b, main=NULL, col="grey",
xlab="all standard deviations")
hist(s[which(lab == "B")], breaks=b, main=NULL, col="grey",
xlab="standard deviations for the letter B")
hist(s[which(lab == "C")], breaks=b, main=NULL, col="grey",
xlab="standard deviations for the letter C")
```
We can see that standard deviations for the letter "B" are towards
the higher end of standard deviations, whereas the values for "C" are
towards the lower end of the range. Maybe this is a consequence of
"B" needing more ink to write than "C"?
Marking criteria:
* Does the report convince the reader that the standard deviations
have been computed correctly?
* Has a reasonable variable name been chosen for the standard
deviations (e.g. s to match the $s_i$ on the question sheet).
* Has care been taken to make the histograms look nice?
* Are the samples for letters "B" and "C" selected correctly?
* Is there some discussion of the fact that "B" has higher
standard deviations than "C"?
## Task 3
Here we have to find the posterior distribution of the mean $\mu_L$ of
the standard deviation for instances of the letter $L$, using some Bayesian model.
The prior distribution for $\mu_L$ is the uniform distribution
$\mathcal{U}[38, 55]$, with density
$$
p(\mu)
= \frac{1}{55-38} 1_{[38, 55]}(\mu)
= \frac{1}{17} 1_{[38, 55]}(\mu)
$$
If we denote the standard deviations corresponding to the letter $L$
by $s = (s_1, \ldots, s_{n_L})$, the likelihood is
$$
\begin{split}
p\bigl( s \bigm| \mu\bigr)
&= \prod_{i=1}^{n_L} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\Bigl(-\frac{(s_i - \mu)^2}{2\sigma^2} \Bigr) \\
&= (2\pi\sigma^2)^{-n_L/2} \exp\Bigl(
-\frac{1}{2\sigma^2} \sum_{i=1}^{n_L} (s_i - \mu)^2
\Bigr)
\end{split}
$$
where $\sigma^2 = 13^2$ is the given variance. Using Bayes' formula,
we can find the posterior distribution, up to constants, as
$$
\begin{split}
p\bigl( \mu \bigm| s \bigr)
&= \frac{p\bigl( s \bigm| \mu\bigr) p(\mu)}{p(s)} \\
&\propto \exp\Bigl(
-\frac{1}{2\sigma^2} \sum_{i=1}^{n_L} (s_i - \mu)^2
\Bigr) 1_{[38, 55]}(\mu)
\end{split}
$$
Or aim is to find a proposal density, which can be used to generate
samples from the posterior, using rejection sampling. To find an
appropriate proposal density, we rewrite the exponential term
to be proportional to a normal distribution:
$$
\begin{split}
p\bigl( \mu \bigm| s \bigr)
&\propto \exp\Bigl(
-\frac{1}{2\sigma^2} \bigl(n_L \mu^2 - 2 \mu \sum_{i=1}^{n_L} s_i + \sum_{i=1}^{n_L} s_i^2\bigr)
\Bigr) 1_{[38, 55]}(\mu) \\
&\propto \exp\Bigl(
-\frac{1}{2\sigma^2/n_L} \bigl(\mu^2 - 2 \mu \frac{1}{n_L} \sum_{i=1}^{n_L} s_i\bigr)
\Bigr) 1_{[38, 55]}(\mu) \\
&\propto \frac{1}{\sqrt{2\pi \sigma^2/n_L}}
\exp\Bigl(
-\frac{\bigl( \mu - \frac{1}{n_L} \sum_{i=1}^{n_L} s_i \bigr)^2}{2\sigma^2/n_L}
\Bigr) 1_{[38, 55]}(\mu) \\
&=: \varphi(\mu).
\end{split}
$$
By writing the target density $\varphi$ in this form, we see that
a good proposal density will be the density of a $\mathcal{N}(\frac{1}{n_L} \sum_{i=1}^{n_L} s_i, \frac{1}{n_L}\sigma^2)$-distribution, *i.e.*
$$
\psi(x)
= \frac{1}{\sqrt{2\pi \sigma^2/n_L}}
\exp\Bigl(
-\frac{\bigl( \mu - \frac{1}{n_L} \sum_{i=1}^{n_L} s_i \bigr)^2}{2\sigma^2/n_L}
\Bigr).
$$
For this choice of $\varphi$ and $\psi$ we have $\varphi(\mu) \leq c
\psi(\mu)$ where $c = 1$. Thus, we get the following algorithm:
* Generate $\mu \sim \mathcal{N}(\frac{1}{n_L} \sum_{i=1}^{n_L} s_i, \frac{1}{n_L}\sigma^2)$
* Generate $U \sim \mathcal{U}[0,1]$
* Accept $\mu$, iff $c \psi(\mu) U \leq \varphi(\mu)$, i.e. if
$\mu \in [38, 55]$.
Marking criteria:
* Has the target density been found correctly?
* Has a feasible proposal density been found?
* Has the constant $c$ been found, and does the report convince the
reader that $\varphi(\mu) \leq c \psi(\mu)$ holds for all $\mu$?
* Does the report explain the resulting rejection sampling algorithm?
## Task 4
Here we have to implement the rejection sampling algorithm
from task 3. We start, by gathering the required values $n_L$ and
$\frac{1}{n_L} \sum_{i=1}^{n_L} s_i$ for all letters:
```{r}
nL <- integer(26)
sL.mean <- numeric(26)
for (i in 1:26) {
rows <- which(lab == LETTERS[i])
nL[i] <- length(rows)
sL.mean[i] <- mean(s[rows])
}
posterior <- data.frame(n=nL, s.mean=sL.mean, row.names = LETTERS)
posterior
```
Using these values, we can now generate the samples. To be able to
experiment with different letters more easily, we write a function
to generate the sample:
```{r}
gen.post <- function(N, letter) {
nL <- posterior[letter,]$n
sL.mean <- posterior[letter,]$s.mean
res <- numeric(N)
done <- 0
while (done < N) {
m <- rnorm(1, sL.mean, 13 / sqrt(nL))
if (m >= 38 && m <= 55) {
done <- done + 1
res[done] <- m
}
}
res
}
```
To test the function, we produce histograms of posterior samples
for "I", "K" and "B" (top to bottom). These letters are chosen,
because "I" and "B" have the lowest and highest mean standard deviation,
respectively, while "K" falls into the middle of the range.
```{r fig.height=7}
par(mfrow=c(3, 1))
hist(gen.post(10000, "I"), breaks=seq(37, 56, l=39), main=NULL, col="grey")
hist(gen.post(10000, "K"), breaks=seq(37, 56, l=39), main=NULL, col="grey")
hist(gen.post(10000, "B"), breaks=seq(37, 56, l=39), main=NULL, col="grey")
```
Using `gen.post()`, we can now generate samples for Monte Carlo estimates.
In your report, you need to only consider a few letters. Here I make
a table of all results, for the help with marking:
```{r}
N <- 1e5
post.mean <- numeric(26)
post.mean.RMSE <- numeric(26)
post.var <- numeric(26)
for (i in 1:26) {
mu <- gen.post(N, LETTERS[i])
post.mean[i] <- mean(mu)
post.mean.RMSE[i] <- sd(mu) / sqrt(N)
post.var[i] <- var(mu)
}
results <- data.frame(post.mean = post.mean, post.var = post.var,
post.mean.RMSE = post.mean.RMSE,
sample.mean = posterior$s.mean,
row.names = LETTERS)
results
```
The table shows all the results. The column `post.mean.RMSE` shows
that the sample size $N$ is large enough, by far, that the posterior
mean is estimated accurately. A more thorough analysis could also
take accuracy of the posterior variance estimate into account. Producing
the table takes a few seconds on my laptop, so there is scope to increase
$N$, if needed.
The table, and the plot below, shows that the Bayesian estimate for
the mean differs clearly from the classical estimate (*i.e.* from the sample
mean) only for the letters "C", "I" and "L". There is also a small
deviation for the letter "B":
```{r}
plot(results$post.mean - results$sample.mean, xaxt="n",
xlab="letter", ylab="post. mean - sample mean")
axis(1, at=1:26, labels=LETTERS)
```
The table shows the results.
Marking criteria:
* Are the Monte Carlo samples generated correctly?
* Are the posterior mean and variance for at least two different letters estimated?
* Is at least one of the letters considered close enough to the
edge of the prior interval, that the Bayesian estimate differs from the
classical estimate?
* Is the Monte Carlo sample size chosen wisely?