This text gives an example solution for the MATH5835M practical.

• Each question is worth 20 marks, and there are also 20 marks for presentation, for a total of $$4 \times 20 + 20 = 100$$ marks.

Overall marking criteria:

• Are all methods used and results explained?

• Do the explanations make sense?

• Is the presentation concise and well-structured?

• Is all relevant R code included in the report? Is all R code in the report relevant?

• Is R used competently?

Marking criteria for individual questions are listed at the end of each question.

# Solutions

The first task is to import the data and to give a short description of the data. After downloading the file test-pre3.csv.gz from from

https://www.seehuhn.de/maths/letters/

we can use the following code (from the web page) to import the data:

test <- read.csv("test-pre3.csv.gz")
im <- as.matrix(test[,-1])
lab <- as.factor(LETTERS[test[,1]+1])
rm(test)

We can see that the data set contains 5835 samples of size 1728 each: according to the web page, each of the samples corresponds to a 54×32 image. The column names r1c1, r1c2 etc. indicate that the first few values corrspond to row 1 of the picture (r stands for “row”, c stands for “column”):

str(im)
##  int [1:5825, 1:1728] 255 255 255 255 255 255 255 255 255 255 ...
##  - attr(*, "dimnames")=List of 2
##   ..$: NULL ## ..$ : chr [1:1728] "r1c1" "r1c2" "r1c3" "r1c4" ...

The provided labels indicate which letter each image corresponds to:

str(lab)
##  Factor w/ 26 levels "A","B","C","D",..: 8 25 26 17 25 26 1 17 20 26 ...

As a test we consider the first letter. Inspecting lab[1], we see that this letter is mean to be an H. Using the code similar to that from the web page, we can show the corresponding image. We simultaneously verify that the value 0 corresponds to black, and the the first value in each row corresponds to the top-left pixel, but setting the corresponding value in the image data to 0.

pic <- matrix(im[1,], 32, 54)
pic[1, 1] <- 0
image(pic[,54:1], asp=1, col = gray((0:255)/255), xlab="", ylab="")

This seems indeed to represent an H, so everything is consistent. As expected the top-left pixel is shown black, because we set im[1, 1] to 0.

Criteria for marking:

• Does the report convince the reader that the data was imported correctly?

• Is there an explanation of how the numbers in R correspond to images?

• Is there an explanation of the role of the labels provided as part of the data?

• Is the R code shown, understandable and correct?

• Is there any evidence that the R code from the web page has been understood, rather than just copied?

In this task we have to compute the standard deviation of the pixels in each image. This gives a measure of how much variation there is within each image:

s <- apply(im, 1, sd)
str(s)
##  num [1:5825] 19.2 49.1 46 61.3 68.7 ...
s[1]
## [1] 19.16695

The result shows that the vector of standard deviations has the correct length 5825, and that the first standard deviation equals the value given in the hint.

par(mfrow=c(3, 1))
b = seq(10, 90, length.out = 41)
hist(s, breaks=b, main=NULL, col="grey",
xlab="all standard deviations")
hist(s[which(lab == "B")], breaks=b, main=NULL, col="grey",
xlab="standard deviations for the letter B")
hist(s[which(lab == "C")], breaks=b, main=NULL, col="grey",
xlab="standard deviations for the letter C")

We can see that standard deviations for the letter “B” are towards the higher end of standard deviations, whereas the values for “C” are towards the lower end of the range. Maybe this is a consequence of “B” needing more ink to write than “C”?

Marking criteria:

• Does the report convince the reader that the standard deviations have been computed correctly?

• Has a reasonable variable name been chosen for the standard deviations (e.g. s to match the $$s_i$$ on the question sheet).

• Has care been taken to make the histograms look nice?

• Are the samples for letters “B” and “C” selected correctly?

• Is there some discussion of the fact that “B” has higher standard deviations than “C”?

Here we have to find the posterior distribution of the mean $$\mu_L$$ of the standard deviation for instances of the letter $$L$$, using some Bayesian model.

The prior distribution for $$\mu_L$$ is the uniform distribution $$\mathcal{U}[38, 55]$$, with density $p(\mu) = \frac{1}{55-38} 1_{[38, 55]}(\mu) = \frac{1}{17} 1_{[38, 55]}(\mu)$ If we denote the standard deviations corresponding to the letter $$L$$ by $$s = (s_1, \ldots, s_{n_L})$$, the likelihood is $\begin{split} p\bigl( s \bigm| \mu\bigr) &= \prod_{i=1}^{n_L} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\Bigl(-\frac{(s_i - \mu)^2}{2\sigma^2} \Bigr) \\ &= (2\pi\sigma^2)^{-n_L/2} \exp\Bigl( -\frac{1}{2\sigma^2} \sum_{i=1}^{n_L} (s_i - \mu)^2 \Bigr) \end{split}$ where $$\sigma^2 = 13^2$$ is the given variance. Using Bayes’ formula, we can find the posterior distribution, up to constants, as $\begin{split} p\bigl( \mu \bigm| s \bigr) &= \frac{p\bigl( s \bigm| \mu\bigr) p(\mu)}{p(s)} \\ &\propto \exp\Bigl( -\frac{1}{2\sigma^2} \sum_{i=1}^{n_L} (s_i - \mu)^2 \Bigr) 1_{[38, 55]}(\mu) \end{split}$ Or aim is to find a proposal density, which can be used to generate samples from the posterior, using rejection sampling. To find an appropriate proposal density, we rewrite the exponential term to be proportional to a normal distribution: $\begin{split} p\bigl( \mu \bigm| s \bigr) &\propto \exp\Bigl( -\frac{1}{2\sigma^2} \bigl(n_L \mu^2 - 2 \mu \sum_{i=1}^{n_L} s_i + \sum_{i=1}^{n_L} s_i^2\bigr) \Bigr) 1_{[38, 55]}(\mu) \\ &\propto \exp\Bigl( -\frac{1}{2\sigma^2/n_L} \bigl(\mu^2 - 2 \mu \frac{1}{n_L} \sum_{i=1}^{n_L} s_i\bigr) \Bigr) 1_{[38, 55]}(\mu) \\ &\propto \frac{1}{\sqrt{2\pi \sigma^2/n_L}} \exp\Bigl( -\frac{\bigl( \mu - \frac{1}{n_L} \sum_{i=1}^{n_L} s_i \bigr)^2}{2\sigma^2/n_L} \Bigr) 1_{[38, 55]}(\mu) \\ &=: \varphi(\mu). \end{split}$ By writing the target density $$\varphi$$ in this form, we see that a good proposal density will be the density of a $$\mathcal{N}(\frac{1}{n_L} \sum_{i=1}^{n_L} s_i, \frac{1}{n_L}\sigma^2)$$-distribution, i.e. $\psi(x) = \frac{1}{\sqrt{2\pi \sigma^2/n_L}} \exp\Bigl( -\frac{\bigl( \mu - \frac{1}{n_L} \sum_{i=1}^{n_L} s_i \bigr)^2}{2\sigma^2/n_L} \Bigr).$ For this choice of $$\varphi$$ and $$\psi$$ we have $$\varphi(\mu) \leq c \psi(\mu)$$ where $$c = 1$$. Thus, we get the following algorithm:

• Generate $$\mu \sim \mathcal{N}(\frac{1}{n_L} \sum_{i=1}^{n_L} s_i, \frac{1}{n_L}\sigma^2)$$

• Generate $$U \sim \mathcal{U}[0,1]$$

• Accept $$\mu$$, iff $$c \psi(\mu) U \leq \varphi(\mu)$$, i.e. if $$\mu \in [38, 55]$$.

Marking criteria:

• Has the target density been found correctly?

• Has a feasible proposal density been found?

• Has the constant $$c$$ been found, and does the report convince the reader that $$\varphi(\mu) \leq c \psi(\mu)$$ holds for all $$\mu$$?

• Does the report explain the resulting rejection sampling algorithm?

Here we have to implement the rejection sampling algorithm from task 3. We start, by gathering the required values $$n_L$$ and $$\frac{1}{n_L} \sum_{i=1}^{n_L} s_i$$ for all letters:

nL <- integer(26)
sL.mean <- numeric(26)
for (i in 1:26) {
rows <- which(lab == LETTERS[i])
nL[i] <- length(rows)
sL.mean[i] <- mean(s[rows])
}
posterior <- data.frame(n=nL, s.mean=sL.mean, row.names = LETTERS)
posterior

Using these values, we can now generate the samples. To be able to experiment with different letters more easily, we write a function to generate the sample:

gen.post <- function(N, letter) {
nL <- posterior[letter,]$n sL.mean <- posterior[letter,]$s.mean

res <- numeric(N)
done <- 0
while (done < N) {
m <- rnorm(1, sL.mean, 13 / sqrt(nL))
if (m >= 38 && m <= 55) {
done <- done + 1
res[done] <- m
}
}
res
}

To test the function, we produce histograms of posterior samples for “I”, “K” and “B” (top to bottom). These letters are chosen, because “I” and “B” have the lowest and highest mean standard deviation, respectively, while “K” falls into the middle of the range.

par(mfrow=c(3, 1))
hist(gen.post(10000, "I"), breaks=seq(37, 56, l=39), main=NULL, col="grey")
hist(gen.post(10000, "K"), breaks=seq(37, 56, l=39), main=NULL, col="grey")
hist(gen.post(10000, "B"), breaks=seq(37, 56, l=39), main=NULL, col="grey")