This text gives an example solution for the MATH5835M practical.

Overall marking criteria:

Marking criteria for individual questions are listed at the end of each question.

Solutions

Task 1

The first task is to import the data and to give a short description of the data. After downloading the file test-pre3.csv.gz from from

https://www.seehuhn.de/maths/letters/

we can use the following code (from the web page) to import the data:

test <- read.csv("test-pre3.csv.gz")
im <- as.matrix(test[,-1])
lab <- as.factor(LETTERS[test[,1]+1])
rm(test)

We can see that the data set contains 5835 samples of size 1728 each: according to the web page, each of the samples corresponds to a 54×32 image. The column names r1c1, r1c2 etc. indicate that the first few values corrspond to row 1 of the picture (r stands for “row”, c stands for “column”):

str(im)
##  int [1:5825, 1:1728] 255 255 255 255 255 255 255 255 255 255 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:1728] "r1c1" "r1c2" "r1c3" "r1c4" ...

The provided labels indicate which letter each image corresponds to:

str(lab)
##  Factor w/ 26 levels "A","B","C","D",..: 8 25 26 17 25 26 1 17 20 26 ...

As a test we consider the first letter. Inspecting lab[1], we see that this letter is mean to be an H. Using the code similar to that from the web page, we can show the corresponding image. We simultaneously verify that the value 0 corresponds to black, and the the first value in each row corresponds to the top-left pixel, but setting the corresponding value in the image data to 0.

pic <- matrix(im[1,], 32, 54)
pic[1, 1] <- 0
image(pic[,54:1], asp=1, col = gray((0:255)/255), xlab="", ylab="")

This seems indeed to represent an H, so everything is consistent. As expected the top-left pixel is shown black, because we set im[1, 1] to 0.

Criteria for marking:

  • Does the report convince the reader that the data was imported correctly?

  • Is there an explanation of how the numbers in R correspond to images?

  • Is there an explanation of the role of the labels provided as part of the data?

  • Is the R code shown, understandable and correct?

  • Is there any evidence that the R code from the web page has been understood, rather than just copied?

Task 2

In this task we have to compute the standard deviation of the pixels in each image. This gives a measure of how much variation there is within each image:

s <- apply(im, 1, sd)
str(s)
##  num [1:5825] 19.2 49.1 46 61.3 68.7 ...
s[1]
## [1] 19.16695

The result shows that the vector of standard deviations has the correct length 5825, and that the first standard deviation equals the value given in the hint.

par(mfrow=c(3, 1))
b = seq(10, 90, length.out = 41)
hist(s, breaks=b, main=NULL, col="grey",
     xlab="all standard deviations")
hist(s[which(lab == "B")], breaks=b, main=NULL, col="grey",
     xlab="standard deviations for the letter B")
hist(s[which(lab == "C")], breaks=b, main=NULL, col="grey",
     xlab="standard deviations for the letter C")

We can see that standard deviations for the letter “B” are towards the higher end of standard deviations, whereas the values for “C” are towards the lower end of the range. Maybe this is a consequence of “B” needing more ink to write than “C”?

Marking criteria:

  • Does the report convince the reader that the standard deviations have been computed correctly?

  • Has a reasonable variable name been chosen for the standard deviations (e.g. s to match the \(s_i\) on the question sheet).

  • Has care been taken to make the histograms look nice?

  • Are the samples for letters “B” and “C” selected correctly?

  • Is there some discussion of the fact that “B” has higher standard deviations than “C”?

Task 3

Here we have to find the posterior distribution of the mean \(\mu_L\) of the standard deviation for instances of the letter \(L\), using some Bayesian model.

The prior distribution for \(\mu_L\) is the uniform distribution \(\mathcal{U}[38, 55]\), with density \[ p(\mu) = \frac{1}{55-38} 1_{[38, 55]}(\mu) = \frac{1}{17} 1_{[38, 55]}(\mu) \] If we denote the standard deviations corresponding to the letter \(L\) by \(s = (s_1, \ldots, s_{n_L})\), the likelihood is \[ \begin{split} p\bigl( s \bigm| \mu\bigr) &= \prod_{i=1}^{n_L} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\Bigl(-\frac{(s_i - \mu)^2}{2\sigma^2} \Bigr) \\ &= (2\pi\sigma^2)^{-n_L/2} \exp\Bigl( -\frac{1}{2\sigma^2} \sum_{i=1}^{n_L} (s_i - \mu)^2 \Bigr) \end{split} \] where \(\sigma^2 = 13^2\) is the given variance. Using Bayes’ formula, we can find the posterior distribution, up to constants, as \[ \begin{split} p\bigl( \mu \bigm| s \bigr) &= \frac{p\bigl( s \bigm| \mu\bigr) p(\mu)}{p(s)} \\ &\propto \exp\Bigl( -\frac{1}{2\sigma^2} \sum_{i=1}^{n_L} (s_i - \mu)^2 \Bigr) 1_{[38, 55]}(\mu) \end{split} \] Or aim is to find a proposal density, which can be used to generate samples from the posterior, using rejection sampling. To find an appropriate proposal density, we rewrite the exponential term to be proportional to a normal distribution: \[ \begin{split} p\bigl( \mu \bigm| s \bigr) &\propto \exp\Bigl( -\frac{1}{2\sigma^2} \bigl(n_L \mu^2 - 2 \mu \sum_{i=1}^{n_L} s_i + \sum_{i=1}^{n_L} s_i^2\bigr) \Bigr) 1_{[38, 55]}(\mu) \\ &\propto \exp\Bigl( -\frac{1}{2\sigma^2/n_L} \bigl(\mu^2 - 2 \mu \frac{1}{n_L} \sum_{i=1}^{n_L} s_i\bigr) \Bigr) 1_{[38, 55]}(\mu) \\ &\propto \frac{1}{\sqrt{2\pi \sigma^2/n_L}} \exp\Bigl( -\frac{\bigl( \mu - \frac{1}{n_L} \sum_{i=1}^{n_L} s_i \bigr)^2}{2\sigma^2/n_L} \Bigr) 1_{[38, 55]}(\mu) \\ &=: \varphi(\mu). \end{split} \] By writing the target density \(\varphi\) in this form, we see that a good proposal density will be the density of a \(\mathcal{N}(\frac{1}{n_L} \sum_{i=1}^{n_L} s_i, \frac{1}{n_L}\sigma^2)\)-distribution, i.e. \[ \psi(x) = \frac{1}{\sqrt{2\pi \sigma^2/n_L}} \exp\Bigl( -\frac{\bigl( \mu - \frac{1}{n_L} \sum_{i=1}^{n_L} s_i \bigr)^2}{2\sigma^2/n_L} \Bigr). \] For this choice of \(\varphi\) and \(\psi\) we have \(\varphi(\mu) \leq c \psi(\mu)\) where \(c = 1\). Thus, we get the following algorithm:

  • Generate \(\mu \sim \mathcal{N}(\frac{1}{n_L} \sum_{i=1}^{n_L} s_i, \frac{1}{n_L}\sigma^2)\)

  • Generate \(U \sim \mathcal{U}[0,1]\)

  • Accept \(\mu\), iff \(c \psi(\mu) U \leq \varphi(\mu)\), i.e. if \(\mu \in [38, 55]\).

Marking criteria:

  • Has the target density been found correctly?

  • Has a feasible proposal density been found?

  • Has the constant \(c\) been found, and does the report convince the reader that \(\varphi(\mu) \leq c \psi(\mu)\) holds for all \(\mu\)?

  • Does the report explain the resulting rejection sampling algorithm?

Task 4

Here we have to implement the rejection sampling algorithm from task 3. We start, by gathering the required values \(n_L\) and \(\frac{1}{n_L} \sum_{i=1}^{n_L} s_i\) for all letters:

nL <- integer(26)
sL.mean <- numeric(26)
for (i in 1:26) {
    rows <- which(lab == LETTERS[i])
    nL[i] <- length(rows)
    sL.mean[i] <- mean(s[rows])
}
posterior <- data.frame(n=nL, s.mean=sL.mean, row.names = LETTERS)
posterior

Using these values, we can now generate the samples. To be able to experiment with different letters more easily, we write a function to generate the sample:

gen.post <- function(N, letter) {
    nL <- posterior[letter,]$n
    sL.mean <- posterior[letter,]$s.mean
    
    res <- numeric(N)
    done <- 0
    while (done < N) {
        m <- rnorm(1, sL.mean, 13 / sqrt(nL))
        if (m >= 38 && m <= 55) {
            done <- done + 1
            res[done] <- m
        }
    }
    res
}

To test the function, we produce histograms of posterior samples for “I”, “K” and “B” (top to bottom). These letters are chosen, because “I” and “B” have the lowest and highest mean standard deviation, respectively, while “K” falls into the middle of the range.

par(mfrow=c(3, 1))
hist(gen.post(10000, "I"), breaks=seq(37, 56, l=39), main=NULL, col="grey")
hist(gen.post(10000, "K"), breaks=seq(37, 56, l=39), main=NULL, col="grey")
hist(gen.post(10000, "B"), breaks=seq(37, 56, l=39), main=NULL, col="grey")