--- title: "MATH5835M Practical Solutions" output: pdf_document: default html_notebook: default html_document: df_print: paged --- This text gives an example solution for the MATH5835M practical. * Each question is worth 20 marks, and there are also 20 marks for presentation, for a total of $4 \times 20 + 20 = 100$ marks. Overall marking criteria: * Are all methods used and results explained? * Do the explanations make sense? * Is the presentation concise and well-structured? * Is all relevant R code included in the report? Is all R code in the report relevant? * Is R used competently? Marking criteria for individual questions are listed at the end of each question. # Solutions ## Task 1 The first task is to import the data and to give a short description of the data. After downloading the file test-pre3.csv.gz from from https://www.seehuhn.de/maths/letters/ we can use the following code (from the web page) to import the data: {r} test <- read.csv("test-pre3.csv.gz") im <- as.matrix(test[,-1]) lab <- as.factor(LETTERS[test[,1]+1]) rm(test)  We can see that the data set contains 5835 samples of size 1728 each: according to the web page, each of the samples corresponds to a 54×32 image. The column names r1c1, r1c2 etc.\ indicate that the first few values corrspond to row 1 of the picture (r stands for "row", c stands for "column"): {r} str(im)  The provided labels indicate which letter each image corresponds to: {r} str(lab)  As a test we consider the first letter. Inspecting lab[1], we see that this letter is mean to be an r lab[1]. Using the code similar to that from the web page, we can show the corresponding image. We simultaneously verify that the value 0 corresponds to black, and the the first value in each row corresponds to the top-left pixel, but setting the corresponding value in the image data to 0. {r} pic <- matrix(im[1,], 32, 54) pic[1, 1] <- 0 image(pic[,54:1], asp=1, col = gray((0:255)/255), xlab="", ylab="")  This seems indeed to represent an H, so everything is consistent. As expected the top-left pixel is shown black, because we set im[1, 1] to 0. Criteria for marking: * Does the report convince the reader that the data was imported correctly? * Is there an explanation of how the numbers in R correspond to images? * Is there an explanation of the role of the labels provided as part of the data? * Is the R code shown, understandable and correct? * Is there any evidence that the R code from the web page has been understood, rather than just copied? ## Task 2 In this task we have to compute the standard deviation of the pixels in each image. This gives a measure of how much variation there is within each image: {r} s <- apply(im, 1, sd) str(s) s[1]  The result shows that the vector of standard deviations has the correct length 5825, and that the first standard deviation equals the value given in the hint. {r fig.height=7} par(mfrow=c(3, 1)) b = seq(10, 90, length.out = 41) hist(s, breaks=b, main=NULL, col="grey", xlab="all standard deviations") hist(s[which(lab == "B")], breaks=b, main=NULL, col="grey", xlab="standard deviations for the letter B") hist(s[which(lab == "C")], breaks=b, main=NULL, col="grey", xlab="standard deviations for the letter C")  We can see that standard deviations for the letter "B" are towards the higher end of standard deviations, whereas the values for "C" are towards the lower end of the range. Maybe this is a consequence of "B" needing more ink to write than "C"? Marking criteria: * Does the report convince the reader that the standard deviations have been computed correctly? * Has a reasonable variable name been chosen for the standard deviations (e.g. s to match the $s_i$ on the question sheet). * Has care been taken to make the histograms look nice? * Are the samples for letters "B" and "C" selected correctly? * Is there some discussion of the fact that "B" has higher standard deviations than "C"? ## Task 3 Here we have to find the posterior distribution of the mean $\mu_L$ of the standard deviation for instances of the letter $L$, using some Bayesian model. The prior distribution for $\mu_L$ is the uniform distribution $\mathcal{U}[38, 55]$, with density $$p(\mu) = \frac{1}{55-38} 1_{[38, 55]}(\mu) = \frac{1}{17} 1_{[38, 55]}(\mu)$$ If we denote the standard deviations corresponding to the letter $L$ by $s = (s_1, \ldots, s_{n_L})$, the likelihood is $$\begin{split} p\bigl( s \bigm| \mu\bigr) &= \prod_{i=1}^{n_L} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\Bigl(-\frac{(s_i - \mu)^2}{2\sigma^2} \Bigr) \\ &= (2\pi\sigma^2)^{-n_L/2} \exp\Bigl( -\frac{1}{2\sigma^2} \sum_{i=1}^{n_L} (s_i - \mu)^2 \Bigr) \end{split}$$ where $\sigma^2 = 13^2$ is the given variance. Using Bayes' formula, we can find the posterior distribution, up to constants, as $$\begin{split} p\bigl( \mu \bigm| s \bigr) &= \frac{p\bigl( s \bigm| \mu\bigr) p(\mu)}{p(s)} \\ &\propto \exp\Bigl( -\frac{1}{2\sigma^2} \sum_{i=1}^{n_L} (s_i - \mu)^2 \Bigr) 1_{[38, 55]}(\mu) \end{split}$$ Or aim is to find a proposal density, which can be used to generate samples from the posterior, using rejection sampling. To find an appropriate proposal density, we rewrite the exponential term to be proportional to a normal distribution: $$\begin{split} p\bigl( \mu \bigm| s \bigr) &\propto \exp\Bigl( -\frac{1}{2\sigma^2} \bigl(n_L \mu^2 - 2 \mu \sum_{i=1}^{n_L} s_i + \sum_{i=1}^{n_L} s_i^2\bigr) \Bigr) 1_{[38, 55]}(\mu) \\ &\propto \exp\Bigl( -\frac{1}{2\sigma^2/n_L} \bigl(\mu^2 - 2 \mu \frac{1}{n_L} \sum_{i=1}^{n_L} s_i\bigr) \Bigr) 1_{[38, 55]}(\mu) \\ &\propto \frac{1}{\sqrt{2\pi \sigma^2/n_L}} \exp\Bigl( -\frac{\bigl( \mu - \frac{1}{n_L} \sum_{i=1}^{n_L} s_i \bigr)^2}{2\sigma^2/n_L} \Bigr) 1_{[38, 55]}(\mu) \\ &=: \varphi(\mu). \end{split}$$ By writing the target density $\varphi$ in this form, we see that a good proposal density will be the density of a $\mathcal{N}(\frac{1}{n_L} \sum_{i=1}^{n_L} s_i, \frac{1}{n_L}\sigma^2)$-distribution, *i.e.* $$\psi(x) = \frac{1}{\sqrt{2\pi \sigma^2/n_L}} \exp\Bigl( -\frac{\bigl( \mu - \frac{1}{n_L} \sum_{i=1}^{n_L} s_i \bigr)^2}{2\sigma^2/n_L} \Bigr).$$ For this choice of $\varphi$ and $\psi$ we have $\varphi(\mu) \leq c \psi(\mu)$ where $c = 1$. Thus, we get the following algorithm: * Generate $\mu \sim \mathcal{N}(\frac{1}{n_L} \sum_{i=1}^{n_L} s_i, \frac{1}{n_L}\sigma^2)$ * Generate $U \sim \mathcal{U}[0,1]$ * Accept $\mu$, iff $c \psi(\mu) U \leq \varphi(\mu)$, i.e. if $\mu \in [38, 55]$. Marking criteria: * Has the target density been found correctly? * Has a feasible proposal density been found? * Has the constant $c$ been found, and does the report convince the reader that $\varphi(\mu) \leq c \psi(\mu)$ holds for all $\mu$? * Does the report explain the resulting rejection sampling algorithm? ## Task 4 Here we have to implement the rejection sampling algorithm from task 3. We start, by gathering the required values $n_L$ and $\frac{1}{n_L} \sum_{i=1}^{n_L} s_i$ for all letters: {r} nL <- integer(26) sL.mean <- numeric(26) for (i in 1:26) { rows <- which(lab == LETTERS[i]) nL[i] <- length(rows) sL.mean[i] <- mean(s[rows]) } posterior <- data.frame(n=nL, s.mean=sL.mean, row.names = LETTERS) posterior  Using these values, we can now generate the samples. To be able to experiment with different letters more easily, we write a function to generate the sample: {r} gen.post <- function(N, letter) { nL <- posterior[letter,]$n sL.mean <- posterior[letter,]$s.mean res <- numeric(N) done <- 0 while (done < N) { m <- rnorm(1, sL.mean, 13 / sqrt(nL)) if (m >= 38 && m <= 55) { done <- done + 1 res[done] <- m } } res }  To test the function, we produce histograms of posterior samples for "I", "K" and "B" (top to bottom). These letters are chosen, because "I" and "B" have the lowest and highest mean standard deviation, respectively, while "K" falls into the middle of the range. {r fig.height=7} par(mfrow=c(3, 1)) hist(gen.post(10000, "I"), breaks=seq(37, 56, l=39), main=NULL, col="grey") hist(gen.post(10000, "K"), breaks=seq(37, 56, l=39), main=NULL, col="grey") hist(gen.post(10000, "B"), breaks=seq(37, 56, l=39), main=NULL, col="grey")  Using gen.post(), we can now generate samples for Monte Carlo estimates. In your report, you need to only consider a few letters. Here I make a table of all results, for the help with marking: {r} N <- 1e5 post.mean <- numeric(26) post.mean.RMSE <- numeric(26) post.var <- numeric(26) for (i in 1:26) { mu <- gen.post(N, LETTERS[i]) post.mean[i] <- mean(mu) post.mean.RMSE[i] <- sd(mu) / sqrt(N) post.var[i] <- var(mu) } results <- data.frame(post.mean = post.mean, post.var = post.var, post.mean.RMSE = post.mean.RMSE, sample.mean = posterior$s.mean, row.names = LETTERS) results  The table shows all the results. The column post.mean.RMSE shows that the sample size$N$is large enough, by far, that the posterior mean is estimated accurately. A more thorough analysis could also take accuracy of the posterior variance estimate into account. Producing the table takes a few seconds on my laptop, so there is scope to increase$N$, if needed. The table, and the plot below, shows that the Bayesian estimate for the mean differs clearly from the classical estimate (*i.e.* from the sample mean) only for the letters "C", "I" and "L". There is also a small deviation for the letter "B": {r} plot(results$post.mean - results\$sample.mean, xaxt="n", xlab="letter", ylab="post. mean - sample mean") axis(1, at=1:26, labels=LETTERS)  The table shows the results. Marking criteria: * Are the Monte Carlo samples generated correctly? * Are the posterior mean and variance for at least two different letters estimated? * Is at least one of the letters considered close enough to the edge of the prior interval, that the Bayesian estimate differs from the classical estimate? * Is the Monte Carlo sample size chosen wisely?