Leverage and Influence

In this tutorial we consider the concepts of leverage and influence of an observation.

Leverage

An example for \(p=1\)

We start by simulating data which has a clear x-space outlier:

n <- 100
x <- runif(n, 0, 10)
# make the first point an outlier
x[1] <- 15
y <- 1 + 2 * x + rnorm(n, 0.1)

Since we didn’t change the \(y\)-value for the outlier, the point still lies close to the regression line:

m <- lm(y ~x)
plot(x, y)
abline(m)

We expect this simulated outlier to have high leverage \(h_{ii}\). Computing leverage manually, we get the following:

X <- model.matrix(m)
H <- X %*% solve(t(X) %*% X) %*% t(X)
hat <- diag(H)
plot(x, hat, xlab=expression(x[i]), ylab=expression('leverage ' * h[ii]))

The plot shows that the outlier has indeed the highest leverage of all points.

Instead of computing the hat-matrix \(H\) and the leverage manually, we can use the R function influence. (Despite it’s name, we use the function to compute the “leverage” \(h_{ii}\).) Comparing the values to the values from the diagonal of the hat matrix shows that we have recovered the same values here.

infl <- influence(m)
infl$hat[1:5] # only print five values to use less space
         1          2          3          4          5 
0.11846881 0.01026841 0.01981502 0.01344492 0.02542774 

An Example for p=2

In higher dimensions, it becomes more difficult to spot \(x\)-space outliers manually, but it is still possible to find \(x\)-space outliers by looking out for samples with high leverage. We illustrate this here using simulated data with \(p=2\) inputs:

n <- 100
x1 <- runif(n, 0, 10)
x2 <- x1 + rnorm(n, 0.05)
# make the first point an outlier
x1[1] <- 9
x2[1] <- 1
y <- 1 + 2 * x1 + 3 * x2 + rnorm(n, 0.1)

Since \(p=2\), we can still plot the input values to verify that there is an outlier:

plot(x1, x2, asp=1)

To check whether the outlier is detected, we plot the leverages:

m <- lm(y ~ x1 + x2)
infl <- influence(m)
plot(infl$hat)

As expected, the first sample has a leverage which is much higher than for the other samples.

Cook’s D values

Cook’s D-value \(D_i^2\) is used to quantify the influence of sample \(i\) in the estimated regression parameters. To illustrate this, we simulate data with an outlier which has a strong effect on the regression line.

n <- 100
x <- runif(n, 0, 10)
x[1] <- 25
y <- 2 + 1 * x + rnorm(n, 0.2)
y[1] <- 2
m <- lm(y ~ x)
plot(x, y)
abline(m)
m2 <- lm(y[-1] ~ x[-1])
abline(m2, col="red")

In the plot, the black line is the fitted regression line for the full data; the red line is the regression line we get when sample 1 (the outlier) is omitted from the data. Clearly, these two lines are quite different, so we expect the outlier to have large value of \(D_i^2\).

The following plot shows that this is indeed the case:

plot(cooks.distance(m))

LS0tCnRpdGxlOiAiVHV0b3JpYWwgNSIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKIyBMZXZlcmFnZSBhbmQgSW5mbHVlbmNlCgpJbiB0aGlzIHR1dG9yaWFsIHdlIGNvbnNpZGVyIHRoZSBjb25jZXB0cyBvZiBsZXZlcmFnZSBhbmQgaW5mbHVlbmNlCm9mIGFuIG9ic2VydmF0aW9uLgoKIyMgTGV2ZXJhZ2UKCiMjIyBBbiBleGFtcGxlIGZvciAkcD0xJAoKV2Ugc3RhcnQgYnkgc2ltdWxhdGluZyBkYXRhIHdoaWNoIGhhcyBhIGNsZWFyIHgtc3BhY2Ugb3V0bGllcjoKYGBge3J9Cm4gPC0gMTAwCnggPC0gcnVuaWYobiwgMCwgMTApCiMgbWFrZSB0aGUgZmlyc3QgcG9pbnQgYW4gb3V0bGllcgp4WzFdIDwtIDE1Cgp5IDwtIDEgKyAyICogeCArIHJub3JtKG4sIDAuMSkKYGBgCgpTaW5jZSB3ZSBkaWRuJ3QgY2hhbmdlIHRoZSAkeSQtdmFsdWUgZm9yIHRoZSBvdXRsaWVyLCB0aGUgcG9pbnQgc3RpbGwgbGllcwpjbG9zZSB0byB0aGUgcmVncmVzc2lvbiBsaW5lOgpgYGB7cn0KbSA8LSBsbSh5IH54KQpwbG90KHgsIHkpCmFibGluZShtKQpgYGAKCldlIGV4cGVjdCB0aGlzIHNpbXVsYXRlZCBvdXRsaWVyIHRvIGhhdmUgaGlnaCBsZXZlcmFnZSAkaF97aWl9JC4gIENvbXB1dGluZwpsZXZlcmFnZSBtYW51YWxseSwgd2UgZ2V0IHRoZSBmb2xsb3dpbmc6CmBgYHtyfQpYIDwtIG1vZGVsLm1hdHJpeChtKQpIIDwtIFggJSolIHNvbHZlKHQoWCkgJSolIFgpICUqJSB0KFgpCmhhdCA8LSBkaWFnKEgpCnBsb3QoeCwgaGF0LCB4bGFiPWV4cHJlc3Npb24oeFtpXSksIHlsYWI9ZXhwcmVzc2lvbignbGV2ZXJhZ2UgJyAqIGhbaWldKSkKYGBgCgpUaGUgcGxvdCBzaG93cyB0aGF0IHRoZSBvdXRsaWVyIGhhcyBpbmRlZWQgdGhlIGhpZ2hlc3QgbGV2ZXJhZ2Ugb2YgYWxsIHBvaW50cy4KCkluc3RlYWQgb2YgY29tcHV0aW5nIHRoZSBoYXQtbWF0cml4ICRIJCBhbmQgdGhlIGxldmVyYWdlIG1hbnVhbGx5LAp3ZSBjYW4gdXNlIHRoZSBSIGZ1bmN0aW9uIGBpbmZsdWVuY2VgLiAgKERlc3BpdGUgaXQncyBuYW1lLCB3ZSB1c2UgdGhlCmZ1bmN0aW9uIHRvIGNvbXB1dGUgdGhlICJsZXZlcmFnZSIgJGhfe2lpfSQuKSAgQ29tcGFyaW5nIHRoZSB2YWx1ZXMgdG8gdGhlCnZhbHVlcyBmcm9tIHRoZSBkaWFnb25hbCBvZiB0aGUgaGF0IG1hdHJpeCBzaG93cyB0aGF0IHdlIGhhdmUgcmVjb3ZlcmVkCnRoZSBzYW1lIHZhbHVlcyBoZXJlLgoKYGBge3J9CmluZmwgPC0gaW5mbHVlbmNlKG0pCmluZmwkaGF0WzE6NV0gIyBvbmx5IHByaW50IGZpdmUgdmFsdWVzIHRvIHVzZSBsZXNzIHNwYWNlCmBgYAoKIyMjIEFuIEV4YW1wbGUgZm9yIHA9MgoKSW4gaGlnaGVyIGRpbWVuc2lvbnMsIGl0IGJlY29tZXMgbW9yZSBkaWZmaWN1bHQgdG8gc3BvdCAkeCQtc3BhY2Ugb3V0bGllcnMKbWFudWFsbHksIGJ1dCBpdCBpcyBzdGlsbCBwb3NzaWJsZSB0byBmaW5kICR4JC1zcGFjZSBvdXRsaWVycyBieSBsb29raW5nIG91dApmb3Igc2FtcGxlcyB3aXRoIGhpZ2ggbGV2ZXJhZ2UuICBXZSBpbGx1c3RyYXRlIHRoaXMgaGVyZSB1c2luZyBzaW11bGF0ZWQKZGF0YSB3aXRoICRwPTIkIGlucHV0czoKYGBge3J9Cm4gPC0gMTAwCngxIDwtIHJ1bmlmKG4sIDAsIDEwKQp4MiA8LSB4MSArIHJub3JtKG4sIDAuMDUpCgojIG1ha2UgdGhlIGZpcnN0IHBvaW50IGFuIG91dGxpZXIKeDFbMV0gPC0gOQp4MlsxXSA8LSAxCgp5IDwtIDEgKyAyICogeDEgKyAzICogeDIgKyBybm9ybShuLCAwLjEpCmBgYAoKU2luY2UgJHA9MiQsIHdlIGNhbiBzdGlsbCBwbG90IHRoZSBpbnB1dCB2YWx1ZXMgdG8gdmVyaWZ5IHRoYXQgdGhlcmUgaXMgYW4Kb3V0bGllcjoKYGBge3J9CnBsb3QoeDEsIHgyLCBhc3A9MSkKYGBgCgpUbyBjaGVjayB3aGV0aGVyIHRoZSBvdXRsaWVyIGlzIGRldGVjdGVkLCB3ZSBwbG90IHRoZSBsZXZlcmFnZXM6CmBgYHtyfQptIDwtIGxtKHkgfiB4MSArIHgyKQppbmZsIDwtIGluZmx1ZW5jZShtKQpwbG90KGluZmwkaGF0KQpgYGAKCkFzIGV4cGVjdGVkLCB0aGUgZmlyc3Qgc2FtcGxlIGhhcyBhIGxldmVyYWdlIHdoaWNoIGlzIG11Y2ggaGlnaGVyIHRoYW4gZm9yCnRoZSBvdGhlciBzYW1wbGVzLgoKCiMjIENvb2sncyBEIHZhbHVlcwoKQ29vaydzIEQtdmFsdWUgJERfaV4yJCBpcyB1c2VkIHRvIHF1YW50aWZ5IHRoZSBpbmZsdWVuY2Ugb2Ygc2FtcGxlICRpJAppbiB0aGUgZXN0aW1hdGVkIHJlZ3Jlc3Npb24gcGFyYW1ldGVycy4gIFRvIGlsbHVzdHJhdGUgdGhpcywgd2Ugc2ltdWxhdGUKZGF0YSB3aXRoIGFuIG91dGxpZXIgd2hpY2ggaGFzIGEgc3Ryb25nIGVmZmVjdCBvbiB0aGUgcmVncmVzc2lvbiBsaW5lLgoKYGBge3J9Cm4gPC0gMTAwCnggPC0gcnVuaWYobiwgMCwgMTApCnhbMV0gPC0gMjUKeSA8LSAyICsgMSAqIHggKyBybm9ybShuLCAwLjIpCnlbMV0gPC0gMgoKbSA8LSBsbSh5IH4geCkKcGxvdCh4LCB5KQphYmxpbmUobSkKCm0yIDwtIGxtKHlbLTFdIH4geFstMV0pCmFibGluZShtMiwgY29sPSJyZWQiKQpgYGAKCkluIHRoZSBwbG90LCB0aGUgYmxhY2sgbGluZSBpcyB0aGUgZml0dGVkIHJlZ3Jlc3Npb24gbGluZSBmb3IgdGhlIGZ1bGwgZGF0YTsKdGhlIHJlZCBsaW5lIGlzIHRoZSByZWdyZXNzaW9uIGxpbmUgd2UgZ2V0IHdoZW4gc2FtcGxlIDEgKHRoZSBvdXRsaWVyKSBpcwpvbWl0dGVkIGZyb20gdGhlIGRhdGEuICBDbGVhcmx5LCB0aGVzZSB0d28gbGluZXMgYXJlIHF1aXRlIGRpZmZlcmVudCwgc28gd2UKZXhwZWN0IHRoZSBvdXRsaWVyIHRvIGhhdmUgbGFyZ2UgdmFsdWUgb2YgICREX2leMiQuCgpUaGUgZm9sbG93aW5nIHBsb3Qgc2hvd3MgdGhhdCB0aGlzIGlzIGluZGVlZCB0aGUgY2FzZToKCmBgYHtyfQpwbG90KGNvb2tzLmRpc3RhbmNlKG0pKQpgYGAKCg==