# Leverage and Influence

In this tutorial we consider the concepts of leverage and influence of an observation.

## Leverage

### An example for $$p=1$$

We start by simulating data which has a clear x-space outlier:

n <- 100
x <- runif(n, 0, 10)
# make the first point an outlier
x[1] <- 15
y <- 1 + 2 * x + rnorm(n, 0.1)

Since we didn’t change the $$y$$-value for the outlier, the point still lies close to the regression line:

m <- lm(y ~x)
plot(x, y)
abline(m)

We expect this simulated outlier to have high leverage $$h_{ii}$$. Computing leverage manually, we get the following:

X <- model.matrix(m)
H <- X %*% solve(t(X) %*% X) %*% t(X)
hat <- diag(H)
plot(x, hat, xlab=expression(x[i]), ylab=expression('leverage ' * h[ii]))

The plot shows that the outlier has indeed the highest leverage of all points.

Instead of computing the hat-matrix $$H$$ and the leverage manually, we can use the R function influence. (Despite it’s name, we use the function to compute the “leverage” $$h_{ii}$$.) Comparing the values to the values from the diagonal of the hat matrix shows that we have recovered the same values here.

infl <- influence(m)
infl$hat[1:5] # only print five values to use less space  1 2 3 4 5 0.11846881 0.01026841 0.01981502 0.01344492 0.02542774  ### An Example for p=2 In higher dimensions, it becomes more difficult to spot $$x$$-space outliers manually, but it is still possible to find $$x$$-space outliers by looking out for samples with high leverage. We illustrate this here using simulated data with $$p=2$$ inputs: n <- 100 x1 <- runif(n, 0, 10) x2 <- x1 + rnorm(n, 0.05) # make the first point an outlier x1[1] <- 9 x2[1] <- 1 y <- 1 + 2 * x1 + 3 * x2 + rnorm(n, 0.1) Since $$p=2$$, we can still plot the input values to verify that there is an outlier: plot(x1, x2, asp=1) To check whether the outlier is detected, we plot the leverages: m <- lm(y ~ x1 + x2) infl <- influence(m) plot(infl$hat)

As expected, the first sample has a leverage which is much higher than for the other samples.

## Cook’s D values

Cook’s D-value $$D_i^2$$ is used to quantify the influence of sample $$i$$ in the estimated regression parameters. To illustrate this, we simulate data with an outlier which has a strong effect on the regression line.

n <- 100
x <- runif(n, 0, 10)
x[1] <- 25
y <- 2 + 1 * x + rnorm(n, 0.2)
y[1] <- 2
m <- lm(y ~ x)
plot(x, y)
abline(m)
m2 <- lm(y[-1] ~ x[-1])
abline(m2, col="red")

In the plot, the black line is the fitted regression line for the full data; the red line is the regression line we get when sample 1 (the outlier) is omitted from the data. Clearly, these two lines are quite different, so we expect the outlier to have large value of $$D_i^2$$.

The following plot shows that this is indeed the case:

plot(cooks.distance(m))