Multiple Linear Regression in R

This is a slightly extended version of the file I showed during the MATH3714 example class on 11 October 2017.

First Experiment: Multiple Linear Regression with Simulated data

Basic Usage:

The lm() command can be used for multiple linear regression, by separating the input using + signs:

n <- 100
x1 <- runif(n)
x2 <- runif(n)
eps <- rnorm(n, sd=0.1)
y <- 1 + 2*x1 + 3*x2 + eps

Using this data, we can perform linear regression.

m <- lm(y ~ x1 + x2)
m

Call:
lm(formula = y ~ x1 + x2)

Coefficients:
(Intercept)           x1           x2  
      0.984        2.020        3.042  

Note that R has automatically added an intercept to the model.

Recognising Unused Variables

As an experiment, we remove x2 from the computation of the response y:

n <- 100
x1 <- runif(n)
x2 <- runif(n)
eps <- rnorm(n, sd=0.1)
y <- 1 + 2*x1 + eps

Now the response does no longer depend on x2, the corresponding coefficient is effectively 0. Can we recognise that y does not depend on x2?

m <- lm(y ~ x1 + x2)
summary(m)

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.26718 -0.06039  0.01150  0.05450  0.21596 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.03213    0.02595  39.778   <2e-16 ***
x1           1.97760    0.03411  57.978   <2e-16 ***
x2          -0.02009    0.03504  -0.573    0.568    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09539 on 97 degrees of freedom
Multiple R-squared:  0.972, Adjusted R-squared:  0.9714 
F-statistic:  1681 on 2 and 97 DF,  p-value: < 2.2e-16

Looking at the outputs we see that the coefficient for x2 is very small, but due to the noise the estimated value is not exactly equal to zero. The absence of star symbols at the end of the x2 line indicates that beta2 is not significantly different from 0.

Switching Off the intercept

n <- 100
x1 <- runif(n)
x2 <- runif(n)
eps <- rnorm(n, sd=0.1)
y <- 2*x1 + 3*x2 + eps

If we would fit a model as before, we would get a very small value as the estimated intercept. If we know there is no intercept and want to fit a model which does not include an intercept at all, we can write 0 + ... in front of the formula on the right-hand side to prevent R from adding an intercept:

m <- lm(y ~ 0 + x1 + x2)
summary(m)

Call:
lm(formula = y ~ 0 + x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.18952 -0.06841  0.01049  0.08078  0.24010 

Coefficients:
   Estimate Std. Error t value Pr(>|t|)    
x1  1.97720    0.02465    80.2   <2e-16 ***
x2  3.00659    0.02616   114.9   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09481 on 98 degrees of freedom
Multiple R-squared:  0.9987,    Adjusted R-squared:  0.9987 
F-statistic: 3.844e+04 on 2 and 98 DF,  p-value: < 2.2e-16

Second Experiment: A Small, Real Data Set

Here we will learn some shortcuts which can be used to simplify calls to lm(). We consider the stackloss data set built into R:

stackloss

In this model, the column stack.loss is the output, and we try to predict this value using the other columns as input. First, we try a naive approach:

y <- stackloss$stack.loss
x1 <- stackloss$Air.Flow
x2 <- stackloss$Water.Temp
x3 <- stackloss$Acid.Conc.
m <- lm(y ~ x1 + x2 + x3)
summary(m)

Call:
lm(formula = y ~ x1 + x2 + x3)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2377 -1.7117 -0.4551  2.3614  5.6978 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
x1            0.7156     0.1349   5.307  5.8e-05 ***
x2            1.2953     0.3680   3.520  0.00263 ** 
x3           -0.1521     0.1563  -0.973  0.34405    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983 
F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

If the data is stored in a data frame, we can use the data=... argument to lm() to safe some typing:

m <- lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data=stackloss)
summary(m)

Call:
lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., 
    data = stackloss)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2377 -1.7117 -0.4551  2.3614  5.6978 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983 
F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

As a shortcut, you can write . instead of the inputs to indicate that the inputs are all variables except for the response:

m <- lm(stack.loss ~ ., data=stackloss)
summary(m)

Call:
lm(formula = stack.loss ~ ., data = stackloss)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2377 -1.7117 -0.4551  2.3614  5.6978 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983 
F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09
LS0tCnRpdGxlOiAiTUFUSDM3MTQsIFR1dG9yaWFsIDIiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMgTXVsdGlwbGUgTGluZWFyIFJlZ3Jlc3Npb24gaW4gUgoKVGhpcyBpcyBhIHNsaWdodGx5IGV4dGVuZGVkIHZlcnNpb24gb2YgdGhlIGZpbGUgSSBzaG93ZWQgZHVyaW5nIHRoZQpNQVRIMzcxNCBleGFtcGxlIGNsYXNzIG9uIDExIE9jdG9iZXIgMjAxNy4KCiMjIEZpcnN0IEV4cGVyaW1lbnQ6IE11bHRpcGxlIExpbmVhciBSZWdyZXNzaW9uIHdpdGggU2ltdWxhdGVkIGRhdGEKCiMjIyBCYXNpYyBVc2FnZToKClRoZSBgbG0oKWAgY29tbWFuZCBjYW4gYmUgdXNlZCBmb3IgbXVsdGlwbGUgbGluZWFyIHJlZ3Jlc3Npb24sCmJ5IHNlcGFyYXRpbmcgdGhlIGlucHV0IHVzaW5nIGArYCBzaWduczoKCmBgYHtyfQpuIDwtIDEwMAp4MSA8LSBydW5pZihuKQp4MiA8LSBydW5pZihuKQplcHMgPC0gcm5vcm0obiwgc2Q9MC4xKQoKeSA8LSAxICsgMip4MSArIDMqeDIgKyBlcHMKYGBgCgpVc2luZyB0aGlzIGRhdGEsIHdlIGNhbiBwZXJmb3JtIGxpbmVhciByZWdyZXNzaW9uLgoKYGBge3J9Cm0gPC0gbG0oeSB+IHgxICsgeDIpCm0KYGBgCgpOb3RlIHRoYXQgUiBoYXMgYXV0b21hdGljYWxseSBhZGRlZCBhbiBpbnRlcmNlcHQgdG8gdGhlIG1vZGVsLgoKIyMjIFJlY29nbmlzaW5nIFVudXNlZCBWYXJpYWJsZXMKCkFzIGFuIGV4cGVyaW1lbnQsIHdlIHJlbW92ZSB4MiBmcm9tIHRoZSBjb21wdXRhdGlvbiBvZiB0aGUKcmVzcG9uc2UgeToKCmBgYHtyfQpuIDwtIDEwMAp4MSA8LSBydW5pZihuKQp4MiA8LSBydW5pZihuKQplcHMgPC0gcm5vcm0obiwgc2Q9MC4xKQoKeSA8LSAxICsgMip4MSArIGVwcwpgYGAKCk5vdyB0aGUgcmVzcG9uc2UgZG9lcyBubyBsb25nZXIgZGVwZW5kIG9uIGB4MmAsIHRoZSBjb3JyZXNwb25kaW5nCmNvZWZmaWNpZW50IGlzIGVmZmVjdGl2ZWx5IDAuICBDYW4gd2UgcmVjb2duaXNlIHRoYXQgYHlgIGRvZXMgbm90CmRlcGVuZCBvbiBgeDJgPwoKYGBge3J9Cm0gPC0gbG0oeSB+IHgxICsgeDIpCnN1bW1hcnkobSkKYGBgCgpMb29raW5nIGF0IHRoZSBvdXRwdXRzIHdlIHNlZSB0aGF0IHRoZSBjb2VmZmljaWVudCBmb3IgYHgyYCBpcyB2ZXJ5CnNtYWxsLCBidXQgZHVlIHRvIHRoZSBub2lzZSB0aGUgZXN0aW1hdGVkIHZhbHVlIGlzIG5vdCBleGFjdGx5IGVxdWFsCnRvIHplcm8uICBUaGUgYWJzZW5jZSBvZiBzdGFyIHN5bWJvbHMgYXQgdGhlIGVuZCBvZiB0aGUgeDIgbGluZQppbmRpY2F0ZXMgdGhhdCBiZXRhMiBpcyBub3Qgc2lnbmlmaWNhbnRseSBkaWZmZXJlbnQgZnJvbSAwLgoKIyMgU3dpdGNoaW5nIE9mZiB0aGUgaW50ZXJjZXB0CgpgYGB7cn0KbiA8LSAxMDAKeDEgPC0gcnVuaWYobikKeDIgPC0gcnVuaWYobikKZXBzIDwtIHJub3JtKG4sIHNkPTAuMSkKCnkgPC0gMip4MSArIDMqeDIgKyBlcHMKYGBgCgpJZiB3ZSB3b3VsZCBmaXQgYSBtb2RlbCBhcyBiZWZvcmUsIHdlIHdvdWxkIGdldCBhIHZlcnkgc21hbGwgdmFsdWUgYXMKdGhlIGVzdGltYXRlZCBpbnRlcmNlcHQuICBJZiB3ZSBrbm93IHRoZXJlIGlzIG5vIGludGVyY2VwdCBhbmQgd2FudCB0bwpmaXQgYSBtb2RlbCB3aGljaCBkb2VzIG5vdCBpbmNsdWRlIGFuIGludGVyY2VwdCBhdCBhbGwsIHdlIGNhbiB3cml0ZQpgMCArIC4uLmAgaW4gZnJvbnQgb2YgdGhlIGZvcm11bGEgb24gdGhlIHJpZ2h0LWhhbmQgc2lkZSB0byBwcmV2ZW50IFIKZnJvbSBhZGRpbmcgYW4gaW50ZXJjZXB0OgoKYGBge3J9Cm0gPC0gbG0oeSB+IDAgKyB4MSArIHgyKQpzdW1tYXJ5KG0pCmBgYAoKIyMgU2Vjb25kIEV4cGVyaW1lbnQ6IEEgU21hbGwsIFJlYWwgRGF0YSBTZXQKCkhlcmUgd2Ugd2lsbCBsZWFybiBzb21lIHNob3J0Y3V0cyB3aGljaCBjYW4gYmUgdXNlZCB0byBzaW1wbGlmeSBjYWxscwp0byBgbG0oKWAuICBXZSBjb25zaWRlciB0aGUgYHN0YWNrbG9zc2AgZGF0YSBzZXQgYnVpbHQgaW50byBSOgoKYGBge3J9CnN0YWNrbG9zcwpgYGAKCkluIHRoaXMgbW9kZWwsIHRoZSBjb2x1bW4gYHN0YWNrLmxvc3NgIGlzIHRoZSBvdXRwdXQsIGFuZCB3ZSB0cnkgdG8KcHJlZGljdCB0aGlzIHZhbHVlIHVzaW5nIHRoZSBvdGhlciBjb2x1bW5zIGFzIGlucHV0LiAgRmlyc3QsIHdlIHRyeSBhCm5haXZlIGFwcHJvYWNoOgoKYGBge3J9CnkgPC0gc3RhY2tsb3NzJHN0YWNrLmxvc3MKeDEgPC0gc3RhY2tsb3NzJEFpci5GbG93CngyIDwtIHN0YWNrbG9zcyRXYXRlci5UZW1wCngzIDwtIHN0YWNrbG9zcyRBY2lkLkNvbmMuCm0gPC0gbG0oeSB+IHgxICsgeDIgKyB4MykKc3VtbWFyeShtKQpgYGAKCklmIHRoZSBkYXRhIGlzIHN0b3JlZCBpbiBhIGRhdGEgZnJhbWUsIHdlIGNhbiB1c2UgdGhlIGBkYXRhPS4uLmAKYXJndW1lbnQgdG8gYGxtKClgIHRvIHNhZmUgc29tZSB0eXBpbmc6CgpgYGB7cn0KbSA8LSBsbShzdGFjay5sb3NzIH4gQWlyLkZsb3cgKyBXYXRlci5UZW1wICsgQWNpZC5Db25jLiwgZGF0YT1zdGFja2xvc3MpCnN1bW1hcnkobSkKYGBgCgpBcyBhIHNob3J0Y3V0LCB5b3UgY2FuIHdyaXRlIGAuYCBpbnN0ZWFkIG9mIHRoZSBpbnB1dHMgdG8gaW5kaWNhdGUKdGhhdCB0aGUgaW5wdXRzIGFyZSBhbGwgdmFyaWFibGVzIGV4Y2VwdCBmb3IgdGhlIHJlc3BvbnNlOgoKYGBge3J9Cm0gPC0gbG0oc3RhY2subG9zcyB+IC4sIGRhdGE9c3RhY2tsb3NzKQpzdW1tYXJ5KG0pCmBgYAo=