Linear Regression in R

This is a slightly extended version of the file I showed during the MATH3714 example class on 27 September 2017.

First Experiment: Simple Linear Regression with Simulated Data

We start by simulating a very simple data set. If fix the seed of the random number generator to 1, so that you will get the same random numbers when you re-run the R-code yourself.

set.seed(1)
n <- 100
x <- runif(n, 1, 7)
y <- 2 + 1*x + rnorm(n, 0, 0.2)

This should give 100 points \((x_i, y_i)\) which are all close to a straight line with slope 1 and intercept 2. To verify that this is the case, we plot the data:

plot(x, y, xlim=c(0,10), ylim=c(0,10), asp=1)
abline(2, 1)
abline(h=2)
abline(v=0)

We can use linear regression to try to recover the values for intercept and slope. Linear regression in R is done using the lm() command:

m <- lm(y ~ x)
summary(m)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.36996 -0.11244 -0.01741  0.10485  0.50332 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.95372    0.05192   37.63   <2e-16 ***
x            1.01041    0.01178   85.76   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1882 on 98 degrees of freedom
Multiple R-squared:  0.9869,    Adjusted R-squared:  0.9867 
F-statistic:  7355 on 1 and 98 DF,  p-value: < 2.2e-16

From the output we see that the intercept is estimated as 1.95372, which is resonably close to the exact value of 2. Similarly, the slope is estimated as 1.01041, which is reasonably close to 1. Finally, the “residual standard error” is reported as 0.1882 (third line from the bottom), which is the computer’s estimate for the standard deviation of the residuals (we used 0.2 when we simulated the data).

We can get the design matrix using the command model.matrix():

model.matrix(m)
    (Intercept)        x
1             1 2.593052
2             1 3.232743
3             1 4.437120
4             1 6.449247
5             1 2.210092
6             1 6.390338
7             1 6.668052
8             1 4.964787
9             1 4.774684
10            1 1.370718
11            1 2.235847
12            1 2.059341
13            1 5.122137
14            1 3.304622
15            1 5.619049
16            1 3.986195
17            1 5.305711
18            1 6.951437
19            1 3.280211
20            1 5.664671
21            1 6.608231
22            1 2.272855
23            1 4.910043
24            1 1.753331
25            1 2.603324
26            1 3.316685
27            1 1.080342
28            1 3.294328
29            1 6.218145
30            1 3.042094
31            1 3.892481
32            1 4.597395
33            1 3.961248
34            1 2.117306
35            1 5.964240
36            1 5.010800
37            1 5.765439
38            1 1.647662
39            1 5.342266
40            1 3.467647
41            1 5.925678
42            1 4.882361
43            1 5.697597
44            1 4.318218
45            1 4.178317
46            1 5.736137
47            1 1.139987
48            1 3.863380
49            1 5.393882
50            1 5.156389
51            1 3.865718
52            1 6.167257
53            1 3.628583
54            1 2.468784
55            1 1.424074
56            1 1.596797
57            1 2.897630
58            1 4.111806
59            1 4.972030
60            1 3.440981
61            1 6.477256
62            1 2.761620
63            1 3.754394
64            1 2.994368
65            1 4.905223
66            1 2.548101
67            1 3.871271
68            1 5.597864
69            1 1.505481
70            1 6.251928
71            1 3.034438
72            1 6.036642
73            1 3.080101
74            1 3.002650
75            1 3.858107
76            1 6.353190
77            1 6.186037
78            1 3.339937
79            1 5.663924
80            1 6.763708
81            1 3.607957
82            1 5.275088
83            1 3.399966
84            1 2.952113
85            1 5.542523
86            1 2.216154
87            1 5.266727
88            1 1.730152
89            1 2.472931
90            1 1.859826
91            1 2.437776
92            1 1.353606
93            1 4.853730
94            1 6.257615
95            1 5.673488
96            1 5.783853
97            1 3.731647
98            1 3.460504
99            1 5.865221
100           1 4.629600
attr(,"assign")
[1] 0 1

Experiment 2: simple linear regression with real data

We consider the cars data set, built into R:

plot(cars)

To fit a linear model, we use the lm() command again: We use the columns speed and distance in the data frame cars. To tell lm() which data frame contains the data, we use data=cars.

m <- lm(dist ~ speed, data=cars)
summary(m)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Using the abline() command, it is easy to add the fitted regression line to a plot:

plot(cars)
abline(m)

LS0tCnRpdGxlOiAiTWF0aDM3MTQsIFR1dG9yaWFsIDEiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMgTGluZWFyIFJlZ3Jlc3Npb24gaW4gUgoKVGhpcyBpcyBhIHNsaWdodGx5IGV4dGVuZGVkIHZlcnNpb24gb2YgdGhlIGZpbGUgSSBzaG93ZWQgZHVyaW5nIHRoZQpNQVRIMzcxNCBleGFtcGxlIGNsYXNzIG9uIDI3IFNlcHRlbWJlciAyMDE3LgoKIyMgRmlyc3QgRXhwZXJpbWVudDogU2ltcGxlIExpbmVhciBSZWdyZXNzaW9uIHdpdGggU2ltdWxhdGVkIERhdGEKCldlIHN0YXJ0IGJ5IHNpbXVsYXRpbmcgYSB2ZXJ5IHNpbXBsZSBkYXRhIHNldC4gIElmIGZpeCB0aGUgc2VlZCBvZiB0aGUKcmFuZG9tIG51bWJlciBnZW5lcmF0b3IgdG8gMSwgc28gdGhhdCB5b3Ugd2lsbCBnZXQgdGhlIHNhbWUgcmFuZG9tIG51bWJlcnMKd2hlbiB5b3UgcmUtcnVuIHRoZSBSLWNvZGUgeW91cnNlbGYuCgpgYGB7cn0Kc2V0LnNlZWQoMSkKCm4gPC0gMTAwCnggPC0gcnVuaWYobiwgMSwgNykKeSA8LSAyICsgMSp4ICsgcm5vcm0obiwgMCwgMC4yKQpgYGAKClRoaXMgc2hvdWxkIGdpdmUgMTAwIHBvaW50cyAkKHhfaSwgeV9pKSQgd2hpY2ggYXJlIGFsbCBjbG9zZSB0byBhIHN0cmFpZ2h0CmxpbmUgd2l0aCBzbG9wZSAxIGFuZCBpbnRlcmNlcHQgMi4gIFRvIHZlcmlmeSB0aGF0IHRoaXMgaXMgdGhlIGNhc2UsCndlIHBsb3QgdGhlIGRhdGE6CgpgYGB7cn0KcGxvdCh4LCB5LCB4bGltPWMoMCwxMCksIHlsaW09YygwLDEwKSwgYXNwPTEpCmFibGluZSgyLCAxKQphYmxpbmUoaD0yKQphYmxpbmUodj0wKQpgYGAKCldlIGNhbiB1c2UgbGluZWFyIHJlZ3Jlc3Npb24gdG8gdHJ5IHRvIHJlY292ZXIgdGhlCnZhbHVlcyBmb3IgaW50ZXJjZXB0IGFuZCBzbG9wZS4gIExpbmVhciByZWdyZXNzaW9uIGluIFIKaXMgZG9uZSB1c2luZyB0aGUgYGxtKClgIGNvbW1hbmQ6CgpgYGB7cn0KbSA8LSBsbSh5IH4geCkKc3VtbWFyeShtKQpgYGAKCkZyb20gdGhlIG91dHB1dCB3ZSBzZWUgdGhhdCB0aGUgaW50ZXJjZXB0IGlzIGVzdGltYXRlZCBhcyAxLjk1MzcyLCB3aGljaAppcyByZXNvbmFibHkgY2xvc2UgdG8gdGhlIGV4YWN0IHZhbHVlIG9mIDIuICBTaW1pbGFybHksIHRoZSBzbG9wZSBpcyBlc3RpbWF0ZWQKYXMgMS4wMTA0MSwgd2hpY2ggaXMgcmVhc29uYWJseSBjbG9zZSB0byAxLiAgRmluYWxseSwgdGhlICJyZXNpZHVhbCBzdGFuZGFyZAplcnJvciIgaXMgcmVwb3J0ZWQgYXMgMC4xODgyICh0aGlyZCBsaW5lIGZyb20gdGhlIGJvdHRvbSksIHdoaWNoIGlzIHRoZQpjb21wdXRlcidzIGVzdGltYXRlIGZvciB0aGUgc3RhbmRhcmQgZGV2aWF0aW9uIG9mIHRoZSByZXNpZHVhbHMgKHdlIHVzZWQKMC4yIHdoZW4gd2Ugc2ltdWxhdGVkIHRoZSBkYXRhKS4KCldlIGNhbiBnZXQgdGhlIGRlc2lnbiBtYXRyaXggdXNpbmcgdGhlIGNvbW1hbmQgbW9kZWwubWF0cml4KCk6CgpgYGB7cn0KbW9kZWwubWF0cml4KG0pCmBgYAoKCiMjIEV4cGVyaW1lbnQgMjogc2ltcGxlIGxpbmVhciByZWdyZXNzaW9uIHdpdGggcmVhbCBkYXRhCgpXZSBjb25zaWRlciB0aGUgYGNhcnNgIGRhdGEgc2V0LCBidWlsdCBpbnRvIFI6CgpgYGB7cn0KcGxvdChjYXJzKQpgYGAKClRvIGZpdCBhIGxpbmVhciBtb2RlbCwgd2UgdXNlIHRoZSBsbSgpIGNvbW1hbmQgYWdhaW46IFdlIHVzZSB0aGUKY29sdW1ucyBzcGVlZCBhbmQgZGlzdGFuY2UgaW4gdGhlIGRhdGEgZnJhbWUgY2Fycy4gIFRvIHRlbGwgbG0oKQp3aGljaCBkYXRhIGZyYW1lIGNvbnRhaW5zIHRoZSBkYXRhLCB3ZSB1c2UgZGF0YT1jYXJzLgoKYGBge3J9Cm0gPC0gbG0oZGlzdCB+IHNwZWVkLCBkYXRhPWNhcnMpCnN1bW1hcnkobSkKYGBgCgpVc2luZyB0aGUgYGFibGluZSgpYCBjb21tYW5kLCBpdCBpcyBlYXN5IHRvIGFkZCB0aGUgZml0dGVkIHJlZ3Jlc3Npb24gbGluZSB0bwphIHBsb3Q6CgpgYGB7cn0KcGxvdChjYXJzKQphYmxpbmUobSkKYGBgCgo=