Chapter 3: Topics in Regression

Lawrence M. Leemis

Chapter 3 Topics in Regression

The previous two chapters have provided a detailed introduction to the basic principles underlying simple linear regression. This chapter will cover some additional topics in regression, but not with the same detail as in the previous two chapters. Sometimes just a single example will illustrate a regression topic that deserves an entire chapter in a full-semester regression course. The topics considered in this chapter are forcing a regression line through the origin, diagnostics, remedial procedures, the matrix approach to simple linear regression, multiple linear regression, weighted least squares estimators, regression models with nonlinear terms, and logistic regression.

3.1 Regression Through the Origin

Applications occasionally arise in which it is of benefit to force a regression line to pass through the origin. To illustrate such applications, return to Examples 1.1 and 1.3 in which Bob and Cheryl each had the number of sales per week as an independent variable X. In both of the examples, [latex]X = 0[/latex] sales per week corresponds to [latex]Y = 0[/latex] commissions (for Bob) and [latex]Y = 0[/latex] revenue per week (from Cheryl’s sales). In these settings it is sensible to force the regression line to pass through the origin; estimating a population intercept does not make sense. The resulting regression model does not contain the β₀ parameter. The simple linear regression model forced through the origin, sometimes abbreviated RTO for regression through the origin, is defined next.

The regression parameter β₁ can be estimated using least squares from the data pairs [latex](X_i, \, Y_i)[/latex] for [latex]i = 1, \, 2, \, \ldots, \, n[/latex].

The next example conducts a hypothesis test to determine whether it is appropriate to drop the intercept term from the simple linear regression model based on the data pairs, and then proceeds to fit the reduced model.

Example 3.1 The R built-in data set Formaldehyde consists of the [latex]n = 6[/latex] data pairs given in Table 3.1. The independent variable carb is the carbohydrate level (ml) and the dependent variable optden is the optical density in a chemical experiment. Fit a simple linear regression to the model using the ordinary least squares estimates. If there is no statistically significant difference between the estimated intercept and zero, then fit a simple linear regression model forcing the regression line to pass through the origin to the data pairs.

Table 3.1: Formaldehyde data set from R.
carb	optden
0.1	0.086
0.3	0.269
0.5	0.446
0.6	0.538
0.7	0.626
0.9	0.782

The scatterplot given in Figure 3.1 shows a strong linear relationship between carbohydrates (measured in ml) and optical density (measured by the reading of the resulting purple color on a spectrophotometer) for the [latex]n = 6[/latex] data pairs. The nearly-perfect linear relationship provides overwhelming visual evidence that a simple linear regression model is appropriate for approximating the relationship between X and Y.

Figure 3.1: A scatterplot of the Formaldehyde data pairs.

Long Description for Figure 3.1

The horizontal axis X ranges from 0 to 1 in increments of 0.2 units. The vertical axis Y ranges from 0 to 0.8 in increments of 0.2 units. 6 data points plotted in the quadrant follow a linear, increasing trend, from left to right. The points are as follows. (0.1, 0.1); (0.3, 0.3); (0.5, 0.4); (0.6, 0.5); (0.7, 0.6); (0.9, 0.8). All data are approximate.

The R commands below fit the standard simple linear regression model (including an intercept) to the six data pairs.

The point estimates for the intercept and slope of the regression line are

$\begin{array}{l} {\hat{β}}_{0} = 0.00509 and {\hat{β}}_{1} = 0.876 . \end{array}$

The call to summary(fit) indicates that there is no statistically significant difference between the point estimate for the intercept and 0. The p-value associated with the hypothesis test

$\begin{array}{l} H_{0} : β_{0} = 0 \end{array}$

versus

$\begin{array}{l} H_{0} : β_{0} \neq 0 \end{array}$

is 0.55, which is statistical evidence that the intercept does not differ significantly from [latex]\beta_0 = 0[/latex]. This p-value, perhaps along with some information about the chemical experiment itself, might cause the experimenter to consider the reduced model which is forced through the origin. This hypothesis test requires normally distributed error terms. The usual analysis of residuals to determine whether a simple linear regression model with normal error terms is appropriate in this setting will be abandoned here because of the small sample size. Histograms and statistical tests have diminished meaning with only [latex]n = 6[/latex] data pairs. The best we can do to assess the normality of the error terms is to use a graphical display such as a QQ plot.

Using Theorem 3.1, the least squares estimate for the slope of the regression line forced through the origin is

$\begin{array}{l} {\hat{β}}_{1} = \frac{\sum_{i = 1}^{n} X_{i} Y_{i}}{\sum_{i = 1}^{n} X_{i}^{2}} = 0.884, \end{array}$

which can be calculated with the R statements given below.

Not surprisingly, the slope of the regression line forced through the origin is very close to the slope of the regression line with the model that includes an intercept. The optical density increases by 0.884 for every unit increase in the carbohydrates. Figure 3.2 contains a scatterplot of the data pairs and the associated regression line forced through the origin. The model clearly provides an adequate approximation to the relationship between the independent variable X and the dependent variable Y over the scope of the model shown in Figure 3.2.

Figure 3.2: A scatterplot of the Formaldehyde data pairs with the fitted regression line.

Long Description for Figure 3.2

The horizontal axis X ranges from 0 to 1 in increments of 0.2 units. The vertical axis Y ranges from 0 to 0.8 in increments of 0.2 units. 6 data points plotted on the X Y plane follow a linear, increasing trend. The points are as follows. (0.1, 0.1); (0.3, 0.3); (0.5, 0.4); (0.6, 0.5); (0.7, 0.6); (0.9, 0.8). The line of regression, from the origin slopes up toward the top right of the X Y plane and all the data points fall on the line of regression. All data are approximate.

These calculations can be performed in R by adding -1 or +0 to the formula argument in the lm function, which forces the regression line to pass through the origin.

These R statements calculate the estimated slope of the regression line as [latex]\hat \beta_1 = 0.884[/latex].

Analogous theorems to those that were applied to simple linear regression with a population intercept parameter β₀ and a population slope parameter β₁ from Chapter 1 can also be derived associated with the simple linear regression model forced through the origin. In addition, the assumption of normal error terms from Chapter 2 can be added to the simple linear regression model forced through the origin, which allows for statistical inference (that is, constructing confidence intervals and performing hypothesis tests) concerning the population slope of the regression line β₁. For example, the additional R command

gives a very narrow 95% confidence interval for β₁ as

$\begin{array}{l} 0.869 < β_{1} < 0.899 . \end{array}$

The narrowness of the confidence interval is a reflection of how close the points fall to the regression line in Figure 3.2.

The next example revisits the regression modeling of the stopping distance as a function of the speed of a car in the built-in cars data frame.

Example 3.2 Recall from Example 2.8 that X, the speed of a car in miles per hour, was used as an independent variable, and Y, the stopping distance in feet, was used as a dependent variable in a simple linear regression model. There are [latex]n = 50[/latex] data pairs in the cars data frame that is built into R. One critique of the simple linear regression model that was constructed for the data pairs in the built-in cars data frame from Example 2.8 was that the regression function did not pass through the origin (stationary cars require no stopping distance). Write R code to estimate the slope of the regression line through the origin and comment on the acceptability of this model.

The physics of the experiment indicates that stationary cars require no distance to stop, so forcing a regression line through the origin is appropriate in this setting. The R code below estimates the slope of the regression line that is forced to pass through the origin.

Figure 3.3 is a scatterplot of the data pairs (not jittered for ties) with the regression line superimposed. A car requires an additional distance of [latex]\hat \beta_1 = 2.91[/latex] feet to stop for every additional mile per hour in speed.

Figure 3.3: Fitted model [latex]Y = \hat \beta_1 X[/latex] of speed X and stopping distance Y for the cars data.

Long Description for Figure 3.3

The horizontal axis X ranges from 0 to 25 in increments of 5. The vertical axis Y ranges from 0 to 120 in increments of 20. The line of regression emerging from the origin has a positive slope. 30 data points are below the line of regression. 14 points are above the line of regression. 5 points are on the line of regression. All data are approximate.

The additional R statements

reveal that 32 data pairs fall below the regression line and only 18 data pairs fall above the regression line. A plot of the standardized residuals can be generated with the R statements

and is given in Figure 3.4. The sum of squares increases from [latex]SSE = 11,354[/latex] as calculated in Example 2.8 for the full simple linear regression model to [latex]SSE = 12,954[/latex] by forcing the regression line through the origin. It is universally the case that SSE stays the same or increases by forcing the regression line to pass through the origin. Using the model as a subscript, this can be written symbolically as

$\begin{array}{l} S S E_{Y = β_{0} + β_{1} X + ϵ} \leq S S E_{Y = β_{1} X + ϵ} . \end{array}$

Figure 3.4: Standardized residuals for the cars data.

Long Description for Figure 3.4

The horizontal axis X i ranges from 0 to 25 in increments of 5. The vertical axis e i over square root of M S E ranges from negative 3 to 3 in increments of 1. Multiple data points are plotted on the X Y plane. A dotted horizontal line, the line of regression, originating from 0 on the vertical axis, extends parallel to the X axis. It has 0 slope. Two dotted horizontal lines, originating from negative 2 and 2 on the vertical axis, extend across the quadrant, parallel to the X axis. 14 points line between the line of regression and the horizontal line originating from 2 on the vertical axis. 32 points line between the line of regression and the horizontal line originating from negative 2 on the vertical axis. 2 points fall on the line of regression. Two points fall outside the horizontal line originating from 2 on the vertical axis.

The nonsymmetry of the residuals in Figure 3.4 suggests that the fitted linear regression function might not be adequate. Perhaps a regression model with higher-order terms or a nonlinear model is worth investigating.

This ends the discussion of forcing the regression line through the origin. Occasions arise in regression modeling in which it is more appropriate to fit a statistical model with fewer parameters. Some of the results from the full simple linear regression model generalize to simple linear regression forced through the origin. The point estimate for β₁, for example, is unbiased. Three examples of results that do not generalize are (a) the residuals do not necessarily sum to zero, (b) the regression line does not necessarily pass through the point [latex]\big( \bar X, \, \bar Y \big)[/latex], and (c) it is possible that SSE can exceed the total sum of squares SST, which can result in a negative value of R².

3.2 Diagnostics

Diagnostic procedures are applied to fitted regression models to assess their conformity to the assumptions (for example, constant variance of the error terms) implicit in the simple linear regression model. We have already considered one such diagnostic procedure from the previous chapter, which is the examination of the residuals to assess their independence, constant variance, and normality. Two other diagnostic procedures will be examined here, which are the identification of data pairs known as leverage points and the identification of data pairs known as influential points. The subsequent section considers remedial procedures, which can be applied to a regression model that fails to satisfy one or more of the assumptions implicit in a regression model.

3.2.1 Leverage

Data pairs that have the ability to exert more influence on the regression line than other data pairs due to their independent variable values are known as leverage points. These data pairs should be given more scrutiny than the others because of the potential tug that they have on the regression line. More specifically, when the value of the independent variable is unusually far from [latex]\bar {X}[/latex] (either low or high), the data pair has the potential to exert more pull on the regression line than other points.

We begin developing the notion of leverage by expressing the predicted value of Y_i, denoted by [latex]\hat{Y}_i[/latex], as a function of Y_i. Using Theorems 1.1 and 1.3, the predicted value of Y_i is

$\begin{array}{l} {\hat{Y}}_{i} & = {\hat{β}}_{0} + {\hat{β}}_{1} X_{i} \\ = \bar{Y} - {\hat{β}}_{1} \bar{X} + {\hat{β}}_{1} X_{i} \\ = \bar{Y} + {\hat{β}}_{1} (X_{i} - \bar{X}) \\ = \frac{1}{n} \sum_{j = 1}^{n} Y_{j} + \sum_{j = 1}^{n} a_{j} Y_{j} (X_{i} - \bar{X}) \\ = \frac{1}{n} \sum_{j = 1}^{n} Y_{j} + \sum_{j = 1}^{n} \frac{X_{j} - \bar{X}}{S_{X X}} Y_{j} (X_{i} - \bar{X}) \\ = \sum_{j = 1}^{n} [\frac{1}{n} + \frac{(X_{i} - \bar{X}) (X_{j} - \bar{X})}{S_{X X}}] Y_{j} \\ = \sum_{j = 1}^{n} h_{i j} Y_{j} \end{array}$

for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]. The [latex]h_{ij}[/latex] values form the elements of an [latex]n \times n[/latex] matrix [latex]{\bf H}[/latex], which is often referred to as the hat matrix or the projection matrix. The reason that this matrix is known as the projection matrix is that it provides a linear transformation from the observed values of the dependent variable to the associated fitted values. The diagonal elements of the hat matrix are known as the leverages of the data pairs, which are defined next.

The leverage is a measure of a data pair’s potential to influence the regression line. Notice that the leverage is a function of the values of the independent variable [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex] only; the heights of the data pairs do not play a role. Since the two denominators in the expression from Definition 3.2 are constants for a particular data set, only the numerator [latex]\left( X_i - \bar X \right) ^ 2[/latex] changes for each value of X_i. It reflects the distance between a particular X_i value and its associated sample mean. The leverage increases as the distance between X_i and [latex]\bar X[/latex] increases. There are several results concerning the leverages; one that concerns the average of the leverages is presented next.

To summarize what we know about the n leverages,

the leverages are the diagonal elements of the hat matrix H,
all leverages are positive, with a minimum of [latex]1 / n[/latex] (for [latex]X_i = \bar X[/latex]) and a maximum of 1, and
the sum of the leverages is 2, so the average of the leverages is [latex]2 / n[/latex].

If all of the leverages are equal (this is always the case, for example, for [latex]n = 2[/latex] data pairs), then each leverage is [latex]2 / n[/latex], which is the average from Theorem 3.2. We would like to establish a threshold at which a data pair has the ability to exert a significant influence over the regression line so that such points might be examined with additional scrutiny. Such data pairs are known as leverage points. Although not used universally, a common way to identify a leverage point is if the leverage [latex]h_{ii}[/latex] is more than twice the average of the leverages. Symbolically, a point is designated a leverage point if

$\begin{array}{l} h_{i i} > \frac{4}{n} . \end{array}$

This threshold will be illustrated in the next example.

Example 3.3 To illustrate the identification of leverage points, we consider the first data set in Anscombe’s quartet. For notational convenience, the [latex]n = 11[/latex] data pairs have been ordered by their independent variable values in Table 3.2. We will investigate the leverages associated with this data set and two other data sets with an extra data pair appended.

Table 3.2: Data set I (sorted by X_i) in Anscombe’s quartet.
X_i	Y_i
4.0	4.26
5.0	5.68
6.0	7.24
7.0	4.82
8.0	6.95
9.0	8.81
10.0	8.04
11.0	8.33
12.0	10.84
13.0	7.58
14.0	9.96

The R code below calculates the [latex]n = 11[/latex] leverages using the formula from Definition 3.2.

Notice that the values of [latex]Y_1, \, Y_2, \, \ldots, \, Y_{11}[/latex] are not needed to compute the leverages. The leverages are displayed in Table 3.3. Not surprisingly, the leverages are symmetric about [latex]\bar X = 9[/latex] because the values of the independent variable are equally spaced. The leverage for [latex]X_6 = 9[/latex] is just [latex]1 / n = 1 / 11 \cong 0.09[/latex], which is the first term in [latex]h_{ii}[/latex] in Definition 3.2. None of the leverages exceeds the threshold value [latex]4 / n = 4 / 11 \cong 0.36[/latex], so this data set does not contain any leverage points.

Table 3.3: Leverages for data set I in Anscombe’s quartet.
i	1	2	3	4	5	6	7	8	9	10	11
X_i	4	5	6	7	8	9	10	11	12	13	14
h_ii	0.32	0.24	0.17	0.13	0.10	0.09	0.10	0.13	0.17	0.24	0.32

Calculating leverages is so common in regression analysis that R has two built-in functions that calculate leverages. The hat function calculates the leverages for Anscombe’s first data set with the single statement

Alternatively, the hatvalues function with the fitted model as an argument can be used to calculate the leverages.

The top graph in Figure 3.5 is a scatterplot of the data pairs and the associated regression line. From a cursory visual assessment, using a simple linear regression model to describe the relationship between X and Y seems reasonable for these data pairs. The leverages for the first three data pairs are identified on the graph. All three graphs in Figure 3.5 have the same horizontal and vertical scales for easier comparison.

Figure 3.5: Fitted regression models and leverage points.

Long Description for Figure 3.5

The horizontal axis X ranges from 4 to 9 and the vertical axis Y ranges from 3 to 13. The top and middle graphs have 11 data pairs plotted with a line of regression having a positive slope, originating from (0,4). Three points fall below the line of regression, 4 are on the line, and three are above the line of regression. In the top graph, the data point (4, 4) is indicted as h 11 equal 0.32. Similarly, the data points (5,6) and (7, 8) are indicated as h 22 equals 0.24 and h 33 equals 0.17 respectively. In the middle graph, the twelfth data pair (19, 12.5) that fall on the line of regression is circled. In the bottom graph, the line of regression, with a positive slope originates at (0, 6) and 12 data pairs are plotted. Four data pairs fall below the line of regression. Two of them fall on the line of regression, and four are above the line of regression. The data pair (19,4) is circled.

The middle graph in Figure 3.5 includes all of the data values from the Anscombe’s first data set, but adds the additional data pair [latex](19, \, 12.5)[/latex] which was gleaned from Anscombe’s fourth data set. The leverages are given in Table 3.4, with the leverage for the data pair [latex](19, \, 12.5)[/latex] set in boldface because it has a leverage that exceeds [latex]4 / n = 4 / 12 \cong 0.33[/latex]. This data pair is a leverage point that warrants particular scrutiny. Although the data pair has the ability to exert unusual effect on the regression line, it is clear that the data point does not alter the regression line from where it was in the top graph. So although the new data pair is a leverage point (and is therefore circled in the middle graph), it does not contradict the existing trend from the other 11 points. In this sense, the leverage point provides some (scant) evidence that the scope of the model can be extended from [latex]4 \le X \le 14[/latex] to [latex]4 \le X \le 19[/latex].

Table 3.4: Leverages for data set I in Anscombe’s quartet with appended [latex]X_{12} = 19[/latex].
i	1	2	3	4	5	6	7	8	9	10	11	12
X_i	4	5	6	7	8	9	10	11	12	13	14	19
$h_{i i}$	0.25	0.20	0.16	0.12	0.10	0.09	0.08	0.09	0.11	0.13	0.17	0.50

The bottom graph in Figure 3.5 includes all of the data values from the Anscombe’s first data set, but adds the additional data pair [latex](19, \, 4)[/latex]. Since the values of the independent variable have not changed, the leverages match those from Table 3.4. The leverage point [latex](19, \, 4)[/latex] is circled on the graph. This leverage point exerts a significant downward tug on the right side of the regression line relative to the pattern established by the first 11 data pairs. A simple linear regression model is not appropriate in this case. There are several potential explanations for the deleterious effects of this leverage point.

The leverage point might have been incorrectly recorded.
The leverage point might be fundamentally different than the others and does not belong in the data set.
The leverage point might indicate that a nonlinear regression model is appropriate.
The leverage point might signal that the scope of the model should be restricted to [latex]4 \le X \le 14[/latex], where a simple linear regression appears to be appropriate.
The leverage point is legitimate and not fundamentally different than the others. It might just happen to be an extreme value. The linear model still might be appropriate, but more data pairs need to be collected to show that this is the case.

The previous example has indicated a fitted simple linear regression model is likely to pass close to a leverage point. Leverage points exert more tug on the regression line than those points whose independent variable value is closer to [latex]\bar X[/latex]. The next illustration of identifying leverage points revisits the heights of couples from Example 2.7.

Example 3.4 Identify the leverage points for the [latex]n = 96[/latex] pairs of couples heights from Example 2.7.

The following R statements load the PBImisc package, set x to the heights of the wives, set y to the associated heights of the husbands, calculate the leverages using the hat function, store the indexes of those points whose leverage exceeds [latex]4 / n[/latex] in the vector i, plot the data pairs using the plot function, plot the regression line using the abline function, and circle the leverage points using the symbols function.

The resulting graph is displayed in Figure 3.6. There are a total of ten leverage points—seven on the left end of the scope of the model and three on the right end of the scope of the model. Examining each of the ten leverage points carefully, nine of the ten do not seem out of step with the rest of the data values. The leverage point [latex](147, \, 178)[/latex], however, which corresponds to an unusually short wife marrying and fairly tall husband, is clearly a point that exerts a significant upward tug on the left side of the regression line. Assuming that the X and Y values were recorded correctly, there is no reason to remove this point from the data set. The impact of this point on the slope of the regression line is minimized by the large sample size.

Figure 3.6: Fitted regression model and leverage points for the [latex]n = 96[/latex] data pairs.

Long Description for Figure 3.6

The horizontal axis X ranges from 140 to 180 in increments of 10 units. The vertical axis Y ranges from negative 150 to 190 in increments of 10 units. Multiple data points are plotted in the quadrant. The line of regression, originating from the vertical axis at (0,153), slopes up toward the top right of the quadrant. It has a positive slope. The data points are crowded between values 150 and 180 on the X axis and 155 and 190 on the Y axis. 10 data points are circled, of which 7 are on the left of the quadrant, and 3 are plotted on the right. The circled points are as follows. (140, 153); (143, 155); (148, 151); (146, 160); (148, 165); (149, 166); (148, 178); (180, 181); (180, 185); (179, 190). All data are approximate.

Identifying leverage points is helpful for knowing which points to more carefully scrutinize. It is not appropriate to simply delete a leverage point because it falls far from the regression line. Leverage points can be helpful in highlighting an aspect of the model that was not originally considered relevant. The next subsection considers how to determine if a leverage point (or any other point) does produce a significant impact on [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex].

3.2.2 Influential Points

Leverage points have the potential to produce large changes in the values of [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] when they are deleted. How can we determine whether a leverage point (or any other point) does have significant impact on the regression line? American statistician R. Dennis Cook suggested a quantity that measures the influence of each data pair on the regression line.

Definition 3.3 For a simple linear regression model, Cook’s distances [latex]D_1, \, D_2, \, \ldots, \, D_n[/latex] associated with the n data pairs have the following three equivalent definitions.

$D_{i} = \frac{\sum_{j = 1}^{n} ({\hat{Y}}_{j} - {\hat{Y}}_{j (i)})^{2}}{2 \cdot M S E},$
$D_{i} = \frac{n ({\hat{β}}_{0 (i)} - {\hat{β}}_{0})^{2} + 2 ({\hat{β}}_{0 (i)} - {\hat{β}}_{0}) ({\hat{β}}_{1 (i)} - {\hat{β}}_{1}) \sum_{i = 1}^{n} X_{i} + ({\hat{β}}_{1 (i)} - {\hat{β}}_{1})^{2} \sum_{i = 1}^{n} X_{i}^{2}}{2 \cdot M S E},$
$D_{i} = \frac{e_{i}^{2} h_{i i}}{2 \cdot M S E {(1 - h_{i i})}^{2}},$

where MSE is the mean square error (see Theorem 1.8), [latex]\hat{Y}_{j(i)}[/latex] is the fitted value of data pair j with data pair i removed, [latex]\beta_{0(i)}[/latex] is the estimated intercept of the regression line for the simple linear regression model with data pair i removed, [latex]\beta_{1(i)}[/latex] is the estimated slope of the regression line for the simple linear regression model with data pair i removed, and [latex]h_{ii}[/latex] is the leverage of data pair i (see Definition 3.2), for [latex]i = 1, \, 2, \, \ldots, \, n[/latex].

The equivalence between the three very diverse formulas in Definition 3.3 is left as an exercise. The data pairs must not be collinear because MSE appears in the denominator of each formula. Each of the three formulas is helpful in developing intuition about Cook’s distance, so each is illustrated in the following three examples.

Example 3.5 Use the first formula from Definition 3.3 to calculate the Cook’s distances for the [latex]n = 11[/latex] data pairs in the Anscombe’s first data set (sorted by the values of the independent variable), appended with the point [latex](X_{12}, \, Y_{12}) = (19, \, 4)[/latex]. This was the last data set encountered in Example 3.3.

The bottom graph in Figure 3.5 shows that the first 11 data pairs are consistent with an underlying linear model, but the 12th data pair is not consistent with this model. The first formula from Definition 3.3 is

$\begin{array}{l} D_{i} = \frac{\sum_{j = 1}^{n} ({\hat{Y}}_{j} - {\hat{Y}}_{j (i)})^{2}}{2 \cdot M S E} \end{array}$

for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]. Since the term [latex]\hat{Y}_j - \hat{Y}_{j(i)}[/latex] is a measure of the effect of dropping data pair i from the data set on the fitted value, larger values for D_i indicate that data pair i is more influential. Squaring [latex]\hat{Y}_j - \hat{Y}_{j(i)}[/latex] assures that the direction of the fitted value when data pair i is dropped makes a positive contribution to D_i. The R code below loops through the data points, excluding the data pairs one-by-one. Hence there will in general be a total of [latex]n + 1[/latex] simple linear regression models fitted when using the first formula for computing Cook’s distance—one regression model for all data pairs included and n other regression models for dropping each data pair once.

Several of the Cook’s distances are given in Table 3.5. Consistent with the bottom graph in Figure 3.5, the 12th Cook’s distance [latex]D_{12} = 3.621[/latex] is substantially larger than the second-largest Cook’s distance [latex]D_1 = 0.236[/latex]. So the 12th data pair, [latex](X_{12}, \, Y_{12})[/latex], is the most influential point. The first data pair, [latex](X_{1}, \, Y_{1})[/latex], is the second most influential point. Notice that these are the two points with the highest leverage (see Table 3.4).

Table 3.5: Cook’s distances for Anscombe’s data set I with [latex](X_{12}, \, Y_{12}) = (19, \, 4)[/latex] appended.
i	1	2	3	4	⋯	11	12
D_i	0.236	0.029	0.005	0.069	⋯	0.128	3.621

To show some of the geometry associated with the calculation of [latex]D_1, \, D_2, \, \ldots, \, D_n[/latex], Figure 3.7 shows the regression line

$\begin{array}{l} Y = {\hat{β}}_{0} + {\hat{β}}_{1} X = 6.09 + 0.114 X \end{array}$

fitted to all [latex]n = 12[/latex] data pairs, which are indicated by solid points ([latex]\bullet[/latex]). This regression line corresponds to the fitted value at [latex]X_{12} = 19[/latex] of

$\begin{array}{l} {\hat{Y}}_{12} = 6.09 + (0.114) (19) = 8.25 . \end{array}$

The other regression line,

$\begin{array}{l} Y = {\hat{β}}_{0 (12)} + {\hat{β}}_{1 (12)} X = 3.00 + 0.500 X, \end{array}$

is the regression that excludes the influential 12th data pair [latex](X_{12}, \, Y_{12}) = (19, \, 4)[/latex]. This regression line corresponds to the fitted value at [latex]X_{12} = 19[/latex] of

$\begin{array}{l} {\hat{Y}}_{12 (12)} = 3.00 + (0.500) (19) = 12.50 . \end{array}$

The two fitted values are indicated by open points ([latex]\circ[/latex]). So when calculating D₁₂ using the first formula in Definition 3.3, one of the terms in the numerator is

$\begin{array}{l} ({\hat{Y}}_{12} - {\hat{Y}}_{12 (12)})^{2} = (8.25 - 12.50)^{2} = (- 4.25)^{2} = 1 8.07, \end{array}$

which makes a huge contribution to the numerator of D₁₂.

Figure 3.7: Calculating Cook’s distances using fitted values.

Long Description for Figure 3.7

The horizontal axis X ranges from 4 to 19. The vertical axis Y ranges from 3 to 13. 12 points are plotted on the X Y plane. A diagonal line, with a steep slope originating from the vertical axis at (0,4), slopes up toward the top right of the quadrant. 5 data points fall on this line. An open point at the end on this line is indicated as Y cap 12 of 12. The equation on the line reads, Y equals beta 0 of 12 plus beta cap 1 of 12 times X. Another diagonal line, with a less steep slope originating from the vertical axis at (0,5), slopes up toward the right of the quadrant. Two data points fall on this line. An open point at the end of this line is indicated as Y cap 12. The equation on the line reads, Y equals beta cap 0 plus beta cap 1 times X. Three data points fall below these diagonal lines, and three are above these lines.

The previous example has indicated that Cook’s distance is a measure of the influence of each data pair based on the effect of removing each data pair sequentially, and measuring the associated impact on the fitted values. If the fitted values are not substantially altered by removing data pair i, then D_i will be small; if the fitted values are substantially altered by removing data pair i, then D_i will be large. This, however, does not explain why the denominator [latex]2 \cdot MSE[/latex] is in all four formulas in Definition 3.3. That will be addressed in the next example.

Example 3.6 Use the second formula from Definition 3.3 to calculate the Cook’s distances for the [latex]n = 11[/latex] data pairs in the Anscombe’s first data set (sorted by the values of the independent variable), appended with the point [latex](X_{12}, \, Y_{12}) = (19, \, 4)[/latex].

The second formula for computing Cook’s distance for data pair i from Definition 3.3 is

$\begin{array}{l} D_{i} = \frac{n ({\hat{β}}_{0 (i)} - {\hat{β}}_{0})^{2} + 2 ({\hat{β}}_{0 (i)} - {\hat{β}}_{0}) ({\hat{β}}_{1 (i)} - {\hat{β}}_{1}) \sum_{i = 1}^{n} X_{i} + ({\hat{β}}_{1 (i)} - {\hat{β}}_{1})^{2} \sum_{i = 1}^{n} X_{i}^{2}}{2 \cdot M S E} \end{array}$

for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]. This formula emphasizes the change in the regression coefficients when data pair i is dropped. Figure 3.8 shows (a) the estimators [latex]\big( \hat \beta_0, \, \hat \beta_1 \big)[/latex] for all [latex]n = 12[/latex] data pairs as a +, (b) the associated confidence regions for β₀ and β₁ at levels 0.25, 0.5, and 0.75, and (c) twelve points indicated by solid circles ([latex]\bullet[/latex]) giving the values of the slope and intercept when data pair i is dropped, for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]. Not surprisingly, the estimated slope and intercept when the 12th data point, [latex](X_{12}, \, Y_{12} ) = (19, \, 4)[/latex], is dropped, strays the furthest from [latex]\big( \hat \beta_0, \, \hat \beta_1 \big)[/latex]. The other 11 estimated slope and intercept pairs all fall within the 0.25 confidence region.

Figure 3.8: Calculating Cook’s distances using the parameter estimates.

Long Description for Figure 3.8

The horizontal axis labeled beta 0 ranges between 3 to 9 in increments of 1. The vertical axis labeled beta 1 ranges from negative 0.2 to 0.5 in increments of 0.1. 12 data pairs are plotted in the quadrant. A point at (6, 0.1) is indicated by a cross mark and three concentric elliptical contours are indicated around it. The outermost ellipse extends between 3 and 9 of the horizontal axis and negative 0.2 and 0.5 of the vertical axis. 11 data points fall within the innermost ellipse extending between 5 and 7.5 of the horizontal axis and 0 and 0.3 of the vertical axis. The midpoint of the contours is indicated by a cross mark falling on (6, 0.1). The midpoint reads, (beta cap 0, beta cap 1 equals (6.09, 0.11). A data pair at the rim of inner ellipse is indicated as (beta cap 0 of 1, beta cap 1 of 1) equals (7.22, 0.03), D 1 equals 0.24. A data pair at (3, 0.5) falls outside the elliptical contours and is indicated as (beta cap 0 of 12, beta cap 1 of 12) equals (3.00, 0.50), D equals 3.62.

The connection with the confidence region for β₀ and β₁ in this case illuminates why the [latex]2 \cdot MSE[/latex] appears in the denominator of all of the formulas for D_i in Definition 3.3.

Compare the right-hand side of the second formula in Definition 3.3 with the expression in Theorem 2.16. They are identical except that β₀ is replaced by [latex]\beta_{0(i)}[/latex] and β₁ is replaced by [latex]\beta_{1(i)}[/latex]. So under the assumption that the data pairs are drawn from a simple linear regression model, one would expect that D_i is approximately [latex]F(2, \, n - 2)[/latex]. Some suggest using the median of a [latex]F(2, \, n - 2)[/latex] distribution as a threshold for classifying a data pair as an influential point. Another approach is to observe that the population mean and variance of an [latex]F(2, \, n - 2)[/latex] random variable are

$\begin{array}{l} E [D_{i}] = \frac{n - 2}{n - 4} (for n > 4) and V [D_{i}] = \frac{(n - 2)^{3}}{(n - 4)^{2} (n - 6)} (for n > 6) . \end{array}$

So in the limit as the number of data pairs increases,

$\begin{array}{l} lim_{n \to \infty} E [D_{i}] = 1 and lim_{n \to \infty} V [D_{i}] = 1. \end{array}$

It is for this reason that a threshold of 1 is used as a simple threshold for classifying a data point as influential based on Cook’s distance. Regardless of whether the median of an [latex]F(2, 10)[/latex] random variable (which is 0.743) or 1 is used as a threshold, the first 11 points are not deemed to be influential points, and the 12th point, [latex](19, \, 4)[/latex], is deemed to be an influential point.

One weakness associated with the first two formulas for computing the Cook’s distances in Definition 3.3 involves computation time. There are [latex]n + 1[/latex] regression lines to estimate (one for all of the data pairs and then another n associated with dropping each of the data pairs). For large values of n, this can require significant computation time. The third formula is much faster, as illustrated next.

Example 3.7 Use the third formula from Definition 3.3 to calculate the Cook’s distances for the [latex]n = 96[/latex] data pairs in the data set of heights of wives and husbands from Example 2.7.

The third formula for computing Cook’s distance for data pair i from Definition 3.3 is

$\begin{array}{l} D_{i} = \frac{e_{i}^{2} h_{i i}}{2 \cdot M S E {(1 - h_{i i})}^{2}} \end{array}$

for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]. The advantage to using this formula over the other two formulas is that it only requires one regression line to be calculated, rather than [latex]n + 1[/latex] regression lines in the other two formulas. This is a substantial time savings for large values of n. The R code below calculates Cook’s distances for the heights data.

The [latex]n = 96[/latex] Cook’s distances are plotted in Figure 3.9. The 12th data pair, which is [latex](X_{12}, \, Y_{12}) = (147, \, 178)[/latex], has a spectacular Cook’s distance of [latex]D_{12} = 0.192[/latex]. Since this does not exceed the first threshold (which is the median of an F random variable with 2 and 94 degrees of freedom: 0.698) or the second threshold (which is 1 using the asymptotic result), we conclude that there are no influential points. Cook’s distances are calculated so frequently in regression analysis that R includes a function named cooks.distance that calculates the Cook’s distances, as illustrated below.

Figure 3.9: Cook’s distances for the heights data.

Cook’s distances are effective for identifying influential points. Once an influential point in a simple linear regression model has been identified, there are several possible next steps.

The influential point might have been recorded or coded improperly; a typographical error has occurred. In most situations, this is easily remedied.
The influential point has some unusual characteristic that is not present with the other data points that might account for it being deemed influential. Depending on the setting, the influential point can be removed and the regression model can be refitted without the influential point.
The influential point might provide some evidence that an alternative regression model is appropriate. This might be a nonlinear regression model or a linear regression model with additional independent variables.
The influential point might be at one of the extremes of the scope of the model. This might indicate that the scope of the model is too wide; narrowing the scope should be considered. It is often the case that a simple linear regression model is valid only over a rather limited scope. This might result in eliminating all data points outside of the narrowed scope and refitting the simple linear regression model.
The high-leverage point is indeed within the scope of the model and was recorded correctly, but its extreme influence on the regression line is resulting in poor diagnostic measures. One approach here is to collect more data values, particularly at the extreme values of the independent variable within the scope of the model in order to mitigate the effect of the influential point.

3.3 Remedial Procedures

The diagnostic procedures presented in the previous section are designed to identify assumptions associated with the simple linear regression model that are not satisfied for a particular set of n data pairs. But these diagnostic procedures do not suggest remedies when model assumptions are not satisfied. This section considers remedial procedures.

Reasons that simple linear regression model with normal error terms can fail to satisfy the assumptions given in Definition 2.1 include

the regression function is not linear,
the regression model has not included an important independent variable,
the error terms have a variance that varies with X,
the error terms are not independent,
the error terms are not normally distributed,
the scope of the regression model is too wide,
the scope of the regression model is too narrow, and
an influential point has an unusually strong effect on the regression line.

Two common approaches to handling a regression model which violates one or more of the assumptions are (a) formulate and fit a regression model with nonlinear terms, and (b) transform the X-values or the Y-values (or both) in a fashion so that the simple linear regression assumptions are satisfied. Regression models with nonlinear terms will be considered in a subsequent section in this chapter; transformations will be considered here. Transformations will be illustrated in a single (long) example.

Example 3.8 A simple linear regression model with normal error terms for the speed of a car X (in miles per hour) versus the stopping distance Y (in feet) for the built-in R cars data set was abandoned in Example 2.8 for several reasons. A scatterplot (without jittering) with the associated regression line is displayed in Figure 3.10. The purpose of this example is to see whether a transformation can overcome the problems associated with

the relationship between X and Y appears to be slightly nonlinear,
the variance of the error terms appears to be increasing in X, and
the residuals do not appear to be normally distributed.

Figure 3.10: Scatterplot and regression line of speed X and stopping distance Y.

Long Description for Figure 3.10

The horizontal axis X measuring speed ranges from 0 to 25 in increments of 5 units. The vertical axis Y measuring stopping distance ranges from 0 to 120 in increments of 40 units. Multiple data points are plotted in the quadrant. Most of them lie between a speed of 4 and 25 and a stopping distance of 0 and 80. The line of regression, originating from (4,0) on the horizontal axis, slopes up toward the top right of the quadrant. It has a positive slope. 25 data points fall below the line of regression. 20 points fall above the line of regression. 5 points fall on the line of regression. All data are approximate.

Rather than providing a complete inventory of all possible patterns and associated potential helpful transformations, four transformations will be illustrated here. This trial-and-error approach is not what is typically relied on in practice. There are some patterns associated with data pairs that tend to give clues as to which transformations will be effective.

The first transformation is [latex]X ^ \prime = X ^ 2[/latex]. The R code below implements the transformation, generates a scatterplot of the transformed data pairs, and plots the associated regression line.

This scatterplot appears in the upper-left graph in Figure 3.11. Tick mark labels have been suppressed on these graphs because the interest is in gazing at the data pairs in order to determine whether the transformed data pairs conform to the simple linear regression model with normal error terms. For the transformation [latex]X ^ \prime = X ^ 2[/latex], little progress is made on the constant variance issue. The first 19 data pairs, which are associated with speeds from 4 to 13 miles per hour, seem to have a smaller variance in their stopping distances than the faster speeds. This transformation is deemed ineffective.

Figure 3.11: Scatterplots and estimated regression lines for transformed cars data.

Long Description for Figure 3.11

“Top left graph: The horizontal axis is labeled X squared and the vertical axis Y. The line of regression is the diagonal line that originates from the Y axis, close to the origin and slope up toward top right of the quadrant. 28 points fall below the line of regression, 5 points fall on the line of regression, 17 fall above the line of regression. Top right graph: The horizontal axis is labeled X and the vertical axis is labeled 1 n Y. The line of regression is the diagonal line that originates from the Y axis, close to the midpoint and slopes up toward the top right of the quadrant. 20 points fall below the line of regression. 7 points fall on the line of regression and 19 points fall above the line of regression. Bottom left graph: The horizontal axis is labeled X squared, and the vertical axis is labeled the square root of Y. The line of regression is the diagonal line that originates from the Y axis, close to the origin and slopes up toward the top right of the quadrant. 21 points fall below the line of regression. 7 points fall on the line of regression and 15 points fall above the line of regression. Bottom right graph: The horizontal axis is labeled X squared and the vertical axis is labeled Y to the power of 0.43. The line of regression is the diagonal line that originates from the Y axis, close to the origin and slope up toward top right of the quadrant. 21 points are below the line of regression. 10 points are on the line of regression, and 16 points are above the line of regression.”

The second transformation is [latex]Y ^ \prime = \ln \, Y[/latex]. The R code below implements the transformation, generates a scatterplot of the transformed data pairs, and plots the associated regression line.

This scatterplot appears in the upper-right graph in Figure 3.11. The transformation [latex]Y ^ \prime = \ln \, Y[/latex] also results in a nonconstant variance in the error terms; this time the variance in the stopping distances is greater for the slower speeds. So this transformation is also abandoned for lack of constant variance of the error terms.

The third transformation is [latex]Y ^ \prime = \sqrt{Y}[/latex]. The R code below implements the transformation, generates a scatterplot of the transformed data pairs, and plots the associated regression line.

This scatterplot appears in the lower-left graph in Figure 3.11. The transformation [latex]Y ^ \prime = \sqrt{Y}[/latex] is the first to show some promise for the use of the simple linear regression model with normal error terms. The variance of the error terms appears to be constant over the scope of the model. There is nothing magical, however, about the [latex]1 / 2[/latex] power in the transformation [latex]Y ^ \prime = \sqrt{Y} = Y ^ {1 / 2}[/latex]. Might the cube root be a superior transformation to the square root? This prompts a fourth transformation, which is [latex]Y ^ \prime = Y ^ \lambda[/latex], and is known as the Box–Cox transformation, named after British statisticians George Box and David Cox. They suggested a similar transformation in 1964, which is

$\begin{array}{l} Y^{'} = \frac{Y^{λ} - 1}{λ}, \end{array}$

and the fitting of the λ parameter by maximum likelihood estimation can be performed by the boxcox function in the MASS package in R.

So the fourth transformation is [latex]Y ^ \prime = (Y ^ \lambda - 1) / \lambda[/latex]. The R code below calculates the maximum likelihood estimator of λ, implements the transformation, generates a scatterplot of the transformed data pairs, and plots the associated regression line. The boxcox function generates the log likelihood function for estimating λ, and the which.max function extracts the maximum likelihood estimator.

The log likelihood function and an associated 95% confidence interval for λ is generated by setting the plotit argument to FALSE in the call to boxcox. This confidence interval includes [latex]\lambda = 1 / 2[/latex]. The maximum likelihood estimator [latex]\hat \lambda = 0.43[/latex] falls between a square root and cube root transformation. This scatterplot appears in the lower-right graph in Figure 3.11, and is very similar to the square root transformation; either would work fine for this data set. Since the last two scatterplots and associated regression lines are nearly identical, we move forward with the transformation [latex]Y ^ \prime = \sqrt{Y}[/latex]. So the tentative fitted model is

$\begin{array}{l} E [\sqrt{Y}] = 1.28 + 0.322 X \end{array}$

where the regression coefficients [latex]\beta_0^\prime = 1.28[/latex] and [latex]\beta_1^\prime = 0.322[/latex] are calculated with the R statement

The next step is to assess the aptness of the model by examining the residuals. The four graphs (read row-wise) in Figure 3.12 are (a) the residuals associated with the transformed model [latex]\sqrt{Y} = 1.28 + 0.322 X[/latex] plotted against their index, (b) the standardized residuals [latex]e_i / \sqrt{MSE}[/latex] associated with the transformed model plotted against the value of the independent variable X_i, (c) a histogram of the standardized residuals [latex]e_i / \sqrt{MSE}[/latex] for the transformed model, and (d) a QQ plot of the standardized residuals [latex]e_i / \sqrt{MSE}[/latex] for the transformed model with theoretical quantiles on the horizontal axis and sample quantiles on the vertical axis. Although there is some nonsymmetry in the histogram of the residuals (which might be due to the binning of the 50 data pairs), the residual plots and the QQ plot make the simple linear regression model with normal error terms for the transformed data pairs seem plausible. A roughly mound-shaped histogram is typically adequate for the normality assumption. Moving from the visual assessment to statistical tests, the R code

gives a p-value for the Shapiro–Wilk test of [latex]p = 0.314[/latex]. This is a big improvement over the p-value obtained in Example 2.8, which rejected normality with [latex]p = 0.0215[/latex]. The transformation is effective. The largest Cook’s distance is 0.134, which occurs at the 49th observation [latex](X_{49}, \, Y_{49}) = (24, \, 120)[/latex]. Returning to the 49th observation in Figure 3.12, we see that it achieves the largest Cook’s distance because of its leverage, but does not appear to be inconsistent with the transformed model.

Figure 3.12: Visual assessment of the residuals of the transformed model.

Long Description for Figure 3.12

Top left graph: The horizontal axis labeled i ranges between 1 and 50. The vertical axis labeled e subscript i ranges between negative 3 and 3. A horizontal line originating from point 0 on the vertical axis extends parallel to the horizontal axis. 25 points are below the horizontal line, 5 are on the line and 20 are above the horizontal line. Top right graph: The horizontal axis labeled X i ranges between 0 and 25. The vertical axis labeled e subscript i over the square root of M S E ranges between negative 3 and 3. A horizontal line originating from point 0 on the vertical axis extends parallel to the horizontal axis. Two dotted horizontal lines, each originating at negative 2 and positive 2 run parallel to the horizontal axis. 26 points line between the horizontal lines at points 0 and negative 2. 18 points line between the horizontal lines at points 0 and 2. 5 points line on the solid horizontal line at 0. One point lies above the horizontal line at two. Bottom left graph: A bell-shaped histogram with 5 bars or bins from negative 3 to 3. The first two bars are of increasing heights, and following three bars are of decreasing heights. The bar from negative 1 to 0 is the highest and the one from 2 to 3 is the lowest. Bottom right graph: The horizontal axis ranges from negative 3 to 3 and the vertical axis ranges from negative 3 to 3. A diagonal line originating from (0, negative 3) on the Y axis, slopes up toward the top right of the X Y plane. Most of the points fall on the line of regression or close to it. The point (2.5, 3) lies above the line of regression.

So the visual assessment and statistical tests lead us to believe that a simple linear regression model with normal error terms for the transformed data is appropriate. The fitted regression model is

$\begin{array}{l} E [\sqrt{Y}] = 1.28 + 0.322 X . \end{array}$

All of the statistical inference techniques can now be applied to the transformed data. For example, confidence intervals for the [latex]\beta_0^\prime[/latex] and [latex]\beta_1^\prime[/latex] (the intercept and slope of the regression line for the transformed data) can be calculated with the R statements

which give the 95% confidence intervals

$\begin{array}{l} 0.303 < β_{0}^{'} < 2.25 and 0.263 < β_{1}^{'} < 0.382 . \end{array}$

Figure 3.13 displays all of the exact two-sided 95% confidence intervals for [latex]E[ \sqrt{Y_h} \, ][/latex] and all of the exact two-sided 95% prediction intervals for [latex]\sqrt{Y_h^*}[/latex] for all values of X_h in the scope of the regression model. For [latex]X_h = 21[/latex] miles per hour, for example, an exact two-sided 95% prediction interval for [latex]\sqrt{Y_h^*}[/latex] is

$\begin{array}{l} 5.78 < \sqrt{Y_{h}^{*}} < 10.3, \end{array}$

Figure 3.13: Transformed cars model 95% confidence and prediction intervals.

Long Description for Figure 3.13

The horizontal axis labeled X, ranges from 0 to 25 in increments of 5. The vertical axis labeled square root of Y, ranges from 0 to 12 in increments of 12. A diagonal line with an increasing trend, originating from (0,1)at the vertical axis passes through points (5,2); (15,6); (25, 9). Prediction interval is the region shaded between 0 and 5 on the vertical axis and confidence interval is the region shaded between 1 and 3 on the vertical axis, right along the line of regression. 25 points lie below the line of regression, 4 points fall on the line of regression and 19 lie above the line of regression. 14 points lie in the 95 percent confidence interval region and one point at 15, 9 lines outside the prediction interval region, while the remaining lie within the prediction interval region.

which can be calculated with the R commands

So to translate this back to the original units, for a 51st car going [latex]X_h = 21[/latex] miles per hour, the expected stopping distance using the transformed model is

$\begin{array}{l} {\hat{Y}}_{h} = (1.28 + 0.322 \cdot 21)^{2} = 64.8 \end{array}$

feet, and an exact two-sided 95% prediction interval for the associated stopping distance is

$\begin{array}{l} 33.5 < Y_{h}^{*} < 106. \end{array}$

The previous example took a trial-and-error approach to determining an appropriate transformation to apply to the raw data pairs in order to satisfy the assumptions implicit in a simple linear regression model with normal error terms. There are templates that can give a more systematic approach to determining these transformations.

There is a nice synergy between matrix algebra and regression, which will be presented in the next section.

3.4 Matrix Approach to Simple Linear Regression

So far, a purely algebraic approach has been taken to simple linear regression modeling. This section considers a matrix-based approach. There are (at least) four reasons to take this approach. First, the mathematical expressions are in many cases much more compact; summations from the algebraic approach are often equivalent to matrix multiplications. Second, matrix algebra can easily be implemented on a computer. Third, the matrix approach generalizes very easily to the multiple regression case in which there are several independent variables. Fourth, the matrix approach generalizes very easily to weighted least squares, which will be introduced in the next section.

We begin the matrix approach by defining certain critical matrices, which will be set in boldface. Let [latex]{\bf X}[/latex] be an [latex]n \times 2[/latex] matrix whose first column is all ones and whose second column contains the observed values of the independent variable, [latex]{\bf Y}[/latex] be an [latex]n \times 1[/latex] vector which holds the observed values of the dependent variable, [latex]\pmb{\beta}[/latex] be a [latex]2 \times 1[/latex] vector which holds the population intercept and slope, and [latex]\pmb{\epsilon}[/latex] be an [latex]n \times 1[/latex] vector which holds the error terms:

$\begin{array}{l} X = [\begin{array}{cc} 1 & X_{1} \\ 1 & X_{2} \\ ⋮ & ⋮ \\ 1 & X_{n} \end{array}], Y = [\begin{array}{c} Y_{1} \\ Y_{2} \\ ⋮ \\ Y_{n} \end{array}], β = [\begin{array}{c} β_{0} \\ β_{1} \end{array}], and ϵ = [\begin{array}{c} ϵ_{1} \\ ϵ_{2} \\ ⋮ \\ ϵ_{n} \end{array}] . \end{array}$

The [latex]\bf X[/latex] matrix is known as the design matrix.

As before, the values of the independent variable (the second column of [latex]\bf X[/latex]) are assumed to be fixed constants observed without error with at least two distinct values, the values of the dependent variable contained in [latex]\bf Y[/latex] are assumed to be continuous random responses, and the elements of the vector [latex]\pmb\epsilon[/latex] are assumed to be mutually independent random variables, each with population mean 0 and finite positive population variance σ². Stated another way, the expected value of [latex]\pmb\epsilon[/latex] is the zero vector and the variance–covariance matrix of [latex]\pmb\epsilon[/latex] is

$\begin{array}{l} [\begin{array}{cccc} σ^{2} & 0 & \dots & 0 \\ 0 & σ^{2} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & σ^{2} \end{array}] . \end{array}$

The simple linear regression model

$\begin{array}{l} Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i} \end{array}$

for [latex]i = 1, \, 2, \, \ldots, \, n[/latex], can be written more explicitly in terms of each observed data pair as

$\begin{array}{l} Y_{1} & = β_{0} + β_{1} X_{1} + ϵ_{1} \\ Y_{2} & = β_{0} + β_{1} X_{2} + ϵ_{2} \\ ⋮ \\ Y_{n} & = β_{0} + β_{1} X_{n} + ϵ_{n} \end{array}$

which, in matrix form, is

$\begin{array}{l} [\begin{array}{c} Y_{1} \\ Y_{2} \\ ⋮ \\ Y_{n} \end{array}] = [\begin{array}{cc} 1 & X_{1} \\ 1 & X_{2} \\ ⋮ & ⋮ \\ 1 & X_{n} \end{array}] \cdot [\begin{array}{c} β_{0} \\ β_{1} \end{array}] + [\begin{array}{c} ϵ_{1} \\ ϵ_{2} \\ ⋮ \\ ϵ_{n} \end{array}] \end{array}$

or simply

$\begin{array}{l} Y = X β + ϵ . \end{array}$

This explains why the artificial column of ones appears as the first column of the [latex]{\bf X}[/latex] matrix; it is to account for the intercept term. To force a regression line through the origin, simply omit the column of ones in the [latex]{\bf X}[/latex] matrix. Taking the expected value of both sides of this equation results in

$\begin{array}{l} E [Y] = X β \end{array}$

because [latex]E[ \epsilon_i ] = 0[/latex], for [latex]i = 1, \, 2, \, \ldots, \, n[/latex], (that is, [latex]E[ \pmb{\epsilon} ] = {\bf 0}[/latex]). The left-hand side of this equation, [latex]E[ {\bf Y} ][/latex], is an n-element column vector with elements [latex]E[Y_1], \, E[Y_2], \ldots , \, E[Y_n][/latex]. The sum of squares which is to be minimized to find the least squares estimators is

$\begin{array}{l} S = {(Y - X β)}^{'} (Y - X β) . \end{array}$

With this notation established, the algebraic results concerning the simple linear regression model can be restated more compactly in terms of these matrices. The results have already been proved, so there is no need to prove them again when stated in matrix form. The [latex]\, ^ \prime[/latex] superscript denotes transpose. It is a good exercise to perform the algebra necessary to see that the algebraic and matrix versions of these definitions and theorems match. The dimensions of the matrices should be checked for conformity.

Definition 1.1. The simple linear regression model is
$\begin{array}{l} Y = X β + ϵ, \end{array}$

where [latex]E[ \pmb{\epsilon} ] = {\bf 0}[/latex], [latex]V[ \pmb{\epsilon} ] = \sigma ^ {\, 2} {\bf I}[/latex], and [latex]{\bf I}[/latex] is the [latex]n \times n[/latex] identity matrix.
Theorem 1.1. The least squares estimators of [latex]\pmb{\beta}[/latex], denoted by [latex]\hat{\pmb{\beta}} = \big( \hat \beta_0 , \, \hat \beta_1 \big) ^ \prime[/latex], solve the normal equations
$\begin{array}{l} X^{'} X \hat{β} = X^{'} Y . \end{array}$

The [latex]{\bf X}[/latex] matrix has rank 2 because there are at least two distinct X_i values. So [latex]{\bf X} ^ \prime {\bf X}[/latex] is invertible and the normal equations have the unique solution

$\begin{array}{l} \hat{β} = (X^{'} X)^{- 1} X^{'} Y, \end{array}$

by premultiplying both sides of the normal equations by [latex]\big( {\bf X} ^ \prime {\bf X} \big) ^ {-1}[/latex].
Theorem 1.2. The least squares estimator of [latex]\pmb{\beta}[/latex] in a simple linear regression model is an unbiased estimator of [latex]\pmb{\beta}[/latex] because
$\begin{array}{l} E [\hat{β}] = β . \end{array}$
Theorem 1.3. The least squares estimators of [latex]\pmb{\beta}[/latex] in the simple linear regression model can be written as linear combinations of the dependent variables:
$\begin{array}{l} \hat{β} = (X^{'} X)^{- 1} X^{'} Y, \end{array}$

where the coefficients in the linear combinations are given by [latex]\big( {\bf X} ^ \prime {\bf X} \big) ^ {-1} {\bf X} ^ \prime[/latex].
Theorem 1.4. The variance–covariance matrix of the least squares estimators of [latex]\pmb{\beta}[/latex] is
$\begin{array}{l} σ^{2} (X^{'} X)^{- 1} . \end{array}$
Theorem 1.5 (Gauss–Markov theorem). The least squares estimators of [latex]\pmb{\beta}[/latex] in a simple linear regression model, [latex]\hat{\pmb{\beta}} = ( {\bf X} ^ \prime {\bf X} ) ^ {-1} {\bf X} ^ \prime {\bf Y}[/latex], have the smallest population variance amongst all linear unbiased estimators of [latex]\pmb{\beta}[/latex].
Definition 1.2. The vector of fitted values in a simple linear regression model is the [latex]n \times 1[/latex] column vector
$\begin{array}{l} \hat{Y} = X \hat{β} = X (X^{'} X)^{- 1} X^{'} Y, \end{array}$

which is a linear combination of the dependent variables. The vector of residuals is the [latex]n \times 1[/latex] column vector

$\begin{array}{l} e & = Y - \hat{Y} \\ = Y - X \hat{β} \\ = Y - X (X^{'} X)^{- 1} X^{'} Y \\ = (I - X (X^{'} X)^{- 1} X^{'}) Y, \end{array}$

which is also a linear combination of the dependent variables. The matrix [latex]{\bf I}[/latex] is the [latex]n \times n[/latex] identity matrix.
Theorem 1.6. For the simple linear regression model with fitted values [latex]\bf\hat{Y}[/latex] and residuals [latex]{\bf e}[/latex],
- [latex]{\bf e} ^ \prime {\bf 1} = 0[/latex],
- [latex]{\bf Y} ^ \prime {\bf 1} = \bf\hat{Y}^ {\kern 0.19em \prime} {\bf 1}[/latex]
- [latex]\bf\hat{Y}^ {\kern 0.19em \prime} {\bf e} = 0[/latex],
where [latex]{\bf 1}[/latex] is an n-element column vector of ones.
Theorem 1.7. An unbiased estimator of σ² in a simple linear regression model is
$\begin{array}{l} {\hat{σ}}^{2} = M S E = \frac{e^{'} e}{n - 2} . \end{array}$
Theorem 1.8. The sums of squares can be partitioned in a simple linear regression model as [latex]SST = SSR + SSE[/latex] or
$\begin{array}{l} (Y - \bar{Y})^{'} (Y - \bar{Y}) = (\hat{Y} - \bar{Y})^{'} (\hat{Y} - \bar{Y}) + (Y - \hat{Y})^{'} (Y - \hat{Y}), \end{array}$

where [latex]\bar {\bf Y}[/latex] is an n-element column vector with identical elements which are each the sample mean of the values of the dependent variable.
Definition 1.3. The coefficient of determination in a simple linear regression model is
$\begin{array}{l} R^{2} = \frac{S S R}{S S T} = \frac{(\hat{Y} - \bar{Y})^{'} (\hat{Y} - \bar{Y})}{(Y - \bar{Y})^{'} (Y - \bar{Y})}, \end{array}$

when [latex]\big( {\bf Y} - \bar {\bf Y} \big) ^ \prime \big( {\bf Y} - \bar {\bf Y} \big) \ne 0[/latex]. The coefficient of correlation is

$\begin{array}{l} r = \pm \sqrt{R^{2}}, \end{array}$

where the sign associated with r is positive when [latex]\hat \beta_1 \ge 0[/latex] and negative when [latex]\hat \beta_1 < 0[/latex].
Definition 2.1. The simple linear regression model with normal error terms is
$\begin{array}{l} Y = X β + ϵ, \end{array}$

where [latex]\pmb{\epsilon} \sim N\left( {\bf 0}, \, \sigma ^ {\, 2} {\bf I} \right)[/latex].
Theorem 2.1. For the simple linear regression model with normal error terms, the maximum likelihood estimators of [latex]\pmb{\beta}[/latex] are
$\begin{array}{l} \hat{β} = (X^{'} X)^{- 1} X^{'} Y \end{array}$

and the maximum likelihood estimator of σ² is

$\begin{array}{l} {\hat{σ}}^{2} = \frac{1}{n} (Y - X \hat{β})^{'} (Y - X \hat{β}) . \end{array}$

Since the vector of error terms [latex]\pmb{\epsilon}[/latex] consists of independent and identically distributed normal random variables, [latex]{\bf Y} = {\bf X} \pmb{\beta} + \pmb{\epsilon}[/latex] is a vector of independent and identically distributed normal random variables, and the linear transformation [latex]\hat{\pmb{\beta}} = ( {\bf X} ^ \prime {\bf X} ) ^ {-1} {\bf X} ^ \prime {\bf Y}[/latex] has normally distributed elements.
Theorem 2.2. For the simple linear regression model with normal error terms,
$\begin{array}{l} \frac{e^{'} e}{σ^{2}} \sim χ^{2} (n - 2), \end{array}$

and is independent of [latex]\hat{\pmb{\beta}}[/latex].
Theorem 2.3. For the simple linear regression model with normal error terms, an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for σ² is
$\begin{array}{l} \frac{e^{'} e}{χ_{n - 2, α / 2}^{2}} < σ^{2} < \frac{e^{'} e}{χ_{n - 2, 1 - α / 2}^{2}} . \end{array}$
Theorems 2.4 and 2.7. For the simple linear regression model with normal error terms,
$\begin{array}{l} \hat{β} \sim N (β, σ^{2} (X^{'} X)^{- 1}) . \end{array}$
Theorem 2.12. For the simple linear regression model with normal error terms, an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for [latex]E[ Y_h ][/latex] for a given value of the independent variable X_h is
$\begin{array}{l} X_{h}^{'} \hat{β} - t_{n - 2, α / 2} \sqrt{{\hat{σ}}^{2} X_{h}^{'} (X^{'} X)^{- 1} X_{h}} < E [Y_{h}] < X_{h}^{'} \hat{β} + t_{n - 2, α / 2} \sqrt{{\hat{σ}}^{2} X_{h}^{'} (X^{'} X)^{- 1} X_{h}}, \end{array}$

where [latex]{\bf X}_h = (1, \, X_h) ^ \prime[/latex] and [latex]\hat{\sigma} ^ 2 = MSE[/latex].
Theorem 2.15. For the simple linear regression model with normal error terms, an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] prediction interval for [latex]Y_h^\star[/latex] for a given value of the independent variable X_h is
$\begin{array}{l} X_{h}^{'} \hat{β} - t_{n - 2, α / 2} \sqrt{{\hat{σ}}^{2} (1 + X_{h}^{'} (X^{'} X)^{- 1} X_{h})} < Y_{h}^{⋆} < X_{h}^{'} \hat{β} + t_{n - 2, α / 2} \sqrt{{\hat{σ}}^{2} (1 + X_{h}^{'} (X^{'} X)^{- 1} X_{h})}, \end{array}$

where [latex]{\bf X}_h = (1, \, X_h) ^ \prime[/latex] and [latex]\hat{\sigma} ^ 2 = MSE[/latex].
Theorem 2.16. Under the simple linear regression model with normal error terms and parameters estimated from the data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \ldots , \, \left( X_n, \, Y_n \right)[/latex],
$\begin{array}{l} \frac{(\hat{β} - β)^{'} X^{'} X (\hat{β} - β)}{2 \cdot M S E} \sim F (2, n - 2) . \end{array}$
Theorem 2.17. Under the simple linear regression model with normal error terms and parameters estimated from the data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \ldots , \, \left( X_n, \, Y_n \right)[/latex], the values of β₀ and β₁ satisfying
$\begin{array}{l} \frac{(\hat{β} - β)^{'} X^{'} X (\hat{β} - β)}{2 \cdot M S E} \leq F_{2, n - 2, α} \end{array}$

form an exact joint [latex]{100(1 - \alpha)}\%[/latex] confidence region for β₀ and β₁.
Definition 3.2. Under the simple linear regression model, the hat matrix is
$\begin{array}{l} H = X {(X^{'} X)}^{- 1} X^{'} . \end{array}$

The diagonal elements of the hat matrix are the leverages. The matrix equation

$\begin{array}{l} \hat{Y} = H Y \end{array}$

indicates that [latex]{\bf H}[/latex] transforms [latex]{\bf Y}[/latex] to [latex]\bf\hat{Y}[/latex]. The hat matrix is symmetric (that is, [latex]{\bf H} = {\bf H} ^ \prime[/latex]) and idempotent (that is, [latex]{\bf H} {\bf H} = {\bf H}[/latex]).

The matrix approach applied to a simple linear regression model is illustrated for a small sample size next.

Example 3.9 Consider again the sales data set from Example 1.3. Let the independent variable X be the number of sales per week that Cheryl completes. Each sale results in a random amount of revenue to the company that can be attributed to Cheryl. Let the dependent random variable Y be the associated total revenue to the company from the sales attributed to Cheryl for that week, in thousands of dollars. The data pairs for the past [latex]n = 3[/latex] weeks are

$\begin{array}{l} (X_{1}, Y_{1}) = (6, 2), (X_{2}, Y_{2}) = (8, 9), and (X_{3}, Y_{3}) = (2, 2) . \end{array}$

Use the matrix approach to simple linear regression to define the matrices [latex]{\bf X}[/latex], [latex]{\bf Y}[/latex], [latex]\pmb{\beta}[/latex], and [latex]\pmb{\epsilon}[/latex]. Calculate the least squares estimates of the population intercept β₀ and population slope β₁, the fitted values, the hat matrix, the residuals, the unbiased estimate of the variance of the error terms, SST, SSR, SSE, R², r, an exact 95% confidence interval for [latex]E[Y_h][/latex] when [latex]X_h = 5[/latex] weekly sales, and an exact 95% prediction interval for [latex]Y_h^\star[/latex] when [latex]X_h = 5[/latex] weekly sales using the matrix approach to simple linear regression.

The [latex]{\bf X}[/latex], [latex]{\bf Y}[/latex], [latex]\pmb{\beta}[/latex], and [latex]\pmb{\epsilon}[/latex] matrices associated with the [latex]n = 3[/latex] data pairs are

$\begin{array}{l} X = [\begin{array}{cc} 1 & 6 \\ 1 & 8 \\ 1 & 2 \end{array}], Y = [\begin{array}{c} 2 \\ 9 \\ 2 \end{array}], β = [\begin{array}{c} β_{0} \\ β_{1} \end{array}], and ϵ = [\begin{array}{c} ϵ_{1} \\ ϵ_{2} \\ ϵ_{3} \end{array}] . \end{array}$

The R code below uses the matrix approach to simple linear regression to calculate the estimate of the intercept [latex]\hat \beta_0[/latex], the estimate of the slope [latex]\hat \beta_1[/latex], the fitted values [latex]\bf\hat{Y}[/latex], the hat matrix [latex]{\bf H}[/latex], the residuals [latex]{\bf e}[/latex], and the estimate of the population variance of the error terms [latex]\hat{\sigma} ^ {\, 2}[/latex]. SST, SSR, SSE, R², r, an exact 95% confidence interval for [latex]E[Y_h][/latex] when [latex]X_h = 5[/latex], and an exact 95% prediction interval for [latex]Y_h^\star[/latex] when [latex]X_h = 5[/latex] using the matrix approach to simple linear regression. The t function computes a matrix transpose, the diag function creates an identity matrix, and the solve function computes the inverse of [latex]X ^ \prime X[/latex]. The matrix multiplication operator is %*%

The output of this code is given in the equations that follow. The least squares estimators of the intercept and slope of the regression line are

$\begin{array}{l} \hat{β} = {(X^{'} X)}^{- 1} X^{'} Y = {[\begin{array}{cc} 3 & 16 \\ 16 & 104 \end{array}]}^{- 1} [\begin{array}{ccc} 1 & 1 & 1 \\ 6 & 8 & 2 \end{array}] [\begin{array}{c} 2 \\ 9 \\ 2 \end{array}] = [\begin{array}{c} - 1 \\ 1 \end{array}] . \end{array}$

The fitted values are

$\begin{array}{l} \hat{Y} = X \hat{β} = [\begin{array}{cc} 1 & 6 \\ 1 & 8 \\ 1 & 2 \end{array}] [\begin{array}{c} - 1 \\ 1 \end{array}] = [\begin{array}{c} 5 \\ 7 \\ 1 \end{array}] . \end{array}$

The [latex]3 \times 3[/latex] hat matrix [latex]{\bf H}[/latex] is

$\begin{array}{l} H = X {(X^{'} X)}^{- 1} X^{'} = [\begin{array}{cc} 1 & 6 \\ 1 & 8 \\ 1 & 2 \end{array}] {[\begin{array}{cc} 3 & 16 \\ 16 & 104 \end{array}]}^{- 1} [\begin{array}{ccc} 1 & 1 & 1 \\ 6 & 8 & 2 \end{array}] = [\begin{array}{ccc} 5 / 14 & 3 / 7 & 3 / 14 \\ 3 / 7 & 5 / 7 & - 1 / 7 \\ 3 / 14 & - 1 / 7 & 13 / 14 \end{array}] . \end{array}$

The diagonal elements of the hat matrix are the leverages [latex]h_{11}, \, h_{22}, \, h_{33}[/latex]. The vector of residuals is

$\begin{array}{l} e = (I - H) Y = [\begin{array}{ccc} 9 / 14 & - 3 / 7 & - 3 / 14 \\ - 3 / 7 & 2 / 7 & 1 / 7 \\ - 3 / 14 & 1 / 7 & 1 / 14 \end{array}] [\begin{array}{c} 2 \\ 9 \\ 2 \end{array}] = [\begin{array}{c} - 3 \\ 2 \\ 1 \end{array}] . \end{array}$

The fitted values and the residuals computed here are consistent with the geometry shown in Figure 1.15 from Example 1.8. The unbiased estimate of the population variance of the error terms is

$\begin{array}{l} {\hat{σ}}^{2} = M S E = \frac{e^{'} e}{n - 2} = \frac{1}{3 - 2} [\begin{array}{ccc} - 3 & 2 & 1 \end{array}] [\begin{array}{c} - 3 \\ 2 \\ 1 \end{array}] = 14. \end{array}$

The sums of squares can be partitioned as [latex]SST = SSR + SSE[/latex] using

$\begin{array}{l} (Y - \bar{Y})^{'} (Y - \bar{Y}) = (\hat{Y} - \bar{Y})^{'} (\hat{Y} - \bar{Y}) + (Y - \hat{Y})^{'} (Y - \hat{Y}), \end{array}$

where [latex]\bar {\bf Y}[/latex] is an n-element column vector with identical elements which are each the sample mean of the values of the dependent variable. For the [latex]n = 3[/latex] data pairs, this becomes

$\begin{array}{l} {(- \frac{7}{3})}^{2} + {(\frac{14}{3})}^{2} + {(- \frac{7}{3})}^{2} = {(\frac{2}{3})}^{2} + {(\frac{8}{3})}^{2} + {(- \frac{10}{3})}^{2} + (- 3)^{2} + 2^{2} + 1^{2} \end{array}$

or

$\begin{array}{l} \frac{98}{3} = \frac{56}{3} + 14. \end{array}$

Figure 3.14 show the geometry associated with [latex]SST = SSR + SSE[/latex] for the three data pairs. The sum of the areas of the three squares in the top graph is SST; the sum of the areas of the three squares in the middle graph is SSR; the sum of the areas of the three squares in the bottom graph is SSE.

Figure 3.14: Geometry associated with [latex]SST = SSR + SSE[/latex] for the sales data.

Long Description for Figure 3.14

“In all of the graphs, the horizontal axis X, ranges from 0 to 8 in increments of 1. The vertical axis Y, ranges from negative 1 to 9 in increments of 1. The data points (X 1, Y 1) is plotted at (6, 2). The data points (X 2, Y 2) is plotted at (8, 9). The data points (X 3, Y 3) is plotted at (2, 2). Graph 1: A horizontal line at 4.2 on the vertical axis runs parallel to the horizontal axis. A square of approximate length 4.5 units labeled 196 over 9 is drawn above the horizontal line such that the top right vertex is the data point (X 2, Y 2). Its base shares with the horizontal line. Similarly, two squares of approximate length 2 units labeled 4 over 9 are drawn below the horizontal line such that the bottom left vertex of the squares are the data point (X 3, Y 3) and (X 1, Y 1) respectively. The squares share one of their sides with the horizontal line. The equation beside the graph reads Y equals Y bar equals 13 over 3. S S T equals summation of i equals 1 to 3, open parenthesis, Y I minus Y bar, close parenthesis, squared equals 98 over 3. Graph 2: The data points X 1, Y 1 is plotted at (6, 2). The data points X 2, Y 2 is plotted at (8, 9). The data points X 3, Y 3 is plotted at (2, 2). A horizontal line at 4 on the vertical axis runs parallel to the horizontal axis. A diagonal line extending from origin passes through the points (2,1); (5, 5); (8, 7). A square of approximate length 2 units labeled 64 over 9 is drawn above the horizontal line shares base with the horizontal line. A dotted line extends from its top left vertex to data point (X2, Y2). A small square of approximate length 1 unit labeled 4 over 9 extends above the horizontal line, sharing base with it such that its top left vertex touches the diagonal line. Another square of approximate length 4 units labeled 100 over 9 is drawn above the horizontal line sharing base with it such that the data point (X 3, Y 3) falls along the right side of the square and the bottom right vertex touches the diagonal line. The equation read, Y equals Y bar equals 13 over 3. Y equals beta cap 0 plus beta cap 1 ties X. S S R equals summation of i equals 1 to 3, open parenthesis, Y cap i minus Y bar, close parenthesis, squared equals 56 over 3. Graph 3: A diagonal line passes from origin passes through the points (2,1), (6,5) and (8, 7). A square labeled 4 is drawn such that it top right vertex is the data point (X 2, Y2) and its right bottom vertex falls on the diagonal line. A square labeled 9 is drawn such that its bottom left vertex is the data point (X 1, Y 1) and its top left vertex falls on the diagonal line. A square labeled 1 is drawn such that its top right vertex is data point (X3, Y 3) and its bottom right vertex fall on the diagonal line The equation reads Y equals beta cap 0 plus beta 1 cap X. S S E equals summation of 1 equals 1 to 3, open parenthesis Y I minus Y cap I, close parenthesis, squared equals 14.”

The coefficient of determination and the correlation coefficient in a simple linear regression model are

$\begin{array}{l} R^{2} = \frac{S S R}{S S T} = \frac{(\hat{Y} - \bar{Y})^{'} (\hat{Y} - \bar{Y})}{(Y - \bar{Y})^{'} (Y - \bar{Y})} = \frac{56 / 3}{98 / 3} = \frac{4}{7} = 0.57 and r = 0.76 . \end{array}$

The three intervals are

$\begin{array}{l} 2.8 < σ^{2} < 14000, \end{array}$

$\begin{array}{l} - 24 < E [Y_{h}] < 32, \end{array}$

and

$\begin{array}{l} - 51 < Y_{h}^{⋆} < 59. \end{array}$

The intervals are unusually wide because there are only [latex]n = 3[/latex] data pairs which have significant deviation from the regression line. Notice that these results match those obtained earlier by algebraic methods and by using the lm (linear model) function as given in Examples 1.3, 1.7, 1.8, and 1.10.

Theorem 2.2 stated that under the simple linear regression model with normal errors,

$\begin{array}{l} \frac{S S E}{σ^{2}} \sim χ^{2} (n - 2) . \end{array}$

An outline of the proof of Theorem 2.2 was given in Chapter 2 in purely algebraic terms. An outline of the proof to the result using the matrix approach to simple linear regression is given here to contrast the difference between the two approaches.

Proof (Outline only; matrix approach) As given in the matrix version of Definition 1.2, the vector of fitted values in a simple linear regression model is the [latex]n \times 1[/latex] column vector

$\begin{array}{l} \hat{Y} = X (X^{'} X)^{- 1} X^{'} Y . \end{array}$

The sum of squares for error in matrix form is

$\begin{array}{l} S S E & = (Y - \hat{Y})^{'} (Y - \hat{Y}) \\ = {[Y - X (X^{'} X)^{- 1} X^{'} Y]}^{'} [Y - X (X^{'} X)^{- 1} X^{'} Y] \\ = [Y^{'} - Y^{'} X^{''} ((X^{'} X)^{'})^{- 1} X^{'}] [Y - X (X^{'} X)^{- 1} X^{'} Y] \\ = Y^{'} [I - X (X^{'} X)^{- 1} X^{'}] [I - X (X^{'} X)^{- 1} X^{'}] Y . \end{array}$

Let [latex]{\bf R} = {\bf I} - {\bf X} \big( {\bf X} ^ \prime {\bf X} \big) ^ {-1} {\bf X} ^ \prime[/latex], where I is the [latex]n \times n[/latex] identity matrix. This matrix plays a critical role in the proof. The matrix [latex]{\bf R}[/latex] is symmetric because

$\begin{array}{l} R^{'} = {[I - X (X^{'} X)^{- 1} X^{'}]}^{'} = I^{'} - X^{''} ((X^{'} X)^{'})^{- 1} X^{'} = I - X (X^{'} X)^{- 1} = R . \end{array}$

The matrix [latex]{\bf R}[/latex] is idempotent because

$\begin{array}{l} R^{2} & = [I - X (X^{'} X)^{- 1} X^{'}] [I - X (X^{'} X)^{- 1} X^{'}] \\ = I^{2} - 2 X (X^{'} X)^{- 1} X^{'} + X (X^{'} X)^{- 1} X^{'} X (X^{'} X)^{- 1} X^{'} \\ = I - X (X^{'} X)^{- 1} X^{'} \\ = R . \end{array}$

Since [latex]{\bf R}[/latex] is a symmetric idempotent matrix, it is a projection matrix. This has two implications. First, the rank of [latex]{\bf R}[/latex] equals the trace of [latex]{\bf R}[/latex], which in this case is [latex]n - 2[/latex]. Second, all eigenvalues of [latex]{\bf R}[/latex] are either zero or one, and in this setting, there are [latex]n - 2[/latex] ones and 2 zeros. The rest of the proof proceeds as follows. Since [latex]{\bf R}[/latex] is symmetric matrix it can be orthogonally diagonalized as [latex]{\bf R} = {\bf U}{\bf D}{\bf U} ^ \prime[/latex], where [latex]{\bf U}[/latex] is an orthogonal matrix and [latex]{\bf D}[/latex] is a diagonal matrix with [latex]n - 2[/latex] ones and 2 zeros on the diagonal. The assumed normality of the error terms in the model results in normally distributed residuals, which can be simplified to yield [latex]SSE / \sigma ^ {\, 2} \sim \chi ^ 2 (n - 2)[/latex]. [latex]\Box[/latex]

The matrix approach gives an alternative way of computing measures of interest in a simple linear regression. Using matrices also allows the following two helpful extensions to simple linear regression.

Removing the first column of the [latex]{\bf X}[/latex] matrix that consists entirely of ones corresponds to forcing a regression line through the origin.
Adding additional columns to the [latex]{\bf X}[/latex] matrix corresponds to including additional independent variables to the regression model, which is known as multiple linear regression. This is the topic of the next section.

3.5 Multiple Linear Regression

Multiple linear regression can often be applied when there are several independent variables (or predictors) [latex]X_1, \, X_2, \, \ldots, \, X_p[/latex] which can be used to explain a continuous dependent (or response) variable Y. Three examples are listed below.

The asking price of a home Y is a function of
- the number of square feet in the home,
- the number of bedrooms, and
- acreage of the land associated with the home.
The annual amount of money a person donates to charity Y is a function of
- the nationality of the person,
- the annual income of the person,
- the net worth of the person,
- the religious affiliation of the person,
- the age of the person, and
- the gender of the person.
The stopping distance of a car Y is a function of
- the speed of the car,
- the weight of the car, and
- the type of brakes installed on the car.

One way to formulate a multiple linear regression model is to treat the left-hand side of the model as an expected value:

$\begin{array}{l} E [Y] = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p} . \end{array}$

Since [latex]E[Y][/latex] denotes a conditional expectation of Y given the values of the p independent variables [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex], a more careful way to write this model is

$\begin{array}{l} E [Y | X_{1}, X_{2}, \dots, X_{n}] = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p} . \end{array}$

So far, there has been no consideration of the probability distribution of the error terms, and that is addressed in the formal definition of a multiple linear regression model given next.

To estimate the parameters in a multiple linear regression model, we collect n observations which each consist of the p independent variables and the associated dependent variable. In most applications, [latex]p > n[/latex]. Occasions arise (often in biostatistical applications) in which [latex]p > n[/latex]. The formulation of the simple linear regression model with notation included for the n observations is

$\begin{array}{l} Y_{i} = β_{0} + β_{1} X_{i 1} + β_{2} X_{i 2} + \dots + β_{p} X_{i p} + ϵ_{i} \end{array}$

for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]. So [latex]X_{ij}[/latex] denotes the value of the jth independent variable collected on the ith observational unit. In the real estate example given at the beginning of this section, X₈₃ is the value of the third independent variable (acreage) collected on the 8th home collected by the analyst. The associated asking price of the 8th home is Y₈.

Figure 3.15 shows a portion of the population regression plane [latex]E[Y] = \beta_0 + \beta_1 X_1 + \beta_2 X_2[/latex] for a multiple linear regression model with [latex]p = 2[/latex] independent variables X₁ and X₂. The plane extends outward from the portion shown in Figure 3.15. The regression parameters β₀, β₁, and β₂ are fixed constants. The intercept β₀ is positive in Figure 3.15 because the plane strikes the Y-axis above the origin. Based on the inclination of the population regression plane relative to the X₁– and X₂-axes it is clear that [latex]\beta_1 < 0[/latex] and [latex]\beta_2 > 0[/latex]. To avoid clutter and highlight the geometry and notation, only the ith data triple [latex](X_{i1}, \, X_{i2}, \, Y_i)[/latex] and the associated error term ϵ_i are shown in the figure.

A graph of the population regression plane and a sample point. — Figure 3.15: Population regression plane and a sample point.

Long Description for Figure 3.15

A rectangular plane is plotted on the three-dimensional coordinate plane, labeled X 1, X2 and Y. The top left vertex of the rectangular plane falls on the Y axis with points 0, 0, beta subscript 0. Three collinear points,( X i 1, X i 2 and Y i); (X I 1, X I 2, E of Y I) in the X 2 Y plane and (X I 1, X I 2, 0) in the X 2, X 1 plane are connected by a dotted line. The point ( X I 1, X I 2, E of Y i) falls on the right end of the rectangular plane. The distance from (X i 1, X i 2, Y i) to( X i 1, X i 2, E of Y i )is indicated as E i. An equation within the rectangular plane reads, E of Y equals beta 0 plus beta 1 X 1 plus beta 2 X 2.

Figure 3.16 shows a portion of the estimated regression plane [latex]Y = \hat \beta_0 + \hat \beta_1 X_1 + \hat \beta_2 X_2[/latex] for a multiple linear regression model with [latex]p = 2[/latex] independent variables X₁ and X₂. The estimated regression parameters [latex]\hat \beta_0[/latex], [latex]\hat \beta_1[/latex], and [latex]\hat \beta_2[/latex] are random variables which are estimated from n data triples [latex]\left( X_{11}, \, X_{12} , \, Y_1 \right), \, \left( X_{21}, \, X_{22}, \, Y_2 \right), \, \ldots , \, \left( X_{n1}, \, X_{n2}, \, Y_n \right)[/latex]. The estimated regression parameters are random variables because the dependent variable values [latex]Y_1, \, Y_2, \, \ldots, \, Y_n[/latex] are random variables. The estimated intercept [latex]\hat \beta_0[/latex] is positive in Figure 3.16 because the plane strikes the Y-axis above the origin. Based on the inclination of the estimated regression plane relative to the X₁– and X₂-axes it is clear that [latex]\hat \beta_1 < 0[/latex] and [latex]\hat \beta_2 > 0[/latex]. To avoid clutter and highlight the geometry and notation, just the ith data triple [latex](X_{i1}, \, X_{i2}, \, Y_i)[/latex], the associated fitted value [latex](X_{i1}, \, X_{i2}, \, \hat{Y}_i)[/latex], and the associated residual e_i are shown in the figure.

A graph with an estimated regression plane and a sample point. — Figure 3.16: Estimated regression plane and a sample point.

Long Description for Figure 3.16

A rectangular plane is plotted on the three-dimensional coordinate plane, labeled X 1, X 2 and Y. The top left vertex of the rectangular plane falls on the Y axis with points 0, 0, beta cap 0. Three collinear points, X i 1, X i 2 and Y i; X i 1, X i 2, Y cap i in the X 2 Y plane and X i 1, X i 2, 0 in the X 2, X 1 plane are connected by a dotted line. The point X i 1, X i 2, Y cap i falls on the right end of the rectangular plane. The distance from X i 1, X i 2, Y i to X i 1, X I 2, Y cap i is indicated as e i. An equation within the rectangular plane reads, Y equals beta cap 0 plus beta 1 X 1 plus bet 2 X 2.

When there are [latex]p > 2[/latex] independent variables, the estimated regression model is a hyperplane in [latex]{\cal R} ^ {p + 1}[/latex]. Residual i is the distance [latex]e_i = Y_i - \hat Y_i[/latex], for [latex]i = 1, \, 2, \, \ldots, \, n[/latex].

When the error terms are assumed to be normally distributed, this model is known as the multiple linear regression model with normal error terms. This additional assumption allows for statistical inference concerning parameters and predicted values in a similar manner to that described in Chapter 2.

The multiple linear regression model can also be expressed in terms of matrices. Relative to the simple linear regression model, additional columns are appended to the [latex]{\bf X}[/latex] matrix, and the [latex]\pmb{\beta}[/latex] vector is expanded to include the parameters associated with the additional parameters:

$\begin{array}{l} X = [\begin{array}{ccccc} 1 & X_{11} & X_{12} & \dots & X_{1 p} \\ 1 & X_{21} & X_{22} & \dots & X_{2 p} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & X_{n 1} & X_{n 2} & \dots & X_{n p} \end{array}], Y = [\begin{array}{c} Y_{1} \\ Y_{2} \\ ⋮ \\ Y_{n} \end{array}], β = [\begin{array}{c} β_{0} \\ β_{1} \\ ⋮ \\ β_{p} \end{array}], and ϵ = [\begin{array}{c} ϵ_{1} \\ ϵ_{2} \\ ⋮ \\ ϵ_{n} \end{array}] . \end{array}$

The vectors [latex]{\bf Y}[/latex] and [latex]\pmb{\epsilon}[/latex] remain unchanged from the simple linear regression formulation. The first row of [latex]{\bf X}[/latex] corresponds to the values of the independent variables collected on the first observational unit, the second row of [latex]{\bf X}[/latex] corresponds to the values of the independent variables collected on the second observational unit, etc. As was the case in simple linear regression, [latex]{\bf X}[/latex] is known as the design matrix.

The good news about the matrix approach to multiple linear regression is that the definitions and results from simple linear regression only require some minor tweaking in order to generalize to multiple regression. Several of these definitions and results are given below. In many cases, it is just a matter of replacing the word “simple” with the word “multiple” or updating the degrees of freedom to account for the p independent variables. It is assumed that the [latex]{\bf X}[/latex] matrix has rank [latex]p + 1[/latex] (that is, a full rank matrix), which means that the columns of [latex]{\bf X}[/latex] are linearly independent.

The multiple linear regression model is
$\begin{array}{l} Y = X β + ϵ, \end{array}$

where [latex]E[ \pmb{\epsilon} ] = {\bf 0}[/latex], [latex]V[ \pmb{\epsilon} ] = \sigma ^ {\, 2} {\bf I}[/latex], and [latex]{\bf I}[/latex] is the [latex]n \times n[/latex] identity matrix.
The least squares estimators of [latex]\pmb{\beta}[/latex], denoted by [latex]\hat{\pmb{\beta}} = \big( \hat \beta_0 , \, \hat \beta_1 , \, \ldots , \, \hat \beta_p \big) ^ \prime[/latex], solve the normal equations
$\begin{array}{l} X^{'} X \hat{β} = X^{'} Y . \end{array}$

Since [latex]{\bf X}[/latex] has full rank, [latex]{\bf X} ^ \prime {\bf X}[/latex] is invertible and the normal equations have the unique solution

$\begin{array}{l} \hat{β} = (X^{'} X)^{- 1} X^{'} Y, \end{array}$

by premultiplying both sides of the normal equations by [latex]\big( {\bf X} ^ \prime {\bf X} \big) ^ {-1}[/latex].
The least squares estimator of [latex]\pmb{\beta}[/latex] in a multiple linear regression model is an unbiased estimator of [latex]\pmb{\beta}[/latex] because
$\begin{array}{l} E [\hat{β}] = β . \end{array}$
The least squares estimators of [latex]\pmb{\beta}[/latex] in the multiple linear regression model can be written as linear combinations of the dependent variables:
$\begin{array}{l} \hat{β} = (X^{'} X)^{- 1} X^{'} Y, \end{array}$

where the coefficients in the linear combinations are given by [latex]\big( {\bf X} ^ \prime {\bf X} \big) ^ {-1} {\bf X} ^ \prime[/latex].
The variance–covariance matrix of the least squares estimators of [latex]\pmb{\beta}[/latex] is
$\begin{array}{l} σ^{2} (X^{'} X)^{- 1} . \end{array}$
(Gauss–Markov theorem) The least squares estimators of [latex]\pmb{\beta}[/latex] in a multiple linear regression model, [latex]\hat{\pmb{\beta}} = ( {\bf X} ^ \prime {\bf X} ) ^ {-1}{\bf X} ^ \prime {\bf Y}[/latex], have the smallest population variance amongst all linear unbiased estimators of [latex]\pmb{\beta}[/latex].
The vector of fitted values in a multiple linear regression model is the [latex]n \times 1[/latex] column vector
$\begin{array}{l} \hat{Y} = X \hat{β} = X (X^{'} X)^{- 1} X^{'} Y, \end{array}$

which is a linear combination of the dependent variables. The vector of residuals is the [latex]n \times 1[/latex] column vector

$\begin{array}{l} e & = Y - \hat{Y} \\ = Y - X \hat{β} \\ = Y - X (X^{'} X)^{- 1} X^{'} Y \\ = (I - X (X^{'} X)^{- 1} X^{'}) Y, \end{array}$

which is also a linear combination of the dependent variables. The matrix [latex]{\bf I}[/latex] is the [latex]n \times n[/latex] identity matrix.
The multiple linear regression model with normal error terms is
$\begin{array}{l} Y = X β + ϵ, \end{array}$

where [latex]\pmb{\epsilon} \sim N\left( {\bf 0}, \, \sigma ^ {\, 2} {\bf I} \right)[/latex].
For the multiple linear regression model with normal error terms, the maximum likelihood estimators of [latex]\pmb{\beta}[/latex] are
$\begin{array}{l} \hat{β} = (X^{'} X)^{- 1} X^{'} Y \end{array}$

and the maximum likelihood estimator of σ² is

$\begin{array}{l} {\hat{σ}}^{2} = \frac{1}{n} (Y - X \hat{β})^{'} (Y - X \hat{β}) . \end{array}$

Since the vector of error terms [latex]\pmb{\epsilon}[/latex] consists of independent and identically distributed normal random variables, [latex]{\bf Y} = {\bf X} \pmb{\beta} + \pmb{\epsilon}[/latex] is a vector of independent and identically distributed normal random variables. Since [latex]\hat{\pmb{\beta}}[/latex] is a linear transformation of Y, [latex]\hat{\pmb{\beta}} \sim N \left( {\pmb{\beta}}, \, \sigma ^ {\, 2} \big( {\bf X} ^ \prime {\bf X} \big) ^ {-1} \right)[/latex].
Under the multiple linear regression model, the [latex]n \times n[/latex] hat matrix is
$\begin{array}{l} H = X {(X^{'} X)}^{- 1} X^{'} . \end{array}$

The diagonal elements of the hat matrix are the leverages. The matrix equation

$\begin{array}{l} \hat{Y} = H Y \end{array}$

indicates that [latex]{\bf H}[/latex] transforms [latex]{\bf Y}[/latex] to [latex]\bf\hat{Y}[/latex]. The hat matrix is symmetric (that is, [latex]{\bf H} = {\bf H} ^ \prime[/latex]) and idempotent (that is, [latex]{\bf H} {\bf H} = {\bf H}[/latex]). The trace of the hat matrix is [latex]\sum_{i\,=\,1}^n h_{ii} = p + 1[/latex].

The example of multiple linear regression that follows considers [latex]p = 2[/latex] predictors of the sales price of a home.

Example 3.10 In Example 2.9, the sales price, Y, of homes sold in Ames, Iowa between 2006 and 2010 with between 2500 and 3500 square feet were fitted to a simple linear regression model with the square footage as an independent variable X. There were[latex]n = 120[/latex] homes in the data frame that fit this criteria. In that analysis, the value of the land was estimated to be $21,233 (although this was outside of the scope of the simple linear regression model), and the price of the home increased by an average of $112 with each additional square foot of indoor space. Fit a multiple linear regression model with normal error terms to the same data set using two independent variables, X₁, the square footage of indoor space, and X₂, the square footage of the lot. The dependent variable is again the sales price Y.

The multiple regression model in this setting is

$\begin{array}{l} Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + ϵ, \end{array}$

where [latex]\epsilon \sim N\left( \mu, \, \sigma _ Z ^ {\, 2} \right)[/latex]. The R code below estimates the regression parameters β₀, β₁, and β₂. The regression function

$\begin{array}{l} E [Y] = β_{0} + β_{1} X_{1} + β_{2} X_{2}, \end{array}$

is a plane in [latex]{\cal R} ^ 3[/latex]. The values of β₁ and β₂ control the tilt of the regression plane, and the value of β₀ is the intercept of the regression plane with the [latex]E[Y][/latex] axis. The regression plane will be fitted in two fashions in R: the matrix approach to multiple linear regression and the built-in lm function. The R code below defines the [latex]{\bf X}[/latex] and [latex]{\bf Y}[/latex] matrices, and then uses the formula

$\begin{array}{l} \hat{β} = (X^{'} X)^{- 1} X^{'} Y \end{array}$

to calculate the estimates of the regression coefficients.

These R statements return the least squares regression parameter estimates [latex]\hat \beta_0 = \text{26,515}[/latex], [latex]{\hat \beta_1 = 96.88}[/latex], and [latex]\hat \beta_2 = 2.65[/latex]. The intercept is not meaningful in this setting because it is associated with a home with 0 square feet and no land. This situation does not make sense nor does it fall in the scope of the model. The naive interpretation of the other regression coefficients in the fitted model are (a) the sales price of a home increases by an average of $96.88 for each additional square foot in the home, and (b) the sales price of the home increases by $2.65 for each additional square foot in the lot size. The interpretation of the estimated regression coefficients is more nuanced in the case of multiple independent variables because those independent variables are often correlated. So reporting that “the value of [latex]{\hat \beta_1 = 96.88}[/latex] means that the sales price of the house increases by an average of $96.88 for each additional square foot of interior space with the lot size fixed” is not quite accurate because the interior space and lot size might be correlated. Larger homes might be built on larger lots, for example. Regression analysts acknowledge possible correlations between the independent variables by just stating “the sales price increases by an average of $96.88 for each additional square foot of interior space, adjusted for lot size” when interpreting [latex]\hat \beta_1[/latex]. Likewise, “the sales price increases by an average of $2.65 for each additional square foot of lot size, adjusted for interior square footage” when interpreting [latex]\hat \beta_2[/latex].

A second way to calculate the estimated regression coefficients is to use R’s built-in lm function.

The call to the summary function prints the following output concerning the fitted multiple linear regression model.

The estimated regression coefficients match those that were calculated using the matrix approach to multiple linear regression. The right-hand column of p-values tells us that the size of a home is a statistically significant predictor of the sales price of a home, but the lot size is not a statistically significant predictor of the sales price of a home.

A multiple linear regression model can easily be adapted to include nonlinear terms. A multiple regression model with two independent variables X₁ and X₂, for example, with a linear relationship between X₁ and Y and a quadratic relationship between X₂ and Y which includes an intercept term is

$\begin{array}{l} Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{2}^{2} + ϵ . \end{array}$

Using the R lm function to estimate the coefficients will be illustrated in Section 3.7.

Multiple linear regression has many more modeling issues that arise than simple linear regression. The subsections that follow consider the following topics within multiple regression: (a) handling categorical independent variables which fall in categories rather than quantitative values, (b) handling the case in which independent variables have interactive effects, (c) extending the ANOVA table to multiple independent variables, (d) calculation of the coefficient of determination for multiple linear regression, and an adjustment that can be made to reduce its bias, (e) the effect of multicollinearity among the independent variables, and (f) algorithms for model selection.

3.5.1 Categorical Independent Variables

Some regression models include independent variables which are not naturally quantitative, but are rather categorical. These categorical independent variables require some special treatment in order to be included in a multiple linear regression model. The cases in which a categorical independent variable falls in one of two categories will be considered separately from the case in which a categorical independent variable falls in one of more than two categories.

Categorical independent variable which falls in one of two categories. Consider a multiple linear regression model with [latex]p = 2[/latex] independent variables, X₁, which is age, and X₂, which is gender. The dependent variable is the annual salary Y. So the multiple linear regression model is

$\begin{array}{l} Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + ϵ . \end{array}$

Regression models assume that the independent variables are quantitative rather than categorical like gender. One solution to this problem is to code the gender as 0 for female and 1 for male. The independent variable X₂ in this case is known as a dummy variable or an indicator variable. As a particular instance, consider [latex]n = 6[/latex] data points consisting of three women (ages 26, 71, and 34) and three men (ages 44, 65, and 21). In this case the design matrix is

$\begin{array}{l} X = [\begin{array}{ccc} 1 & 26 & 0 \\ 1 & 71 & 0 \\ 1 & 34 & 0 \\ 1 & 44 & 1 \\ 1 & 65 & 1 \\ 1 & 21 & 1 \end{array}] . \end{array}$

The elements of the six-element column vector Y are the associated salaries. The value of [latex]\hat \beta_0[/latex] is not meaningful here. Not only is it outside of the scope of the model, its interpretation as the annual salary of a newborn baby girl doesn’t fit with societal norms. Newborn baby girls seldom earn annual salaries. The value of [latex]\hat \beta_1[/latex] indicates the increase in annual salary for each additional year in age, adjusted for gender. Since salaries tend to rise over time, we anticipate that [latex]\hat \beta_1[/latex] will be positive. The value of [latex]\hat \beta_2[/latex] indicates the change in salary associated being male rather than female, adjusted for age. If [latex]\hat \beta_2[/latex] is significantly greater than zero, then men’s salaries are significantly higher than women’s salaries, adjusted for age; if [latex]\hat \beta_2[/latex] is significantly less than zero, then women’s salaries are significantly higher than men’s salaries, adjusted for age. The choice of using an indicator of 0 for women and 1 for men was arbitrary. See if you can predict what would happen if instead we used 0 for men and 1 for women.

Categorical independent variable which falls in one of more than two categories. Let’s extend the regression model to predict the annual salary to include another categorical variable: political affiliation. This categorical variable will have three levels: Republican, Democrat, and Independent. The third category includes anyone who is not affiliated with the two main political parties in the United States. Although it might be tempting to just let [latex]X_3 = 1[/latex] denote a Republican, [latex]X_3 = 2[/latex] denote a Democrat, and [latex]X_3 = 3[/latex] denote an Independent, this will likely produce erroneous results for two reasons. First, using the ordering [latex]X_3 = 1[/latex], [latex]X_3 = 2[/latex], and [latex]X_3 = 3[/latex] implies an ordering of the salaries associated with individuals from the three different political affiliations for [latex]{\beta_3 > 0}[/latex], or the opposite ordering of the salaries associated with individuals from the three political affiliations for [latex]\beta_3 < 0[/latex]. This ordering might not be the correct ordering. Second, leaving a gap of 1 between each of the values of X₃ indicates that there is a known and equal salary gap between individuals from the ordered different political affiliations. The usual way to account for a categorical independent variable which can take on c values is to define [latex]c - 1[/latex] independent indicator variables. In the case of political affiliation, the independent variables X₃ and X₄ can be defined as

$\begin{array}{l} X_{3} = {\begin{cases} 0 & not a Republican \\ 1 & Republican \end{cases} \end{array}$

and

$\begin{array}{l} X_{4} = {\begin{cases} 0 & not a Democrat \\ 1 & Democrat . \end{cases} \end{array}$

So now the multiple linear regression model with [latex]p = 4[/latex] independent variables is

$\begin{array}{l} Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + β_{4} X_{4} + ϵ . \end{array}$

In this fashion, the expected value of an Independent’s salary is given by

$\begin{array}{l} E [Y] = β_{0} + β_{1} X_{1} + β_{2} X_{2}, \end{array}$

the expected value of an Republican’s salary is given by

$\begin{array}{l} E [Y] = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3}, \end{array}$

and the expected value of a Democrat’s salary is given by

$\begin{array}{l} E [Y] = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{4} X_{4} . \end{array}$

With this arrangement of the levels of the categorical variable representing the political affiliation, there is no predicted ordering of salaries by the three political affiliations nor are the gaps between the affiliations necessarily equal.

As a particular instance, consider [latex]n = 6[/latex] data points with three women (a 26-year-old Independent, a 71-year-old Democrat, and a 34-year-old Republican) and three men (a 44-year-old Independent, a 65-year-old Democrat, and a 21-year-old Republican) in the study. The appropriate design matrix is

$\begin{array}{l} X = [\begin{array}{ccccc} 1 & 26 & 0 & 0 & 0 \\ 1 & 71 & 0 & 0 & 1 \\ 1 & 34 & 0 & 1 & 0 \\ 1 & 44 & 1 & 0 & 0 \\ 1 & 65 & 1 & 0 & 1 \\ 1 & 21 & 1 & 1 & 0 \end{array}] . \end{array}$

The value of [latex]\hat \beta_3[/latex] is the estimated difference between the mean annual salary of an Independent and a Republican, adjusted for age and gender. The value of [latex]\hat \beta_3[/latex] is the estimated difference between the mean annual salary of an Independent and a Democrat, adjusted for age and gender. This example has been for illustrative purposes only. Estimating five parameters [latex]\beta_0, \, \beta_1, \, \ldots , \beta_4[/latex] from just six data values will almost certainly not provide strong statistical evidence concerning the effect of age, gender, and political affiliation on salary. Furthermore, many other important factors, such as years of education, years on the job, and type of work, have not been included in this regression model.

3.5.2 Interaction Terms

The multiple linear regression model

$\begin{array}{l} Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p} + ϵ \end{array}$

assumes a linear relationship between each independent variable and Y and the slope associated with an independent variable is identical at all values of the other independent variables within the scope of the multiple linear regression model. This relationship is illustrated for some selected data points of smaller homes from the Ames, Iowa housing data set from Examples 2.9 and 3.10. In this case, X₁ is the interior square footage, X₂ is an indicator variable reflecting the lot size,

$\begin{array}{l} X_{2} = {\begin{cases} 0 & lot size is less than or equal to 10,000 square feet \\ 1 & lot size is greater than 10,000 square feet, \end{cases} \end{array}$

and Y is the sales price. The multiple linear regression model with the [latex]p = 2[/latex] independent variables is

$\begin{array}{l} Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + ϵ . \end{array}$

Figure 3.17 shows a scatterplot of the interior square footage and sales price of homes on smaller lots ([latex]X_2 = 0[/latex] as open points) and larger lots ([latex]X_2 = 1[/latex] as solid points). The values of [latex]\hat \beta_0[/latex], [latex]\hat \beta_1[/latex], and [latex]\hat \beta_2[/latex] are indicated on the graph. The estimated intercept [latex]\hat \beta_0 = \text{21,473}[/latex], although slightly outside of the scope of the model, gives the estimated sales price of a small lot containing no dwelling as $21,473. The estimated regression coefficient [latex]\hat \beta_1 = 31.33[/latex] indicates that the sales price of a home increases by an estimated $31.33 for each additional interior square foot, adjusted for lot size. The estimated regression coefficient [latex]\hat \beta_2 = \text{35,693}[/latex] indicates that homes on larger lots cost $35,693 more, on average, than homes on smaller lots, adjusted for interior square feet. Notice that this formulation of the multiple linear regression model forces the slopes of the two lines in Figure 3.17 to be identical, regardless of the value of X₂.

A scatter plot graph with multiple linear regression lines for data pairs indicating sales price for smaller lots and larger lots. The values of beta cap 0, beta cap 1 and beta cap 2 are indicated. — Figure 3.17: Fitted multiple linear regression model [latex]Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon[/latex].

Long Description for Figure 3.17

The horizontal axis labeled X 1 and ranges between 0 to 2500 in increments of 500. The vertical axis labeled Y ranges from 0 to 120,000 in increments of 40,000. The data pairs representing sales price of smaller lots, X equals 0, are indicated in open circle sand those for larger lots, X 2 equals 1 are represented as solid points. Two parallel regression lines, with positive slopes originate from points (0, 20,000) and (0, 50,000). The data pair for smaller lots are clustered around the regression line originating at (0, 20,000) and those for larger lots are clustered around the regression line originating at (0, 50,000). The slope of the regression lines are indicated as beta cap 1. The distance between horizontal axis and regression line originating at (0, 20,000) is indicated as beta cap 0. The distance between the two regression lines is indicated as beta cap 2.

But is the assumption of equal slopes of the two lines in Figure 3.17 justified? Separate simple linear regression models are fitted to the homes built on smaller and larger lots, and the results are plotted in Figure 3.18. The lines do not appear to be parallel in this case, indicating that a more complex regression model is warranted. There appears, in this case, to be an interaction effect between X₁ and X₂. This means that the effect of one independent variable (X₁, for example, the interior size) on Y is altered based on the value of another independent variable (X₂, the lot size indicator).

A scatter plot graph with two separate simple linear regression lines for data pairs, indicating for smaller lots and larger lots. — Figure 3.18: Fitted simple linear regression models [latex]Y = \beta_0 + \beta_1 X_1 + \epsilon[/latex].

Long Description for Figure 3.18

The horizontal axis labeled X 1 ranges between 0 to 2500 in increments of 500. The vertical axis labeled Y ranges from 0 to 120,000 in increments of 40,000. The data pairs representing the sales price of smaller lots, X equals 0, are indicated in open circle sand those for larger lots, X 2 equals 1 are represented as solid points. Two regression lines, with positive slopes originate from points (0, 30,000) and (0, 40,000). Most of the data pair fall between 500 and 1500 of horizontal axis and 30,000 and 60,000 of vatical axis values. The data pair for smaller lots are clustered around the less steeper regression line originating at (0, 30,000). The data pairs for larger lots are clustered around the regression line, which is a steeper line, originating at (0, 40,000). Most of the data pairs lie between 1000 and 2000 on the horizontal axis and between 70,000 and 120,000 on the vertical axis.

Regression analysts account for this interaction by including cross-product terms in the regression model. In this Ames housing data set example, the regression model with an interaction term is

$\begin{array}{l} Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{1} X_{2} + ϵ . \end{array}$

If the regression parameter [latex]\hat \beta_3[/latex] differs statistically from 0, then the inclusion of the interaction term is warranted. Notice that when [latex]X_2 = 0[/latex] (smaller lots), the model reduces to

$\begin{array}{l} Y = β_{0} + β_{1} X_{1} + ϵ, \end{array}$

which is a simple linear regression model with intercept parameter β₀ and slope parameter β₁. On the other hand, when [latex]X_2 = 1[/latex] (larger lots), the model reduces to

$\begin{array}{l} Y = β_{0} + β_{1} X_{1} + β_{2} + β_{3} X_{1} + ϵ \end{array}$

or

$\begin{array}{l} Y = β_{0} + β_{2} + (β_{1} + β_{3}) X_{1} + ϵ \end{array}$

which is a simple linear regression model with intercept parameter [latex]\beta_0 + \beta_2[/latex] and slope parameter [latex]\beta_1 + \beta_3[/latex]. It is in this fashion that the two non-parallel lines depicted in Figure 3.18 can be estimated in a single regression model. Not surprisingly, it requires four parameters, β₀, β₁, β₂, and β₃, to do so. The multiple linear regression model with an interaction term can be fitted using the lm function in R by simply replacing the usual + in the formula with *. All four parameters are statistically significant at the 0.05 level in this case, so the inclusion of an interaction term is warranted.

3.5.3 The ANOVA Table

The degrees of freedom for the sums of squares in multiple linear regression are modified because of the additional parameters estimated relative to those given in the ANOVA table from Table 2.2 for simple linear regression. The ANOVA table for a multiple linear regression model with p independent variables and normal error terms is given in Table 3.6.

Table 3.6: Basic ANOVA table for multiple linear regression.
Source	SS	df	MS	F
Regression	SSR	p	MSR	[latex]MSR / MSE[/latex]
Error	SSE	[latex]n - p - 1[/latex]	MSE
Total	SST	[latex]n - 1[/latex]

Formulas for the sums of squares using the matrix formulation for multiple linear regression are [latex]SST = SSR + SSE[/latex], which is

$\begin{array}{l} (Y - \bar{Y})^{'} (Y - \bar{Y}) = (\hat{Y} - \bar{Y})^{'} (\hat{Y} - \bar{Y}) + (Y - \hat{Y})^{'} (Y - \hat{Y}), \end{array}$

where [latex]\bar {\bf Y}[/latex] is an n-element column vector with identical elements which are each the sample mean of the values of the dependent variable. Equivalently,

$\begin{array}{l} S S T = Y^{'} Y - Y^{'} J Y / n, S S R = {\hat{β}}^{'} X^{'} Y - Y^{'} J Y / n, S S E = Y^{'} Y - {\hat{β}}^{'} X^{'} Y, \end{array}$

where [latex]\bf{J}[/latex] is an [latex]n \times n[/latex] matrix with all elements being equal to 1. The mean square error for regression is [latex]MSR = SSR / p[/latex], the mean square error is [latex]MSE = SSE / (n - p - 1)[/latex], and the test statistic [latex]F = MSR / MSE[/latex] can be used for testing

$\begin{array}{l} H_{0} : β_{1} = β_{2} = \dots = β_{p} = 0 \end{array}$

versus

$\begin{array}{l} H_{1} : not all β_{1}, β_{2}, \dots, β_{p} equal 0 \end{array}$

where F has an [latex]F(p, \, n - p - 1)[/latex] distribution under H₀. The anova function in R can be used to generate an ANOVA table associated with a multiple linear regression model fitted by the lm function. For the Ames, Iowa housing data from Example 3.10 which used [latex]p = 2[/latex] independent variables (interior square footage and lot size), the R summary function returns the test statistic [latex]F = 5.322[/latex], which is associated with a p-value of [latex]p = 0.006[/latex] based on the F distribution with [latex]p = 2[/latex] and [latex]n - p - 1 = 120 - 2 - 1 =117[/latex] degrees of freedom. There is strong statistical evidence that one or both of the coefficients [latex]\hat \beta_1[/latex] and [latex]\hat \beta_2[/latex] is statistically different from zero. One or both of the independent variables is effective in predicting the sales price.

3.5.4 Adjusted Coefficient of Determination

The coefficient of determination for a multiple linear regression model is defined as

$\begin{array}{l} R^{2} = \frac{S S R}{S S T} = \frac{S S T - S S E}{S S T} = 1 - \frac{S S E}{S S T}, \end{array}$

and it measures the fraction of variation in [latex]Y_1, \, Y_2, \, \ldots, \, Y_n[/latex] about [latex]\bar Y[/latex] that is accounted for by the linear relationship between the independent variables [latex]X_1, \, X_2, \, \ldots, \, X_p[/latex] and Y. As before [latex]0 \le R ^ 2 \le 1[/latex], and the extreme cases are associated with [latex]\hat \beta_1 = \hat \beta_2 = \cdots = \hat \beta_p = 0[/latex] (for [latex]R ^ 2 = 0[/latex]) and all Y-values falling in the estimated regression hyperplane (for [latex]R ^ 2 = 1[/latex]).

Now consider a multiple linear regression model with p independent variables [latex]X_1, \, X_2, \, \ldots, \, X_p[/latex]. What is the effect on SST and SSE of adding another independent variable, [latex]X_{p + 1}[/latex], to the model? Adding another independent variable does not affect SST because it depends only on [latex]Y_1, \, Y_2, \, \ldots, \, Y_n[/latex]. The value of SSE cannot increase with the addition of the new independent variable because either (a) SSE will remain the same if [latex]\hat \beta_{p + 1} = 0[/latex], or (b) SSE will decrease if [latex]\hat \beta_{p + 1} \ne 0[/latex]. The impact on R² is that it must stay the same or increase for every additional independent variable that is added to the model.

It is for this reason that R² tends to be a biased estimator of the fraction of variation in [latex]Y_1, \, Y_2, \, \ldots, \, Y_n[/latex] accounted for by the independent variables. Some regression software (including R) calculate an adjusted coefficient of variation by dividing the sums of squares by their associated degrees of freedom

$\begin{array}{l} R_{adj}^{2} = 1 - \frac{S S E / (n - p - 1)}{S S T / (n - 1)} . \end{array}$

Both values are reported in the call to the summary function with the Ames, Iowa housing data in Example 3.10 as

$\begin{array}{l} R^{2} = 0.08339 and R_{adj}^{2} = 0.06772 . \end{array}$

3.5.5 Multicollinearity

In many settings, the values of the independent variables are correlated. In the housing data set from Example 3.10, for example, the independent variables X₁ (interior square footage) and X₂ (lot size) are probably positively correlated. Intuition suggests that larger homes are built on larger lots, on average. In the extreme case, what if homes in Ames were required by some bizarre municipal code to all be single story homes with the square footage of the lot always exactly four times the square footage of the interior of the home? In this case, [latex]X_2 = 4 X_1[/latex], so knowing the value of either X₁ or X₂ allows you to know the value of the other. Intuitively, one of the two independent variables is superfluous. When this is the case, the design matrix [latex]{\bf X}[/latex] has two columns which are multiples of one another, so these columns are linearly dependent and the matrix does not have full rank. This implies that the matrix [latex]{\bf X} ^ \prime {\bf X}[/latex] (which is used in computing the estimates of the regression coefficients) is singular, so it does not have an inverse. In this case, the usual formula for the regression coefficients,

$\begin{array}{l} \hat{β} = {(X^{'} X)}^{- 1} X^{'} Y, \end{array}$

is undefined because the matrix [latex]{\bf X} ^ \prime {\bf X}[/latex] does not have an inverse. In the case in which [latex]X_2 = 4 X_1[/latex], all pairs of the independent variables fall on a line, so it is impossible to know the proper tilt of the fitted regression plane in [latex]{\cal R} ^ 3[/latex]. There are many planes that minimize the sum of squared errors.

Multicollinearity is the condition associated with independent variables that are highly correlated among themselves in a multiple regression model. More specifically, multicollinearity occurs when two or more of the independent variables have a high correlation. This can appear as an approximately linear relationship between two of the independent variables. Multicollinearity is a condition associated with the design matrix [latex]{\bf X}[/latex] rather than the values of the dependent variable [latex]{\bf Y}[/latex] or the model [latex]{\bf Y} = {\bf X} {\pmb{\beta}} + {\pmb{\epsilon}}[/latex]. In cases in which multicollinearity exists, the matrix [latex]{\bf X} ^ \prime {\bf X}[/latex] has an inverse, but it is ill-conditioned and subject to slight variations in the data or is unstable because of large differences in the magnitudes of the various values of the independent variables. One of the key practical issues when multicollinearity is present is that an estimated regression coefficient for a particular independent variable depends on whether the other independent variables are included or left out of the model.

So multicollinearity has been loosely defined as high correlation among the independent variables. There is redundancy to the information contained in the independent variables. The next paragraphs describe how to detect multicollinearity, its consequences, and some remedies.

Although the hypothetical perfect correlation between the interior space and the lot size of a home from Ames, Iowa described previously occurs seldom in practice, highly correlated independent variables can result in some unusual behavior of regression coefficients as a regression model is constructed. Some signs that multicollinearity might be present in a multiple linear regression model include the following.

Large values of the estimated standard deviations of the regression coefficients.
Including or not including an independent variable in the model results in large changes to the estimated regression coefficients.
An estimated regression coefficient that is statistically significant when the associated independent variable is considered alone, but becomes insignificant when one or more other independent variables are added to the model.
An estimated regression coefficient with a sign that is inconsistent with expected sign or inconsistent with previous similar data sets.
The pairwise sample correlation among the independent variables is high. The cor function in R can be used to assess the correlation among independent variables. The R statement

for example, calculates the correlation matrix for the columns of the built-in data frame named swiss. The off-diagonal elements of this matrix range from [latex]-0.69[/latex] to 0.70, indicating that multicollinearity is present.

All of the criteria listed above are informal. A more formal way to determine whether multicollinearity is present is to introduce a statistic which reflects multicollinearity. The estimate of the variance of [latex]\hat \beta _ j[/latex] can be written as

$\begin{array}{l} \hat{V} [{\hat{β}}_{j}] = \frac{1}{1 - R_{j}^{2}} [\frac{M S E}{\sum_{i = 1}^{n} (X_{i j} - {\bar{X}}_{j})^{2}}], \end{array}$

where [latex]\bar X_j = \sum_{i\,=\,1}^n X_{ij}[/latex], [latex]MSE = SSE / (n - p - 1)[/latex] for the full multiple regression model, and [latex]R_j^2[/latex] is the coefficient of determination obtained by conducting a multiple linear regression with X_j as the dependent variable and the other [latex]p - 1[/latex] X-values as the independent variables, for [latex]j = 1, \, 2, \, \ldots, \, p[/latex]. The coefficient on the right-hand side of this equation,

$\begin{array}{l} V I F_{j} = \frac{1}{1 - R_{j}^{2}}, \end{array}$

is known as a variance inflation factor for independent variable j, for [latex]j = 1, \, 2, \, \ldots, \, p[/latex]. In the extreme case when [latex]R_j^2 = 0[/latex], the associated variance inflation factor is [latex]VIF_j = 1[/latex]. This corresponds to the case in which X_j is not linearly related to the other independent variables. As [latex]R_j^2[/latex] increases, [latex]VIF_j[/latex] also increases, corresponding to increased correlation between the independent variables. When the largest of the [latex]VIF_j[/latex] values exceeds the threshold value of 10, one can conclude that the multicollinearity is present among the independent variables.

The R code below calculates the variance inflation factors for the data values in the swiss data frame, where the independent variables

X₁, the percentage of males involved in agriculture as an occupation,
X₂, the percentage of draftees receiving the highest make on an army examination,
X₃, the percentage of draftees with education beyond the primary school,
X₄, the percentage of Catholics, and
X₅, the percentage of live births who live less than one year,

are used to predict Y, a common standardized fertility measure, from the [latex]n = 47[/latex] French-speaking provinces of Switzerland in about the year 1888. The R code below computes the variance inflation factors for the [latex]p = 5[/latex] independent variables.

The variance inflation factors for the [latex]p = 5[/latex] independent variables are

$\begin{array}{l} V I F_{1} = 2.28, V I F_{2} = 3.68, V I F_{3} = 2.77, V I F_{4} = 1.94, V I F_{5} = 1.11 . \end{array}$

Since none of these five values exceeds 10, we can conclude that the multicollinearity that exists in the independent variables is not strong enough to cause concern. (Some regression analysts use 5 as a threshold rather than 10.) Some keystrokes can be saved by using the vif function from the car package on a multiple linear regression model fitted by the lm function.

One popular remedy for multicollinearity is known as ridge regression, which is a parameter estimation technique that abandons the requirement of unbiased parameter estimates. The approach taken with ridge regression is to choose estimates for the regression parameters that are biased, but have a smaller variance than the ordinary least squares estimates. The goal is to generate parameter estimates with tolerable bias but smaller variance. The typical approach used in statistics to overcome this bias/variability trade-off is to use the estimates that minimize the mean square errors. Assuming that the X and Y values have been centered, we can dispense with the need for an intercept term in the multiple regression model. Rather than minimizing the usual sum of squared errors

$\begin{array}{l} S = \sum_{i = 1}^{n} {(Y_{i} - β_{1} X_{i 1} - β_{2} X_{i 2} - \dots - β_{p} X_{i p})}^{2}, \end{array}$

ridge regression minimizes

$\begin{array}{l} S_{R} = \sum_{i = 1}^{n} {(Y_{i} - β_{1} X_{i 1} - β_{2} X_{i 2} - \dots - β_{p} X_{i p})}^{2} + λ \sum_{j = 1}^{p} β_{j}^{2} . \end{array}$

There are now two terms in the modified sum of squares. The second term in S_R is known as the penalty term. The new parameter λ is known as the penalty parameter. When [latex]\lambda = 0[/latex], [latex]S_R[/latex], reduces to the ordinary least squares case and achieves a value SSE at the ordinary least squares estimators. As λ increases, the estimators converge to [latex]\hat \beta_1 = \hat \beta_2 = \cdots = \beta_p = 0[/latex]. We desire a λ value that introduces some bias into the parameter estimates, but also have a reduced variance.

The geometry associated with ridge regression for [latex]p = 2[/latex] independent variables X₁ and X₂ in a multiple linear regression model is illustrated in Figure 3.19. The ellipses are level surfaces of the first term in [latex]S_R[/latex]. The center of the ellipses is the ordinary least squares estimators of [latex](\beta_1, \, \beta_2) = \big( \hat \beta_1, \, \hat \beta_2 \big)[/latex], which are the values that minimize the first term of [latex]S_R[/latex]. The circles centered at the origin are level surfaces of the second term in [latex]S_R[/latex]. The ridge regression estimators for β₁ and β₂ will occur at the intersection of one of elliptical and circular contours. In Figure 3.19 the two outermost level surfaces intersect at a point, which is a value of the ridge regression estimates of β₁ and β₂ which correspond to one particular value of the penalty parameter λ. The point at which this intersection occurs is a function of the penalty parameter λ. In higher dimensions, the circles become spheres and the ellipses become ellipsoids.

A graph of the ridge regression geometry for two independent variables. — Figure 3.19: Ridge regression geometry for [latex]p = 2[/latex] independent variables.

Long Description for Figure 3.19

The horizontal axis labeled beta 0 contains the points 0 and beta cap 0. The vertical axis labeled beta 1 contains the points 0 and beta cap 1. A point plotted at beta cap 0, beta cap 1 is surrounded by two concentric elliptical contours. Another point plotted at 0, 0 is surrounded by two concentric circles. The outermost levels of the elliptical and circle contours intersect at a point between 0 and beta cap 0 of the horizontal axis and 0 and beta cap 1 of the vertical axis.

Determining the value of the penalty parameter is critical in ridge regression, but its choice depends on the regression model and associated data set. A common technique for determining an optimal value for λ is known as k-fold cross-validation. There are several functions in R which can perform ridge regression: the lm.ridge function from the MASS package, the linearRidge function from the ridge package, and the glmnet function from the glmnet package. Ridge regression is related to the lasso (least absolute shrinkage and selection operator) estimator and elastic net regularization, two other popular parameter estimation techniques that are often applied for large values of p.

Is there a way to completely avoid multicollinearity? In some settings, the answer is yes. When the values of the independent variables are chosen so that they are uncorrelated, the regression coefficients associated with a simple linear regression model of each independent variable separately match the regression coefficients of any model involving more independent variables. This fact provides a strong argument for a designed experiment which can result in uncorrelated independent variables whenever the setting of the regression problem make this possible.

3.5.6 Model Selection

It is common in regression modeling to have a large number of potential independent variables that might adequately predict the dependent variable Y that need to be sifted through in order to decide whether each should be included or excluded from the regression model. If there are p potential independent variables in the multiple linear regression model

$\begin{array}{l} Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p} + ϵ \end{array}$

then there are 2^p possible regression models (always including an intercept term and not considering interaction terms or nonlinear terms) because each independent variable will either be included or not included in the regression model. Since the number of regression models to fit can be daunting, even for moderate values of p, we desire an algorithm for selecting the appropriate independent variables to include in the model. Forward stepwise regression is one such automatic search procedure used to select the independent variables to include in a multiple linear regression model. The procedure begins with the null model [latex]Y = \beta_0 + \epsilon[/latex] and progressively adds independent variables to the model that are deemed to be statistically significant. In the initial step, p simple linear regression models are fit for each potential independent variable. The independent variable with the smallest p-value falling below a prescribed threshold (commonly, [latex]\alpha = 0.05[/latex]) associated with the t-test described in Section 2.3.2 is added to the model. In the second step, [latex]p - 1[/latex] multiple linear regression models with two independent variables are fitted using the previously selected independent variable and each of the other potential independent variables. The independent variable with the smallest p-value is added to the model. This process continues until no more independent variables meet the criteria. This is the multiple linear regression model selected by forward stepwise regression. Several other variants of forward stepwise regression and other model selection algorithms are outlined below.

Foreward stepwise regression often includes a test to determine whether independent variables that have previously been added to the model have p-values that exceed the threshold and should consequently be removed from the model.
Backward stepwise regression starts by including all p independent variables in the regression model and eliminates the independent variable with the largest p-value on each step. Unfortunately, there is no guarantee that forward stepwise regression and backward stepwise regression will result in the same final regression model.
Once this statistically significant independent variables have been identified, a similar stepwise procedure can be executed to test for statistically significant interaction terms.
A similar stepwise procedure can be executed to test for the significance of nonlinear terms in the regression model.
With increased computer speeds and a moderate value of p, the number of independent variables, it is possible to fit all 2^p possible regression models and compare them to determine an appropriate final regression model.
Comparing potential regression models using p-values is not universal. The Akaike Information Criterion (AIC) is a measure which extracts a penalty for each additional parameter in a model in an effort to avoid overfitting.In summary, selecting a multiple linear regression model is not easy. The skills required to select a model include the ability to (a) detect and remedy multicollinearity, (b) assess evidence of interaction effects between independent variables and include them in the model when appropriate, (c) assess evidence of nonlinear relationships between some or all of the independent variables and the dependent variable and include appropriate terms in the model, (d) execute the appropriate multidimensional diagnostic procedures (outlined in the simple linear regression case in Section 3.2) and execute the appropriate remedial procedures (outlined in the simple linear regression case in Section 3.3) when model assumptions are violated, and (e) assess the normality of the residuals.

3.6 Weighted Least Squares

The three approaches to estimating the parameters in a simple linear regression model that we have encountered thus far,

the algebraic approach,
the matrix approach,
using the R lm (linear model) function,

all have the same assumptions regarding the independent variable, the dependent variable, and the model [latex]Y = \beta_0 + \beta_1 X + \epsilon[/latex]. In all three approaches, the error terms are assumed to be mutually independent random variables, each with population mean 0 and population variance–covariance matrix [latex]V[\epsilon] = \sigma _ Z ^ {\, 2} I[/latex], where I is the [latex]n \times n[/latex] identity matrix. This means that [latex]V[ \epsilon_i ] = \sigma _ Z ^ {\, 2}[/latex], for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]. There is also an implicit assumption that each of the data pairs [latex](X_i, \, Y_i)[/latex] are each given equal weight in the regression.

Settings occasionally arise in which some data values should be given different weights. There might be evidence that some of the Y_i values have more precision than others. Weights can be placed on each of the data pairs to account for this difference in precision. This leads to a weighted least squares approach to estimating the coefficients in a regression model.

In the standard simple linear regression model, the assumption

$\begin{array}{l} V [ϵ_{i}] = σ_{Z}^{2}, \end{array}$

for [latex]i = 1, \, 2, \, \ldots, \, n[/latex], means that the variance of the dependent variable from the regression line is equal for all of the n data pairs, regardless of the value of the independent variable. In weighted least squares modeling, the positive weights [latex]w_1, \, w_2, \, \ldots , \, w_n[/latex] are determined so that

$\begin{array}{l} V [ϵ_{i}] = σ_{Z}^{2} / w_{i} \end{array}$

for [latex]i = 1, \, 2, \, \ldots, \, n[/latex], which means that certain data pairs have more precision than other data pairs. The weights are fixed constants. There is no requirement that the weights sum to one. Data pairs with larger weights are assumed to have a lower variability to their error terms. This allows for a population variance that changes from one data pair to another.

As an illustration, the values of the dependent variable Y might be sample means at the various values of the independent variable X. Furthermore, if the sample sizes associated with the sample means are known and unequal, then we would like to assign higher weights to the data pairs associated with larger sample sizes. If n_i is the sample size for data pair i, for [latex]i = 1, \, 2, \, \ldots, \, n[/latex], then the appropriate weight for data pair i is [latex]w_i = n_i[/latex] so that

$\begin{array}{l} V [ϵ_{i}] = σ_{Z}^{2} / n_{i} \end{array}$

for [latex]i = 1, \, 2, \, \ldots, \, n[/latex].

So rather than minimizing the sum of squares

$\begin{array}{l} S = \sum_{i = 1}^{n} (Y_{i} - β_{0} - β_{1} X_{i})^{2} \end{array}$

as was the case in the standard simple linear regression model, weighted least squares minimizes the weighted sum of squares

$\begin{array}{l} S = \sum_{i = 1}^{n} w_{i} (Y_{i} - β_{0} - β_{1} X_{i})^{2} . \end{array}$

Notice that this reduces to the ordinary sum of squares when [latex]w_1 = w_2 = \cdots = w_n = 1[/latex]. As before, calculus can be used to minimize S with respect to β₀ and β₁ to arrive at the least squares estimators [latex]\beta_0[/latex] and [latex]\beta_1[/latex]. The partial derivatives of S with respect to β₀ and β₁ are

$\begin{array}{l} \frac{\partial S}{\partial β_{0}} = - 2 \sum_{i = 1}^{n} w_{i} (Y_{i} - β_{0} - β_{1} X_{i}) = 0 \end{array}$

and

$\begin{array}{l} \frac{\partial S}{\partial β_{1}} = - 2 \sum_{i = 1}^{n} w_{i} X_{i} (Y_{i} - β_{0} - β_{1} X_{i}) = 0. \end{array}$

These can be simplified to give the normal equations

$\begin{array}{l} β_{0} \sum_{i = 1}^{n} w_{i} + β_{1} \sum_{i = 1}^{n} w_{i} X_{i} = \sum_{i = 1}^{n} w_{i} Y_{i} \end{array}$

and

$\begin{array}{l} β_{0} \sum_{i = 1}^{n} w_{i} X_{i} + β_{1} w_{i} X_{i}^{2} = \sum_{i = 1}^{n} w_{i} X_{i} Y_{i} . \end{array}$

The normal equations are a system of two linear equations in the two unknowns β₀ and β₁, given the data pairs [latex](X_1, \, Y_1), \, (X_2, \, Y_2), \, \ldots , \, (X_n, \, Y_n)[/latex] and the weights [latex]w_1, \, w_2, \, \ldots, \, w_n[/latex]. The normal equations can be solved to yield the weighted least squares estimators. This derivation constitutes a proof of the following theorem.

The matrix approach can also be applied to weighted least squares. Define the [latex]{\bf X}[/latex], [latex]{\bf Y}[/latex], [latex]\pmb{\beta}[/latex] and [latex]\pmb{\epsilon}[/latex] matrices as in Section 3.4:

$\begin{array}{l} X = [\begin{array}{cc} 1 & X_{1} \\ 1 & X_{2} \\ ⋮ & ⋮ \\ 1 & X_{n} \end{array}], Y = [\begin{array}{c} Y_{1} \\ Y_{2} \\ ⋮ \\ Y_{n} \end{array}], β = [\begin{array}{c} β_{0} \\ β_{1} \end{array}], and ϵ = [\begin{array}{c} ϵ_{1} \\ ϵ_{2} \\ ⋮ \\ ϵ_{n} \end{array}] . \end{array}$

In addition, assume that the matrix [latex]{\bf W}[/latex] is a diagonal matrix with the weights [latex]w_1, \, w_2, \, \ldots, \, w_n[/latex] on the diagonal:

$\begin{array}{l} W = [\begin{array}{cccc} w_{1} & 0 & \dots & 0 \\ 0 & w_{2} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & w_{n} \end{array}] . \end{array}$

In this case, the normal equations can be written in matrix form as

$\begin{array}{l} X^{'} W X β = X^{'} W Y . \end{array}$

Pre-multiplying both sides of this equation by [latex]\left( {\bf X} ^ \prime {\bf W} {\bf X} \right) ^ {-1}[/latex] gives the least squares estimators for the regression parameters in matrix form as

$\begin{array}{l} \hat{β} = {(X^{'} W X)}^{- 1} X^{'} W Y . \end{array}$

As before, the fitted values can also be written in matrix form as

$\begin{array}{l} \hat{Y} = X \hat{β} \end{array}$

or

$\begin{array}{l} \hat{Y} = X {(X^{'} W X)}^{- 1} X^{'} W Y . \end{array}$

The residuals [latex]e_i = Y_i - \hat{Y}_i[/latex] for [latex]i = 1, \, 2, \, \ldots, \, n[/latex], can also be written in matrix form as

$\begin{array}{l} e & = Y - \hat{Y} \\ = Y - X \hat{β} \\ = Y - X {(X^{'} W X)}^{- 1} X^{'} W Y \\ = (I - X {(X^{'} W X)}^{- 1} X^{'} W) Y, \end{array}$

where [latex]{\bf e}[/latex] is the column vector of residuals [latex]{\bf e} = (e_1, \, e_2, \, \ldots , \, e_n) ^ \prime[/latex]. These matrix results are summarized in the following theorem.

The algebraic approach, matrix approach, and R approach to weighted least squares problem will be illustrated in the next example. Establishing the weights [latex]w_1, \, w_2, \, \ldots, \, w_n[/latex] can be a nontrivial problem, and differs depending on the setting in which the weighted regression model is employed.

Example 3.11 In reliability, current status data is generated by testing a randomly selected group of items with varying ages from a population at a particular fixed time in order to determine whether or not each item has failed or is operating at its particular age. Items were selected at ages 100, 200, 300, and 400 hours to see if they are operating. In this case, the independent variable X is the age, measured in hours, at which an item is tested. Each item tested is deemed to be either operating or failed. Table 3.7 contains the results of the test. Notice that 100 items were tested at ages [latex]X_1 = 100[/latex] and [latex]X_2 = 200[/latex], but only [latex]10[/latex] items were tested at ages [latex]X_3 = 300[/latex] and [latex]X_4 = 400[/latex]. The dependent variable in this setting is the fraction of items that survive to a particular age. The sample size at each testing age is denoted by n_i, [latex]i = 1, \, 2, \, 3, \, 4[/latex]. So a total of [latex]n_1 + n_2 + n_3 + n_4 = 220[/latex] items were tested. The number of items that are operating at each testing age is denoted by S_i, [latex]i = 1, \, 2, \, 3, \, 4[/latex]. The fraction of items that are operating at each testing age, which is the dependent variable in the regression, is denoted by Y_i, [latex]i = 1, \, 2, \, 3, \, 4[/latex]. Notice that the fraction surviving is not necessarily decreasing from one time to the next because of random sampling variability. The small sample sizes at times [latex]X_3 = 300[/latex] and [latex]X_4 = 400[/latex] magnify this problem with the data set. The goal here is to establish a regression function that will adequately smooth the data values in order to estimate the survivor function for the items at any time.

Table 3.7: Current status data test results.
Time (hours)	$X_{1} = 100$	$X_{2} = 200$	$X_{3} = 300$	$X_{4} = 400$
Sample size	$n_{1} = 100$	$n_{2} = 100$	$n_{3} = 10$	$n_{4} = 10$
Number surviving	$S_{1} = 50$	$S_{2} = 25$	$S_{3} = 4$	$S_{4} = 3$
Fraction surviving	$Y_{1} = 0.5$	$Y_{2} = 0.25$	$Y_{3} = 0.4$	$Y_{4} = 0.3$

Assume for now that the standard (non-weighted) least squares approach using the [latex]n = 4[/latex] data pairs

$\begin{array}{l} (100, 0.5), (200, 0.25), (300, 0.4), and (400, 0.3) \end{array}$

is taken to this problem. The R code below fits the simple linear regression model to the data.

The regression line in this case has intercept [latex]\hat \beta_0 = 0.475[/latex] and slope [latex]\hat \beta_1 = -0.00045[/latex]. The survival probability of a brand-new item is estimated to be 0.475, and the survival probability decreases by 0.00045 for every hour that passes. The unimpressive survival probability of 0.475 for a new item is outside of the scope of the simple linear regression model, so its interpretation is not meaningful.

But using the standard simple linear regression approach is not appropriate here. The first two data pairs, both of which involved testing 100 items, should be weighted more heavily that the last two data pairs, which only involved testing 10 items. Determining the appropriate weights, however, is nontrivial.

Assume that the test results for each item are mutually independent Bernoulli trials. The number of items that survive a test at one particular time (that is, S_i using the notation from Table 3.7) is a binomial random variable with parameters n_i and p_i, where p_i is the population probability that item i is operating at time X_i. The population variance of the dependent variable [latex]Y_i = S_i / n_i[/latex] is

$\begin{array}{l} V [{\hat{p}}_{i}] = V [Y_{i}] = V [\frac{S_{i}}{n_{i}}] = \frac{1}{n_{i}^{2}} V [S_{i}] = \frac{n_{i} p_{i} (1 - p_{i})}{n_{i}^{2}} = \frac{p_{i} (1 - p_{i})}{n_{i}}, \end{array}$

for [latex]i = 1, \, 2, \, 3, \, 4[/latex]. Using the point estimate for p_i on the right-hand side of this expression results in the following estimated variances for the four dependent variables:

$\begin{array}{l} \hat{V} [Y_{1}] = \frac{\frac{50}{100} (1 - \frac{50}{100})}{100} = \frac{1}{400}, \hat{V} [Y_{2}] = \frac{\frac{25}{100} (1 - \frac{25}{100})}{100} = \frac{3}{1600}, \end{array}$

$\begin{array}{l} \hat{V} [Y_{3}] = \frac{\frac{4}{10} (1 - \frac{4}{10})}{10} = \frac{24}{1000}, \hat{V} [Y_{4}] = \frac{\frac{3}{10} (1 - \frac{3}{10})}{10} = \frac{21}{1000} . \end{array}$

Not surprisingly, the first two variance estimates are about an order of magnitude smaller than the second two variance estimates because of the differences in the sample sizes. This approach will have problems if one of the testing times has all successes ([latex]S_i = n_i[/latex]) or all failures ([latex]S_i = 0[/latex]).

Since the weights w_i appear in the denominator of the expression [latex]V \left[ \epsilon_i \right] = \sigma _ Z ^ {\, 2} / w_i[/latex], the reciprocals of these variance estimates will be used as the weights in the weighted least squares regression:

$\begin{array}{l} w_{1} = \frac{400}{1}, w_{2} = \frac{1600}{3}, w_{3} = \frac{1000}{24}, w_{4} = \frac{1000}{21} . \end{array}$

The regression coefficients will be calculated in three ways, all of which yield identical results: the algebraic approach, the matrix approach, and using the lm function.

First, the algebraic approach for calculating the slope and intercept of the regression line using weighted least squares uses the following R statements. These are an implementation of Theorem 3.3.

The weighted mean of the X values is

$\begin{array}{l} {\bar{X}}_{w} = \frac{\sum_{i = 1}^{n} w_{i} X_{i}}{\sum_{i = 1}^{n} w_{i}} = 174.2 725. \end{array}$

Notice that this is slightly lower than the unweighted mean of the x values, which is [latex](100 + 200 + 300 + 400) / 4 = 250[/latex] hours. This is due to the larger sample sizes at testing times 100 and 200, resulting in larger weights for these values. The weighted mean of the Y values is

$\begin{array}{l} {\bar{Y}}_{w} = \frac{\sum_{i = 1}^{n} w_{i} Y_{i}}{\sum_{i = 1}^{n} w_{i}} = 0.3562 . \end{array}$

The estimates for the slope and intercept of the regression line for weighted least squares is

$\begin{array}{l} {\hat{β}}_{1} = \frac{\sum_{i = 1}^{n} w_{i} (X_{i} - {\bar{X}}_{w}) (Y_{i} - {\bar{Y}}_{w})}{\sum_{i = 1}^{n} w_{i} (X_{i} - {\bar{X}}_{w})^{2}} = - 0.001081 \end{array}$

and

$\begin{array}{l} {\hat{β}}_{0} = {\bar{Y}}_{w} - {\hat{β}}_{1} {\bar{X}}_{w} = 0.5447 . \end{array}$

The interpretation of these estimates is that the estimated probability of survival at time 0 is 0.5447 and the probability of survival decreases by 0.001081 with every hour that passes.

Second, using the matrix approach, the [latex]{\bf X}[/latex], [latex]{\bf Y}[/latex], and [latex]{\bf W}[/latex] matrices associated with this data set are

$\begin{array}{l} X = [\begin{array}{cc} 1 & 100 \\ 1 & 200 \\ 1 & 300 \\ 1 & 400 \end{array}], Y = [\begin{array}{c} 0.50 \\ 0.25 \\ 0.40 \\ 0.30 \end{array}], and W = [\begin{array}{cccc} \frac{400}{1} & 0 & 0 & 0 \\ 0 & \frac{1600}{3} & 0 & 0 \\ 0 & 0 & \frac{1000}{24} & 0 \\ 0 & 0 & 0 & \frac{1000}{21} \end{array}] . \end{array}$

The R code below uses the matrix approach to simple linear regression with weights to calculate the estimated slope [latex]\hat \beta_0[/latex] and intercept [latex]\hat \beta_1[/latex], the fitted values [latex]\bf \hat{Y}[/latex], and the residuals [latex]{\bf e}[/latex] for the current status data set using Theorem 3.4. The R solve function is used to compute the inverse of [latex]{\bf X} ^ \prime {\bf X}[/latex].

The results of these calculations are given below. The point estimators of the slope and intercept are

$\begin{array}{l} \hat{β} = {(X^{'} W X)}^{- 1} X^{'} W Y = [\begin{array}{c} 0.5447 \\ - 0.001081 \end{array}] . \end{array}$

The fitted values are

$\begin{array}{l} \hat{Y} = X \hat{β} = [\begin{array}{c} 0.43 65 \\ 0.3284 \\ 0.2203 \\ 0.1121 \end{array}] . \end{array}$

The residuals are

$\begin{array}{l} e = Y - \hat{Y} = [\begin{array}{c} 0.0635 \\ - 0.0784 \\ 0.1797 \\ 0.1879 \end{array}] . \end{array}$

Third, the built-in function lm can be used for weighted least squares by using the weights argument. The R code below calculates the estimates of the regression coefficients, the fitted values, and the residuals.

The three approaches all yield the same results. The regression line associated with ordinary least squares and weighted least squares can be compared graphically. The R code below plots the four data pairs and the associated ordinary least squares and weighted least squares regression lines.

Figure 3.20 contains the resulting plot, which shows the ordinary least squares line with equal weighting to the four data values and the weighted least squares line with much more weight to the first two data pairs and much less weight to the last two data pairs. Extra circles have been added to the two data pairs associated with the larger sample sizes with larger weights in Figure 3.20. The effect of the larger weights on the first two data pairs is apparent in the weighted least squares regression line. The rightmost two data pairs exert significantly less tug on the weighted least squares regression line because of their smaller weights.

Figure 3.20: Current status data ordinary and weighted least squares fits.

Long Description for Figure 3.20

The horizontal axis X ranges between 0 and 400 in increments of 100. The vertical axis Y ranges between 0 and 1.0 in increments of 0.2. Data points are plotted at (100, 0.5); (200, 0.3); (300, 0.5); (400, 0.4). A diagonal line originating at (0, 0.5) passes through the point 400, 0.4 and has a negative slope. It is labeled ordinary least square. Another diagonal line originating at( 0, 0.58) at slope down steeply and is labeled weighted least squares. The ordinary least squares line and weighted least squares line intersect at (100, 0.48). All data are approximate.

Using simple linear regression in the previous example, either weighted or unweighted, might not be the best approach. The dependent variable Y is the probability that an item of age X is functioning. This dependent variable must lie between 0 and 1, but the regression line could potentially fall outside of that range within the scope of the model. Two potential remedies are given in the next two sections: using a regression model with nonlinear terms such as X² or X³, or a survivor function of a lifetime model rather than a line, or a nonlinear model known as a logistic regression model, whose dependent variable necessarily lies between 0 and 1.

3.7 Regression Models with Nonlinear Terms

Regression models with nonlinear terms arise frequently in regression modeling. One simple example is polynomial regression. A quadratic regression model, for example, is

$\begin{array}{l} Y = β_{0} + β_{1} X + β_{2} X^{2} + ϵ, \end{array}$

where β₀, β₁, and β₂ are the regression coefficients, and ϵ is a white noise term. This model is still linear in β₀, β₁, and β₂. One way to think about this model is to consider X and X² to be the [latex]p = 2[/latex] independent variables in a multiple regression model. The next example fits a quadratic model to the data pairs in which the independent variable X is the speed of an automobile and the dependent variable Y is its stopping distance.

Example 3.12 Consider the [latex]n = 50[/latex] data pairs from Example 2.8 which give the speed (in miles per hour) as X and the stopping distance (in feet) as Y. These data pairs are built into the base R language in the data frame named cars, where the speed column contains the values of X and the dist column contains the values of Y. Fit a quadratic regression model forced through the origin to the data pairs.

Since the quadratic regression model is being forced through the origin in order to account for the fact that stationary cars ([latex]X = 0[/latex]) do not require any distance ([latex]Y = 0[/latex]) to stop, the quadratic regression model is

$\begin{array}{l} Y = β_{1} X + β_{2} X^{2} + ϵ, \end{array}$

where [latex]\epsilon \sim WN \left( 0, \, \sigma _ Z ^ {\, 2} \right)[/latex]. R is capable of fitting nonlinear models to data. The I (inhibit interpretation) function allows the modeling of some function of a particular independent variable. For the data pairs in the cars data frame, a quadratic regression model that is forced through the origin can be fit with lm function.

The -1 part of the formula forces the regression function to pass through the origin. The output generated by the summary(fit) statement is given below.

The fitted quadratic regression model that is forced to pass through the origin is

$\begin{array}{l} Y = 1.24 X + 0.0901 X^{2}, \end{array}$

where X is speed and Y is stopping distance. Notice that [latex]\hat \beta_2 = 0.0901 > 0[/latex], which means that a graph of the fitted regression function—a parabola that passes through the origin—is concave up. Since the p-value associated with the linear term is [latex]p = 0.032[/latex] and the p-value associated with the quadratic term is [latex]p = 0.0036[/latex] both of the regression coefficients are statistically significant. The R commands

plot the fitted model over the scatterplot. This graph appears in Figure 3.21.

Figure 3.21: Scatterplot and quadratic fit of speed X and stopping distance Y.

Long Description for Figure 3.21

The horizontal axis X measuring speed ranges from 0 to 25 in increments of 5 units. The vertical axis Y measuring stopping distance ranges from 0 to 120 in increments of 40 units. Multiple data points are plotted on the X Y plane. A quadratic regression line starting from origin follows an increasing trend and passes through points (10, 20); (20, 60); (25, 80). 27 data points fall below the regression line. 19 data points fall above the regression line. 4 data points fall on the regression line.

How do we compare the simple linear regression model to the quadratic regression model forced through the origin? Both have two parameters, but which one of the models is a better approximation to the data pairs? One way to compare the two models is with the sum of squared residuals for each of the models, which are computed with the R commands

The simple linear regression model has a sum of squared residuals of [latex]S = \text{11,353.52}[/latex], and the quadratic regression model forced through the origin has a sum of squared residuals of [latex]S = \text{10,831.12}[/latex]. Using the quadratic regression model forced through the origin reduces the sum of squared residuals by 522.4. Higher-order polynomials can be fit using the lm function in a similar manner. As was the case in multiple regression, adding more terms generally results in a reduction in the sum of squared residuals.

Nonlinear regression modeling is not limited to just polynomial regression models. The next two examples fit the same data set concerning the national debt in the United States between 1970 and 2020 to a nonlinear regression model using two fundamentally different approaches. The first approach is to transform the nonlinear regression model to a linear regression model and then apply the standard techniques for parameter estimation to the transformed model. The second approach is to use numerical methods to minimize the sum of squares in the usual least squares fashion described previously.

Example 3.13 The national debt of the United States, in trillions of dollars, between 1970 and 2020 is given in Table 3.8. These values are not adjusted for inflation. Fit an exponential regression model to the national debt of the United States, where X is the year and Y is the debt, by transforming an exponential regression model to a linear model.

Table 3.8: United States national debt, 1970–2020.
Year	Debt
1970	0.37
1975	0.53
1980	0.91
1985	1.82
1990	3.23
1995	4.97
2000	5.67
2005	7.93
2010	13.56
2015	18.15
2020	27.75

Figure 3.22: Scatterplot of the year X and the national debt Y.

Long Description for Figure 3.22

The horizontal axis X lists the years from 1970 to 2020 in increments of 10 units. The vertical axis Y measuring debt ranges from 0 to 30 in increments of 5 units. The data points plotted are as follows. (1970,0); (1975, 0); (1980, 1); (1990, 3); (1995, 5); (2000, 5); (2005, 7); (2010, 15); (2015, 20); (2020, 27).

The scatterplot in Figure 3.22 shows that a simple linear regression model is not appropriate for these data pairs. A regression model that reflects the exponential growth rate in the debt is warranted. Both savings and debt tend to grow exponentially, so an exponential regression model is a reasonable initial model to investigate. Consider fitting the regression model

$\begin{array}{l} Y = e^{β_{0} + β_{1} X + ϵ} \end{array}$

to the data set, where X is the year, Y is the debt, ϵ is an error term, and β₀ and β₁ are unknown regression parameters to be estimated from the data pairs. This model can be transformed to a linear model by taking the natural logarithm of both sides of the model:

$\begin{array}{l} \ln Y = β_{0} + β_{1} X + ϵ . \end{array}$

This model is now in the form of a simple linear regression with independent variable X and dependent variable [latex]\ln Y[/latex]. The intercept of the fitted model is β₀ and the slope of the fitted model is β₁. So a graph that contains X on the horizontal axis and [latex]\ln Y[/latex] on the vertical axis should be approximately linear if this transformation approach is appropriate. Such a graph is given in Figure 3.23, which is much closer to linear than the raw data points. It is apparent that some work on debt reduction occurred in the late 1990s, resulting in a slight bit of nonlinearity. We will proceed with fitting the transformed model. The R code below follows a similar pattern to the earlier examples, but this time the formula used in the call to the lm function is log(debt) year. The curve function is used to add the fitted regression function to the scatterplot.

The fitted model is displayed in Figure 3.24. The values of the estimated parameters are [latex]\hat \beta_0 = -170.4[/latex] and [latex]\hat \beta_1 = 0.08606[/latex].

Figure 3.23: Scatterplot of the year X and the logarithm of the national debt Y.

Long Description for Figure 3.23

The horizontal axis X lists the years from 1970 to 2020 in increments of 10 units. The vertical axis Y measuring logarithm of debt ranges from negative 1 to 4 in increments of 1 unit. The data points plotted are as follows. (1970, negative 1);( 1980, 0); (1990, 1); (2000, 1.5);( 2010, 2.5); (2015, 3); (2020, 3.5).

Figure 3.24: Scatterplot and exponential fit of year X and debt Y.

Long Description for Figure 3.24

The horizontal axis X lists the years from 1970 to 2020 in increments of 10 units. The vertical axis Y measuring debt ranges from 0 to 30 in increments of 5 units. The data points plotted are as follows. (1970, 0); (1975, 0); (1980, 1); (1990, 3); (1995, 4); (2000, 5); (2005, 7); (2010, 15); (2015, 17); (2020, 27).

There is a second approach to fitting an exponential regression model to the national debt data pairs that follows the standard approach to least squares estimation, which is given next.

Example 3.14 Fit an additive exponential regression model to the United States national debt data pairs from Example 3.13.

The second approach to fitting an exponential regression model to the debt data pairs is to use the additive model

$\begin{array}{l} Y = e^{β_{0} + β_{1} X} + ϵ . \end{array}$

Using the traditional least squares approach, the sum of squares

$\begin{array}{l} S = \sum_{i = 1}^{n} {(Y_{i} - e^{β_{0} + β_{1} X_{i}})}^{2} \end{array}$

is minimized with respect to β₀ and β₁, yielding the associated least squares estimators [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex]. The estimators cannot be expressed in closed form, so numerical methods must be used to estimate β₀ and β₁. A nonlinear least squares R function named nls can be used to estimate the parameters. Here is a first attempt at fitting the model.

This code returns an error message indicating that the nls function was unable to estimate the parameters. What went wrong? The way that the model has been formulated, the parameter [latex]e ^ {\kern 0.04em \beta_0}[/latex] represents the United States national debt in the year 0. This is why we had the parameter estimate [latex]e ^ {\kern 0.04em \hat \beta_0} = e ^ {-170.4} = 10 ^ {-74}[/latex] from the transformation approach in Example 3.13. The nls function attempts to do a search over all values of β₀ and β₁ to minimize the sum of squares. Finding the value of [latex]\hat \beta_0[/latex] is like finding a needle in a haystack. We need to give the nls function some help. We will give nls some starting values in a list named start to make the internal search performed by the nls function easier. The initial values for [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] will be the estimates for β₀ and β₁ from the transformation approach from the previous example.

The estimated parameters are [latex]\hat \beta_0 = -151.7[/latex] and [latex]\hat \beta_1 = 0.07676[/latex]. Thus, the fitted nonlinear regression model is

$\begin{array}{l} E [Y] = e^{{\hat{β}}_{0} + {\hat{β}}_{1} X} . \end{array}$

The fitted exponential regression model is displayed in Figure 3.25. The two different exponential regression models can be compared by computing the sums of squares for the two models, which can be computed by the additional R command

Figure 3.25: Scatterplot and exponential fit of year X and debt Y.

Long Description for Figure 3.25

The horizontal axis X lists the years from 1970 to 2020 in increments of 10 units. The vertical axis Y measuring debt ranges from 0 to 30 in increments of 5 units. The data points plotted are as follows. (1970,0); (1975, 0); (1980, 1); (1990, 3); (1995, 5); (2000, 5); (2005, 7); (2010, 15); (2015, 17); (2020, 27). The quadratic regression line originating from points (1970, 0) has a positive quadratic trend. 10 data points fall on the regression line except point (1995, 5), which is slightly above the regression line.

The sum of squares for fitting the exponential regression model using the transformation technique is 22.7 and the sum of squares for the nonlinear least squares is 3.1. So consistent with Figures 3.24 and 3.25, the nonlinear least squares approach provides a better fit to the data pairs.

One drawback that emerged from the survival function estimation example from the previous section (involving current status data) is that fitting a regression line results in a survival probability that can be negative or greater than one when extrapolated outside of the range of the independent variable in the data pairs. In addition, the estimated probability of survival at time zero for both the ordinary simple linear regression model and the weighted simple linear regression model seemed low. Typically, a brand-new item is not defective. A nonlinear regression function is an attractive alternative model in this particular setting. The next example combines a nonlinear regression model and weighted least squares estimators to provide an improved regression model.

Example 3.15 Consider again the estimate of the probability of survival from the current status data given in Example 3.11. A simple nonlinear model that might be appropriate for the current status data set is to assume that the lifetime of the item under consideration follows the exponential(λ) distribution. The survivor function for an exponential random variable T with positive failure rate λ is

$\begin{array}{l} S (t) = P (T \geq t) = e^{- λ t} t > 0, \end{array}$

where t is the failure time in hours.

Fit this nonlinear regression model using ordinary least squares.
Fit this nonlinear regression model using weighted least squares.
Compare the two fitted regression models.

There are two ways to proceed with this regression problem. The first is to minimize the squared deviations
$\begin{array}{l} S = \sum_{i = 1}^{n} {(Y_{i} - e^{- λ X_{i}})}^{2} \end{array}$

with respect to λ to arrive at an appropriate regression parameter estimator. Equivalently, the least squares estimator of λ is

$\begin{array}{l} \hat{λ} = \underset{λ}{argmin} \sum_{i = 1}^{n} {(Y_{i} - e^{- λ X_{i}})}^{2} . \end{array}$

This is the usual least squares approach. The second is to perform algebraic manipulations to the model in order to “linearize” the model so that the theory associated with the simple linear regression model can be implemented. The second approach is considered here. Treating this as a regression problem with X as time and Y as the survival probability results in the multiplicative nonlinear regression model

$\begin{array}{l} Y = e^{- λ X} ϵ . \end{array}$

Taking the natural logarithm of both sides of this model results in

$\begin{array}{l} \ln Y = - λ X + ϵ \end{array}$

or

$\begin{array}{l} - \ln Y = λ X + ϵ . \end{array}$

(Notice that when the error distribution is symmetric, which is often the case, the last step is justified.) This can be thought of as a linear regression problem with X as the independent variable and [latex]-\ln Y[/latex] as the dependent variable. There is no intercept in this model, so it can be treated as forcing the regression line through the origin and the single regression parameter λ corresponds to the slope of the regression line.

The R code below uses unweighted least squares to estimate the slope λ using the algebraic approach that forces the regression line through the origin using the techniques from Section 3.1.

The R code using the matrix approach is identical to the algebraic approach in this case. Likewise, the regression parameter λ can be estimated using the lm function with the – 1 parameter to force the regression through the origin via the code below.

Using any of these approaches to estimating λ, the estimate for the failure rate is

$\begin{array}{l} \hat{λ} = 0.003677 \end{array}$

failures per hour.
For the current status data set, it is sensible to incorporate the weights that are based on the various sample sizes into the regression model. The algebraic and the matrix approach to the nonlinear weighted least squares model, which will be a regression model forced through the origin, have identical R code, which is given below.

The R code using the lm function to estimate the parameter λ is given below.

Regardless of which approach is taken, the least squares estimate for the failure rate is

$\begin{array}{l} \hat{λ} = 0.005721 \end{array}$

failures per hour, which is slightly higher than the estimated failure rate in the ordinary least squares approach.
The two approaches (ordinary least squares and weighted least squares) for the nonlinear regression model can be compared graphically by plotting the two estimated survivor functions associated with the two fitted models. The R code below generates that plot. The estimated failure rate in the case of ordinary nonlinear least squares is stored in lambda.ols. The estimated failure rate in the case of weighted nonlinear least squares is stored in lambda.wls.

Figure 3.26 contains the graph. The ordinary least squares fit with [latex]\hat \lambda = 0.003677[/latex] gives equal weight to the four data pairs; the weighted least squares fit with [latex]{\hat \lambda = 0.005721}[/latex] gives significantly more weight to the first two data pairs. The two data pairs with the larger sample sizes are again circled in the figure. The weighted least squares model indicates that there is a higher estimated failure rate when increased weight is placed on the first two data pairs.

Figure 3.26: Current status data ordinary and weighted least squares fits for the exponential model.

Long Description for Figure 3.26

The horizontal axis X ranges between 0 and 400 in increments of 100. The vertical axis Y ranges between 0 and 1.0 in increments of 0.2. Data points are plotted at (100, 0.5); (200, 0.3); (300, 0.5); (400, 0.4). A line originating at 1.0 at vertical axis curves downward, passes through the point (400, 0.3) and has a negative slope. It is the ordinary least square. Another line originating at 1.0 at the vertical axis, curves down, passes through (300,2) and (400, 0.1) and is labeled weighted least squares. The points (100, 0.5) and (200, 0.3) are circled, and lie to the left of the weighted least squares line. Points (300, 0.5) and (400, 0.4) fall to the right of the ordinary least squares line. All data are approximate.

3.8 Logistic Regression

Logistic regression is appropriate when the dependent variable Y can assume one of two values: zero and one. This is sometimes known as a binary or dichotomous response variable. For now, to keep the mathematics and interpretations simple, assume that there is a single predictor X. This is known as a simple logistic regression model, and is a special type of nonlinear regression model. Including multiple independent variables in a logistic regression model is a straightforward extension. For dichotomous data, instead of predicting 0 or 1, we predict the probability of getting a 1 [that is, [latex]P(Y = 1)[/latex]]. So we need a regression model that predicts values of the interval [latex][0, \, 1][/latex].

The following example will be used throughout this section to motivate the need for a special model to accommodate a binary dependent variable, and to illustrate the techniques for the estimation of the model parameters.

Example 3.16 As an example to motivate the application of simple logistic regression, consider the [latex]{n = 948}[/latex] field goal attempts in the National Football League during the 2003 season. Let the independent variable X be the length of the field goal attempt (in yards) and the dependent variable Y be the outcome (0 for failure and 1 for success). The scatterplot (without jittering for ties) of the data values is shown in Figure 3.27, along with the associated least squares regression line with estimated intercept [latex]\hat \beta_0 = 1.35[/latex] and slope [latex]\hat \beta_1 = -0.015[/latex]. The regression line is heading in the correct direction because longer field goals are less likely to be successful. Simple linear regression is clearly not an appropriate statistical model in this setting because it predicts probabilities outside of the interval [latex][0, \, 1][/latex]. Even if predictions greater than 1 are set to 1 and negative predictions are set to zero, the model predicts that all 20-yard field goal attempts will be successful, and, at the other extreme, it predicts that the probability of kicking an 85-yard field goal is 0.06. This is inconsistent with the fact that the longest field goal ever in the NFL was a 66-yard field goal by Justin Tucker of the Baltimore Ravens on September 26, 2021. Obviously we can build a better regression model.

Figure 3.27: Scatterplot of field goal outcomes vs. yards with regression line.

Long Description for Figure 3.27

The horizontal axis, X ranges from 15 to 65 in increments of 5. The vertical axis, Y ranges from 0 to 1. 35 data points are plotted between 20 and 62 along the X axis with a common Y value 0. 41 data points are plotted between 18 and 60 along the X axis with a common Y value 1. A linear line of regression originating at (21, 1) slopes toward the bottom right of the X Y plane, and has a negative slope.

One of the initial considerations in developing a statistical model for the outcome of a field goal as a function of the length of the field goal attempt is to find a function that will only assume values between 0 and 1. A diagram that gives some guidance with regard to this function is to batch the data into 5-year increments. So the bins are all field goals that fall in the ranges [latex]20 \pm 2, \, 25 \pm 2, \, \ldots , \, 60 \pm 2[/latex]. This window is long enough so that the random sampling variability associated with nearby attempts is damped considerably, and yet short enough so that outcome patterns as a function of yardage are still apparent. The R code below batches the independent variable into the 5-yard increments and plots the estimated probability of success for attempts in each batch at its midpoint. This estimated probability is just the fraction of successful field goals within a particular range. Furthermore, the area of each point plotted is proportional to the number of attempts in that particular bin. For example, there were 79 attempts in the first bin (18–22 yards) and only 4 attempts in the last bin (58–62 yards). The R code below reads a data set off of the web that contains the results of [latex]n = 948[/latex] NFL field goal attempts during 2003. The data consists of columns that give the length of the field goal attempt and the outcome, failure ([latex]Y = 0[/latex]) or success ([latex]Y = 1[/latex]). The R code rounds each length to the nearest 5 yards, and plots the midpoint of the rounded field goal lengths versus the estimated probability of success.

While the performance of NFL field goal kickers varies from one kicker to the next, these points give us an idea of what we would like for a smooth regression function in this setting.

The results are shown in Figure 3.28. It is clear that the estimated probability of making a field goal decreases as the length of the field goal attempt increases, as one would expect. There is a strong relationship between the length of the field goal attempt and the probability of success. Our goal is to fit a nonlinear regression function to the raw data values that smooths the random sampling variability and can be used for the purpose of prediction.

A scatter plot graph with data points for field goal outcome versus yards following a declining trend as the number of yards increases. — Figure 3.28: Field goal outcomes vs. yards in 5-yard increments.

Long Description for Figure 3.28

The horizontal axis ranges from 15 to 65 in increments of 5. The vertical axis ranges from 0 to 1.0 in increments of 0.2. The data points are plotted at (20,1); (25, 1); (30, 0.9); (35, 0.8); (40, 0.8); (45, 0.7); (50, 0.6), (55, 0.5). As the length of field increases, the success of goal decreases.

When the dependent variable only takes on the values zero and one, the usual mean response function for the simple linear regression model

$\begin{array}{l} Y = β_{0} + β_{1} X + ϵ \end{array}$

is

$\begin{array}{l} E [Y] = β_{0} + β_{1} X, \end{array}$

where [latex]E[Y][/latex] denotes the conditional expected value of Y given a particular setting of the independent variable X. This mean response function does not limit the values of Y to just zero and one. With normally distributed error terms, this model would allow for Y values which could be less than 0 or greater than 1.

In logistic regression, this type of curve, regardless of whether it begins near one and ends near zero or it begins near zero and ends near one, is known as a sigmoidal response function. A natural choice for the sigmoidal response function is a cumulative distribution function associated with a random variable, or its complement (the survivor function). Three popular probability distributions whose cumulative distribution functions are used in logistic regression are the standard logistic distribution (also commonly called the logit model), the standard normal distribution (also commonly called the probit model), and the standard extreme value distribution (also commonly called the complimentary log-log model). These are described in the next paragraph.

The standard logistic distribution has probability density function

$\begin{array}{l} f (x) = \frac{e^{x}}{{(1 + e^{x})}^{2}} - \infty < x < \infty \end{array}$

and cumulative distribution function

$\begin{array}{l} F (x) = \frac{e^{x}}{1 + e^{x}} - \infty < x < \infty . \end{array}$

The probability density function is symmetric about the population mean [latex]E[X] = 0[/latex] and has population variance [latex]V[X] = \pi ^ 2 / 3[/latex]. The standard normal distribution has probability density function

$\begin{array}{l} f (x) = \frac{1}{\sqrt{2 π}} e^{- x^{2} / 2} - \infty < x < \infty \end{array}$

and cumulative distribution function

$\begin{array}{l} F (x) = \int_{- \infty}^{x} f (w) d w - \infty < x < \infty . \end{array}$

The probability density function is also symmetric about the population mean [latex]E[X] = 0[/latex] and has population variance [latex]V[X] = 1[/latex]. The probability density function for the standard logistic distribution is similar in shape (that is, bell-shape) to that for the standard normal distribution, but has heavier tails. The symmetry of the probability density functions for the standard logistic distribution and the standard normal distribution limits the shape of the associated cumulative distribution function. A nonsymmetric distribution often provides a better fit. This leads to a search for a probability distribution with a nonsymmetric probability density function. One such probability distribution is the extreme value distribution. The standard extreme value distribution has probability density function

$\begin{array}{l} f (x) = e^{x - e^{x}} - \infty < x < \infty \end{array}$

and cumulative distribution function

$\begin{array}{l} F (x) = 1 - e^{- e^{x}} - \infty < x < \infty . \end{array}$

The population mean and the population variance are not mathematically tractable, but the numeric values, to ten digits, are

$\begin{array}{l} E [X] = - 0.5772156649 and V [X] = 1.644934067 . \end{array}$

The probability density function is not symmetric about the mean.

The R code below plots these three probability density functions on the same set of axes. The standard normal probability density function is taken directly from the formulas in the previous paragraph. The probability density functions for the standard logistic distribution and the standard extreme value distribution have been standardized (by subtracting their population mean and dividing by the population standard deviation) so that all three probability density functions can be viewed on an equal footing. The plot emphasizes the shape of the various probability density functions.

The results are displayed in Figure 3.29. All three probability distributions have support on the entire real number line [latex]-\infty < x < \infty[/latex], although the graph only includes the values within three standard deviation units from the population mean. As expected, the probability density functions for the standard normal distribution and the standardized version of the standard logistic distribution are symmetric and bell-shaped. The probability density function of the standardized version of the standard extreme value distribution is nonsymmetric. The R code below plots the cumulative distribution function associated with the standardized version of the standard logistic distribution.

A graph of the logistic and normal distribution and extreme value probability distribution. — Figure 3.29: Standardized logistic, normal, and extreme value probability density functions.

Long Description for Figure 3.29

The horizontal axis ranges from negative 3 to 3 in increments of 1. The vertical axis ranges from 0 to 0.5 in increments of 0.1. The normal distribution and logistics distributions are represented by bell shaped curves with mean 0 and same standard deviation. The curves originate at negative (3, 0), and fall back to( 3, 0). Normal distribution peaks at (0, 0.38) but logistic distribution peaks at (0, 0.45). The extreme value probability distribution is left skewed, with the mean at 0.5. It originates at negative (3, 0), peaks at (1, .46) and falls back to (3, 0).

The cumulative distribution function [latex]F(x) = P(X \le x)[/latex] is graphed in Figure 3.30. This cumulative distribution function is monotone increasing and satisfies [latex]\lim_{x \, \rightarrow \, - \infty} F(x) = 0[/latex] and [latex]\lim_{x \, \rightarrow \, \infty} F(x) = 1[/latex]. Notice that a plot of [latex]F(-x)[/latex] gives the complement of the cumulative distribution function. In other words, [latex]S(x) = 1 - F(x) = P(X \ge x)[/latex]. This function is monotone decreasing and satisfies [latex]\lim_{x \, \rightarrow \, - \infty} S(x) = 1[/latex] and [latex]\lim_{x \, \rightarrow \, \infty} S(x) = 0[/latex]. This function is known in survival analysis as the survivor function.

A standardized version of the standard logistic cumulative distribution function shows a S shaped curve. The horizontal axis ranges from negative 3 to 3 in increments of 1. The vertical axis ranges from 0 to 1.0 in increments of 0.2. A S-shaped curve passes through the points (negative 3, 0); (negative 2, 0.1); (negative 1, 0.1); (0, 4); (1, 0.8); (2, 1); (3,1). — Figure 3.30: Standardized version of the standard logistic cumulative distribution function.

Now that cumulative distribution functions and their complements have been identified as a reasonable way to estimate the probability of success for the field goal data, we would like to establish a mechanism for incorporating the value of the predictor X into the probability model. The emphasis here will be on using the cumulative distribution function for the logistic distribution, since that seems to be the most commonly used in logistic regression.

The usual form of the mean response function for simple linear regression is

$\begin{array}{l} E [Y] = β_{0} + β_{1} X . \end{array}$

But in the case of a binary outcome, the constraint

$\begin{array}{l} 0 \leq E [Y] \leq 1 \end{array}$

must be imposed. This is done naturally using the cumulative distribution functions and their complements for the various probability distributions described earlier. Let [latex]\pi (X)[/latex] be the mean response function for a regression model with a binary response. Using the cumulative distribution function for the logistic distribution, the mean response function is

$\begin{array}{l} π (X) = E [Y] = \frac{e^{β_{0} + β_{1} X}}{1 + e^{β_{0} + β_{1} X}} . \end{array}$

Since the random variable Y can only assume the values 0 and 1 for a particular value of X, it is a Bernoulli random variable with probability of success [latex]\pi(X)[/latex]. Since the expected value and the probability that a Bernoulli random variable assumes the value 1 are equal, the mean response function can also be expressed as

$\begin{array}{l} π (X) = P (Y = 1) = \frac{e^{β_{0} + β_{1} X}}{1 + e^{β_{0} + β_{1} X}}, \end{array}$

where [latex]P(Y = 1)[/latex] is the probability that the dependent variable Y equals 1 for a particular fixed setting of the independent variable X. The parameters β₀ and β₁ assume the following roles.

The sign of β₁ controls whether the mean response function is monotone increasing or decreasing. Table 3.9 shows the direction of the relationship associated with the sign of β₁. The statistical significance of the point estimator of β₁ depends on its magnitude.
The magnitude of β₁ controls the steepness of the mean response function, with larger magnitudes corresponding to steeper mean response functions.
The value of β₀ controls the location of the mean response function on the X-axis.

Table 3.9: Direction of monotonicity of [latex]\pi(X)[/latex].
Condition	[latex]\displaystyle{\lim_{X \, \rightarrow \, - \infty} \pi(X)}[/latex]	[latex]\displaystyle{\lim_{X \, \rightarrow \, \infty} \pi(X)}[/latex]
[latex]\beta_1 < 0[/latex]	1	0
[latex]\beta_1 > 0[/latex]	0	1
[latex]\beta_1 = 0[/latex]	[latex]e ^ {\beta_0} / \left( 1 + e ^ {\beta_0} \right)[/latex]	[latex]e ^ {\beta_0} / \left( 1 + e ^ {\beta_0} \right)[/latex]

A graph that illustrates the effect of varying values of β₁ for the fixed value of [latex]\beta_0 = 0[/latex] on the mean response function [latex]\pi(X)[/latex] is given in Figure 3.31. As expected, the mean response function [latex]\pi(X)[/latex] is monotone decreasing for [latex]\beta_1 < 1[/latex] and monotone increasing for [latex]\beta_1 > 1[/latex]. The mean response function is steeper as the magnitude of β₁ increases.

A graph depicts the effects of varying beta subscript 1 and fixed beta subscript 0. — Figure 3.31: Mean response functions for [latex]\beta_0 = 0[/latex] and various β₁ values.

Long Description for Figure 3.31

The horizontal axis ranges from negative 3 to 3 in increments of 1. The vertical axis ranges from 0 to 1.0 in increments of 0.2. Beta subscript 2 is a S-shaped curve that passes through the points (negative 3, 0); (negative 1, 0.1); (0, .5); (1, 0.8); (2, 1); (3,1). Beta subscript 1 equals 1 passes through points (negative 3, 0.02); (negative 1, 0.2); (0, 0.4); (1, 0.7); (3, 0.9). Beta subscript negative 1 is an inverse S-shaped curve that passes through the points (negative 3,1); (negative 1, 0.8); (0, 0.5); (1, 0.3); (3, 0).

A graph that illustrates the effect of varying values of β₀ for the fixed value of [latex]\beta_1 = 1[/latex] on the mean response function [latex]\pi(X)[/latex] is given in Figure 3.32. As expected, the mean response function [latex]\pi(X)[/latex] is monotone increasing in all cases because [latex]\beta_1 > 1[/latex]. The effect of varying β₀ is to shift the mean response functions horizontally. The rationale behind the horizontal shift can be seen by writing the mean response function with [latex]\beta_1 = 1[/latex] as

$\begin{array}{l} π (X) = \frac{e^{β_{0} + X}}{1 + e^{β_{0} + X}} . \end{array}$

A graph depicts the effects of varying beta subscript 0 values with a fixed beta subscript 1 values at 1. — Figure 3.32: Mean response functions for [latex]\beta_1 = 1[/latex] and various β₀ values.

Long Description for Figure 3.32

The horizontal axis ranges from negative 3 to 3 in increments of 1. The vertical axis ranges from 0 to 1.0 in increments of 0.2. When Beta subscript 0 equals 1, the curve passes through the points (negative 3, 0.17); (negative 1, 0.4); (0, .7); (1, 0.8); (3,1). When beta subscript 0 equals 0, the curve passes through the points (negative 3, 0.12); (negative 1, 0.2); (0, 0.4); (1, 0.6); (3, 0.9). When beta subscript 0 equals negative 1, the curve passes through the points (negative 3, 0); (negative 1, 0.1); (0, 0.2); (1, 0.4); (3, 0.8). All data are approximate.

So the effect of increasing β₀ in this case is to shift the mean response function to the right (for [latex]\beta_1 < 1[/latex]) or to the left (for [latex]\beta_1 > 1[/latex]) relative to the [latex]\pi(X)[/latex] curve associated with [latex]\beta_0 = 0[/latex].

To summarize, the sign of β₁ controls the direction of the monotonicity of [latex]\pi(X)[/latex], the magnitude of β₁ controls the steepness of [latex]\pi(X)[/latex], and β₀ controls the location of [latex]\pi(X)[/latex] along the X-axis.

We now consider the estimation of the parameters β₀ and β₁ from a data set consisting of the n data pairs [latex](X_1, \, Y_1)[/latex], [latex](X_2, \, Y_2)[/latex], …, [latex](X_n, \, Y_n)[/latex]. The first components [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex] are real numbers and the second components [latex]Y_1, \, Y_2, \, \ldots, \, Y_n[/latex] assume only the values 0 and 1. Since

$\begin{array}{l} P (Y = 1) = π (X) and P (Y = 0) = 1 - π (X) \end{array}$

the contribution to the likelihood function of the data pair [latex](X_i, \, Y_i)[/latex] is

$\begin{array}{l} π (X_{i})^{Y_{i}} {[1 - π (X_{i})]}^{1 - Y_{i}} \end{array}$

for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]. When [latex]Y_i = 0[/latex], the contribution to the likelihood function is [latex]1 - \pi(X_i)[/latex], which is [latex]P(Y_i = 0)[/latex], where [latex]P(Y_i = 0)[/latex] is the probability that [latex]Y_i = 0[/latex] for the particular setting of the independent variable at X_i. When [latex]Y_i = 1[/latex], the contribution to the likelihood function is [latex]\pi(X_i)[/latex], which is [latex]P(Y_i = 1)[/latex]. Since X_i is assumed to be observed without error, Y_i is a random binary response, and the responses are assumed to be mutually independent random variables, the likelihood function is

$\begin{array}{l} L (β_{0}, β_{1}) = \prod_{i = 1}^{n} π (X_{i})^{Y_{i}} {[1 - π (X_{i})]}^{1 - Y_{i}} . \end{array}$

The log likelihood function is

$\begin{array}{l} \ln L (β_{0}, β_{1}) = \sum_{i = 1}^{n} Y_{i} \ln [π (X_{i})] + (1 - Y_{i}) \ln [1 - π (X_{i})] . \end{array}$

This can be written in terms of β₀ and β₁ as

$\begin{array}{l} \ln L (β_{0}, β_{1}) = \sum_{i = 1}^{n} Y_{i} [β_{0} + β_{1} X_{i} - \ln (1 + e^{β_{0} + β_{1} X_{i}})] - (1 - Y_{i}) \ln (1 + e^{β_{0} + β_{1} X_{i}}) \end{array}$

or

$\begin{array}{l} \ln L (β_{0}, β_{1}) = \sum_{i = 1}^{n} Y_{i} (β_{0} + β_{1} X_{i}) - \ln (1 + e^{β_{0} + β_{1} X_{i}}) . \end{array}$

The likelihood function and the log likelihood function are maximized at the same values of β₀ and β₁ because the natural logarithm is a monotonic transformation. The score vector is comprised of the partial derivatives of the log likelihood function with respect to β₀ and β₁:

$\begin{array}{l} \frac{\partial \ln L (β_{0}, β_{1})}{\partial β_{0}} = \sum_{i = 1}^{n} (Y_{i} - \frac{e^{β_{0} + β_{1} X_{i}}}{1 + e^{β_{0} + β_{1} X_{i}}}) \end{array}$

and

$\begin{array}{l} \frac{\partial \ln L (β_{0}, β_{1})}{\partial β_{1}} = \sum_{i = 1}^{n} (X_{i} Y_{i} - \frac{X_{i} e^{β_{0} + β_{1} X_{i}}}{1 + e^{β_{0} + β_{1} X_{i}}}) . \end{array}$

When these two equations are equated to zero, there is no closed form solution for [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex], so numerical methods must be relied on to calculate these point estimates. The second derivatives of the log likelihood function after simplification are

$\begin{array}{l} \frac{\partial^{2} \ln L (β_{0}, β_{1})}{\partial β_{0}^{2}} = - \sum_{i = 1}^{n} \frac{e^{β_{0} + β_{1} X_{i}}}{{(1 + e^{β_{0} + β_{1} X_{i}})}^{2}}, \end{array}$

$\begin{array}{l} \frac{\partial^{2} \ln L (β_{0}, β_{1})}{\partial β_{0} \partial β_{1}} = - \sum_{i = 1}^{n} \frac{X_{i} e^{β_{0} + β_{1} X_{i}}}{{(1 + e^{β_{0} + β_{1} X_{i}})}^{2}}, \end{array}$

and

$\begin{array}{l} \frac{\partial^{2} \ln L (β_{0}, β_{1})}{\partial β_{1}^{2}} = - \sum_{i = 1}^{n} \frac{X_{i}^{2} e^{β_{0} + β_{1} X_{i}}}{{(1 + e^{β_{0} + β_{1} X_{i}})}^{2}} . \end{array}$

The Fisher information matrix is the matrix of expected values of these partial derivatives:

$\begin{array}{l} I (β_{0}, β_{1}) = (\begin{array}{c} E [\frac{- \partial^{2} \ln L (β_{0}, β_{1})}{\partial β_{0}^{2}}] & E [\frac{- \partial^{2} \ln L (β_{0}, β_{1})}{\partial β_{0} β_{1}}] \\ E [\frac{- \partial^{2} \ln L (β_{0}, β_{1})}{\partial β_{1} β_{0}}] & E [\frac{- \partial^{2} \ln L (β_{0}, β_{1})}{\partial β_{1}^{2}}] \end{array}) . \end{array}$

The expected values in this matrix can be determined because they do not contain any random variables. Their values cannot be calculated, however, because the values of the parameters β₀ and β₁ are unknown. The observed information matrix

$\begin{array}{l} O ({\hat{β}}_{0}, {\hat{β}}_{1}) = {(\begin{array}{c} \frac{- \partial^{2} \ln L (β_{0}, β_{1})}{\partial β_{0}^{2}} & \frac{- \partial^{2} \ln L (β_{0}, β_{1})}{\partial β_{0} β_{1}} \\ \frac{- \partial^{2} \ln L (β_{0}, β_{1})}{\partial β_{1} β_{0}} & \frac{- \partial^{2} \ln L (β_{0}, β_{1})}{\partial β_{1}^{2}} \end{array})}_{β_{0} = {\hat{β}}_{0}, β_{1} = {\hat{β}}_{1}} \end{array}$

can be estimated from data values once the maximum likelihood estimators are computed. This matrix is the variance–covariance matrix of the score vector and its inverse is the asymptotic variance–covariance matrix of the maximum likelihood estimators. The square roots of the diagonal elements of this inverse matrix provide estimates of the standard errors of the maximum likelihood estimates.

The NFL field goal data set has a large sample size ([latex]n = 948[/latex]) and a strong statistical relationship between the length of the field goal attempt and the probability of success. The R code below again uses the optim function to calculate the parameter estimates. The first argument to optim are initial parameter estimates. The second argument to optim is the function to be minimized, so the negative of the log likelihood function is given as the second argument. Once the maximum likelihood estimates are calculated, the observed information matrix, standard errors, z-statistics, and associated p-values are calculated.

The results of the code are summarized in Table 3.10. The values of [latex]\hat \beta_0[/latex] and [latex]\hat \beta_1[/latex] are both statistically significant with p-values near zero. The observed information matrix for the NFL field goal data set

Table 3.10: Summary statistics for NFL field goal data.
i	${\hat{β}}_{i}$	${\hat{σ}}_{{\hat{β}}_{i}}$	z	p
0	5.69	0.451	12.6	0.00
1	$- 0.110$	0.0106	$- 10.4$	0.00

is

$\begin{array}{l} O ({\hat{β}}_{0}, {\hat{β}}_{1}) = (\begin{array}{c} 130.83 & 54 70.26 \\ 5470.26 & 237, 653.57 \end{array}) . \end{array}$

These values can be compared to the values obtained using the glm (generalized linear model) function:

The results match those given in Table 3.10. When the link parameter within the binomial family is set to logit, the cumulative distribution function (or its complement) for the standard logistic distribution is employed. When the link parameter is set to probit, the cumulative distribution function (or its complement) for the standard normal distribution is employed. The logit and probit choices force the sigmoidal function to be symmetric, so that it approaches 0 and 1 at the same rate. When the link parameter is set to cloglog, the cumulative distribution function (or its complement) for the standard extreme value distribution is employed. It approaches 0 and 1 at the different rates.

When the following R statements are added to the code that generated Figure 3.28, the fitted mean response function [latex]\hat \pi(X)[/latex] is added to the graph.

The graph is shown in Figure 3.33. The estimated mean response function is monotone decreasing because [latex]\hat \beta_1 < 0[/latex] Furthermore, the mean response curve does an adequate job of modeling the probability of success as the points lie very close to the estimated mean response function. The estimated mean response function can be used for prediction. The estimated probability that a 38-yard field goal attempt is successful is

$\begin{array}{l} \hat{π} (38) = \frac{e^{5.6942693 - 0.1098488 (38)}}{1 + e^{5.6942693 - 0.1098488 (38)}} = 0.82 . \end{array}$

A graph of the field goal outcome versus yards, and the estimated mean response of a monotone decreasing function. — Figure 3.33: Field goal outcomes and estimated mean response function.

Long Description for Figure 3.33

The horizontal axis ranges from 15 to 65 in increments of 5. The vertical axis ranges from 0 to 1.0 in increments of 0.2. The datapoints plotted are as follows. (20, 1); (25,1); (30, 0.9); (35, 0.8);( 40, 0.8); (45, 0.7); (50, 0.6); (55, 0.4). A curve passing through all of the plotted points expect (35, 0.8), follows a decreasing trend.

This value can be generated with the predict function in R with the additional statements

Some keystrokes can be saved by using the type = “response” argument in the call to predict.

The limitations of a symmetric mean response function also become apparent in this case. The estimated probability that a 71-yard field goal attempt is successful is

$\begin{array}{l} \hat{π} (71) = \frac{e^{5.6942693 - 0.1098488 (71)}}{1 + e^{5.6942693 - 0.1098488 (71)}} = 0.11, \end{array}$

even though the NFL field goal record from 2021 is 66 yards. This is clearly a case of extrapolating beyond the range of the data, which is discouraged. The meaningful range of [latex]\hat \pi (X)[/latex] is over the scope of the model [latex]18 \le X \le 62[/latex], whose endpoints are the shortest and longest field goal attempt during the 2003 season. The symmetric nature of the logistic distribution makes the [latex]\hat \pi (X)[/latex] values associated with X-values greater than 62 yards higher than are meaningful.

Confidence intervals for the parameters in a logistic regression model can be calculated with the confint and confint.default functions. These confidence intervals give a measure of the precision of the point estimates. The R code below calculates the 95% confidence intervals for the parameters using the confint and confint.default functions for the NFL field goal data.

The first set of confidence intervals that are returned via confint use the profiled log likelihood function to return the confidence intervals given in the output below. The default is a 95% confidence interval.

To three significant digits, these 95% confidence intervals are

$\begin{array}{l} 4.84 < β_{0} < 6.61 and - 0.131 < β_{1} < - 0.0897 . \end{array}$

The second set of confidence intervals that are returned via confint.default are based on the asymptotic normality of the maximum likelihood estimators. The call to confint.default returns the confidence intervals given in the output below.

To three significant digits, these 95% confidence intervals are

$\begin{array}{l} 4.82 < β_{0} < 6.58 and - 0.131 < β_{1} < - 0.0892 . \end{array}$

Alternatively, the 95% confidence interval for β₁ can be calculated by using the qnorm function to calculate the appropriate quantile from the standard normal distribution.

The 95% confidence interval for β₁ that is returned matches that returned by confint.default. The confidence intervals based on the asymptotic normality of the maximum likelihood estimator from confint.default will be symmetric about the maximum likelihood estimators, but the confidence interval based on the profiled log likelihood function from confint will not be symmetric about the maximum likelihood estimators. The confidence intervals given here are somewhat narrow because of the large sample size of [latex]n = 948[/latex] for the NFL field goal data.

The last topic is the interpretation of the point estimators for the coefficients. This interpretation is much more difficult than the interpretation of the coefficients in a standard simple linear regression model. The next paragraph defines the odds and the log odds. The subsequent paragraph relates the log odds to the logistic regression model.

Consider an event which occurs with probability 0.9. The probability that the event will not occur is 0.1. The odds are defined as the ratio of the probability that the event will occur to the probability that the event will not occur. In this case that ratio is 9, so the odds are often referred to as 9 to 1. Table 3.11 gives several probability values and associated odds for several probability values.

Table 3.11: Probability and odds.
Probability	Odds
0.2	0.25
0.5	1
0.6	1.5
0.75	3
0.8	4
0.9	9
0.99	99

The R code below generates a graph of the odds on the vertical axis versus the probability on the horizontal axis.

Figure 3.34 shows the transformation from probability to odds, which reveals a monotone increasing function. Probabilities fall on the interval [latex][0, \, 1][/latex]; odds fall on the interval [latex][0, \, \infty )[/latex]. The natural logarithm of the odds is the function, known as the log odds, which is a transformation of the probability p in the following fashion:

A graph of the relationship between probability and odds; depicted as a monotone, increasing function. — Figure 3.34: Odds versus probability.

Long Description for Figure 3.34

The horizontal axis measuring probability, ranges from 0 to 1 in increments of 0.2. The vertical axis measuring odds ranges from 0 to 9 in increments of 1. A curve following an increasing trend passes through the points (0.0, 0); (0.2, 0.5); (0.4, 1); (0.6, 1.5); (0.8, 2.5); and (0.9, 9).

$\begin{array}{l} \ln (\frac{p}{1 - p}) . \end{array}$

This is a transformation from [latex][0, \, 1][/latex] to [latex]( - \infty, \, \infty )[/latex]. Table 3.12 extends the previous table by including a column for the log odds. Notice that a probability of [latex]1 / 2[/latex] corresponds to a log odds of 0 and the symmetry of the log odds associated with the probabilities 0.2 and 0.8. The R code below graphs the log odds versus the probability.

Table 3.12: Probability, odds, and log odds.
Probability	Odds	Log Odds
0.2	0.25	$- 1.3863$
0.5	1	0
0.6	1.5	0.4055
0.75	3	1.0986
0.8	4	1.3863
0.9	9	2.1972
0.99	99	4.5951

The associated graph is shown in Figure 3.35. The shape of the log odds is a transformed version of the mean response functions seen earlier. The purpose of defining the log odds is to convert from probability, which has a restricted range between 0 and 1, and the log odds, which has an unrestricted range.

A graph of the relationship between probability and log odds. — Figure 3.35: Log odds versus probability.

Long Description for Figure 3.35

The horizontal axis measuring probability, ranges from 0 to 1 in increments of 0.2. The vertical axis measuring log odds ranges from negative 3 to 3 in increments of 1. A curve passes through the points as follows. (0.0, negative 3); (0.2, negative 1.5); (0.4, negative 0.5); (0.6, .5); (0.8, 1.5); (1.0, 3). The curve increases until 0.4, inflects at 0.5, and increases steadily to the right of the quadrant.

Now back to logistic regression and the interpretation of the estimated coefficients. Recall that for a simple logistic regression problem, the mean response function is

$\begin{array}{l} π (x) = E [Y | X = x] = P (Y = 1 | X = x) = \frac{e^{β_{0} + β_{1} x}}{1 + e^{β_{0} + β_{1} x}}, \end{array}$

where x is the independent variable and Y is the response variable. The logit transformation of [latex]\pi(x)[/latex] is

$\begin{array}{l} \ln [\frac{π (x)}{1 - π (x)}] = \ln [e^{β_{0} + β_{1} x}] = β_{0} + β_{1} x . \end{array}$

Since [latex]\pi(x)[/latex] is a probability, the expression on the left-hand side of this equation is a log odds.

Now consider the NFL data. From the earlier work, the estimated intercept provided by the R glm function is [latex]\hat \beta_0 = 5.6979[/latex] and the estimated coefficient associated with the length of the field goal attempt in yards is [latex]\hat \beta_1 = -0.1099[/latex]. The estimated intercept is the log odds of a kicker making a field goal from a (theoretical) zero yards, which has no meaningful interpretation in this setting. The value of [latex]\hat \beta_1 = -0.1099[/latex] is the change in the log odds for a one-yard change in the length of the field goal attempt. Additionally, the quantity

$\begin{array}{l} e^{{\hat{β}}_{1}} = e^{- 0.1099} = 0.8959 \end{array}$

is the multiplier that gives the change in the odds for a one-unit change in the independent variable. We expect to see a 10.4% decrease in the odds associated with the probability of success for a field goal attempt for every additional yard added to the field goal attempt. This value and an associated 95% confidence interval can be generated with the additional R statement

The analysis of the NFL data given here is a composite of all kickers in the NFL during 2003. Individual kickers within the NFL will have their own logistic regression curve.

With this background concerning simple logistic regression in place, it is straightforward to extend this to more complicated modeling situations. Additional topics in logistic regression include constructing a confidence interval for a predicted value, the calculation of deviance residuals, including multiple independent variables in a logistic regression model, model assessment, and interpreting estimated coefficients for interaction terms.

3.9 Exercises

- 3.1 Write a paragraph that describes why the sum of squares for error associated with the simple linear regression model [latex]Y = \beta_0 + \beta_1 X + \epsilon[/latex] will always be less than or equal to the sum of squares for error associated with the simple linear regression model forced through the origin [latex]Y = \beta_1 X + \epsilon[/latex] for the same data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \ldots , \, \left( X_n, \, Y_n \right)[/latex].
- 3.2 Under what condition(s) does the regression line forced through the origin have the same sum of squares for error as the simple linear regression for the full model [latex]Y = \beta_0 + \beta_1 X + \epsilon[/latex] for the same data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \ldots , \, \left( X_n, \, Y_n \right)[/latex].
- 3.3 Consider the simple linear regression model forced through the origin
  
  $\begin{array}{l} Y = β_{1} X + ϵ . \end{array}$
  
  Show that the least squares estimator [latex]\hat \beta_1[/latex] is an unbiased estimator of β₁.
- 3.4 Consider the simple linear regression model forced through the origin
  
  $\begin{array}{l} Y = β_{1} X + ϵ . \end{array}$
  
  Find [latex]V \big[ \hat \beta_1 \big][/latex].
- 3.5 Consider the simple linear regression model forced through the origin with normal error terms,
  
  $\begin{array}{l} Y = β_{1} X + ϵ, \end{array}$
  
  where [latex]\epsilon \sim N\left( 0, \, \sigma ^ {\, 2} \right)[/latex].
  1. Find the maximum likelihood estimators of β₁ and σ².
  2. Show that the maximum likelihood estimators maximize the log likelihood function.
- 3.6 Give an example of [latex]n = 2[/latex] data pairs corresponding to the case in which a simple linear regression line forced through the origin contains the point [latex]\left( \bar X , \, \bar Y \right)[/latex].
- 3.7 Give an example of [latex]n = 2[/latex] data pairs corresponding to the case in which a simple linear regression line forced through the origin does not contain the point [latex]\left( \bar X , \, \bar Y \right)[/latex].
- 3.8 Consider the simple linear regression model forced through the origin with normal error terms
  
  $\begin{array}{l} Y = β_{1} X + ϵ, \end{array}$
  
  with known parameters β₁ and σ². Find an exact two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval for β₁ from n data pairs [latex]\left( X_1, \, Y_1 \right), \, \left( X_2, \, Y_2 \right), \, \ldots , \, \left( X_n, \, Y_n \right)[/latex].
- 3.9 Consider the simple linear regression model forced through the origin with normal error terms,
  
  $\begin{array}{l} Y = β_{1} X + ϵ, \end{array}$
  
  with unknown parameters β₁ and σ². Show that the R statement
  
  uses the formula
  
  $\begin{array}{l} {\hat{β}}_{1} - t_{n - 1, α / 2} \sqrt{\frac{S S E}{(n - 1) \sum_{i = 1}^{n} X_{i}^{2}}} < β_{1} < {\hat{β}}_{1} + t_{n - 1, α / 2} \sqrt{\frac{S S E}{(n - 1) \sum_{i = 1}^{n} X_{i}^{2}}} \end{array}$
  
  to calculate the 95% two-sided confidence interval for β₁ for the data pairs in the built-in R data frame Formaldehyde. Notice that the degrees of freedom are one more than the associated degrees of freedom for the full simple linear regression model.

3.10 Consider the simple linear regression model forced through the origin with normal error terms,

$\begin{array}{l} Y = β_{1} X + ϵ, \end{array}$

with unknown parameters β₁ and σ². Conduct a Monte Carlo simulation experiment to provide convincing numerical evidence that the two-sided [latex]{100(1 - \alpha)}\%[/latex] confidence interval

$\begin{array}{l} {\hat{β}}_{1} - t_{n - 1, α / 2} \sqrt{\frac{S S E}{(n - 1) \sum_{i = 1}^{n} X_{i}^{2}}} < β_{1} < {\hat{β}}_{1} + t_{n - 1, α / 2} \sqrt{\frac{S S E}{(n - 1) \sum_{i = 1}^{n} X_{i}^{2}}} \end{array}$

is an exact confidence interval for β₁ for the following parameter settings: [latex]n = 3[/latex], [latex]\alpha =0.05[/latex], [latex]\beta_1 = 2[/latex], [latex]X_1 = 1[/latex], [latex]X_2 = 2[/latex], [latex]X_3 = 3[/latex], and [latex]\sigma ^ {\, 2} = 1[/latex].
3.11 The Brown–Forsythe test can be used to determine whether the error terms have constant variance. In particular, it tests for equality of the variances of the error terms in two subsets of the data values. The test is analogous to a t-test. The test is robust with respect to departures from normality of the error terms. The data pairs are partitioned by a threshold value of X which is not one of the [latex]X_1, \, X_2, \, \ldots, \, X_n[/latex] values. Let n₁ be the number of data pairs with X-values less than the threshold value and n₂ be the number of data pairs with X-values greater than the threshold value so that [latex]n = n_1 + n_2[/latex]. In addition, let
- [latex]e_{i1}[/latex] be residual i for group 1,
- [latex]e_{i2}[/latex] be residual i for group 2,
- [latex]\tilde e_1[/latex] be the sample median of the group 1 residuals,
- [latex]\tilde e_2[/latex] be the sample median of the group 2 residuals,
- [latex]d_{i1} = |e_{i1} - \tilde e_1|[/latex],
- [latex]d_{i2} = |e_{i2} - \tilde e_2|[/latex],
- [latex]\bar d_{1} = (1 / n_1) \sum_{i\,=\,1}^{n_1} d_{i1}[/latex], and
- [latex]\bar d_{2} = (1 / n_2) \sum_{i\,=\,1}^{n_2} d_{i2}[/latex].
The test statistic for the Brown–Forsythe test is

$\begin{array}{l} t = \frac{{\bar{d}}_{1} - {\bar{d}}_{2}}{s \sqrt{1 / n_{1} + 1 / n_{2}}}, \end{array}$

where s² is the pooled sample variance

$\begin{array}{l} s^{2} = \frac{\sum_{i = 1}^{n_{1}} (d_{i 1} - {\bar{d}}_{1})^{2} + \sum_{i = 1}^{n_{2}} (d_{i 2} - {\bar{d}}_{2})^{2}}{n - 2} . \end{array}$

The test statistic is approximately [latex]t(n - 2)[/latex] when the population variances of the error terms in the two groups are equal n₁ and n₂ are large enough so that the dependency between the residuals is not too large. Write R code to compute the p-value for the Brown–Forsythe test for the cars data set using speed as the independent variable and dist as the dependent variable with a threshold value of 13.5 miles per hour.
3.12 Find the leverages for [latex]n = 2[/latex] data pairs in a simple linear regression model.
3.13 For a simple linear regression model with [latex]X_i = i[/latex], for [latex]i = 1, \, 2, \, \ldots, \, n[/latex], derive a formula for the leverage of the ith data pair.
3.14 Write R functions named cooks.distance1, cooks.distance2, and cooks.distance3, which calculate the Cook's distances for each of the n data pairs associated with the simple linear regression model

$\begin{array}{l} Y = β_{0} + β_{1} X + ϵ \end{array}$

using the three formulas from Definition 3.3. The arguments for these three functions are the vector x, which contains the n values of the independent variable, and the vector y, which contains the n values of the dependent variable. Test your functions on the Formaldehyde data set which is built into R, with carb as the independent variable and optden as the dependent variable.
3.15 Make a scatterplot (with associated regression line) of the [latex]n = 11[/latex] data pairs in the third data set in Anscombe's quartet with the R commands

Without doing any calculations,
1. circle the point(s) with the largest leverage, and
2. circle the point(s) with the largest Cook's distance.
3.16 What is the smallest and largest possible leverage?
3.17 Show that leverage is scale invariant. In other words, show that the leverages remain unchanged when the scale of the independent variable changes (for example, from centimeters to meters).
3.18 Use Monte Carlo simulation to estimate the probability that all of the Cook's distances are less than 1 for a simple linear regression model with normal error terms and the following parameter settings: [latex]\beta_0 = 1[/latex], [latex]\beta_1 = 1 / 2[/latex], [latex]\sigma = 1[/latex], [latex]n = 10[/latex], and [latex]X_i = i[/latex] for [latex]i = 1, \, 2, \, \ldots, \, n[/latex]. Is this probability affected by changes is σ or n?
3.19 Use Monte Carlo simulation to draw empirical cumulative distribution functions of Cook's distances D₁, D₂, D₃, D₄, and D₅ for a simple linear regression model with the following parameter settings: [latex]\beta_0 = 1[/latex], [latex]\beta_1 = 1 / 2[/latex], [latex]\sigma = 1[/latex], [latex]n = 10[/latex], and [latex]X_i = i[/latex] for [latex]i = 1, \, 2, \, \ldots, \, n[/latex].
3.20 Consider a simple linear regression model with the independent variable X and the dependent variable Y having the same units (for example, centimeters). If the same linear transformation is applied to both X and Y so as to change their units (for example, from centimeters to meters), show that the Cook's distances remain unchanged.
3.21 Show that the row sums of the hat matrix are all equal to 1 for data pairs [latex]\left( X_1, \, Y_1 \right)[/latex], [latex]\left( X_2, \, Y_2 \right)[/latex], [latex]\ldots \,[/latex], [latex]\left( X_n, \, Y_n \right)[/latex] in a simple linear regression model.
3.22 Perform a Monte Carlo simulation to provide convincing numerical evidence that

$\begin{array}{l} \frac{(\hat{β} - β)^{'} X^{'} X (\hat{β} - β)}{2 \cdot MSE} \sim F (2, n - 2) \end{array}$

for a simple linear regression model with normal error terms of your choice. This result is used to establish a [latex]{100(1 - \alpha)}\%[/latex] confidence region for β₀ and β₁.
3.23 Show that the residuals [latex]e_i = Y_i - \hat{Y}_i[/latex] for [latex]i = 1, \, 2, \, \ldots, \, n[/latex], can be written in terms of the hat matrix [latex]{\bf H}[/latex] as

$\begin{array}{l} e = (I - H) Y . \end{array}$
3.24 For the simple linear regression model with normal error terms, the variance–covariance matrix of [latex]\hat{\pmb{\beta}}[/latex] is

$\begin{array}{l} σ^{2} {(X^{'} X)}^{- 1} . \end{array}$

For data pairs [latex]\left( X_1, \, Y_1 \right), \, \left(X_2, \, Y_2 \right), \, \ldots , \, \left( X_n, \, Y_n \right)[/latex], give an estimator for this matrix.
3.25 For the simple linear regression model, show that

$\begin{array}{l} X_{h}^{'} {(X^{'} X)}^{- 1} X_{h} = \frac{1}{n} + \frac{{(X_{h} - \bar{X})}^{2}}{S_{X X}} . \end{array}$
3.26 For a simple linear regression model, show that the matrix equation

$\begin{array}{l} X^{'} X \hat{β} = X^{'} Y, \end{array}$

where

$\begin{array}{l} X = [\begin{array}{cc} 1 & X_{1} \\ 1 & X_{2} \\ ⋮ & ⋮ \\ 1 & X_{n} \end{array}], Y = [\begin{array}{c} Y_{1} \\ Y_{2} \\ ⋮ \\ Y_{n} \end{array}], and \hat{β} = [\begin{array}{c} {\hat{β}}_{0} \\ {\hat{β}}_{1} \end{array}], \end{array}$

corresponds to the normal equations given in Theorem 1.1 as
[latex]\begin{align*} n \hat \beta_0 + \hat \beta_1 \sum_{i\,=\,1}^n X_i & = \sum_{i\,=\,1}^n Y_i \\ \hat \beta_0 \sum_{i\,=\,1}^n X_i + \hat \beta_1 \sum_{i\,=\,1}^n X_i ^ 2 & = \sum_{i\,=\,1}^n X_i Y_i. \end{align*}[/latex]
3.27 A multiple linear regression model is used to determine the relationship between the sales price of a home Y as a function of the two predictor variables: X₁, the number of square feet in the home, and X₂, the distance from downtown in miles. The fitted model is

$\begin{array}{l} Y = 170, 024 + 133 X_{1} - 14, 123 X_{2} . \end{array}$

One home sells for $314,159. Find the predicted sales price for a second home, which is the same size as the first but is ten miles further away from downtown that the first home.
3.28 The R built-in data frame named swiss contains a standardized fertility measure and five socio-economic indicators for 47 French-speaking provinces in Switzerland from about 1888.
1. Using a forward stepwise regression with threshold [latex]\alpha = 0.05[/latex], determine a multiple linear regression model with a dependent variable Y, the standardized fertility measure, and the five associated potential independent variables.
2. Using a backward stepwise regression with threshold [latex]\alpha = 0.05[/latex], determine a multiple linear regression model with a dependent variable Y, the standardized fertility measure, and the five associated potential independent variables.
3. For one of the two final multiple linear regression models determined in parts (a) and (b), test the statistical significance of all possible interaction terms.
3.29 Show that when the independent variables X₁ and X₂ in a multiple linear regression model are uncorrelated, the estimator for [latex]\hat \beta_1[/latex] is the same for both the simple linear regression model involving just X₁ and Y and the multiple linear regression model involving X₁, X₂, and Y.
3.30 Consider a simple linear regression model that uses the weighted least squares estimation. When all of the weights [latex]w_1, \, w_2, \, \ldots, \, w_n[/latex] are equal, show that the weighted least squares normal equations reduce to the associated unweighted least squares normal equations.

3.31 “I first believed I was dreaming …but it is absolutely certain and exact that the ratio which exists between the period times of any two planets is precisely the ratio of the 3/2th power of the mean distance” was the reaction of Johannes Kepler upon discovering the relationship

$\begin{array}{l} y = β x^{3 / 2} \end{array}$

as translated from Harmonies of the World by Kepler in 1619, where x is the distance between a planet and the sun and y is the period. Using the data from the Wikipedia webpage titled Kepler's Laws of Planetary Motion, the data values for the [latex]n = 8[/latex] planets are given below.

Planet	Semi-major axis (AU) x	Period (days) y
Mercury	0.38710	87.9693
Venus	0.72333	224.7008
Earth	1	365.2564
Mars	1.52366	686.9796
Jupiter	5.20336	4332.8201
Saturn	9.53707	10,775.599
Uranus	19.1913	30,687.153
Neptune	30.0690	60,190.03

The semi-major axes values are measured in Astronomical Units (AU).

Make an appropriate scatterplot to visually assess whether a regression model is appropriate.
Find the least squares point estimate for β.
Perhaps fit a least squares model in another fashion.
Interpret the value for [latex]\hat \beta[/latex].
Find a 95% confidence interval for β.

3.32 Fit the quadratic regression function forced through the origin

$\begin{array}{l} Y = β_{1} X^{2} + ϵ, \end{array}$

to the data pairs in the cars data frame in R, where X is the speed of the car in miles per hour and Y is the stopping distance in feet.
3.33 Using an extreme value distribution as a link function, fit a regression function to the 2003 NFL field goal data from Section 3.8 and use the fitted model to predict that probability of success on a 38-yard field goal attempt.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Statistical Modeling: Regression, Survival Analysis, and Time Series Analysis Copyright © 2023 by Lawrence M. Leemis is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Chapter 3 Topics in Regression

3.1 Regression Through the Origin

3.2 Diagnostics

3.2.1 Leverage

3.2.2 Influential Points

3.3 Remedial Procedures

3.4 Matrix Approach to Simple Linear Regression

3.5 Multiple Linear Regression

3.5.1 Categorical Independent Variables

3.5.2 Interaction Terms

3.5.3 The ANOVA Table

3.5.4 Adjusted Coefficient of Determination

3.5.5 Multicollinearity

3.5.6 Model Selection

3.6 Weighted Least Squares

3.7 Regression Models with Nonlinear Terms

3.8 Logistic Regression

3.9 Exercises

License

Share This Book