Multiple Regression

Steps in Multiple
Regression

Elements
of a Multiple Regression Equation

Problems
with Multiple Regression

Standardized Regression

Reality in the public sector is complex. Often there may be several possible causes associated with a problem; and likewise there may be several factors necessary for a solution. Complex statistical applications are needed which can:

- -deal with interval and ratio level variables
- -assess causal linkages
- -forecast future outcomes

Ordinary least squares linear
regression is the most widely used type of regression for predicting the
value of one dependent variable from the value of one independent variable.
It is also widely used for predicting the value of one dependent variable
from the values of two or more independent variables. When there are two
or more independent variables, it is called multiple regression.

- The steps in multiple regression are basically the same as in simple regression.
- 1. State the research hypothesis.
- 2. State the null hypothesis
- 3. Gather the data
- 4. Assess each variable separately first (obtain measures of central tendency and dispersion; frequency distributions; graphs); is the variable normally distributed?
- 5. Assess the relationship of each independent variable, one at a time, with the dependent variable (calculate the correlation coefficient; obtain a scatter plot); are the two variables linearly related?
- 6. Assess the relationships between all of the independent variables with each other (obtain a correlation coefficient matrix for all the independent variables); are the independent variables too highly correlated with one another?
- 7. Calculate the regression equation from the data
- 8. Calculate and examine appropriate measures of association and tests of statistical significance for each coefficient and for the equation as a whole
- 9. Accept or reject the null hypothesis
- 10. Reject or accept the research hypothesis
- 11. Explain the practical implications of the findings

**Y** is the value of the Dependent variable
(Y), what is being predicted or explained

a (Alpha) is the Constant or intercept

b_{1} is the Slope (Beta coefficient)
for X_{1}

**X _{1}**
First independent variable that is explaining the variance in Y

b_{2} is the Slope (Beta coefficient)
for X_{2}

**X _{2}**
Second independent variable that is explaining the variance in Y

b** _{3}** is the Slope (Beta coefficient)
for X

**X _{3}**
Third independent variable that is explaining the variance in Y

**s.e.b _{1}**
standard error of coefficient b

**s.e.b _{2}**
standard error of coefficient b

**s.e.b _{3}**
standard error of coefficient b

**R ^{2}**
The proportion of the variance in the values of the dependent variable
(Y) explained by all the independent variables (Xs) in the equation together;
sometimes this is reported as adjusted R

**F** Whether the equation as a whole
is statistically significant in explaining Y

Example: The Department of Highway Safety wants to understand the influence of various factors on the number of annual highway fatalities.

Hypothesis: Number of annual fatalities is affected by total population, days of snow, and avenge MPH on highways.

Null hypothesis: Number of annual fatalities is not affected by total population, days of snow, or average MPH on highways.

Dependent variable: Y is the number of traffic fatalities in a state in a given year

Independent variable: X_{1} is
the state's total population; X_{2} is the number of days it snowed;
X_{3} is the average speed drivers were driving at for that year.

Equation: Y = 1.4 + .00029 X_{1}
+ 2.4 X_{2} + 10.3 X_{3}

Predicted value of Y: If X_{1}=3,000,000,
X_{2}=2, and X_{3}=65, then

Y = 1.4 + .00029 (3,000,0000) + 2.4 (2)
+ 10.3 (65) = 1545.7

a=1.4

This is the number of traffic fatalities that would be expected if all three independent variables were equal to zero (no population, no days snowed, and zero average speed).

**b _{1}**=.00029

If X_{2} and X_{3} remain
the same, this indicates that for each extra person in the population,
the number of yearly traffic fatalities increases by .00029.

**b _{2}**=2.4

If X_{1} and X_{3} remain
the same, this indicates that for each extra day of snow, Y increases by
2.4 additional traffic fatalities.

**b _{3}**=
10.3

If X_{1} and X_{2} remain
the same, this indicates that for each mph increase in average speed, Y
increases by 10.3 traffic fatalities.

**s.e.b _{1}**=.00003

Dividing b_{1} by s.e.b1 gives
us a t-score of 9.66; p<.01. The t-score indicates that the slope of
the b coefficient is significantly different from zero so the variable
should be in the equation.

**s.e. _{2}=.62**

Dividing b_{2} by s.e.b2 gives
us a t-score of 3.87; p<.01. The t-score indicates that the slope of
the b coefficient is significantly different from zero so the variable
should be in the equation.

**s.e.b _{3}**=1.1

Dividing b_{3} by s.e.b3 gives
us a t-score of 9.36; p<.01. The t-score indicates that the slope of
the b coefficient is significantly different from zero so the variable
should be in the equation.

**R ^{2} = .78**

We can explain 78% of the difference in annual fatality rates among states if we know the states' populations, days of snow, and average highway speeds.

**F **is statistically significant.

The equation as a whole helps us to understand the dependent variable (Y).

__Conclusion__: Reject
the null hypothesis and accept the research hypothesis. Make recommendations
for management implications and further research.

Just as with simple regression, multiple regression will not be good at explaining the relationship of the independent variables to the dependent variables if those relationships are not linear.

Ordinary least squares linear multiple regression is used to predict dependent variables measured at the interval or ratio level. If the dependent variable is not measured at this level, then other, more specialized regression techniques must be used.

Ordinary least squares linear multiple regression assumes that the independent (X) variables are measures at the interval or ratio level. If the variables are not, then multiple regression will result in more errors of prediction. When nominal level variables are used, they are called "dummy" variables. They take the value of 1 to represent the presence of some quality, and the value of zero the indicate the absence of that quality (for example, smoker=1, non-smoker=0). Ordinal coefficients may indicate ranks (for example, staff=1, supervisor=2, manager=3). The interpretation of the coefficients is more problematic with independent variables measured at the nominal or ordinal level.

Regression with only one dependent and one independent variable normally requires a minimum of 30 observations. A good rule of thumb is to add at least an additional 10 observations for each additional independent variable added to the equation.

The number of independent
variables in the equation should be limited by two factors. First, the
independent variables should be included in the equation only if they are
based on the researcher's theory about what factors influence the dependent
variable. Second, variables that do not contribute very much to explaining
the variance in the dependent variable (i.e., to the total R^{2}),
should be eliminated.

Many difficulties tend to arise when there are more than five independent variables in a multiple regression equation. One of the most frequent is the problem that two or more of the independent variables are highly correlated to one another. This is called multicollinearity. If a correlation coefficient matrix with all the independent variables indicates correlations of .75 or higher, then there may be a problem with multicollinearity.

When two variables
are highly correlated, they are basically measuring the same phenomenon.
When one enters into the regression equation, it tends to explain most
of the variance in the dependent variable that is related to that phenomenon.
This leave little variance to be explained by the second independent variable.

- Signs of multicollinearity include:

- 1) none of the t-ratios of the coefficients are statistically significant, but he F-test for the equation as a whole is significant;
- 2) adding an additional independent variable to the equation radically changes either the size or the sign (plus or minus) of the coefficients associated with the other independent variables

If multicollinearity is discovered, the
researcher may drop one of the two variables that are highly correlated,
or simply leave them in and note that multicollinearity is present.

In multiple regression, the relative size of the coefficients is not important. For example, say that we want to predict the graduate grade point averages of students who are newly admitted to the MPA Program. We use their undergraduate GPA, their GRE scores, and the number of years they have been out of college as independent variables. We obtain the following regression equation:

Y=1.437 + (.367) (UG-GPA) + (.00099) (GRE score) + (-.014) (years out of college)

We cannot compare the size of the various coefficients because the three independent variables are measured on different scales. Undergraduate GPA is measured on a scale from 0.0 to 4.0. GRE score is measured on a scale from 0 to 1600. Years out of college is measured on a scale from 0 to 20. We cannot directly tell which independent variable has the most effect on Y (graduate level GPA).

However, it is possible
to transform the coefficients into standardized regression coefficients,
which are written as the plain English letter b. The standardized regression
coefficients in any one regression equation are measured on the same scale,
with a mean of zero and a standard deviation of 1. They are then directly
comparable to one another, with the largest coefficient indicating which
independent variable has the greatest influence on the dependent variable.

Variable Name | Non-Standardized
Coefficient (beta) |
Standardized
Coefficient (b) |

Undergraduate GPA | .367 | +.291 |

GRE score | .00099 | +.175 |

Years out of college | -.014 | -.122 |

Intercept or Constant (a) | 1.437 | n/a |

The convention to use to denote non-standardized regression coefficients and to use b to denote standardized coefficients is not always respected. The one difference between non-standardized and standardized regression is that standardized regression does not have an term (a constant). If there is no term (no constant), then the regression coefficients have been standardized. If there is an term, then the regression coefficients have not been standardized.