Steps in Multiple Regression
Elements of a Multiple Regression Equation
Problems with Multiple Regression
Reality in the public sector is complex. Often there may be several possible causes associated with a problem; and likewise there may be several factors necessary for a solution. Complex statistical applications are needed which can:
Ordinary least squares linear
regression is the most widely used type of regression for predicting the
value of one dependent variable from the value of one independent variable.
It is also widely used for predicting the value of one dependent variable
from the values of two or more independent variables. When there are two
or more independent variables, it is called multiple regression.
Y is the value of the Dependent variable (Y), what is being predicted or explained
a (Alpha) is the Constant or intercept
b1 is the Slope (Beta coefficient) for X1
X1 First independent variable that is explaining the variance in Y
b2 is the Slope (Beta coefficient) for X2
X2 Second independent variable that is explaining the variance in Y
b3 is the Slope (Beta coefficient) for X3
X3 Third independent variable that is explaining the variance in Y
s.e.b1 standard error of coefficient b1
s.e.b2 standard error of coefficient b2
s.e.b3 standard error of coefficient b3
R2 The proportion of the variance in the values of the dependent variable (Y) explained by all the independent variables (Xs) in the equation together; sometimes this is reported as adjusted R2, when a correction has been made to reflect the number of variables in the equation.
F Whether the equation as a whole is statistically significant in explaining Y
Example: The Department of Highway Safety wants to understand the influence of various factors on the number of annual highway fatalities.
Hypothesis: Number of annual fatalities is affected by total population, days of snow, and avenge MPH on highways.
Null hypothesis: Number of annual fatalities is not affected by total population, days of snow, or average MPH on highways.
Dependent variable: Y is the number of traffic fatalities in a state in a given year
Independent variable: X1 is the state's total population; X2 is the number of days it snowed; X3 is the average speed drivers were driving at for that year.
Equation: Y = 1.4 + .00029 X1 + 2.4 X2 + 10.3 X3
Predicted value of Y: If X1=3,000,000,
X2=2, and X3=65, then
Y = 1.4 + .00029 (3,000,0000) + 2.4 (2) + 10.3 (65) = 1545.7
This is the number of traffic fatalities that would be expected if all three independent variables were equal to zero (no population, no days snowed, and zero average speed).
If X2 and X3 remain the same, this indicates that for each extra person in the population, the number of yearly traffic fatalities increases by .00029.
If X1 and X3 remain the same, this indicates that for each extra day of snow, Y increases by 2.4 additional traffic fatalities.
If X1 and X2 remain the same, this indicates that for each mph increase in average speed, Y increases by 10.3 traffic fatalities.
Dividing b1 by s.e.b1 gives
us a t-score of 9.66; p<.01. The t-score indicates that the slope of
the b coefficient is significantly different from zero so the variable
should be in the equation.
Dividing b2 by s.e.b2 gives us a t-score of 3.87; p<.01. The t-score indicates that the slope of the b coefficient is significantly different from zero so the variable should be in the equation.
Dividing b3 by s.e.b3 gives us a t-score of 9.36; p<.01. The t-score indicates that the slope of the b coefficient is significantly different from zero so the variable should be in the equation.
R2 = .78
We can explain 78% of the difference in annual fatality rates among states if we know the states' populations, days of snow, and average highway speeds.
F is statistically significant.
The equation as a whole helps us to understand the dependent variable (Y).
the null hypothesis and accept the research hypothesis. Make recommendations
for management implications and further research.
Just as with simple regression, multiple regression will not be good at explaining the relationship of the independent variables to the dependent variables if those relationships are not linear.
Ordinary least squares linear multiple regression is used to predict dependent variables measured at the interval or ratio level. If the dependent variable is not measured at this level, then other, more specialized regression techniques must be used.
Ordinary least squares linear multiple regression assumes that the independent (X) variables are measures at the interval or ratio level. If the variables are not, then multiple regression will result in more errors of prediction. When nominal level variables are used, they are called "dummy" variables. They take the value of 1 to represent the presence of some quality, and the value of zero the indicate the absence of that quality (for example, smoker=1, non-smoker=0). Ordinal coefficients may indicate ranks (for example, staff=1, supervisor=2, manager=3). The interpretation of the coefficients is more problematic with independent variables measured at the nominal or ordinal level.
Regression with only one dependent and one independent variable normally requires a minimum of 30 observations. A good rule of thumb is to add at least an additional 10 observations for each additional independent variable added to the equation.
The number of independent variables in the equation should be limited by two factors. First, the independent variables should be included in the equation only if they are based on the researcher's theory about what factors influence the dependent variable. Second, variables that do not contribute very much to explaining the variance in the dependent variable (i.e., to the total R2), should be eliminated.
Many difficulties tend to arise when there are more than five independent variables in a multiple regression equation. One of the most frequent is the problem that two or more of the independent variables are highly correlated to one another. This is called multicollinearity. If a correlation coefficient matrix with all the independent variables indicates correlations of .75 or higher, then there may be a problem with multicollinearity.
When two variables
are highly correlated, they are basically measuring the same phenomenon.
When one enters into the regression equation, it tends to explain most
of the variance in the dependent variable that is related to that phenomenon.
This leave little variance to be explained by the second independent variable.
If multicollinearity is discovered, the
researcher may drop one of the two variables that are highly correlated,
or simply leave them in and note that multicollinearity is present.
In multiple regression, the relative size of the coefficients is not important. For example, say that we want to predict the graduate grade point averages of students who are newly admitted to the MPA Program. We use their undergraduate GPA, their GRE scores, and the number of years they have been out of college as independent variables. We obtain the following regression equation:
Y=1.437 + (.367) (UG-GPA) + (.00099) (GRE score) + (-.014) (years out of college)
We cannot compare the size of the various coefficients because the three independent variables are measured on different scales. Undergraduate GPA is measured on a scale from 0.0 to 4.0. GRE score is measured on a scale from 0 to 1600. Years out of college is measured on a scale from 0 to 20. We cannot directly tell which independent variable has the most effect on Y (graduate level GPA).
However, it is possible
to transform the coefficients into standardized regression coefficients,
which are written as the plain English letter b. The standardized regression
coefficients in any one regression equation are measured on the same scale,
with a mean of zero and a standard deviation of 1. They are then directly
comparable to one another, with the largest coefficient indicating which
independent variable has the greatest influence on the dependent variable.
|Years out of college||-.014||-.122|
|Intercept or Constant (a)||1.437||n/a|
The convention to use to denote non-standardized regression coefficients and to use b to denote standardized coefficients is not always respected. The one difference between non-standardized and standardized regression is that standardized regression does not have an term (a constant). If there is no term (no constant), then the regression coefficients have been standardized. If there is an term, then the regression coefficients have not been standardized.