Multiple Regression

PPA 696 RESEARCH METHODS

MULTIPLE REGRESSION

Multiple Regression
Steps in Multiple Regression
Elements of a Multiple Regression Equation
Problems with Multiple Regression
Standardized Regression

MULTIPLE REGRESSION

Reality in the public sector is complex. Often there may be several possible causes associated with a problem; and likewise there may be several factors necessary for a solution. Complex statistical applications are needed which can:

-deal with interval and ratio level variables

-assess causal linkages

-forecast future outcomes

Ordinary least squares linear regression is the most widely used type of regression for predicting the value of one dependent variable from the value of one independent variable. It is also widely used for predicting the value of one dependent variable from the values of two or more independent variables. When there are two or more independent variables, it is called multiple regression.

STEPS IN MULTIPLE REGRESSION

The steps in multiple regression are basically the same as in simple regression.: 1. State the research hypothesis.; 2. State the null hypothesis; 3. Gather the data; 4. Assess each variable separately first (obtain measures of central tendency and dispersion; frequency distributions; graphs); is the variable normally distributed?; 5. Assess the relationship of each independent variable, one at a time, with the dependent variable (calculate the correlation coefficient; obtain a scatter plot); are the two variables linearly related?; 6. Assess the relationships between all of the independent variables with each other (obtain a correlation coefficient matrix for all the independent variables); are the independent variables too highly correlated with one another?; 7. Calculate the regression equation from the data; 8. Calculate and examine appropriate measures of association and tests of statistical significance for each coefficient and for the equation as a whole; 9. Accept or reject the null hypothesis; 10. Reject or accept the research hypothesis; 11. Explain the practical implications of the findings

ELEMENTS OF A MULTIPLE REGRESSION EQUATION

Y=a + b₁X₁ + b₂X₂ + b₃X₃

Y is the value of the Dependent variable (Y), what is being predicted or explained

a (Alpha) is the Constant or intercept

b₁ is the Slope (Beta coefficient) for X₁

X₁ First independent variable that is explaining the variance in Y

b₂ is the Slope (Beta coefficient) for X₂

X₂ Second independent variable that is explaining the variance in Y

b₃ is the Slope (Beta coefficient) for X₃

X₃ Third independent variable that is explaining the variance in Y

s.e.b₁ standard error of coefficient b₁

s.e.b₂ standard error of coefficient b₂

s.e.b₃ standard error of coefficient b₃

R² The proportion of the variance in the values of the dependent variable (Y) explained by all the independent variables (Xs) in the equation together; sometimes this is reported as adjusted R², when a correction has been made to reflect the number of variables in the equation.

F Whether the equation as a whole is statistically significant in explaining Y

Example: The Department of Highway Safety wants to understand the influence of various factors on the number of annual highway fatalities.

Hypothesis: Number of annual fatalities is affected by total population, days of snow, and avenge MPH on highways.

Null hypothesis: Number of annual fatalities is not affected by total population, days of snow, or average MPH on highways.

Dependent variable: Y is the number of traffic fatalities in a state in a given year

Independent variable: X₁ is the state's total population; X₂ is the number of days it snowed; X₃ is the average speed drivers were driving at for that year.

Equation: Y = 1.4 + .00029 X₁ + 2.4 X₂ + 10.3 X₃

Predicted value of Y: If X₁=3,000,000, X₂=2, and X₃=65, then
Y = 1.4 + .00029 (3,000,0000) + 2.4 (2) + 10.3 (65) = 1545.7

a=1.4

This is the number of traffic fatalities that would be expected if all three independent variables were equal to zero (no population, no days snowed, and zero average speed).

b₁=.00029

If X₂ and X₃ remain the same, this indicates that for each extra person in the population, the number of yearly traffic fatalities increases by .00029.

b₂=2.4

If X₁ and X₃ remain the same, this indicates that for each extra day of snow, Y increases by 2.4 additional traffic fatalities.

b₃= 10.3

If X₁ and X₂ remain the same, this indicates that for each mph increase in average speed, Y increases by 10.3 traffic fatalities.

s.e.b₁=.00003

Dividing b₁ by s.e.b1 gives us a t-score of 9.66; p<.01. The t-score indicates that the slope of the b coefficient is significantly different from zero so the variable should be in the equation.

s.e.₂=.62

Dividing b₂ by s.e.b2 gives us a t-score of 3.87; p<.01. The t-score indicates that the slope of the b coefficient is significantly different from zero so the variable should be in the equation.

s.e.b₃=1.1

Dividing b₃ by s.e.b3 gives us a t-score of 9.36; p<.01. The t-score indicates that the slope of the b coefficient is significantly different from zero so the variable should be in the equation.

R² = .78

We can explain 78% of the difference in annual fatality rates among states if we know the states' populations, days of snow, and average highway speeds.

F is statistically significant.

The equation as a whole helps us to understand the dependent variable (Y).

Conclusion: Reject the null hypothesis and accept the research hypothesis. Make recommendations for management implications and further research.

PROBLEMS WITH MULTIPLE REGRESSION

Just as with simple regression, multiple regression will not be good at explaining the relationship of the independent variables to the dependent variables if those relationships are not linear.

Ordinary least squares linear multiple regression is used to predict dependent variables measured at the interval or ratio level. If the dependent variable is not measured at this level, then other, more specialized regression techniques must be used.

Ordinary least squares linear multiple regression assumes that the independent (X) variables are measures at the interval or ratio level. If the variables are not, then multiple regression will result in more errors of prediction. When nominal level variables are used, they are called "dummy" variables. They take the value of 1 to represent the presence of some quality, and the value of zero the indicate the absence of that quality (for example, smoker=1, non-smoker=0). Ordinal coefficients may indicate ranks (for example, staff=1, supervisor=2, manager=3). The interpretation of the coefficients is more problematic with independent variables measured at the nominal or ordinal level.

Regression with only one dependent and one independent variable normally requires a minimum of 30 observations. A good rule of thumb is to add at least an additional 10 observations for each additional independent variable added to the equation.

The number of independent variables in the equation should be limited by two factors. First, the independent variables should be included in the equation only if they are based on the researcher's theory about what factors influence the dependent variable. Second, variables that do not contribute very much to explaining the variance in the dependent variable (i.e., to the total R²), should be eliminated.

Many difficulties tend to arise when there are more than five independent variables in a multiple regression equation. One of the most frequent is the problem that two or more of the independent variables are highly correlated to one another. This is called multicollinearity. If a correlation coefficient matrix with all the independent variables indicates correlations of .75 or higher, then there may be a problem with multicollinearity.

When two variables are highly correlated, they are basically measuring the same phenomenon. When one enters into the regression equation, it tends to explain most of the variance in the dependent variable that is related to that phenomenon. This leave little variance to be explained by the second independent variable.

Signs of multicollinearity include:

1) none of the t-ratios of the coefficients are statistically significant, but he F-test for the equation as a whole is significant;

2) adding an additional independent variable to the equation radically changes either the size or the sign (plus or minus) of the coefficients associated with the other independent variables

If multicollinearity is discovered, the researcher may drop one of the two variables that are highly correlated, or simply leave them in and note that multicollinearity is present.

STANDARDIZED REGRESSION

In multiple regression, the relative size of the coefficients is not important. For example, say that we want to predict the graduate grade point averages of students who are newly admitted to the MPA Program. We use their undergraduate GPA, their GRE scores, and the number of years they have been out of college as independent variables. We obtain the following regression equation:

Y=1.437 + (.367) (UG-GPA) + (.00099) (GRE score) + (-.014) (years out of college)

We cannot compare the size of the various coefficients because the three independent variables are measured on different scales. Undergraduate GPA is measured on a scale from 0.0 to 4.0. GRE score is measured on a scale from 0 to 1600. Years out of college is measured on a scale from 0 to 20. We cannot directly tell which independent variable has the most effect on Y (graduate level GPA).

However, it is possible to transform the coefficients into standardized regression coefficients, which are written as the plain English letter b. The standardized regression coefficients in any one regression equation are measured on the same scale, with a mean of zero and a standard deviation of 1. They are then directly comparable to one another, with the largest coefficient indicating which independent variable has the greatest influence on the dependent variable.

Variable Name Non-Standardized
Coefficient (beta) Standardized
Coefficient (b)

Undergraduate GPA .367 +.291

GRE score .00099 +.175

Years out of college -.014 -.122

Intercept or Constant (a) 1.437 n/a

The convention to use to denote non-standardized regression coefficients and to use b to denote standardized coefficients is not always respected. The one difference between non-standardized and standardized regression is that standardized regression does not have an term (a constant). If there is no term (no constant), then the regression coefficients have been standardized. If there is an term, then the regression coefficients have not been standardized.

Variable Name	Non-Standardized Coefficient (beta)	Standardized Coefficient (b)
Undergraduate GPA	.367	+.291
GRE score	.00099	+.175
Years out of college	-.014	-.122
Intercept or Constant (a)	1.437	n/a