Simple Regression

PPA 696 RESEARCH METHODS

SIMPLE REGRESSION

Regression
Elements of a Regression Equation
Assessing the Regression Equation
Steps in Linear Regression
Assumptions of Linear Regression
Time Series Regression

REGRESSION

The most commonly used form of regression is linear regression, and the most common type of linear regression is called ordinary least squares regression.

Linear regression uses the values from an existing data set consisting of measurements of the values of two variables, X and Y, to develop a model that is useful for predicting the value of the dependent variable, Y for given values of X.

ELEMENTS OF A REGRESSION EQUATION

The regression equation is written as Y = a + bX +e

Y is the value of the Dependent variable (Y), what is being predicted or explained

a or Alpha, a constant; equals the value of Y when the value of X=0

b or Beta, the coefficient of X; the slope of the regression line; how much Y changes for each one-unit change in X.

X is the value of the Independent variable (X), what is predicting or explaining the value of Y

e is the error term; the error in predicting the value of Y, given the value of X (it is not displayed in most regression equations).

For example, say we know what the average speed is of cars on the freeway when we have 2 highway patrols deployed (average speed=75 mph) or 10 highway patrols deployed (average speed=35 mph). But what will be the average speed of cars on the freeway when we deploy 5 highway patrols?

Average Speed on Freeway (Y) Number of Patrol Cars Deployed (X)

75 2

35 10

From our known data, we can use the regression formula (calculations not shown) to compute the values of and and obtain the following equation: Y= 85 + (-5) X, where

Y is the average speed of cars on the freeway

a=85, or the average speed when X=0

b=(-5), the impact on Y of each additional patrol car deployed

X is the number of patrol cars deployed

That is, the average speed of cars on the freeway when there are no highway patrols working (X=0) will be 85 mph. For each additional highway patrol car working, the average speed will drop by 5 mph. For five patrols (X=5), Y = 85 + (-5) (5) = 85 - 25 = 60 mph

There may be some variations on how regression equations are written in the literature. For example, you may sometimes see the dependent variable term (Y) written with a little "hat" ( ^ ) on it, or called Y-hat. This refers to the predicted value of Y. The plain Y refers to observed values of Y in the data set used to calculate the regression equation.

You may see the symbols for alpha (a) and beta (b) written in Greek letters, or you may see them written in English letters. The coefficient of the independent variable may have a subscript, as may the term for X, for example, b₁X₁ (this is common in multiple regression).

ASSESSING THE REGRESSION EQUATION

We now have a regression equation. But how good is the equation at predicting values of Y, for given values of X? For that assessment, we turn to measures of association and measures of statistical significance that are used with regression equations.

r²
r² is a measure of association; it represents the percent of the variance in the values of Y that can be explained by knowing the value of X. r² varies from a low of 0.0 (none of the variance is explained), to a high of +1.0 (all of the variance is explained).

s.e.b
s.e.b is the standard error of the computed value of b. A t-test for statistical significance of the coefficient is conducted by dividing the value of b by its standard error. By rule of thumb, a t-value of greater than 2.0 is usually statistically significant but you must consult a t-table to be sure. If the t-value indicates that the b coefficient is statistically significant, this means that the independent variable or X (number of patrol cars deployed) should be kept in the regression equation, since it has a statistically significant relationship with the dependent variable or Y (average speed in mph). If the relationship was not statistically significant, the value of the b coefficient would be (statistically speaking) indistinguishable from zero.

F
F is a test for statistical significance of the regression equation as a whole. It is obtained by dividing the explained variance by the unexplained variance. By rule of thumb, an F-value of greater than 4.0 is usually statistically significant but you must consult an F-table to be sure. If F is significant, than the regression equation helps us to understand the relationship between X and Y.

For our example above, say we obtained the following values:

r² = .9
Knowing the value of X (the number of patrol cars deployed), we can explain 90% of the variance in Y (the average speed of motorists on the freeway).

s.e.b = 1.5
Dividing b by s.e.b, we obtain a value for t = -5/1.5 = -3.3. Consulting a t-table, we find that the coefficient is statistically significant. This means that the independent variable X (number of patrol cars deployed) should be kept in the regression equation, since it has a statistically significant relationship with the dependent variable Y (average speed in mph).

F= 8.4
From the F-table, we see that the regression equation as a whole is statistically significant. This means that the regression equation is helping us to understand the relationship between X and Y.

STEPS IN LINEAR REGRESSION

1. State the hypothesis.
2. State the null hypothesis
3. Gather the data.
4. Compute the regression equation
5. Examine tests of statistical significant and measures of association
6. Relate statistical findings to the hypothesis. Accept or reject the null hypothesis.
7. Reject, accept or revise the original hypothesis. Make suggestions for research design and management aspects of the problem.

Example: The motor pool wants to know if it costs more to maintain cars that are driven more often.

Hypothesis: maintenance costs are affected by car mileage
Null hypothesis: there is no relationship between mileage and maintenance costs

Dependent variable: Y is the cost in dollars of yearly maintenance on a motor vehicle
Independent variable: X is the yearly mileage on the same motor vehicle

Data are gathered on each car in the motor pool, regarding number of miles driven in a given year, and maintenance costs for that year. Here is a sample of the data collected.

Car Number Miles Driven (X) Repair Costs (Y)

1 80,000 $1,200

2 29,000 $150

3 53,000 $650

4 13,000 $200

5 45,000 $325

The regression equation is computed as (computations not shown): Y = 50 + .03 X

For example, if X=50,000 then Y = 50 + .03 (50,000) = $1,550

a=50 or the cost of maintenance when X=0; if there is no mileage on the car, then the yearly cost of maintenance=$50

b=.03 the value that Y increases for each unit increase in X; for each extra mile driven (X), the cost of yearly maintenance increases by $.03

s.e.b = .0005; the value of b divided by s.e.b=60.0; the t-table indicates that the b coefficient of X is statistically significant (it is related to Y)

r²=.90 we can explain 90% of the variance in repair costs for different vehicles if we know the vehicle mileage for each car

Conclusion: Reject the null hypothesis of no relationship and accept the research hypothesis, that mileage affects repair costs.

ASSUMPTIONS OF LINEAR REGRESSION

In theory, there are several important assumptions that must be satisfied if linear regression is to be used. These are:

1. Both the independent (X) and the dependent (Y) variables are measured at the interval or ratio level.

2. The relationship between the independent (X) and the dependent (Y) variables is linear.

3. Errors in prediction of the value of Y are distributed in a way that approaches the normal curve.

4. Errors in prediction of the value of Y are all independent of one another.

5. The distribution of the errors in prediction of the value of Y is constant regardless of the value of X.

There are a number of advanced statistical tests that can be used to examine whether or not these assumptions are true for any given regression equation. However, these are beyond the scope of this discussion.

TIME SERIES REGRESSION

Linear regression is useful for exploring the relationship of an independent variable that marks the passage of time to a dependent variable when the relationship is linear; that is, when there is an obvious downward, or upward, trend in the data over time.

However, if the trend of the dependent variable over time is not linear, then linear regression will not capture the relationship. Linear regression fails to capture seasonal, cyclical, and counter-cyclical trends in time series data. Neither does linear regression capture the effects of changes in direction of time series data, nor changes in the rate of change over time. For time series regression, it is important to obtain a plot of the data over time and inspect it for possible non-linear trends.

There is also a problem if the values at one point in the time series are determined or strongly influenced by values at a previous time. This is called auto-correlation. This occurs when the values of the dependent variable over time are not randomly distributed.

Linear regression can be used with interrupted time series research designs. For example, say a policy is implemented to reduce the number of accidents among teenage drivers.

1. Data are gathered for at least 20 or 30 time periods (months or quarters) before the policy is implemented, and then for another 20 or 30 time periods after the policy is implemented.

2. One linear regression is performed for the accident rate data on the pre-policy time periods.

3. Another linear regression is performed for the accident rate data on the post-policy time period.

4. There should be differences in the values of the constant, b coefficient, s.e.b , and r² for the two equations.

If there is a difference between the two equations, then the policy has had an effect. If all the data points (both pre- and post-) had been included in the regression equation, the amount of variance explained (r²) would be quite low. This is because, if there is a change after the policy is introduced, the trend is no longer linear. Instead, there are two different linear trends, one before the policy was introduced, and another, different one after it was introduced.

In setting up the data for time series regression, the researcher must remember to number the years (or other time periods) consecutively from 1 to n. These are the values for the independent (X) variable. The value of the dependent variable is the accident rate. For example,

Independent Variable (X) - Year Dependent Variable (Y) - Accident Rate

1 50,000

2 51,000

3 52,000

4 53,000

Average Speed on Freeway (Y)	Number of Patrol Cars Deployed (X)
75	2
35	10

Car Number	Miles Driven (X)	Repair Costs (Y)
1	80,000	$1,200
2	29,000	$150
3	53,000	$650
4	13,000	$200
5	45,000	$325

Independent Variable (X) - Year	Dependent Variable (Y) - Accident Rate
1	50,000
2	51,000
3	52,000
4	53,000