Regression
Elements of a Regression Equation
Assessing the Regression Equation
Steps in Linear Regression
Assumptions
of Linear Regression
Time
Series Regression
The most commonly used form of regression is linear regression, and the most common type of linear regression is called ordinary least squares regression.
Linear regression uses the values from
an existing data set consisting of measurements of the values of two variables,
X and Y, to develop a model that is useful for predicting the value of
the dependent variable, Y for given values of X.
Y is the value of the Dependent variable (Y), what is being predicted or explained
a or Alpha, a constant; equals the value of Y when the value of X=0
b or Beta, the coefficient of X; the slope of the regression line; how much Y changes for each one-unit change in X.
X is the value of the Independent variable (X), what is predicting or explaining the value of Y
e is the error term; the error in predicting the value of Y, given the value of X (it is not displayed in most regression equations).
For example, say we know what the average
speed is of cars on the freeway when we have 2 highway patrols deployed
(average speed=75 mph) or 10 highway patrols deployed (average speed=35
mph). But what will be the average speed of cars on the freeway when we
deploy 5 highway patrols?
Average Speed on Freeway (Y) | Number of Patrol Cars Deployed (X) |
75 | 2 |
35 | 10 |
From our known data, we can use the regression formula (calculations not shown) to compute the values of and and obtain the following equation: Y= 85 + (-5) X, where
Y is the average speed of cars on the freeway
a=85, or the average speed when X=0
b=(-5), the impact on Y of each additional patrol car deployed
X is the number of patrol cars deployed
That is, the average speed of cars on the freeway when there are no highway patrols working (X=0) will be 85 mph. For each additional highway patrol car working, the average speed will drop by 5 mph. For five patrols (X=5), Y = 85 + (-5) (5) = 85 - 25 = 60 mph
There may be some variations on how regression equations are written in the literature. For example, you may sometimes see the dependent variable term (Y) written with a little "hat" ( ^ ) on it, or called Y-hat. This refers to the predicted value of Y. The plain Y refers to observed values of Y in the data set used to calculate the regression equation.
You may see the symbols for alpha (a)
and beta (b) written in Greek letters, or you may see them written in English
letters. The coefficient of the independent variable may have a subscript,
as may the term for X, for example, b1X1
(this is common in multiple regression).
We now have a regression equation. But how good is the equation at predicting values of Y, for given values of X? For that assessment, we turn to measures of association and measures of statistical significance that are used with regression equations.
r2
r2 is a measure of association;
it represents the percent of the variance in the values of Y that can be
explained by knowing the value of X. r2 varies from a low of
0.0 (none of the variance is explained), to a high of +1.0 (all of the
variance is explained).
s.e.b
s.e.b is the standard error of the computed
value of b. A t-test for statistical significance of the coefficient is
conducted by dividing the value of b by its standard error. By rule of
thumb, a t-value of greater than 2.0 is usually statistically significant
but you must consult a t-table to be sure. If
the t-value indicates that the b coefficient is statistically significant,
this means that the independent variable or X (number of patrol cars deployed)
should be kept in the regression equation, since it has a statistically
significant relationship with the dependent variable or Y (average speed
in mph). If the relationship was not
statistically significant, the value of the b coefficient would be (statistically
speaking) indistinguishable from zero.
F
F is a test for statistical significance
of the regression equation as a whole. It is obtained by dividing the explained
variance by the unexplained variance. By rule of thumb, an F-value of greater
than 4.0 is usually statistically significant but you must consult an F-table
to be sure. If F is significant, than the regression equation helps us
to understand the relationship between X and Y.
For our example above, say we obtained the following values:
r2 = .9
Knowing the value of X (the number of patrol
cars deployed), we can explain 90% of the variance in Y (the average speed
of motorists on the freeway).
s.e.b
= 1.5
Dividing b by s.e.b, we obtain a value for t
= -5/1.5 = -3.3. Consulting a t-table, we find that the coefficient is
statistically significant. This means that the independent variable X (number
of patrol cars deployed) should be kept in the regression equation, since
it has a statistically significant relationship with the dependent variable
Y (average speed in mph).
F= 8.4
From the F-table, we see that the regression
equation as a whole is statistically significant.
This means that the regression equation is helping us to understand the
relationship between X and Y.
Example: The motor pool wants to know if it costs more to maintain cars that are driven more often.
Hypothesis: maintenance costs are affected by
car mileage
Null hypothesis: there is no relationship between
mileage and maintenance costs
Dependent variable: Y is the cost in dollars of
yearly maintenance on a motor vehicle
Independent variable: X is the yearly mileage
on the same motor vehicle
Data are gathered on each car in the motor pool,
regarding number of miles driven in a given year, and maintenance costs
for that year. Here is a sample of the data collected.
Car Number | Miles Driven (X) | Repair Costs (Y) |
1 | 80,000 | $1,200 |
2 | 29,000 | $150 |
3 | 53,000 | $650 |
4 | 13,000 | $200 |
5 | 45,000 | $325 |
The regression equation is computed as (computations not shown): Y = 50 + .03 X
For example, if X=50,000 then Y = 50 + .03 (50,000) = $1,550
a=50 or the cost of maintenance when X=0; if there is no mileage on the car, then the yearly cost of maintenance=$50
b=.03 the value that Y increases for each unit increase in X; for each extra mile driven (X), the cost of yearly maintenance increases by $.03
s.e.b = .0005; the value of b divided by s.e.b=60.0; the t-table indicates that the b coefficient of X is statistically significant (it is related to Y)
r2=.90 we can explain 90% of the variance in repair costs for different vehicles if we know the vehicle mileage for each car
Conclusion: Reject the null hypothesis of no relationship
and accept the research hypothesis, that mileage affects repair costs.
There are a number of advanced statistical
tests that can be used to examine whether or not these assumptions are
true for any given regression equation. However, these are beyond the scope
of this discussion.
However, if the trend of the dependent variable over time is not linear, then linear regression will not capture the relationship. Linear regression fails to capture seasonal, cyclical, and counter-cyclical trends in time series data. Neither does linear regression capture the effects of changes in direction of time series data, nor changes in the rate of change over time. For time series regression, it is important to obtain a plot of the data over time and inspect it for possible non-linear trends.
There is also a problem
if the values at one point in the time series are determined or strongly
influenced by values at a previous time. This is called auto-correlation.
This occurs when the values of the dependent variable over time are not
randomly distributed.
If there is a difference between the two equations, then the policy has had an effect. If all the data points (both pre- and post-) had been included in the regression equation, the amount of variance explained (r2) would be quite low. This is because, if there is a change after the policy is introduced, the trend is no longer linear. Instead, there are two different linear trends, one before the policy was introduced, and another, different one after it was introduced.
In setting up the data
for time series regression, the researcher must remember to number the
years (or other time periods) consecutively from 1 to n. These are the
values for the independent (X) variable. The value of the dependent variable
is the accident rate. For example,
Independent Variable (X) - Year | Dependent Variable (Y) - Accident Rate |
1 | 50,000 |
2 | 51,000 |
3 | 52,000 |
4 | 53,000 |