Purpose of the Lab:
This lab introduces you to simple linear correlation and regression as a method of discerning and testing relationships between data pairs measured at the interval or ratio levels. The last lab introduced you to the Chi-squared technique, which instead handles data at the nominal level of measurement.
Type I and II Errors, Alpha, and Beta, Yet Again
By now, you're familiar with the concepts of Type I and Type II errors and how a researcher selects an alpha level (or, conversely, a confidence level) in such a way as to minimize the type of error that has the graver consequences. Since in locational analysis, the more serious error is usually the Type II, we set alpha rather generously to avoid the danger of missing an exploitable relationship. In this lab, we will utilize the 90 percent confidence level, thus setting alpha at 10 percent or 0.1. Executive decision.
Null and Alternate Hypotheses
Again, the default hypotheses will be null, that is, that there is no significant relationship between any variable, X, and another variable, Y. You will reject that default unless the results are so extreme that you are confident that in only 1 in 10 possible samples could you have gotten results that extreme from random sorting.
Procedures for Correlation
Below is a data set e-mailed to me by one of my former students who is now working as a location analyst for a marketing company in the Southwest. In exchange for my consultancy, he approved my use of these data in my classes. What we have are data purporting to be updates of the Census for 1997. The data subset I've shown below is demographic attributes for 25 Metropolitan Statistical Areas (MSAs) in California, plus data on average sales in "widget" stores (for confidentiality) per square foot of retail sales area per year for each of the 25 market areas. What you're going to do is run correlations and regressions for a few X variables and the Y variable of sales per square foot. Once more, you'll use Excel or another spreadsheet to help you see the structure of the formulae.
A helpful first step (but not required for this exercise) is to create X-Y graphs for each pair of variables. Sales per square foot, ranging from about $100 to about $175, will be on all the vertical, or Y, axes. For each community, find its X value (e.g., median household income, average household size, or median age) on the horizontal (X) axis and place a dot directly above that value at the height of its sales per square foot on the Y axis. The result will be a cloud of dots, a scattergram. "Eyeballing" the diagram will tell you quite a bit about each relationship right away:
The computational formula for the correlation co-efficient, r, is:
__ __ __ \ \ \ N /_ XY - (/_ X)(/_ Y) r = __________________________________________ ___________________________________________ / __ __ __ __ / \ 2 \ 2 \ 2 \ 2 \/ [ N /_ X - (/_ X) ] x [ N /_ Y - (/_ Y) ]
So, for each X-Y pair of variables, you need five columns: X, Y, XY, X squared, and Y squared. At the bottom of each column, you need to sum the 25 cells in the column.
Now, somewhere on the bottom that spreadsheet of yours, you need to multiply the sum of the XY column by 25 (N). Ditto with the sum of the X squares and of the Y squares. You also need to take the sum of the Xs and square it and the sum of the Ys and square that, too. Also, you have to multiply the sum of the Xs by the sum of the Ys. That should get you every term you need to do the formula above. Don't forget to take the square root of the denominator before dividing the numerator by it.
When you do the calculations, IF you've done them properly, you will wind up with a number ranging from 0 to positive or negative 1. A perfect direct relationship produces a correlation co-efficient of +1.00; a perfect inverse relationship produces an r of -1.00. A co-efficient of 0 means there is no relationship at all between the two variables. To test the significance of the correlation co-efficient, you can do a t test. In class, t was defined as r divided by the standard error of the r. A computational formula is:
____ / r \/n - 2 t = _________ _____ / 2 \/1 - rYou then compare your calculated t with the critical t you get from the t table (handed out in class), reading under the column for the selected alpha level (0.10) and across from your degrees of freedom (which is number of cases [25] minus degrees of freedom -- you lose one degree of freedom for each of the two variables you're correlating -- 23). If the calculated t is greater than the critical t for your chosen alpha and degrees of freedom, you reject the null hypothesis.
So, all that done, state in ordinary English what you found about the relationship of each of the X variables with the chosen Y.
Procedures for Regression
You should remember that correlation itself implies nothing about the direction of causation: It just descriptively states the degree of relationship between two variables. Regression, on the other hand, allows analysis of a particular direction of causality. That is, based on logical, theoretical grounds, you state that a given variable (conventionally called Y) is dependent on the actions of other variables. The variable that acts to create variation in Y is deemed an independent variable (conventionally termed X). Regression analyzes the relationship between X and Y specifically for the degree of variation in Y produced by a given change in X. X changes, Y responds.
A first step in regression analysis is to determine the degree of variation in Y that can be attributed to variation in X. I'll spare you the details of dividing the variation in Y into a portion that can be assigned to X and another, unexplained portion. Essentially, squaring the correlation co-efficient gives you the amount of variation in Y explained by variation in X. So, r gives you the degree of association between X and Y, while r2 tells you how much explanatory power X has over Y. R2 is also known as the co-efficient of determination.
Next, you need to model the relationship between X and Y. That model takes the form:
Y = a + bX where: Y is the predicted value of Y given a particular value of X a is the Y intercept (where the regression line crosses the Y axis) b is the slope (negative or positive) of the line
So, you need to figure out b and then you can use it to figure out a.
__ __ __ \ \ \ N /_ XY - (/_ X)(/_ Y) _______________________ b = __ __ \ 2 \ 2 N /_ X - ( /_ X) _ _ a = Y - bX
Piece of cake. You can then, optionally, graph the regression line on your optional scattergram for each X and Y. This least-squares regression line is the best linear description of the trend between X and Y and, in more advanced analyses, can help you start accounting for the residuals (or vertical distance from each data point to the regression line). The residuals represent the variation in Y that is affected by something else than X. Sometimes there are patterns in the residuals that identify another X: You can eventually build elaborate multiple regression models linking more than one X variable in such a way as better to account for the variations in Y. Not now: We all have headaches.
Format for Your Lab Report
For the lab, what I want for you to do is:
You can show all of these (r, r2, b, and a) in a common table. You will only need a paragraph or so to state your conclusions. Your whole report, then, will fit on one page, though I'd like to have your spreadsheets, too, so that I can give partial credit for any messed up answers if I can find a minor error easily.
MSA MedHHinc MedAge AveHHsiz SalSqFt/Yr Los Angeles-Long Beach, CA PMS $37,121 32.9 2.96 $133.96 Riverside-San Bernardino, CA P $36,271 32.3 2.96 $126.29 San Diego, CA $40,709 33.2 2.71 $138.30 Orange County, CA $51,294 33.8 2.88 $142.93 Oakland, CA $51,098 35.6 2.61 $136.66 San Francisco, CA $51,754 38.3 2.44 $153.27 San Jose, CA $61,529 34.3 2.82 $165.04 Sacramento, CA $41,831 34.9 2.60 $141.92 Fresno, CA $30,646 31.2 3.01 $124.36 Ventura, CA $52,492 33.9 3.01 $148.47 Bakersfield, CA $31,598 31.5 2.95 $135.62 Stockton-Lodi, CA $36,464 32.6 3.00 $129.61 Vallejo-Fairfield-Napa, CA PMS $48,828 34.1 2.80 $150.70 Santa Rosa, CA $45,799 36.7 2.53 $145.00 Modesto, CA $34,022 32.6 2.95 $137.48 Santa Barbara-Santa Maria-Lomp $42,130 34.0 2.75 $134.50 Visalia-Tulare-Porterville, CA $29,730 30.4 3.16 $120.97 Salinas, CA $43,354 31.8 2.99 $130.60 Santa Cruz-Watsonville, CA PMS $49,672 35.3 2.69 $142.31 San Luis Obispo-Atascadero-Pas $38,120 35.1 2.54 $124.74 Chico-Paradise, CA $28,889 35.3 2.48 $121.48 Merced, CA $27,177 29.6 3.22 $120.49 Redding, CA $30,700 36.6 2.53 $119.95 Yolo, CA $36,666 31.5 2.64 $117.83 Yuba City, CA $28,416 32.9 2.79 $127.95