GEOG 216-01

Locational Analysis

Lab 5: Simple Linear Correlation and Regression Analysis

====================

Purpose of the Lab:

This lab introduces you to simple linear correlation and regression as a method of discerning and testing relationships between data pairs measured at the interval or ratio levels. The last lab introduced you to the Chi-squared technique, which instead handles data at the nominal level of measurement.

====================

Type I and II Errors, Alpha, and Beta, Yet Again

By now, you're familiar with the concepts of Type I and Type II errors and how a researcher selects an alpha level (or, conversely, a confidence level) in such a way as to minimize the type of error that has the graver consequences. Since in locational analysis, the more serious error is usually the Type II, we set alpha rather generously to avoid the danger of missing an exploitable relationship. In this lab, we will utilize the 90 percent confidence level, thus setting alpha at 10 percent or 0.1. Executive decision.

====================

Null and Alternate Hypotheses

Again, the default hypotheses will be null, that is, that there is no significant relationship between any variable, X, and another variable, Y. You will reject that default unless the results are so extreme that you are confident that in only 1 in 10 possible samples could you have gotten results that extreme from random sorting.

====================

Procedures for Correlation

Below is a data set e-mailed to me by one of my former students who is now working as a location analyst for a marketing company in the Southwest. In exchange for my consultancy, he approved my use of these data in my classes. What we have are data purporting to be updates of the Census for 1997. The data subset I've shown below is demographic attributes for 25 Metropolitan Statistical Areas (MSAs) in California, plus data on average sales in "widget" stores (for confidentiality) per square foot of retail sales area per year for each of the 25 market areas. What you're going to do is run correlations and regressions for a few X variables and the Y variable of sales per square foot. Once more, you'll use Excel or another spreadsheet to help you see the structure of the formulae.

A helpful first step (but not required for this exercise) is to create X-Y graphs for each pair of variables. Sales per square foot, ranging from about $100 to about $175, will be on all the vertical, or Y, axes. For each community, find its X value (e.g., median household income, average household size, or median age) on the horizontal (X) axis and place a dot directly above that value at the height of its sales per square foot on the Y axis. The result will be a cloud of dots, a scattergram. "Eyeballing" the diagram will tell you quite a bit about each relationship right away:

The computational formula for the correlation co-efficient, r, is:

                              __       __    __
                              \        \     \ 
                            N /_ XY - (/_ X)(/_ Y)
     r =          __________________________________________
                 ___________________________________________
                /   __       __            __       __
               /    \   2    \    2        \   2    \    2
             \/ [ N /_ X  - (/_ X) ] x [ N /_ Y  - (/_ Y) ]

So, for each X-Y pair of variables, you need five columns: X, Y, XY, X squared, and Y squared. At the bottom of each column, you need to sum the 25 cells in the column.

Now, somewhere on the bottom that spreadsheet of yours, you need to multiply the sum of the XY column by 25 (N). Ditto with the sum of the X squares and of the Y squares. You also need to take the sum of the Xs and square it and the sum of the Ys and square that, too. Also, you have to multiply the sum of the Xs by the sum of the Ys. That should get you every term you need to do the formula above. Don't forget to take the square root of the denominator before dividing the numerator by it.

When you do the calculations, IF you've done them properly, you will wind up with a number ranging from 0 to positive or negative 1. A perfect direct relationship produces a correlation co-efficient of +1.00; a perfect inverse relationship produces an r of -1.00. A co-efficient of 0 means there is no relationship at all between the two variables. To test the significance of the correlation co-efficient, you can do a t test. In class, t was defined as r divided by the standard error of the r. A computational formula is:

                ____   
               /       
           r \/n - 2   
      t =  _________
               _____
              /    2
            \/1 - r 

You then compare your calculated t with the critical t you get from the t table (handed out in class), reading under the column for the selected alpha level (0.10) and across from your degrees of freedom (which is number of cases [25] minus degrees of freedom -- you lose one degree of freedom for each of the two variables you're correlating -- 23). If the calculated t is greater than the critical t for your chosen alpha and degrees of freedom, you reject the null hypothesis.

So, all that done, state in ordinary English what you found about the relationship of each of the X variables with the chosen Y.

====================

Procedures for Regression

You should remember that correlation itself implies nothing about the direction of causation: It just descriptively states the degree of relationship between two variables. Regression, on the other hand, allows analysis of a particular direction of causality. That is, based on logical, theoretical grounds, you state that a given variable (conventionally called Y) is dependent on the actions of other variables. The variable that acts to create variation in Y is deemed an independent variable (conventionally termed X). Regression analyzes the relationship between X and Y specifically for the degree of variation in Y produced by a given change in X. X changes, Y responds.

A first step in regression analysis is to determine the degree of variation in Y that can be attributed to variation in X. I'll spare you the details of dividing the variation in Y into a portion that can be assigned to X and another, unexplained portion. Essentially, squaring the correlation co-efficient gives you the amount of variation in Y explained by variation in X. So, r gives you the degree of association between X and Y, while r2 tells you how much explanatory power X has over Y. R2 is also known as the co-efficient of determination.

Next, you need to model the relationship between X and Y. That model takes the form:


     Y = a + bX

     where: 
           Y is the predicted value of Y given a particular value of X
           a is the Y intercept (where the regression line crosses the Y axis)
           b is the slope (negative or positive) of the line

So, you need to figure out b and then you can use it to figure out a.


             __        __    __ 
             \         \     \  
           N /_ XY  - (/_ X)(/_ Y) 
           _______________________
     b =       __         __ 
               \   2      \    2
             N /_ X  -  ( /_ X)


           _    _
     a =   Y - bX


Piece of cake. You can then, optionally, graph the regression line on your optional scattergram for each X and Y. This least-squares regression line is the best linear description of the trend between X and Y and, in more advanced analyses, can help you start accounting for the residuals (or vertical distance from each data point to the regression line). The residuals represent the variation in Y that is affected by something else than X. Sometimes there are patterns in the residuals that identify another X: You can eventually build elaborate multiple regression models linking more than one X variable in such a way as better to account for the variations in Y. Not now: We all have headaches.

====================

Format for Your Lab Report

For the lab, what I want for you to do is:

  1. Correlate median household income, mean household size, and median age with sales per square foot of widget retail space. Do your calculations at the full decimal capacity of your spreadsheet, but round the final answers to two decimal places of accuracy (e.g., 0.00).
  2. Test each correlation for significance at the 90 percent confidence level
  3. For those relationships deemed significant, perform a simple linear regression, calculating the co-efficient of determination, b, and a. Again, utilize the full power of your spreadsheet in calculations, but show your answers to two decimal places of accuracy.
  4. State in regular English what your analysis implies about the three pairs of relationships in terms of direction, strength, and significance.

You can show all of these (r, r2, b, and a) in a common table. You will only need a paragraph or so to state your conclusions. Your whole report, then, will fit on one page, though I'd like to have your spreadsheets, too, so that I can give partial credit for any messed up answers if I can find a minor error easily.

====================

Data Set

MSA                                MedHHinc    MedAge   AveHHsiz   SalSqFt/Yr

Los Angeles-Long Beach, CA PMS      $37,121      32.9       2.96      $133.96    
Riverside-San Bernardino, CA P      $36,271      32.3       2.96      $126.29    
San Diego, CA                       $40,709      33.2       2.71      $138.30    
Orange County, CA                   $51,294      33.8       2.88      $142.93    
Oakland, CA                         $51,098      35.6       2.61      $136.66    
San Francisco, CA                   $51,754      38.3       2.44      $153.27    
San Jose, CA                        $61,529      34.3       2.82      $165.04    
Sacramento, CA                      $41,831      34.9       2.60      $141.92    
Fresno, CA                          $30,646      31.2       3.01      $124.36    
Ventura, CA                         $52,492      33.9       3.01      $148.47    
Bakersfield, CA                     $31,598      31.5       2.95      $135.62    
Stockton-Lodi, CA                   $36,464      32.6       3.00      $129.61    
Vallejo-Fairfield-Napa, CA PMS      $48,828      34.1       2.80      $150.70    
Santa Rosa, CA                      $45,799      36.7       2.53      $145.00    
Modesto, CA                         $34,022      32.6       2.95      $137.48    
Santa Barbara-Santa Maria-Lomp      $42,130      34.0       2.75      $134.50    
Visalia-Tulare-Porterville, CA      $29,730      30.4       3.16      $120.97    
Salinas, CA                         $43,354      31.8       2.99      $130.60    
Santa Cruz-Watsonville, CA PMS      $49,672      35.3       2.69      $142.31    
San Luis Obispo-Atascadero-Pas      $38,120      35.1       2.54      $124.74    
Chico-Paradise, CA                  $28,889      35.3       2.48      $121.48    
Merced, CA                          $27,177      29.6       3.22      $120.49    
Redding, CA                         $30,700      36.6       2.53      $119.95    
Yolo, CA                            $36,666      31.5       2.64      $117.83    
Yuba City, CA                       $28,416      32.9       2.79      $127.95    
                                                    

====================

first placed on the web: 11/11/98
last revised: 11/11/98
© Dr. Christine M. Rodrigue

====================