Geography 215: QUANTITATIVE METHODS

Dr. Rodrigue

Graded Lab 9: Simple Linear Correlation and Regression


For all questions, please do your calculations at the full capacity of your spreadsheet or calculator, but round your answers to three decimal places of accuracy (i.e., 0.000).


LAB EXERCISE A: Correlation Analysis (Pearson's r)

Like Chi-squared, correlation analysis determines whether two variables are associated with one another, whether positively or negatively. Unlike Chi-squared, simple linear correlation (Pearson's r) relates two variables measured at the interval or ratio level. Also, not only can Pearson's r tell you IF a relationship exists (like Chi-squared) but how STRONG that association is (Chi-squared requires doing a Yule's Q to get this information, which you can only use on a 2 x 2 table). The Pearson's r co-efficient ranges from -1 to +1, where a perfect, all-lined-up, direct relationship shows a co-efficient of +1 and a perfect inverse relationship shows a co-efficient of -1. The closer to 0 the co-efficient is, the more random the association.

Ideally, you can get started by simply graphing your two variables against one another on a common X-Y scattergram. For your first pair of data, you'd look up the X variable on the X axis and read straight up until you come to the height of the Y variable on the Y axis. At that intersection, you mark a point. You keep on going for as many X-Y pairs as you have. Examination of the scattergram can tell you a lot about the relationship: its direction, its strength, and its linearity. If the cloud of dots trends to the upper right of the graph, you have a direct relationship; if it trends down to the lower right, you have an inverse relationship. If the cloud is very narrow, like a cigar, it's a strong relationship; it's weaker if it looks more like a football; there is basically no discernible relationship if the cloud looks more like a soccer ball. If the relationship does appear to exist on the basis of this cloud's shape, examine it more closely to see if its longest axis falls roughly along a line. Pearson's r is for linearly related data: If the cloud has a bent shape, like a J or an L, it must be transformed mathematically to force it into a straight line so that Pearson's r will work with it (very commonly, you can linearize such data with logarithmic transformations). If the cloud is U shaped, though, you're out of luck. With the 10 cases we have in this exercise (professorial kindness), we are unlikely to detect these sorts of problems (non-linearity, heteroscedasticity), so we'll plow forward to illustrate the process of performing a simple correlation and regression analysis. Lucky you.

Below is a spreadsheet summarizing student performance and a variety of class and instructor attributes for ten randomly chosen sample sections of a GIS seminar given at the various Nationwide University campuses in California's industrial and office complexes. Each class section had ten students enrolled. How convenient the math!

The first column is the location of the classes. The second is average student performance for each of the ten class sections. These scores have been converted into a 1-5 point system (grades on a 1-5 scale, with 5 being best, an A, and 1 being worst, an F). The third column is the average number of hours spent by students in outside study and project preparation each week in each of the sections (based, of course, on student self-reporting, which may be a bit self-serving!). The fourth column is student perception of the ten instructors' receptiveness to students and willingness to provide individualized attention and support. The fifth column consists of student ratings of the courses in terms of the organization and clarity of the instructors and materials provided. The last column is the average number of years that students have spent working for their present employers (Nationwide University targets its classes to working professionals, managers, and technical people in industry).

     -------------------------------------------------------------------
     COURSE          STUDENT    OUTSIDE     INSTR-    ORGANI-   WORK EX-
     LOCATION         GRADES      STUDY    SUPPORT     ZATION   PERIENCE
                           Y         X1         X2         X3        X4     
                                                                        
     Bakersfield       3.638      2.154      4.154      3.369      4.338
     Fresno            4.018      2.682      3.288      3.676      3.776
     Irvine            4.241      2.193      4.367      3.671      4.133
     Los Angeles       4.278      2.719      3.931      4.147      4.378
     Riverside         3.615      2.366      3.949      3.163      3.956
     Sacramento        4.200      2.500      3.900      3.700      4.200
     San Diego         4.184      1.805      3.992      4.119      4.168
     San Francisco     4.040      3.071      3.494      2.849      3.877
     San Jose          4.011      2.310      4.182      3.372      3.787
     Van Nuys          3.932      2.511      4.063      2.995      3.842
     -------------------------------------------------------------------

Now, just because I'm a semi-compassionate person (I'd score pretty highly on X2?), I have also made this spreadsheet available to you on the web. Make sure you're on a PC with both Netscape (or similar graphical browser) and Excel. Click on this link, http://www.csuchico.edu/~lapaloma/geog215/215lab9.xls, and either Netscape will prompt you to save the file somewhere or will directly evoke Excel and plop you right in the spreadsheet. If this doesn't work, you can manually type it into your spreadsheet.

It's about time you became acquainted with the statistical functions of your spreadsheet (Excel for most, if not all, of you). Up until now, I've been having you calculate everything in terms of columns and rows, to mitigate the math phobia so many students have. Now, it's time to find out some other nifty things spreadsheets can do these days (and this should speed things up for you a lot). For each variable, calculate the indicated statistics, using these statistical functions.

You can do this a couple of different ways. First, you can type in the function, e.g., =stdev(b4:b13), in the cell where you want the answer to land. Alternatively, you can highlight that cell and click on the "fx" button on top ("function 'wizard'") and then choose the "statistical" function on the left menu and then pick whatever function you want to perform on the right menu that comes up (e.g., stdev for samples or stdevp for populations). A dialogue box will come up for you to input the range, on which you want the math performed (e.g., b4:b13). Hit "finish," and there you have it.

  1. Basic statistical functions on a spreadsheet.
         ------------------------------------------------------------------
         see below      STUDENT    OUTSIDE     INSTR-    ORGANI-   WORK EX-
         for Excel       GRADES      STUDY    SUPPORT     ZATION   PERIENCE
         definitions          Y         X1         X2         X3         X4     
                        
         minima         _______    _______    _______    _______    _______
    
         maxima         _______    _______    _______    _______    _______
    
         means          _______    _______    _______    _______    _______
    
         medians        _______    _______    _______    _______    _______
    
         std dev        _______    _______    _______    _______    _______
    
         variances      _______    _______    _______    _______    _______
    
         ------------------------------------------------------------------
    
         Excel formulas for the cells in which you'll put your results:
         
              minima       =min(b4:b13) or whatever your range is          
                        
              maxima       =max(b4:b13)  
                        
              means        =average(b4:b13)  
                        
              medians      =median(b4:b13)  
                        
              std dev      =stdev(b4:b13) 
                        
              variances    =var(b4:b13)  
    
         Punctuation is very important!
    
         And don't forget you can block copy one formula across the row!
    
         ------------------------------------------------------------------
    
    
  2. Now, you can perform the following correlation functions and t-tests. I'll spare you from doing all ten possible cross pairs: You need only do the following three, just to get the hang of it. Remember, since correlation measures association without causation, it doesn't really make any difference which variable you deem X and which Y. Since this is just an exploratory study, let's set alpha at 0.10 and do two- tailed tests. When you hit the t table, you'll lose one degree of freedom for each variable being correlated (n-2). So, please calculate the correlation co-efficients (and the t statistics) for:
         ---------------------------------------------------------------------------------
    
                                                       r              t            prob
    
         student grades and outside study           _______        _______        _______
    
         work experience and outside study          _______        _______        _______
    
         instructor support and work experience     _______        _______        _______
    
         --------------------------------------------------------------------------------     
    
         Excel formulas for the cells in which you're going to put your results:
     
              r  =correl(c4:c13,b4:b13)
    
              t  =(b16*(sqrt(10-2)))/sqrt(1-b16^2)
                 b16 is used here to mean whichever cell you put r in, and you may
                 well have designed your spreadsheet to put it in some other cell
    
         Again, punctuation is critical and you can block copy to save typing.             
    
    
  3. Now, use the two-tailed t table in M & M, Table D, to determine the critical value for t at the pre-selected alpha and the appropriate degrees of freedom. Which, if any, of the three associations is significant at the chosen alpha level? That is, which of the tcalc is larger than the tcrit taken from the table?
    
         _________________________________________________________________________
    
    
  4. To calculate the probability of getting results as extreme as these through random sampling, go to M & M Table C (Student's t Distribution), which gives you the area under the t distribution from the mean to your t score (just like the normal table given in Table A). Round your t score to one decimal place of accuracy (e.g., 0.0), because that's all the detail there is on that table. Look up your rounded t score on the left axis across to 8 degrees of freedom. Subtract the number from 5,000 and then divide it by 10,000 to get the prob-value expressed as a probability. Enter the prob-value in the table above.

  5. In light of these findings, please state in English what you've learned about these three potential associations.
    
         _________________________________________________________________________ 
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
    


LAB EXERCISE B: Simple Linear Regression Analysis

Unlike correlation, regression tests for causal importance and models the degree of influence that one independent variable, X, has over another dependent variable, Y. That is, it measures the degree to which variation in X explains variation in Y. You can come up with such statements as "Variable 1 explains 47 percent of the variation in Variable 2."

This means that you have to set up hypotheses expressing your sense of the direction of causality. These hunches are grounded in theory or in plain common sense where theory is not yet well-developed.

Let's use student performance as the dependent, Y, variable. Let us then try out each of the other variables as independent variables, X1, X2, X3, and X4, proposed as explanations for student performance. Let us further assume that NU is doing an exploratory study and would be more uncomfortable with a Type II error. In other words, let us choose a somewhat generous alpha, 0.10 for our testing standard.

Since, regression (unlike correlation) entails specifying an expected direction, we will be using a one-tailed test. A common test statistic for the co-efficient of determination (regression co-efficient) is F, which is designed for one-tailed tests. To enter an F table, you need to know WHICH table to enter (0.10) and two different figures for degrees of freedom (remember that bit from the ANOVA lab?). For a test of r2, DF1 is equal to k or the number of independent variables in any one test (in this case, 1 for each of the 4 tests) and DF2 is equal to the number of cases (10) minus k (1) minus 1, or 8 in all.

In plain English, please state your expectations of the relationships between each X and the single Y. State them formally as alternate or working hypotheses and as null hypotheses. For each alternate hypothesis, also state why you expect what you do.

  1. Outside study (X1) and student grades (Y):
         Ha:
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         Ho:
    
         _________________________________________________________________________
    
    
  2. Instructor support ((X2) and student grades (Y):
         Ha:
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         Ho:
    
         _________________________________________________________________________
    
    
  3. Organization of instructor and materials (X3) and student grades (Y):
         Ha:
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         Ho:
    
         _________________________________________________________________________
    
    
  4. Years of work experience (X4) and student grades (Y):
         Ha:
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         Ho:
    
         _________________________________________________________________________
    
    
  5. That done, calculate for each Xi-Y pair, r, r2, b, a, and F. Put an asterisk next to any Fcalc that is larger than the Fcrit taken from the F table I handed out in class for alpha=0.10, with DF1=1 and DF2=8. I will spare you the prob-value "good manners" step, because you neither have access to F tables expressed as prob-values nor access to my favorite full- blown statistical program with a probability calculator built in (Statistica®).

         -------------------------------------------------------------------
    
                     r          r2         b          a          F        *?
    
         X1-->Y   _______    _______    _______    _______    _______    ___ 
    
         X2-->Y   _______    _______    _______    _______    _______    ___ 
    
         X3-->Y   _______    _______    _______    _______    _______    ___ 
    
         X4-->Y   _______    _______    _______    _______    _______    ___ 
     
         -------------------------------------------------------------------
    
         Excel formulas for regression elements
      
              r    =correl(b4:b13,c4:c13)
    
              r2   =b16^2
                   b16 is used here to symbolize wherever you stuck your r calculation
    
              b    =slope(b4:b13,d4:d13)     
                   It is very important that your Y variable be the first array listed
                   Follow the Y array by a comma and then the X array
    
              a    =intercept (b4:b13,e4:e13)
                   Again, the Y array has to come first
    
              F    =(b20*(10-2))/(1-b20)
                   b20 is used to symbolize wherever you put your r2
    
         Punctuation is really important here, folks  
    
    
  6. Interpret these results in terms of your original pairs of hypotheses. Were you able to reject any null hypotheses? If so, which one(s)?
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
    
  7. On the basis of the relationship(s) that proved significant and those that did not, formulate recommendations for course developers at NU in terms of the factor(s) they should emphasize in new course development. That is, on the basis of your analysis, to which of the following factors should they pay close attention and which can they pretty much overlook without affecting the program?

         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
    

  8. For any X-Y pair(s) that turned out significant at the chosen alpha, fill in your regression model in the form:
              Y = a + bX     
    
              Y = _____ + _____X  (don't forget the sign for b)
    
    
    
    
    
              There, you've built what may be your first regression model(s)! 
    
    
    At this point, many students become concerned, saying they have calculated a and b but they don't know how to calculate X and Y. Calm down: You've created a model of the relationship between X and Y. What can you do with this? You can select any value of X and use the model to predict what the expected value of Y should be, if no other factors were muddling up this X-Y interaction. You'll get to fiddle with this predictive function of simple linear regression a little farther down.

  9. Also draw an X-Y graph, showing the regression line you fitted to those data. Only the line is necessary (the scatterplot is optional). Start the line at "a" (the Y intercept). Then, draw a point on the line elsewhere on the graph (the intersection of the mean X and the mean Y will do nicely). Now, connect the two dots and extend the line to the rightmost margin of your graph. Piece of cake. Oh, and here's a template for any graph(s) you need to draw.
              Y
                   
                   5  +---------+---------+---------+---------+---------+  
                      |         |         |         |         |         | 
                      |         |         |         |         |         |    
                      |         |         |         |         |         | 
                      |         |         |         |         |         | 
                   4  +---------+---------+---------+---------+---------+  
              s       |         |         |         |         |         | 
              t       |         |         |         |         |         | 
              u       |         |         |         |         |         | 
              d       |         |         |         |         |         | 
              e    3  +---------+---------+---------+---------+---------+    
              n       |         |         |         |         |         | 
              t       |         |         |         |         |         | 
                      |         |         |         |         |         | 
              g       |         |         |         |         |         | 
              r    2  +---------+---------+---------+---------+---------+    
              a       |         |         |         |         |         | 
              d       |         |         |         |         |         | 
              e       |         |         |         |         |         | 
              s       |         |         |         |         |         | 
                   1  +---------+---------+---------+---------+---------+    
                      |         |         |         |         |         | 
                      |         |         |         |         |         | 
                      |         |         |         |         |         | 
                      |         |         |         |         |         | 
                   0  +---------+---------+---------+---------+---------+    
                      0         1         2         3         4         5
    
                                                                             X
    
         Name of Xi Variable:  __________________________________________
    
  10. Eyeball your graph(s). For a quick look at the concept of expected values and residuals, select a graph and tell me the name of the X variable.
         _________________________________________________________________________
    
    
  11. Locate the value 3.7 on the X axis. Read up from that value until you hit the regression line. Then, read across to the Y axis and tell me what the expected average numerical student grade would be as precisely as you can.
         _______
    
    
  12. Of course, other factors might cause a given section of students with an average X score of 3.7 to get a different average grade than would be expected from your regression model. Is there any campus with an average score of 3.7? If so, what was the actual average grade for that group (in numbers, not letter grades)?
         _______
    
    
  13. Subtract the expected score from the actual score. This vertical difference between the scores is called the residual. Analysis of the residuals can often suggest other variables that can be linked with a significant X to build a multiple-regression model with even greater ability to account for the variations in Y. That, however, will have to wait for another semester!
         _______
    
    


first placed on the web: 11/27/98
last revised: 11/29/98
© Dr. Christine M. Rodrigue