Dr. Rodrigue, Quant. Methods, Lab 9

Geography 215: QUANTITATIVE METHODS

Dr. Rodrigue

Graded Lab 9: Simple Linear Correlation and Regression

For all questions, please do your calculations at the full capacity of your spreadsheet or calculator, but round your answers to three decimal places of accuracy (i.e., 0.000).

LAB EXERCISE A: Correlation Analysis (Pearson's r)

Like Chi-squared, correlation analysis determines whether two variables are associated with one another, whether positively or negatively. Unlike Chi-squared, simple linear correlation (Pearson's r) relates two variables measured at the interval or ratio level. Also, not only can Pearson's r tell you IF a relationship exists (like Chi-squared) but how STRONG that association is (Chi-squared requires doing a Yule's Q to get this information, which you can only use on a 2 x 2 table). The Pearson's r co-efficient ranges from -1 to +1, where a perfect, all-lined-up, direct relationship shows a co-efficient of +1 and a perfect inverse relationship shows a co-efficient of -1. The closer to 0 the co-efficient is, the more random the association.

Ideally, you can get started by simply graphing your two variables against one another on a common X-Y scattergram. For your first pair of data, you'd look up the X variable on the X axis and read straight up until you come to the height of the Y variable on the Y axis. At that intersection, you mark a point. You keep on going for as many X-Y pairs as you have. Examination of the scattergram can tell you a lot about the relationship: its direction, its strength, and its linearity. If the cloud of dots trends to the upper right of the graph, you have a direct relationship; if it trends down to the lower right, you have an inverse relationship. If the cloud is very narrow, like a cigar, it's a strong relationship; it's weaker if it looks more like a football; there is basically no discernible relationship if the cloud looks more like a soccer ball. If the relationship does appear to exist on the basis of this cloud's shape, examine it more closely to see if its longest axis falls roughly along a line. Pearson's r is for linearly related data: If the cloud has a bent shape, like a J or an L, it must be transformed mathematically to force it into a straight line so that Pearson's r will work with it (very commonly, you can linearize such data with logarithmic transformations). If the cloud is U shaped, though, you're out of luck. With the 10 cases we have in this exercise (professorial kindness), we are unlikely to detect these sorts of problems (non-linearity, heteroscedasticity), so we'll plow forward to illustrate the process of performing a simple correlation and regression analysis. Lucky you.

Below is a spreadsheet summarizing student performance and a variety of class and instructor attributes for ten randomly chosen sample sections of a GIS seminar given at the various Nationwide University campuses in California's industrial and office complexes. Each class section had ten students enrolled. How convenient the math!

The first column is the location of the classes. The second is average student performance for each of the ten class sections. These scores have been converted into a 1-5 point system (grades on a 1-5 scale, with 5 being best, an A, and 1 being worst, an F). The third column is the average number of hours spent by students in outside study and project preparation each week in each of the sections (based, of course, on student self-reporting, which may be a bit self-serving!). The fourth column is student perception of the ten instructors' receptiveness to students and willingness to provide individualized attention and support. The fifth column consists of student ratings of the courses in terms of the organization and clarity of the instructors and materials provided. The last column is the average number of years that students have spent working for their present employers (Nationwide University targets its classes to working professionals, managers, and technical people in industry).

     -------------------------------------------------------------------
     COURSE          STUDENT    OUTSIDE     INSTR-    ORGANI-   WORK EX-
     LOCATION         GRADES      STUDY    SUPPORT     ZATION   PERIENCE
                           Y         X1         X2         X3        X4     
                                                                        
     Bakersfield       3.638      2.154      4.154      3.369      4.338
     Fresno            4.018      2.682      3.288      3.676      3.776
     Irvine            4.241      2.193      4.367      3.671      4.133
     Los Angeles       4.278      2.719      3.931      4.147      4.378
     Riverside         3.615      2.366      3.949      3.163      3.956
     Sacramento        4.200      2.500      3.900      3.700      4.200
     San Diego         4.184      1.805      3.992      4.119      4.168
     San Francisco     4.040      3.071      3.494      2.849      3.877
     San Jose          4.011      2.310      4.182      3.372      3.787
     Van Nuys          3.932      2.511      4.063      2.995      3.842
     -------------------------------------------------------------------

Now, just because I'm a semi-compassionate person (I'd score pretty highly on X2?), I have also made this spreadsheet available to you on the web. Make sure you're on a PC with both Netscape (or similar graphical browser) and Excel. Click on this link, http://www.csuchico.edu/~lapaloma/geog215/215lab9.xls, and either Netscape will prompt you to save the file somewhere or will directly evoke Excel and plop you right in the spreadsheet. If this doesn't work, you can manually type it into your spreadsheet.

It's about time you became acquainted with the statistical functions of your spreadsheet (Excel for most, if not all, of you). Up until now, I've been having you calculate everything in terms of columns and rows, to mitigate the math phobia so many students have. Now, it's time to find out some other nifty things spreadsheets can do these days (and this should speed things up for you a lot). For each variable, calculate the indicated statistics, using these statistical functions.

You can do this a couple of different ways. First, you can type in the function, e.g., =stdev(b4:b13), in the cell where you want the answer to land. Alternatively, you can highlight that cell and click on the "f_x" button on top ("function 'wizard'") and then choose the "statistical" function on the left menu and then pick whatever function you want to perform on the right menu that comes up (e.g., stdev for samples or stdevp for populations). A dialogue box will come up for you to input the range, on which you want the math performed (e.g., b4:b13). Hit "finish," and there you have it.

Basic statistical functions on a spreadsheet.

     ------------------------------------------------------------------
     see below      STUDENT    OUTSIDE     INSTR-    ORGANI-   WORK EX-
     for Excel       GRADES      STUDY    SUPPORT     ZATION   PERIENCE
     definitions          Y         X1         X2         X3         X4     
                    
     minima         _______    _______    _______    _______    _______

     maxima         _______    _______    _______    _______    _______

     means          _______    _______    _______    _______    _______

     medians        _______    _______    _______    _______    _______

     std dev        _______    _______    _______    _______    _______

     variances      _______    _______    _______    _______    _______

     ------------------------------------------------------------------

     Excel formulas for the cells in which you'll put your results:
     
          minima       =min(b4:b13) or whatever your range is          
                    
          maxima       =max(b4:b13)  
                    
          means        =average(b4:b13)  
                    
          medians      =median(b4:b13)  
                    
          std dev      =stdev(b4:b13) 
                    
          variances    =var(b4:b13)  

     Punctuation is very important!

     And don't forget you can block copy one formula across the row!

     ------------------------------------------------------------------

Now, you can perform the following correlation functions and t-tests. I'll spare you from doing all ten possible cross pairs: You need only do the following three, just to get the hang of it. Remember, since correlation measures association without causation, it doesn't really make any difference which variable you deem X and which Y. Since this is just an exploratory study, let's set alpha at 0.10 and do two- tailed tests. When you hit the t table, you'll lose one degree of freedom for each variable being correlated (n-2). So, please calculate the correlation co-efficients (and the t statistics) for:

     ---------------------------------------------------------------------------------

                                                   r              t            prob

     student grades and outside study           _______        _______        _______

     work experience and outside study          _______        _______        _______

     instructor support and work experience     _______        _______        _______

     --------------------------------------------------------------------------------     

     Excel formulas for the cells in which you're going to put your results:
 
          r  =correl(c4:c13,b4:b13)

          t  =(b16*(sqrt(10-2)))/sqrt(1-b16^2)
             b16 is used here to mean whichever cell you put r in, and you may
             well have designed your spreadsheet to put it in some other cell

     Again, punctuation is critical and you can block copy to save typing.

Now, use the two-tailed t table in M & M, Table D, to determine the critical value for t at the pre-selected alpha and the appropriate degrees of freedom. Which, if any, of the three associations is significant at the chosen alpha level? That is, which of the t_calc is larger than the t_crit taken from the table?
```
     _________________________________________________________________________
```
To calculate the probability of getting results as extreme as these through random sampling, go to M & M Table C (Student's t Distribution), which gives you the area under the t distribution from the mean to your t score (just like the normal table given in Table A). Round your t score to one decimal place of accuracy (e.g., 0.0), because that's all the detail there is on that table. Look up your rounded t score on the left axis across to 8 degrees of freedom. Subtract the number from 5,000 and then divide it by 10,000 to get the prob-value expressed as a probability. Enter the prob-value in the table above.

In light of these findings, please state in English what you've learned about these three potential associations.

     _________________________________________________________________________ 

     _________________________________________________________________________

     _________________________________________________________________________

LAB EXERCISE B: Simple Linear Regression Analysis

Unlike correlation, regression tests for causal importance and models the degree of influence that one independent variable, X, has over another dependent variable, Y. That is, it measures the degree to which variation in X explains variation in Y. You can come up with such statements as "Variable 1 explains 47 percent of the variation in Variable 2."

This means that you have to set up hypotheses expressing your sense of the direction of causality. These hunches are grounded in theory or in plain common sense where theory is not yet well-developed.

Let's use student performance as the dependent, Y, variable. Let us then try out each of the other variables as independent variables, X₁, X₂, X₃, and X₄, proposed as explanations for student performance. Let us further assume that NU is doing an exploratory study and would be more uncomfortable with a Type II error. In other words, let us choose a somewhat generous alpha, 0.10 for our testing standard.

Since, regression (unlike correlation) entails specifying an expected direction, we will be using a one-tailed test. A common test statistic for the co-efficient of determination (regression co-efficient) is F, which is designed for one-tailed tests. To enter an F table, you need to know WHICH table to enter (0.10) and two different figures for degrees of freedom (remember that bit from the ANOVA lab?). For a test of r², DF₁ is equal to k or the number of independent variables in any one test (in this case, 1 for each of the 4 tests) and DF₂ is equal to the number of cases (10) minus k (1) minus 1, or 8 in all.

In plain English, please state your expectations of the relationships between each X and the single Y. State them formally as alternate or working hypotheses and as null hypotheses. For each alternate hypothesis, also state why you expect what you do.

Outside study (X₁) and student grades (Y):

     H_a:

     _________________________________________________________________________

     _________________________________________________________________________

     H_o:

     _________________________________________________________________________

Instructor support ((X₂) and student grades (Y):

     H_a:

     _________________________________________________________________________

     _________________________________________________________________________

     H_o:

     _________________________________________________________________________

Organization of instructor and materials (X₃) and student grades (Y):

     H_a:

     _________________________________________________________________________

     _________________________________________________________________________

     H_o:

     _________________________________________________________________________

Years of work experience (X₄) and student grades (Y):

     H_a:

     _________________________________________________________________________

     _________________________________________________________________________

     H_o:

     _________________________________________________________________________

That done, calculate for each X_i-Y pair, r, r², b, a, and F. Put an asterisk next to any F_calc that is larger than the F_crit taken from the F table I handed out in class for alpha=0.10, with DF₁=1 and DF₂=8. I will spare you the prob-value "good manners" step, because you neither have access to F tables expressed as prob-values nor access to my favorite full- blown statistical program with a probability calculator built in (Statistica^®).

     -------------------------------------------------------------------

                 r          r²         b          a          F        *?

     X₁-->Y   _______    _______    _______    _______    _______    ___ 

     X₂-->Y   _______    _______    _______    _______    _______    ___ 

     X₃-->Y   _______    _______    _______    _______    _______    ___ 

     X₄-->Y   _______    _______    _______    _______    _______    ___ 
 
     -------------------------------------------------------------------

     Excel formulas for regression elements
  
          r    =correl(b4:b13,c4:c13)

          r2   =b16^2
               b16 is used here to symbolize wherever you stuck your r calculation

          b    =slope(b4:b13,d4:d13)     
               It is very important that your Y variable be the first array listed
               Follow the Y array by a comma and then the X array

          a    =intercept (b4:b13,e4:e13)
               Again, the Y array has to come first

          F    =(b20*(10-2))/(1-b20)
               b20 is used to symbolize wherever you put your r²

     Punctuation is really important here, folks

Interpret these results in terms of your original pairs of hypotheses. Were you able to reject any null hypotheses? If so, which one(s)?

     _________________________________________________________________________

     _________________________________________________________________________

     _________________________________________________________________________

     _________________________________________________________________________

On the basis of the relationship(s) that proved significant and those that did not, formulate recommendations for course developers at NU in terms of the factor(s) they should emphasize in new course development. That is, on the basis of your analysis, to which of the following factors should they pay close attention and which can they pretty much overlook without affecting the program?

Student self-reported outside study?
Instructors who are particularly supportive?
Organization of course materials and hiring organized individuals as instructors?
Students with more years in industry?

     _________________________________________________________________________

     _________________________________________________________________________

     _________________________________________________________________________

     _________________________________________________________________________

     _________________________________________________________________________

     _________________________________________________________________________

     _________________________________________________________________________

     _________________________________________________________________________

For any X-Y pair(s) that turned out significant at the chosen alpha, fill in your regression model in the form:
```
          Y = a + bX     

          Y = _____ + _____X  (don't forget the sign for b)





          There, you've built what may be your first regression model(s)! 
```
At this point, many students become concerned, saying they have calculated a and b but they don't know how to calculate X and Y. Calm down: You've created a model of the relationship between X and Y. What can you do with this? You can select any value of X and use the model to predict what the expected value of Y should be, if no other factors were muddling up this X-Y interaction. You'll get to fiddle with this predictive function of simple linear regression a little farther down.

Also draw an X-Y graph, showing the regression line you fitted to those data. Only the line is necessary (the scatterplot is optional). Start the line at "a" (the Y intercept). Then, draw a point on the line elsewhere on the graph (the intersection of the mean X and the mean Y will do nicely). Now, connect the two dots and extend the line to the rightmost margin of your graph. Piece of cake. Oh, and here's a template for any graph(s) you need to draw.

          Y
               
               5  +---------+---------+---------+---------+---------+  
                  |         |         |         |         |         | 
                  |         |         |         |         |         |    
                  |         |         |         |         |         | 
                  |         |         |         |         |         | 
               4  +---------+---------+---------+---------+---------+  
          s       |         |         |         |         |         | 
          t       |         |         |         |         |         | 
          u       |         |         |         |         |         | 
          d       |         |         |         |         |         | 
          e    3  +---------+---------+---------+---------+---------+    
          n       |         |         |         |         |         | 
          t       |         |         |         |         |         | 
                  |         |         |         |         |         | 
          g       |         |         |         |         |         | 
          r    2  +---------+---------+---------+---------+---------+    
          a       |         |         |         |         |         | 
          d       |         |         |         |         |         | 
          e       |         |         |         |         |         | 
          s       |         |         |         |         |         | 
               1  +---------+---------+---------+---------+---------+    
                  |         |         |         |         |         | 
                  |         |         |         |         |         | 
                  |         |         |         |         |         | 
                  |         |         |         |         |         | 
               0  +---------+---------+---------+---------+---------+    
                  0         1         2         3         4         5

                                                                         X

     Name of X_i Variable:  __________________________________________

Eyeball your graph(s). For a quick look at the concept of expected values and residuals, select a graph and tell me the name of the X variable.
```
     _________________________________________________________________________
```
Locate the value 3.7 on the X axis. Read up from that value until you hit the regression line. Then, read across to the Y axis and tell me what the expected average numerical student grade would be as precisely as you can.
```
     _______
```
Of course, other factors might cause a given section of students with an average X score of 3.7 to get a different average grade than would be expected from your regression model. Is there any campus with an average score of 3.7? If so, what was the actual average grade for that group (in numbers, not letter grades)?
```
     _______
```
Subtract the expected score from the actual score. This vertical difference between the scores is called the residual. Analysis of the residuals can often suggest other variables that can be linked with a significant X to build a multiple-regression model with even greater ability to account for the variations in Y. That, however, will have to wait for another semester!
```
     _______
```