Geography 200: INTRODUCTION TO RESEARCH METHODS FOR GEOGRAPHERS

Dr. Rodrigue

Graded Lab 9: Nearest-Neighbor Analysis and Bi-Variate Hypothesis Testing with Nominal Data

==========

This lab covers two techniques applicable to hypothesis testing with spatial data. One is nearest-neighbor analysis and the other is Chi-squared in quadrat analysis. For all questions, please do your calculations at the full capacity of your spreadsheet or calculator, but round the fractional answers to your questions to three decimal places of accuracy (i.e., 0.000) just before you write them down. Rounding, however, is not necessary for integer counts, just calculated fractional answers.

==========

LAB EXERCISE A: Nearest-Neighbor Analysis

Examine the distribution of adult chaparral whitethorn shrubs, Ceanothus leucodermis in Figure 1. The area mapped is a hillside in the Sulphur Mountain area of Ventura County (and, ooooo, is it stinky!). Employ nearest-neighbor analysis to ascertain whether, at this scale, this species has a clumped, uniform, or random distribution.

This isn't always as obvious as it seems, because human beings are "hard-wired" to see patterns even when there aren't any, to make Type I errors ("seeing" predators and running like crazy kept a lot of your distant ancestors alive, as opposed to thinking an actual predator was just a "random" pattern nearby: "Runs Like a Fool" left a lineage leading to us, while "Aw, You're Just Seeing Things" did not, having made a Type II error with deadly consequences, leaving us with a propensity to think we see patterns that don't really exist). Avoiding this all-too-human tendency is why alpha is set so low in hypothesis testing, e.g., 0.05 or 0.01.

Okay, let's set this up in formal hypothesis-testing language. After having looked at that map, Figure 1, would you say it is more uniform or clumped in distribution? State this as a directional working (or alternate) hypothesis:

______________________________________________________________________________  

What, then, would be the null hypothesis?

______________________________________________________________________________  

What do you think would be a healthy alpha for this hypothesis, a nice balance between the "seeing things that ain't there" Type I error and the "missing out on something that is there" Type II error?

______________________________________________________________________________  

Calculating the nearest-neighbor co-efficient (R) entails the tedious process of measuring the distance between each point in a given space and the point that is its nearest neighbor. It should be noted that point a may well have point b as its nearest neighbor, but point b may have another point entirely, say, c, as its nearest neighbor. Anyhow, having measured all those nearest neighbor distances, you figure out the mean nearest-neighbor distance and then create an expected mean nearest-neighbor distance from the density of the points in your study area. You then create R as the ratio of the mean observed nearest-neighbor distance to this expected mean nearest-neighbor distance.

R can vary from 0 to 2.149 (I've always liked that extra .149 bit!). A score of 0 means perfect clustering: All points are found at the exact same point in space (which is, of course, a physical impossibility if your study entails point data collected at one time). A score of 2.149 means perfect uniformity along a hexagonal lattice. A score of 1 represents perfectly random distribution of points in space. So, you can use this statistic to characterize a distribution as more clustered or more uniform or just random.

  1. What is n (i.e., how many plants are there in the area shown in Figure 1?
         n = ________
    
    
  2. What is A (the size of the study area, measured in square meters)?
         A = ________
    
    
    
    
    
    
  3. What is P (the Perimeter of the study area, measured all along the edge of the study area in meters -- multiply the width of the study area in meters by 4, since the study area happens to be perfectly square)?
         P = ________
    
    
  4. What is D (the Density of C. leucodermis in numbers per unit of area, or n/A, or, in this case, number of plants divided by the size of the study area in square meters)?
         D = ________
    
    
  5. The tedious part: Measure and record the distance from each plant to its nearest-neighbor (remember, one plant can wind up the nearest-neighbor for more than one other plant). Actually, here are those measurements in meters already done for you (trying to make life a little easier on you) in the worksheet below. "#" stands for the individual plant identification number shown on the map, "nn" is the number of that plant's nearest neighbor, while "r" is the distance between each plant and its nearest neighbor.
    
          #          nn        r(m)    
         ==========================
          1           2        2.02
          2           4        1.77
          3           1        2.14
          4           2        1.77
          5           4        2.50
          6           3        2.55
          7           9        1.82
          8           9         .90
          9           8         .90
         10          11        3.81
         11          12        1.12
         12          11        1.12
         13          18        8.68
         14          15        2.02
         15          14        2.02
         16          15        2.80
         17          10        5.77
         18          19        3.02
         19          29        2.83
         20          22        3.54
         21          22        4.19
         22          21        4.19
         23          24         .79
         24          23         .79
         25          24        2.93
         26          28        1.90
         27          28        3.48
         28          29        1.90
         29          28        1.90
         30          28        6.27                                                                          
    
              
    
  6. Okay, in a rare fit of compassion, these data can be downloaded as an Open/LibreOffice spreadsheet by clicking this link: https://cla.csulb.edu/departments/geography/labs/data/nearnabedata.ods. Your browser will ask you which software to use if you just decide to open it. You can also choose to save it and THEN open it by firing up Open/LibreOffice. If you are working in the student lab, you should probably save it on your own flash drive and then find Open/LibreOffice on the Desktop or Start menu and have it open the file (assuming you remember where you saved it <G>).

  7. What is the ro (the observed mean nearest-neighbor distance)? This entails adding up all of those observed nearest-neighbor distances and then dividing them by n, or the number of plants. If you decide to use the OpenOffice spreadsheet, you can calculate this very easily by moving your cursor to cell C33 and typing the following: =average(c2:c31) and hitting Enter.
         _
         ro =  ________
    
    
    
    
    

  8. What is the re (the expected mean nearest-neighbor distance)? In English, you divide n by the A (study area), which gets you D (which you calculated above). Then, you take the square root of D. Multiply that square root by 2. Now, divide 1 BY that answer. If you're using OpenOffice, put your cursor in cell C34 and type the following: =1/(2*sqrt(D)), but don't type in D literally. Instead, substitute the answer you gave for D in question 4 above.
                   1
         _      ______
         re  =    ___
                2√ D  
    
         _
         re  =  ________
                           
    
  9. Plot complication: Nearest-neighbor analysis in small study areas can be distorted by edge effects. The smaller the study area, the larger the number of points near the perimeter with true nearest-neighbors outside the perimeter, which won't be counted in your analysis. There is a fairly easy correction to the expected mean nearest-neighbor distance that compensates for edge effects. Remembering that P means the perimeter and n means the number of plants, add the expected mean nearest-neighbor distance to:
                              _
         P/n (0.0514 + 0.041/√n)
    
    
    Make sure to remember the algebraic order of operation. If you're doing this in OpenOffice, put your cursor in cell A35 and type =P/n*(0.0514+0.041/SQRT(n)), remembering not to type in P and n but the answers you got for P and n.

    So, the whole formula for the edge-corrected expected mean nearest-neighbor distance (rE) is:

                                          _
         rE = re + [ P/n (0.0514 + 0.041/√n ) ] 
    
    =  ________ 
    
    
    In OpenOffice, that would be =C34+C35

  10. What is R (the ratio of the observed to the edge-corrected expected mean nearest-neighbor distances)?
                 _            
                 ro
    
         R  =  _____  
                 _
                 rE  
    
    
            =  ________
    
    
    In OpenOffice, that would be =C33/C36

  11. Now, eyeballing that answer and re-reading the material at the beginning of the lab for interpreting R, how would you characterize the local distribution of Ceanothus leucodermis?
    ______________________________________________________________________________  
    
    ______________________________________________________________________________  
    
    ______________________________________________________________________________  
    
    

  12. In order to interpret your R value, you can perform a Z test. This is a statistical test that will tell you the probability that you could have gotten a value of R as far from 1.0 as yours under pure random sampling processes. In other words, this Z test will tell you how likely it is that your pattern is just random noise. To calculate Z, you need to calculate one more thing: The standard error, corrected for edge effects. Standard error is based on a measure of the internal variability in your data set. It is a common denominator in many different statistical tests, as you may have begun to notice. The one for nearest-neighbor analysis is:
    [ formula for edge-corrected nearest-neighbor standard error ]

    To do this in OpenOffice, type the following into cell c38: =SQRT(0.0703*(625/30^2)+0.037*100*SQRT(625/30^5))

    You can do this manually by working your way through the formula and paying close attention to the algebraic order of operations.

    SEc = __________

  13. To calculate Z, subtract the edge-corrected expected mean nearest-neighbor distance from the observed mean distance and then divide that answer by the standard error:
             _    _
             ro - rE
    
         Z = ________
               SEc
    
    
    If you're using OpenOffice, that would be =(C33-C36)/C38

  14. Now that you have Z, what do you do with it? You go to a Z table, such as this one shown in StatSoft: http://www.statsoft.com/textbook/sttable.html#z. What it does is tell you what the probability is that you could have gotten a Z value as extreme as yours if there were nothing but random chance going on.

    So, look up your Z score on that table and state the prob-value (don't forget to subtract the cell value from 0.5000 and then multiply by 2): __________

  15. Now, looking at that prob-value, is your pattern significantly different than randomness (i.e., is your calculated prob-value smaller than the alpha you picked back up at the top of this lab section)?

    __________

  16. Having done this statistical test using Z, would you modify the conclusions you wrote up in #11 above? If so, restate your conclusions (the second chance to get it right plan!):
    ______________________________________________________________________________  
    
    ______________________________________________________________________________  
    
    ______________________________________________________________________________  
    
    

==========

LAB EXERCISE B: Chi-Squared Quadrat Analysis

You may recall that quadrat-based techniques of spatial analysis involve the division of an area into equal-sized plots, usually through a grid of squares. This permits the use of statistical techniques to analyze quantitative data with no more measurement sophistication than mere frequencies by category (nominal data). Nearest-neighbor analysis, by contrast, required the collection of data at the ratio level of measurement, which is the highest level.

For your reference pleasure, the definitional formula for Chi-squared is:

          r   k                       
         __  __  (Oij - Eij)2

     X2 =\   \  ____________
         /_  /_      Eij
         i=1 j=1

You'll be comforted to know I'll walk you through a much easier computational process.

  1. The first order of business, as usual, is to set up null and alternative hypotheses and an appropriate standard to judge whether the null hypothesis can be rejected for your purposes. So, eyeballing the map in Figure 2, formulate your hunch about the relationship between the distributions of the two plant species described below. Now, please state the null version of that hypothesis:
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
    
  2. Now, this is a classic physical geography kind of study, so it will be driven mainly by some sort of scientific theoretical concern, testing the validity of some argument about niche-sharing or perhaps allelopathy. So, is the consequence of a Type I error likely to be life-threatening?
         _____ yes          _____ no
    
    
    
    
    
    Given that, would you set alpha at the high end (larger prob-value) or low end (smaller prob-value) of the scientific continuum of common alphas?
         _____ larger alpha          _____ smaller alpha
    
    
    On the other side, this is clearly not a marketing type of study, where a Type II error (missing a significant and exploitable relationship, if it exists) would have the more serious consequences. So, there is not much pressure to increase alpha: It is more important for theory that you not delude yourself into seeing significant relationships where there might be none. So, you want to select an alpha that is on the smaller side compared to, say, the marketing continuum and yet on the larger end of the scientific continuum. Balancing these concerns, which of the commonly used alpha levels is best able to suit your purposes?
    
         _____ 0.10          _____ 0.05          _____ 0.01
    
    
  3. Figure 2 shows the distribution of two plant species, Salvia apiana (white sage) and Avena barbata (slender oat). Characterize each of the larger quadrats (the ones labeled A1 or F9 or J5, for example) as belonging to one of the four quadrat types listed below. Be sure that you've accounted for 100 quadrats (10 x 10), and that no quadrat fits in more than one category. This can be a little tedious to keep track of, so be really careful from the "git-go" in creating your observed counts.
    Late breaking news: Here is a spreadsheet with the counts done, just because I'm a nice person sometimes: https://home.csulb.edu/~rodrigue/geog200/SalviaAvenaClassified.ods Count the "a", "b", "c", and "d" in column C and place those counts in the appropriate cells below ("a", "b", "c", and "d").=

    • (a) containing both Avena and Salvia;

    • (b) containing Avena but no Salvia;

    • (c) containing Salvia but no Avena; OR

    • (d) containing neither Avena nor Salvia.

  4. Now, with all 100 quadrats accounted for, each in no more than one category, fill in the following spreadsheet with quadrat counts in each of the four categories. Put your count numbers by category in the upper part of the appropriate cell. These are your observed or real-world frequencies.
    
                          |                  SALVIA                 |
                          |                    |                    |
                          |      present       |       absent       |   row totals
         _________________________________________________________________________
                          |(a)                 |(b)                 |-e-
               present    |                    |                    |
                          |                    |                    |
         AVENA   _________________________________________________________________
                          |(c)                 |(d)                 |-f-
               absent     |                    |                    |
                          |                    |                    |
         _________________________________________________________________________
                          |-g-                 |-h-                 |-i-
         column totals    |                    |                    |   
                          |                    |                    | n = 
    
    
    
  5. Compute the marginal totals. That is, sum the observed frequencies in each row and put those sums in the appropriate row total (e or f). Do the same for the frequencies in each column and put those sums in the appropriate column total (g or h). The sum of row totals should equal the sum of column totals. If so, put the total number or n (which had better equal 100) in cell i.

  6. Create the expected frequencies for each data cell (a through d). This is the distribution of cell counts you would expect from your data if there were no association between the two plant species (i.e., random processes were allocating them among the cells). To do this for each data cell, a through d, multiply the row total to its right by the column total below it and then divide the answer by n. Put the answer, rounded to three decimal places of accuracy, in its cell below the actual observed frequency.

    Still lost? Okay, okay. In other words, multiply cells e and g and divide the answer by cell i. Put the answer, properly rounded, in the lower part of cell a. Similarly, multiply cells e and h and divide by i, and put that answer in cell b. Multiply cells f and g and divide by i, and plop that answer in cell c. Lastly, multiply cell f by cell h, divide by i again, and put the result in cell d.

    That done, examine the expected frequencies. Chi-square should not be used if any expected frequencies are below 2 (or, irrelevantly in this case, if more than 20 percent of the data cells have fewer than 5 actual cases). You will note that there are no such problems with your contingency table, so you can safely proceed through Chi-square.

  7. Now, move on to the worksheet below for calculating Chi-squared. In the first column, enter the observed frequencies for each data cell (the number in the upper part of cells a through d).

  8. In the second column, square those frequencies.

  9. In the third column, enter the expected frequencies for each data cell (the number in the lower part of cells a through d).

  10. In the fourth column, divide each squared frequency by the corresponding expected frequency in the bottom of the appropriate data cell (a through d).

  11. Now, sum the fourth column and put the answer near the bottom of the spreadsheet (sum(O2/E). Show your work here to three decimal places of accuracy.

  12. Finally, subtract n (from cell i) from that sum. This answer is your calculated Chi-squared (X2). Put it at the bottom of the whole spreadsheet, also rounded to three decimal places of accuracy.
         ________________________________________________________________________
    
         DATA CELL |   O    |    O2      |    E   |               O2/E
    
         ________________________________________________________________________
            (a)    |        |            |        |
         ________________________________________________________________________     
            (b)    |        |            |        |
         ________________________________________________________________________
            (c)    |        |            |        |
         ________________________________________________________________________
            (d)    |        |            |        |
         ________________________________________________________________________
                                                  |sum(O2/E) = 
         ________________________________________________________________________
                                                  |sum(O2/E) - n = X2 =
         ________________________________________________________________________
    
    
    
    
    
  13. Now, to interpret this hard-gained number, your X2calc, you need to compare it with a critical X2. To do this, you will need the Chi-squared table I distributed in class, the one suited to the classical approach to hypothesis testing. You can also use the Chi-squared table here. You need your pre-selected alpha level to pick the right column and the degrees of freedom for your 2 x 2 contingency table to choose the right row to enter the table. Degrees of freedom in Chi-squared can be defined as:
    
         DF = (r - 1)(k - 1)
         where r = number of rows and k = number of columns
    
    
    
    So, you will enter the table at the intersection of:
         the column headed ________ 
    
         and the row corresponding to ________ degrees of freedom.
    
    What, then, is your critical Chi-squared value?
    
         X2crit =  ________
    
    
    
  14. Is your X2calc ________ greater than or ________ less than the X2crit?

  15. If your actual, calculated Chi-square value is greater than the critical Chi-square, you may safely conclude that your pattern is not just a random one. In other words, there is a statistically significant probability that there is a real association of some sort between your variables (in this case, between the two plant species). If the calculated Chi-square value is less than the critical test value, the relationship probably is random. Can the null hypothesis of random association between these two plant species in this study area be rejected in this case?
    
         _____ reject Ho          _____ do not reject Ho
    
    
  16. It's always good etiquette, whenever possible, to calculate the prob-value of a Type I error, to express your faith in the null hypothesis, however, in the off chance that a reader may have compelling reasons to use a different standard of alpha than you chose. Unfortunately, M&M somehow messed up and did not put the oh-so-important 1 DF column in their prob-value table for Chi-squared. In a rare attack of generosity, I have decided to provide you the missing column by spending a lot of time with the probability calculator within Statistica, a very nice full-featured statistics package. Consult Figure 3 to get the missing column and tell me the probability that you could have gotten results as extreme as yours if there is but a random association between the two plant species. Even better, I've just created a Chi-squared prob-values table in OpenOffice for you (which even has a box you can modify to figure out prob-values that don't fall on the table): https://home.csulb.edu/~rodrigue/geog200/chisquareprobvalues.ods.
         ________ prob-value of Ho
    
    
  17. Plot complication. Chi-squared is notoriously sensitive to sample size. That is, the same percentages in each cell can appear significant in a big sample (large n) or insignificant in a small sample. It might help to assess the strength of a significant relationship, should the Chi-squared test find one. For that, you can use Yule's Q. Yule's Q, however, can only be calculated for contingency tables with no more than two rows and two columns (bigger tables can sometimes be collapsed into a 2 x 2 format, by combining rows and columns in some sort of logical way). Conveniently, this lab just happens to feature a 2 x 2 table.

    To calculate Yule's Q, multiply cells a and d and also cells b and c. Then, enter these multiplications into the following formula:

              ad - bc
         Q =  _______
              ad + bc
    
    
    So, what is the Q value for this lab? ________

  18. Now, what does it all MEAN? Basically, Yule's Q can vary from -1 to +1. The closer it is to 0, the weaker the relationship is. The closer it is to -1 or +1, the stronger the relationship is, whether inverse (negative) or direct (positive).

    Please interpret the results of Lab B, taking into consideration both Chi-squared and Yule's Q. What sort of ecological relationship, if any, exists between Salvia apiana and Avena barbata at this scale of analysis? Is it significant? How strong is the effect? What is the direction of the association?

         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
    
    And that's a wrap for another lab, folks!

    ==========

    Figure 1 -- Map of Ceanothus leucodermis Plants

    [ map of ceanothus plants ]

    ==========

    Figure 2 Map of Oats and Sage (you might want to recopy these figures at 120 percent or so)

    [ map of oats and sage ]

    ==========

    Figure 3: p-Values for X2 (or, better, use the table here)

    
           2                   2                   2                   2
          X      1 DF         X      1 DF         X      1 DF         X      1 DF
    
         3.2    .0736        4.4    .0359        5.6    .0180        6.8    .0091
         3.3    .0692        4.5    .0339        5.7    .0170        6.9    .0086
         3.4    .0652        4.6    .0320        5.8    .0160        7.0    .0082
         3.5    .0614        4.7    .0302        5.9    .0151        7.1    .0077
         3.6    .0578        4.8    .0285        6.0    .0143        7.2    .0073
         3.7    .0544        4.9    .0268        6.1    .0135        7.3    .0669
         3.8    .0513        5.0    .0254        6.2    .0128        7.4    .0065
         3.9    .0483        5.1    .0239        6.3    .0121        7.5    .0062
         4.0    .0455        5.2    .0226        6.4    .0114        7.6    .0058
         4.1    .0429        5.3    .0213        6.5    .0108        7.7    .0055
         4.2    .0404        5.4    .0201        6.6    .0102        7.8    .0052
         4.3    .0381        5.5    .0190        6.7    .0096       >7.8   <.0050
    
    
    
    
    
    data collected by Dr. Rodrigue from the Probability Calculator within Statistica® at ALST, Inc., of Northridge, CA, 11/98.

    ==========

    first placed on the web: 11/26/98
    last revised: 04/24/17
    © Dr. Christine M. Rodrigue

    ==========