Geography 215: QUANTITATIVE METHODS

Dr. Rodrigue

Graded Lab 8: Nearest-Neighbor Analysis and Bi-Variate Hypothesis Testing with Nominal Data


This lab covers two techniques applicable to hypothesis testing with spatial data. One is nearest-neighbor analysis and the other is Chi-squared in quadrat analysis. For all questions, please do your calculations at the full capacity of your spreadsheet or calculator, but round your answers to three decimal places of accuracy (i.e., 0.000).


LAB EXERCISE A: Nearest-Neighbor Analysis

Examine the distribution of adult fetid lilac shrubs, Ceanothus foetidus (which I made up, by the way, so don't go running to the Munz flora in the excitement of learning about a new chaparral plant!) in Figure 1. The area mapped is a hillside in the Sulphur Mountain area of Ventura County (which does exist and, ooooo, is it stinky!). Employ nearest-neighbor analysis to ascertain whether, at this scale, this "species" has a clumped, uniform, or random distribution.

Calculating the nearest-neighbor co-efficient (R) entails the tedious process of measuring the distance between each point in a given space and the point that is its nearest neighbor. It should be noted that point a may well have point b as its nearest neighbor, but point b may have another point entirely, say, c, as its nearest neighbor. Anyhow, having measured all those nearest neighbor distances, you figure out the mean nearest-neighbor distance and then create an expected mean nearest-neighbor distance from the density of the points in your study area. You then create R as the ratio of the mean observed nearest-neighbor distance to this expected mean nearest-neighbor distance.

R can vary from 0 to 2.149 (I've always liked that extra .149 bit!). A score of 0 means perfect clustering: All points are found at the exact same point in space (which is, of course, a physical impossibility if your study entails point data collected at one time). A score of 2.149 means perfect uniformity. A score of 1 represents perfectly random distribution of points in space. So, you can use this descriptive statistic to characterize a distribution as more clustered or more uniform or just random. Unfortunately, I know of no significance test for the R ratio, so there's no way to test whether a given distribution departs significantly from randomness.

  1. What is n (i.e., how many plants are there in the area shown in Figure 1?
         n = ________
    
    
  2. What is A (the size of the study area, measured in square meters)?
         A = ________
    
    
  3. What is p (the density of C. foetidus in numbers per unit of area, or n/A, or, in this case, number of plants divided by the size of the study area in square meters)?
         p = ________
    
    
  4. The tedious part: Measure and record the distance from each plant to its nearest-neighbor (remember, one plant can wind up the nearest-neighbor for more than one other plant). Record your measurements in the worksheet below. "#" stands for the individual plant identification number shown on the map, while "r" is the distance between each plant and its nearest neighbor.
    
                #      r                 #      r                 #      r    
                                                                              
                1)  ________            11)  ________            21)  ________
                                                                              
                2)  ________            12)  ________            22)  ________
                                                                              
                3)  ________            13)  ________            23)  ________
                                                                              
                4)  ________            14)  ________            24)  ________
                                                                              
                5)  ________            15)  ________            25)  ________
                                                                              
                6)  ________            16)  ________            26)  ________
                                                                              
                7)  ________            17)  ________            27)  ________
                                                                              
                8)  ________            18)  ________            28)  ________
                                                                              
                9)  ________            19)  ________            29)  ________
                                                                              
               10)  ________            20)  ________            30)  ________
    
              
    
                      _
  5. What is the ro (the observed mean nearest-neighbor distance)?
         _
         ro =  ________
    
                      _
  6. What is the re (the expected mean nearest-neighbor distance)?
         _           1                  1    
         re  =      ______              _____
                 2\/ n/A             2\/  p  
    
         _
         re =  ________
    
    
  7. What is R (the ratio of observed to expected mean nearest-neighbor distances)?
                 _            
                 ro
         R  =  _____  
                 _
                 re  
    
    
            =  ________
    
    
  8. On the basis of the R you calculated and the guidelines for interpreting it given at the top of Lab A, how would you characterize the local distribution of C. foetidus, at least at this particular scale of analysis?
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
    


LAB EXERCISE B: Chi-Squared Quadrat Analysis

You may recall that quadrat-based techniques of spatial analysis involve the division of an area into equal-sized plots, usually through a grid of squares. This permits the use of statistical techniques to analyze quantitative data with no more measurement sophistication than mere frequencies by category (nominal data). Nearest-neighbor analysis, by contrast, required the collection of data at the ratio level of measurement, which is the highest level.

For your reference pleasure, the definitional formula for Chi-squared is:

          r   k                       
         __  __  (Oij - Eij)2
     X2 =\   \  ____________
         /_  /_      Eij
         i=1 j=1

You'll be comforted to know I'll walk you through a much easier computational process.

  1. The first order of business, as usual, is to set up null and alternative hypotheses and an appropriate standard to judge whether the null hypothesis can be rejected for your purposes. So, eyeballing the map in Figure 2, formulate your hunch about the relationship between the distributions of the two plant species described below. Now, please state the null version of that hypothesis:
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
    
  2. Now, this is a classic physical geography kind of study, so it will be driven mainly by some sort of scientific theoretical concern, testing the validity of some argument about niche-sharing or perhaps allelopathy. So, is the consequence of a Type I error likely to be life-threatening?
         _____ yes          _____ no
    
    
    Given that, would you set alpha at the high end (larger prob-value) or low end (smaller prob-value) of the scientific continuum of common alphas?
         _____ larger alpha          _____ smaller alpha
    
    
    On the other side, this is clearly not a marketing type of study, where a Type II error (missing a significant and exploitable relationship, if it exists) would have the more serious consequences. So, there is not much pressure to increase alpha: It is more important for theory that you not delude yourself into seeing significant relationships where there might be none. So, you want to select an alpha that is on the smaller side compared to, say, the marketing continuum and yet on the larger end of the scientific continuum. Balancing these concerns, which of the commonly used alpha levels is best able to suit your purposes?
         _____ 0.10          _____ 0.05          _____ 0.01
    
    
  3. Figure 2 shows the distribution of two (real) plant species, Salvia apiana (white sage) and Avena barbata (slender oat). Characterize each of the larger quadrats (the ones labeled A1 or F9 or J5, for example) as belonging to one of the four quadrat types listed below. Be sure that you've accounted for 100 quadrats (10 x 10), and that no quadrat fits in more than one category. This can be a little tedious to keep track of, so be really careful from the "git-go" in creating your observed counts.

  4. Now, with all 100 quadrats accounted for, each in no more than one category, fill in the following spreadsheet with quadrat counts in each of the four categories. Put your count numbers by category in the upper part of the appropriate cell. These are your observed or real-world frequencies.
    
                          |                  SALVIA                 |
                          |                    |                    |
                          |      present       |       absent       |   row totals
         _________________________________________________________________________
                          |(a)                 |(b)                 |-e-
               present    |                    |                    |
                          |                    |                    |
         AVENA   _________________________________________________________________
                          |(c)                 |(d)                 |-f-
               absent     |                    |                    |
                          |                    |                    |
         _________________________________________________________________________
                          |-g-                 |-h-                 |-i-
         column totals    |                    |                    |   
                          |                    |                    | n = 
    
    
    
  5. Compute the marginal totals. That is, sum the observed frequencies in each row and put those sums in the appropriate row total (e or f). Do the same for the frequencies in each column and put those sums in the appropriate column total (g or h). The sum of row totals should equal the sum of column totals. If so, put the total number or n (which had better equal 100) in cell i.

  6. Create the expected frequencies for each data cell (a through d). This is the distribution of cell counts you would expect from your data if there were no association between the two plant species (i.e., random processes were allocating them among the cells). To do this for each data cell, a through d, multiply the row total to its right by the column total below it and then divide the answer by n. Put the answer, rounded to three decimal places of accuracy, in its cell below the actual observed frequency.

    Still lost? Okay, okay. In other words, multiply cells e and g and divide the answer by cell i. Put the answer, properly rounded, in the lower part of cell a. Similarly, multiply cells e and h and divide by i, and put that answer in cell b. Multiply cells f and g and divide by i, and plop that answer in cell c. Lastly, multiply cell f by cell h, divide by i again, and put the result in cell d.

    That done, examine the expected frequencies. Chi-square should not be used if any expected frequencies are below 2 (or, irrelevantly in this case, if more than 20 percent of the data cells have fewer than 5 actual cases). You will note that there are no such problems with your contingency table, so you can safely proceed through Chi-square.

  7. Now, move on to the worksheet below for calculating Chi-squared. In the first column, enter the observed frequencies for each data cell (the number in the upper part of cells a through d).

  8. In the second column, square those frequencies.

  9. In the third column, divide each squared frequency by the corresponding expected frequency in the bottom of the appropriate data cell (a through d).

  10. Now, sum the third column and put the answer near the bottom of the spreadsheet (sum(O2/E). Show your work here to three decimal places of accuracy.

  11. Finally, subtract n (from cell i) from that sum. This answer is your calculated Chi-squared (X2). Put it at the bottom of the whole spreadsheet, also rounded to three decimal places of accuracy.
         ________________________________________________________________________
    
         DATA CELL |     O     |       O2       |               O2/E
         ________________________________________________________________________
            (a)    |           |                |
         ________________________________________________________________________     
            (b)    |           |                |
         ________________________________________________________________________
            (c)    |           |                |
         ________________________________________________________________________
            (d)    |           |                |
         ________________________________________________________________________
                                                | sum(O2/E) = 
         ________________________________________________________________________
                               |         sum(O2/E) - n = X2 =
         ________________________________________________________________________
    
    
  12. Now, to interpret this hard-gained number, your X2calc, you need to compare it with a critical X2. To do this, you will need the Chi-squared table I distributed in class, the one suited to the classical approach to hypothesis testing. You need your pre-selected alpha level to pick the right column and the degrees of freedom for your 2 x 2 contingency table to choose the right row to enter the table. Degrees of freedom in Chi-squared can be defined as:
         DF = (r - 1)(k - 1)
         where r = number of rows and k = number of columns
    
    
    So, you will enter the table at the intersection of:
         the column headed ________ 
    
         and the row corresponding to ________ degrees of freedom.
    
    What, then, is your critical Chi-squared value?
         X2crit =  ________
    
    
  13. Is your X2calc ________ greater than or ________ less than the X2crit?

  14. If your actual, calculated Chi-square value is greater than the critical Chi-square, you may safely conclude that your pattern is not just a random one. In other words, there is a statistically significant probability that there is a real association of some sort between your variables (in this case, between the two plant species). If the calculated Chi-square value is less than the critical test value, the relationship probably is random. Can the null hypothesis of random association between these two plant species in this study area be rejected in this case?
         _____ reject Ho          _____ do not reject Ho
    
    
  15. It's always good etiquette, whenever possible, to calculate the prob-value of a Type I error, to express your faith in the null hypothesis, however, in the off chance that a reader may have compelling reasons to use a different standard of alpha than you chose. Unfortunately, M&M somehow messed up and did not put the oh-so-important 1 DF column in their prob-value table for Chi-squared. In a rare attack of generosity, I have decided to provide you the missing column by spending a lot of time with the probability calculator within Statistica, a very nice full-featured statistics package. Consult Figure 3 to get the missing column and tell me the probability that you could have gotten results as extreme as yours if there is but a random association between the two plant species.
         ________ prob-value of Ho
    
    
  16. Plot complication. Chi-squared is notoriously sensitive to sample size. That is, the same percentages in each cell can appear significant in a big sample (large n) or insignificant in a small sample. It might help to assess the strength of a significant relationship, should the Chi-squared test find one. For that, you can use Yule's Q. Yule's Q, however, can only be calculated for contingency tables with no more than two rows and two columns (bigger tables can sometimes be collapsed into a 2 x 2 format, by combining rows and columns in some sort of logical way). Conveniently, this lab just happens to feature a 2 x 2 table.

    To calculate Yule's Q, multiply cells a and d and also cells b and c. Then, enter these multiplications into the following formula:

              ad - bc
         Q =  _______
              ad + bc
    
    
    So, what is the Q value for this lab? ________

  17. Now, what does it all MEAN? Basically, Yule's Q can vary from -1 to +1. The closer it is to 0, the weaker the relationship is. The closer it is to -1 or +1, the stronger the relationship is, whether inverse (negative) or direct (positive).

    Please interpret the results of Lab B, taking into consideration both Chi-squared and Yule's Q. What sort of ecological relationship, if any, exists between Salvia apiana and Avena barbata at this scale of analysis?

         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
         _________________________________________________________________________
    
    
    And that's that for another lab, folks!


    Figure 1 -- Map of Ceanothus foetidus Plants (or why they don't let me teach cartography!)

    [ map of ceanothus plants ]


    Figure 2 Map of Oats and Sage (you might want to recopy these figures at 120 percent or so)

    [ map of oats and sage ]


    Figure 3: p-Values for X2

         X2      1 DF         X2     1 DF         X2     1 DF          X2     1 DF
    
         3.2    .0736        4.4    .0359        5.6    .0180        6.8    .0091
         3.3    .0692        4.5    .0339        5.7    .0170        6.9    .0086
         3.4    .0652        4.6    .0320        5.8    .0160        7.0    .0082
         3.5    .0614        4.7    .0302        5.9    .0151        7.1    .0077
         3.6    .0578        4.8    .0285        6.0    .0143        7.2    .0073
         3.7    .0544        4.9    .0268        6.1    .0135        7.3    .0669
         3.8    .0513        5.0    .0254        6.2    .0128        7.4    .0065
         3.9    .0483        5.1    .0239        6.3    .0121        7.5    .0062
         4.0    .0455        5.2    .0226        6.4    .0114        7.6    .0058
         4.1    .0429        5.3    .0213        6.5    .0108        7.7    .0055
         4.2    .0404        5.4    .0201        6.6    .0102        7.8    .0052
         4.3    .0381        5.5    .0190        6.7    .0096       >7.8   <.0050
    
    
    data collected by Dr. Rodrigue from the Probability Calculator within Statistica® at ALST, Inc., of Northridge, CA, 11/98.


    first placed on the web: 11/26/98
    last revised: 11/30/98
    © Dr. Christine M. Rodrigue