CALIFORNIA STATE UNIVERSTIY, LONG BEACH

GEOG 400
Geographical Analysis

Project: Multiple Regression Using a Spreadsheet

==========

The purpose of this lab is:

  • to introduce you to multiple regression, in which you build a model that explains variation in a dependent variable by reference to the weighted influence of two or more independent variables.
  • to familiarize you with OpenOffice Calc's multiple regression capacities (you may not have access to expensive statistical software in your professional future, so it's good to know that office software can perform some multivariate analysis, including the ubiquitous multiple regression)
  • to alert you to shortcomings in even the most official of datasets: caveat emptor
  • to give you practice in judging among several good models

This project has the following deliverables:

  • the answer sheet to this lab, printed, filled out, and autographed
  • six scatterplots with linear trendlines added in for all 46 records, showing the association between Y and each of the Xi variables
  • six scatterplots with linear trendlines added in for the database missing the outlier (n=45)

Your data set this time pertains to an important social issue that varies in context and impact geographically at a variety of scales, international, intranational, and intrastate: murder involving gun use. The data come from a variety of sources, including U.S. Census estimates of the population in 1997, the FBI Uniform Crime Report for 1997, the L.A. Times' report on the 2000 presidential election results, and the National Rifle Association reports on the status of gun control laws in 1999. The scale of analysis is intranational: variations from state to state.

Unfortunately, these data, the best available at the time I put the database together back in 2001, are buggy, which is unfortunate, given the importance of the issue and the demand for risk management policy. For example, I could not find NRA data on gun control status for years earlier than 1999; ideally, gun control data should be older than the dependent variable. We can proceed with this as a pædagogical exercise, hoping that there have not been significant changes in the stance various states have taken towards concealed-carry laws from 1996 or 1997 to 1999.

Another example often widely discussed is the flawed nature of the Uniform Crime Report. The FBI must depend on local police agencies to report crime broken down the way the FBI defines various crimes. Political agendas vary among police departments, which may affect their tendency to underreport or overreport crime rates in their jurisdictions. The 1997 data also include the District of Columbia but are missing several states, so n=46, instead of 51.

Yet another problem is the Census data. They are not hard counts in the sense of the 1990 and 2000 censuses (even the nose-counts are themselves subject to underrepresenting the population in large cities, which affects their share of Federal grants). The 1997 data are U.S. Census estimates of the population, based on the agency's mathematical projections. The per capita income and urbanization data, however, are for 2000 and 1999, respectively.

The L.A. Times' data on the 2000 presidential election represent the percentage of voters choosing Bush out of those voting for Bush, Gore, or Nader (Buchanan, Hagelin, other third party candidates, or write-in candidates were not included).

The NRA data are in an ordinal format. The degree of permissiveness toward concealed carry is not shown as a scalar continuum, which is what regression is based on, but as an ordinal ranking of difficulty in getting the right to carry a gun concealed on your person when you venture out into public space.

  • 0 stands for it's completely legal and no permit is needed (Vermont is the only state that does that, but it's not in this data set)
  • 1 for "shall issue" states (meaning you ask for a permit and you get it, but the police know you have it, as in Connecticut and Utah)
  • 2 means the state offers concealed carry permits but county sheriffs have the final say and there's a lot of variation within state (e.g., California)
  • 3 means that restrictions are extremely tight, as in Illinois and Missouri, and there is little to no chance of ever earning a permit (the state may not allow concealed carry at all).

Regression requires scalar (interval or ratio) data for Y variables. It assumes that X variables are scalar, too, but it is able to handle binary (yes/no, 0/1) X variables. Ordinal variables are not supposed to be used as X variables. A lot of data are ordinal, however, and people want to use them in regression modelling. Some people just go on ahead and use tham as ordinal variables, especially if there are a lot of levels to the scale, but that's not really cricket. The proper way to handle them is to break ordinal X variables into "dummy variables." Yes, they're actually called "dummies."

This means that you convert each level of the ordinal variable as a separate binary variable (here, a state score of 3 would be considered a 1 and a state score of 0, 1, or 2 would be considered non-3, that is, 0). Regression is known to be robust against binary X variables, so you create a passel of them, one fewer than the levels of your scale. An ordinal scale with 4 levels (as here) would become 3 new binary variables: 3 (yes or no), 2 (yes or no), 1 (yes or no), but not specifying 0. This creates some "plot complications" in spreadsheet and statistical software output, though, and I'm going to have you go the easy, if dodgy route and pretend that the NRA scale is a single "scalar" variable rather than converting it into 3 dummy variables.

So, with these caveats of mismatches among scales and timing and levels of measurement, let's work through an "everything but the kitchen sink" example anyhow just to see how you can use Calc to do multiple regression analysis.

==============================

Getting the Data

==========

The data are again in a spreadsheet formatted with variable names across the top. There are several columns: "A" is "State" (the record name); "B" is your Y variable: "Gunmrd97%" (percentage of the 1997 population murdered with guns). Your X1 through X6 variables are as follows: "C" is "%cen cty 99" (percentage of a state's 1999 population that resides in a central city); "D" is "%gbn Bush" (Y2K voters for Gore/Bush/Nader who voted for Bush, to operationalize the conservative vote); "E" is "2K per cap" (per capita income in 2000); "F" is "%>24:BA" (percentage of the 1997 population 25 or older which has completed at least a baccalaureate degree); "G" is "NRA carry99" (NRA data on the 1999 status of concealed-carry laws); and "H" is "% yngmen" (percentage of the 1997 population consisting of young men aged 15-24 years, who are disproportionately represented among arrestees for violent crimes in general).

Again, click https://home.csulb.edu/~rodrigue/geog400/gunmurder.ods to download the file. Whatever you do, make sure to pick the "Save as" option at some point and save the file to your flash drive (e.g., E:\gunmurder.ods) and/or e-mail the file to yourself. Remember, the student storage area gets purged regularly.

==============================

Hypothesis Formulation

==========

State what you think the relationship is going to be between each of the X variables and the Y variable: direct or inverse.

Y and X1:
______________________________________________________________________________  

Y and X2:

______________________________________________________________________________  

Y and X3:

______________________________________________________________________________  

Y and X4:

______________________________________________________________________________  

Y and X5:

______________________________________________________________________________  

Y and X6:

______________________________________________________________________________  
Because each of you may come to different expectations and be able to defend them theoretically, we'll evaluate the significance of these associations using two-tailed tests (half the alpha goes on each end of the normal curve to define two rejection regions). That is, the null version of all of these expectations will be "there is no significant association between Xi and Y."

Pick an alpha level that you think optimizes the hazards of getting all excited over random associations (Type I error) or failing to detect associations that might really be there in our buggy data (Type II error). Briefly defend your "truth standard." Think about whether this is a more exploratory study with weaker standards or a final polished study with Type I consequences for policy that require more rigorous standards.

alpha = __________

______________________________________________________________________________

______________________________________________________________________________

==============================

Building Simple Linear Models of Each of the Associations

==========

Let's have a look at the array of X variables to see if any of them are likely to be significantly associated with the percentage of gun murders. In cell b49, enter "r"; in b50, "r sq"; in b51, "t"; in b52, "b"; in b53, "a"; in b54, "r sq. adj."; and in b55, "prob-value."

Now, copy the following formulas in the following cells.

  • In c49, put in =CORREL($b$2:$b$47;c2:c47)
  • In c50, put in =C49^2
  • In c51, put in =(C49*(SQRT(46-2)))/SQRT(1-C49^2)
  • In c52, put in =SLOPE($B$2:$B$47;C2:C47)
  • In c53, put in =INTERCEPT($B$2:$B$47;C2:C47)
  • In c54, put in =1-(1-C50)*((46-1)/(46-1-1))
  • In c55, put in =TDIST(ABS(C51);46;2)

The $ signs force Calc to keep absolute references to the Y variable in column B, so when you copy the formulas to cover the other five X variables, you'll be correlating each of them to Y (instead of to the preceding X variable!).

TDIST is a prob-value generator built into Calc. Thought you'd appreciate that after struggling with the t-tables in your basic statistics class! The first number in the parentheses is your tcalc value. You can enter it in manually or, in this case, by reference to a cell where it was calculated. The next number after the semi-colon is degrees of freedom (n-2, losing one degree of freedom for each variable in your bivariate analysis). The last number in the parentheses is the number of tails to be used in the test (1 or, in this case, 2).

The ABS before (C50), by the way, means that Calc should calculate the t score on the absolute value of C50's contents, whether it's negative or positive. Spreadsheets can't calculate negative t scores for some boneheaded reason and, since these are two-tailed tests you're doing, the absolute value will work just fine. Dealing with software not written by statisticians (as here) or cartographers (as with GIS packages) sometimes means you have to find creative workarounds for their inadequate programming. This ABS bit is an example.

Having written your formula in C49, copy c49 to d49:h49 by clicking on cell c49 and moving the cursor until it turns into a skinny crosshair, holding down the left mouse button, dragging the mouse to cell h49 and letting the button go. Do the same thing for c50 through c54.

Format the whole block from b49 to h54 as numbers at a consistent level of decimal accuracy, perhaps 3 or 4 decimal places of accuracy.

==============================

Interpretation of Your Simple Linear Regressions

==========

Okay, now interpret your results. Which associations turned out significant at your chosen alpha level? Write down the prob-values for all six of your regressions and then put a check mark by any of them that are significant (prob-value < alpha). Which associations turned out to have the direction you had predicted (direct or inverse)? Write down two letters in the boxes provided: D or I for your expected direction, followed by the actual direction (D or I). So, you might have "D, I" as your answer for one of the boxes. Now, write up whether you were surprised at any of your results? What do you think is going on (other than buggy data)?


____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

==============================

Building a First Draft Multiple Regression Model

==========

Let's see if we can improve the explanation offered by any one significant X variable by adding other X variables in with it. You build multiple regression models either by throwing everything you have in there and then eliminating variables that do not contribute significantly to the overall effect (backwards elimination) or by adding one X variable at a time, noting whether it has added anything significant to the model and, if it does, adding another and another until there is no more significant improvement (forward stepwise).

Let's do this by backward elimination. Let's start with everything thrown in, running the regression with all six X variables. We'll note the multiple correlation coëfficient and the adjusted coëfficient of multiple determination and compare them with the original best single factor model to see if the improvement in the model warrants keeping everything in there.

To run a multiple regression with everything, put your cursor someplace where you have room for an array function to put out a table (maybe B60). Hit the FX button in the upper left to set off an array function. Scroll down and select LINEST. For Y range, put in B2:B47. For X range, put in C2:h47 (you have to have all the X variables in an adjacent group of columns). For the next two boxes, enter 1. Then, hit OK.

An unlabelled heap of numbers and #NAs come up. Here is what they mean:
b6 b5 b4 b3 b2 b1 a
std err b6 std err b5 std err b4 std err b3 std err b2 std err b1 std err a
r2 std err of r2          
F ratio deg of freedom (n-k-1)          
btwn-group ∑ of sq. w/in-group ∑ of sq.          

To keep yourself from going nuts, you might label the columns of all this output, too (Calc reverses the order of your Xi):

cell b58 = % yngmen; cell c58 = NRA carry99; cell d58 = %>24:BA; cell e58 = 2K per cap; cell f58 = %gbn Bush; cell g58 = %cen cty 99; cell h58 = Gunmurd97%

cell B59 = X6; cell C59 = X5; cell d59 = X4; cell e59 = X3; cell f59 = X2; cell g59 = X1; cell h59 = Y

You might want to brush up on ANOVA from your GEOG 200 or similar course to understand the between-group sum of squares and the within-group sum of squares. Basically, you divide the between-group SS by the number of Xi variables and the within-group SS by the degrees of freedom to get the between-group and the within-group mean squares. If you divide the between-group mean square by the within-group mean square, you get the F ratio.

Okay, so that's what all the computer schmutz that comes out of OpenOffice Calc means when you ask for a LINEST multiple regression. The lack of labels is one thing I find hugely annoying about Calc, but I digress.

You're not done yet, though. We need to calculate the prob-value for our F-ratio, as well as the separate t-stats and the prob-values for each of our Xi variables as they affect variation in Y, holding all the others constant.

Assuming that you were in cell B60 when you asked for the LINEST array function and that the array came out to occupy cells B60 to H64, let's have you add three more rows of statistics to complete your set. In cell B65, =fdist(B63;6;C63). That is, the prob-value for the F-ratio in cell B63; the number of Xi variables; and the degrees of freedom (observations minus the number of Xi variables minus one).

In cell B66, type =B60/B61, hit return, and then drag that cell's contents from cell C66 through cell H66. That gets you the t-statistic for each variable by dividing its b coëfficient by the standard error for that b coëfficient.

You could go off to the back of a statistics book to figure out the prob-value of each t-score, but Calc can do it for you. In cell B67, type =tdist(abs(b66);$c63;2). This calculates the probability that you could have gotten a t-score as high as you did in cell B66 from random sorting, given the degrees of freedom, while requesting a two-tailed test (half of your alpha is put on each side of the bell curve, on the assumption that we can't predict the direction of significance). Now, drag that formula to fill C67 through H67.

And we need to calculate the ADJUSTED r2, which is going to become a very important diagnostic for evaluating our models. Obviously, if you keep adding variables, you will increase r and r2. To evaluate a model's performance, though, we need to know what the r2 would be if we adjusted it for the number of variables (and records). Are we getting better explanation by adding more complexity, or are we just adding complexity but not understanding?

Calc gives you r2. Here is how you adjust it for the number of variables (and records). In cell B68, type in =1-(1-B62)*((46- 1)/(46-6-1). It gets Calc's r2 and then adjusts it for the number of records (minus one) and, very importantly, for the number of independent variables.

Now, to interpret all this. Compare the r and adjusted r2 to the best of your original simple linear regression model's r and r2.

Name of variables in your best simple linear regression model:

_________________________ Y variable (dependent, criterion variable -- uhh, that would be gunmurder97% -- don't think I'm giving anything away here!)

_________________________ X variable (independent, predictor variable -- you're on your own here -- which one of those simple linear regressions had the highest r and r2 values?)

Best simple linear regression model r = __________

Everything-in-it regression model r = __________

Best simple linear regression model r2 = __________

Multiple regression adjusted r2 = __________

So, did this exercise in building the more elaborate model pay off in a noticeably better explanation of gun murder death rates (focus on the r2adj comparisons, rather than just the r comparison)?

______________________________________________________________________________

At four decimal places of accuracy, is there any noticeable improvement in significance from your best simple linear regression model to the prob-value for the F-ratio in the everything-in-it multiple regression model?


______________________________________________________________________________

The first row in Calc's array output gives your regression coëfficients (partial correlation coëfficients or b. The one farthest to the right in your first row is your intercept or a. With the information in that column, you can now write up your model in the Y = a + bX1 + bX2 + ... bX6 format. Oh, heck, why don't you do that now? Yes, some of them will look like 0. Make sure that you put in the negative sign for any b coëfficients that are negative (even if they looks like zero).

Y = ____ + ____X1 + ____X2 + ____X3 + ____X4 + ____X5 + ____X6

 
Now, compare the t score for each partial correlation coëfficient and its associated prob-value, which you calculated below the LINEST array output. This way you can see the significance of each X variable, with the indirect influences it exerts through all other X variables partialed out or controlled.

Have a look at those prob-values and compare them with the ones you got from doing all those simple linear regressions. Two of the variables had very low prob-values when considered in stand-alone simple linear regressions but the highest prob-values when all the variables are considered together (meaning that they show high collinearity with another variable). Which two factors are they? You only need to put a check mark by the two variables with prob-values smaller than your chosen alpha.


______________________________________________________________________________

==============================

Refining the Model by Backward Elimination

==========

Well, it seems that not all the variables in our original model are worth bothering with. This was apparent back when we did all those simple linear regressions: There were some high prob-values for individual X variables and gun murder rates.

The collection of useless variables grew upon close examination of the prob-values of the individual X variables in the everything-in-it multiple regression. A couple of the X variables had low prob-values considered alone in their association with Y, but those prob-values ballooned when all X variables were regressed with Y. This was because of collinearity: These X variables produce some of their effects on Y through their high correlations with other X variables.

So, let's whip out the pruning shears and dump all X variables that contribute trivially to understanding gun murder rates. In other words, identify the X variables that do not produce prob-values less than your pre-selected alpha standards. Using that standard, which variables are you going to eliminate?


______________________________________________________________________________

Now, rerun the multiple regression with the surviving variables. The safest thing to do is highlight and copy your original data and then click on Sheet 2 and, in cell A1, paste your data there. Now, delete the columns that you decided to get rid of above. Your new sheet should have gun murder rates in column B, and your surviving X variables should start in column C and go right from there. Spreadsheets need to have all the X i variables adjacent to one another. Now, redo the LINEST process, this time with fewer X variables. Make sure to recalculate the prob-value for the F-ratio, the t-stats/scores for each of the surviving Xi variables, and the prob-values for each of those t-scores.

Now, concentrate on the r and adjusted r2 statistic. The removal of some X variables will cause a drop in r, to be sure, but the question is what happens to the adjusted r2? Did it rise, stay about the same, or decline? In other words, did the reduced model improve explanation or hurt explanation? Another way to check on model performance is to look at the F statistic and the F Significance or prob-value (the bigger the F stat, the lower the prob-value). Do you see an increase, decrease, or stagnation in the F statistic?


______________________________________________________________________________
Examine the regression coëfficients block and write your new model:

Y = ______________________________________________

Now, look at the prob-values associated with your surviving X variables. Did the reconfiguration of the model cause any of them to get bumped above your alpha standard? Check yes or no. If yes, name which one(s)?


______________________________________________________________________________
Remove any X variable that no longer meets your truth standard and rerun the regression the way you did before (you may be down to one variable, so, you may be able to recyle the simple linear regression for that variable from Sheet 1.

Using the regression coëfficients from this iteration, re-write your regression model:


______________________________________________________________________________
Now, let's have a look at the new r (which you expect to decline) and the new adjusted r2? Did the adjusted r2 rise, stay about the same, or decline compared to the everything-in-it model and the first reduced model? In other words, did the latest iteration improve explanation or hurt explanation?

______________________________________________________________________________

==============================

Getting Suspicious
Which, Really, You Should Have Been before Actually Doing the Regression

==========

Okay, now let's have a closer look at these data and models. Do X-Y scatterplots for all six simple linear regressions back in Sheet 1. Fit them with linear trendlines. Not exactly your homoscedastic balance of points on either side of the trendline, is it?

What you have is an "extreme outlier," a bizarre case that is skewing the associations you found out of all common sense. By examining your data table, can you figure out which record is the extreme outlier on all six graphs? What is the outlier?


______________________________________________________________________________
Okay, let's have a look at its effect. First, highlight your whole data table including your simple linear regression outputs from Cell A1 through Cell H 54, then copy it (Control-C), then click on Sheet 3 at the bottom of your spreadsheet. In Cell A of Sheet 3, paste your spreadsheet in (Control-V). You can run different scenarios or experiments on the same data in different sheets and keep everything in the same workbook.

Now, highlight the row containing the extreme outlier. Delete that row. You now have 45 records, instead of 46. And look what happens to all those simple linear regression outputs! Everything changes. Drastically.

Do a kitchen sink multiple regression. Not so impressive anymore, is it?

Eyeball those prob-values. Does any variable in the kitchen sink produce a prob-value below your alpha standard? Which?


______________________________________________________________________________
So, now re-do your regression for just that variable (actually, just use the simple linear regression for that variable in Sheet 3). Of the two regressions you did this way, which one produces the better overall significance (F sig or prob-value for the multiple regression and t-stat and associated prob-value for that one X variable model)?

______________________________________________________________________________


==============================

Interpretation

==========

In clear, readable English below, summarize the highlights of this process. All three models produced in the original multiple regression and backwards elimination section were highly significant, but only one of them combines explanatory power with economy of modelling. Which of the first round of regression models (before you removed the outlier) produced the best overall prob-value and proved the most powerful and economical modelling exercise?

How did your analysis change when you threw out the extreme outlier?

The gun control issue will certainly not be resolved by our quick 'n' dirty analysis of questionable and skewed databases, but, within the limits of these data, how did the relationship between gun control and gun murder rates turn out in either of the kitchen sink models? What seems to drive gun murder rates more than anything else? Why did this variable seem so insignificant in the original kitchen sink regression?


____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

____________________________________________________________________________________________________

If you're curious, you could try working the lab through with the ordinal NRA gun permit data properly broken out as three dummy variables using this version of the database: https://home.csulb.edu/~rodrigue/geog400/gunmurderdummies.ods. See if that makes any difference to your overall conclusions. This is not required -- I just wanted to make the data available in the correct format for the curious!

==============================
GEOG 400 Home   |   Dr. Rodrigue's Home   |   Geography Home   |   ES&P Home   |   EMER Home   |   Scientific Calculator  
J.H. McDonald's multiple-regression model

==============================

This document is maintained by Dr. Rodrigue
First placed on Web: 03/22/01
Last Updated: 09/14/21

==============================