GEOG 400/500 Project: Dr. Rodrigue

CALIFORNIA STATE UNIVERSTIY, LONG BEACH
GEOG 400/500
Geographical Analysis
Project: Linear Regression with Transformed Variables

This lab has the following purposes:

to (re)familiarize you with the use of spreadsheets for basic statistics;
to refresh your memory of simple linear regression from your basic stats class;
to introduce you to the use of mathematical transformations to linearize certain kinds of curvilinear associations between variables; and
to illustrate a fundamental physical geography axiom about stream discharge levels and stream velocity.

Project deliverables:

a copy of this lab filled in and autographed;
your three regression models (simple linear, logarithmic, and power), autographed;
two histograms, autographed; and
three X-Y scatterplots with trendlines, autographed.

Background

A bivariate association is a relationship between two variables, something like

the heavier the soil, the greater the cation exchange capacity;
the greater the annual precipitation in an area, the lower the soil pH;
the greater the distance from the freeway, the lesser the concentration of lead in the soil.

Simple linear regression gives us the ability to describe bivariate associations in terms of the familiar statistical concepts of significance testing. Unfortunately, a lot of bivariate associations are not linear, so all of these advantages can't be used all the time.

Luckily, however, sometimes simple mathematical transformations of one or both variables can turn a curved scatterplot into a straight one. If you can do this, you may have found an inherently linear relationship. So? Well, if you can linearize an association, then you can validly use simple linear regression to analyze its direction, strength, and significance (because of algebraic invariance). This can be shaky, if the linearization makes the scatterplot heteroskedastic, so there is some controversy over this, but it remains a very popular simplification. Linearization can make the statistical analysis of some curvilinear relationships not much more complicated than simple linear regression.

Your data set for this project comes from physical geography and geology, having to do with the association between stream discharge and stream velocity. You might want to go over your introductory physical geography or general geology textbook and class notes on the various associations among slope angle, stream discharge, bed shape and roughness, and velocity, in case you don't remember all the interactions. Here's an optional link to my own Geography 140 notes on the topic:
https://home.csulb.edu/~rodrigue/geog140/lectures/streams.html.

Since regression is about specifying the direction of influence, which makes more physical sense?

A. Velocity of stream flow governs discharge?

B. Discharge of stream flow affects velocity?

So, which is the X (independent) variable and which is the Y (dependent) variable?
A. Velocity

B. Discharge

This particular data set is created by the United States Geological Survey (USGS) from remotely sensed stream gaging stations, which record the height of the stream. Height of the stream can be converted to discharge, if prior studies have been done on the local riverbed structure (wetted bed channel perimeter, depth, velocity). These data are coupled with readings of stream flow velocity. This particular data set, from Soos Creek, Washington, was compiled from USGS data by Profs. Greg Langkamp and Joe Hull, Seattle Central Community College, as part of their NSF-funded Quantitative Environmental Learning Project (QELP) grant (NSF Division of Undergraduate Education #9980740).

In this lab, you will build models of the association of stream discharge and stream flow velocity, by playing around with ways of linearizing the data to find the best model. In this project, we are less interested in inferential uses of regression modelling (testing hypotheses) as we are in using regression to describe an association. But, we might as well pick an alpha standard anyhow, to avoid wasting time on insignificant associations. In case you've forgotten, alpha is the probability of seeing a pattern that really isn't there (a Type I error). What's the standard-issue prob-value used for alpha in a scientific study, in which human life is not at direct risk but you don't want to delude yourself into getting excited over patterns that don't exist (Type I error)?

 
 
_________________________________

Getting the Data

The data are in an OpenOffice Calc spreadsheet, https://home.csulb.edu/~rodrigue/geog400/soosdata.ods. It is formatted with variable names across the top. There are three columns: "A" is "Observations" (to keep track of the 50 random readings reported in this data set); "B" is "Discharge"; and "C" is "velocity." Discharge is given in cubic feet per second, and velocity in feet per second. You can download your file to the lab hard drives if you've forgotten a flash drive, but be aware that the data are purged each week. If you have to save it on the lab computers, create a directory with your last name and save your temporary copy of the file there, so you'll be able to find it when you remember to bring your own storage device. Alternatively, you can e-mail your work from that subdirectory to yourself, which is probably a good habit to get into at the end of each day.

Visualizing the Data

First, let's get an idea of how much each of the variables, well, vary. In Cell A53, type "minima"; in Cell A54, type "maxima" (minus the quotation marks). In Cell B53, type =min(b2:b51); in Cell B54, type =max(b2:b51). Highlight cells B53 and B54 and then move your cursor to the lower right side of the highlighted area until it changes from an arrow to cross-hairs. Left-click and drag to cells C53 and C54 to copy the formulas over to the second column. We'll be making use of this later.

To make your lives easier, use Calc's graphics functions to create and view scatterplots of discharge and velocity. To do this, highlight cells B2:C51 (left-click your mouse as you drag it from B2 down to C51). Then click on the "Chart Wizard" box (the little icon with a multi-colored pie chart on it). Up will come the "Chart Wizard ... Chart Type" box.

Pick "XY (Scatter)." Then pick the box (probably upper left one) that shows your X and Y axes and a cloud of dots, but no lines. Then click "Next."

Accept Data range: $soosdata.$b$1:$c$51 and "Data series in columns" and First row as label. The dollar signs tell Calc that b2 through c51 are absolute positions (don't worry about that). Note the first draft of your graph coming up just like that. Yes, this is a non-linear association. Click "Next" and then "Next" again.

At this point, you can customize the graph a lot. You might want to give it a snappy title, such as "Raw discharge and velocity, Soos Creek, WA," or some such (you want to keep track of these scatterplots, so titles and axes are useful). For the X axis, you could label that "Discharge (cfs)" and the Y axis would be something like "Velocity (ft/sec)." When you're done with your creative artistry, hit "Finish."

Your graph will appear as a free-floating box within your spreadsheet. The grid of dots at the corners and the grey border tell you it's active, so you can put your cursor on the border somewhere and, depressing the left mouse button, move the graph someplace convenient and maybe resize it (by clicking one of the corner dots, left-clicking your mouse, and moving it around). You can also get rid of that annoying legend box by right-clicking on it and picking "Clear." I stuck my graphs at the right end of the spreadsheet. So, plop it somewhere out of your way.

You can also ask Calc to put a trendline through your data set. Once your graph is created and sitting in your spreadsheet somewhere and in active status (grey border) put your mouse exactly over one of the data points and click your right mouse button. Select "Insert Trendline" and, under "Type," choose the linear version. Painless (but kind of ugly). Look how the trendline starts off with most the dots below it, then, in the middle stretch of the line, most of the dots are above it, and the last stretch has the dots below the line again. Obviously, the linear model is not the happiest description of the data cloud: The data cloud is curved.

Right-click the line, pick Format Trend Line, and then the Type tab. You can now experiment with some of the other mathematical functions Calc provides for you when you right-click. Which one produces the prettiest curve, the one that seems to reproduce the curving trend in the scatterplot?

 
 
_________________________________

Building a Simple Linear Model of the Raw Data

A simple linear regression model takes the form:

: Y = a + bX + e

Where X is the independent or forcing variable (on the horizontal axis), Y is the dependent variable (on the vertical axis) that we're trying to predict with the model, a is the Y intercept (or point on the Y axis where the regression line would touch), b is the slope of the regression line (positive for a direct relationship and negative for an inverse one), and e is the error term (or the amount of residual variation in Y that can't be explained by variation in X). The error term implies that there are other X variables that we could explore in a more elaborate, multivariate model. We'll worry about that later. Most commonly, you'll see the equation shortened to (since everyone knows that there's bound to be at least some unexplained residual error):

: Y = a + bX (you may see it in some disciplines as Y = bX + a or Y = b₁X + b₀ or f(X) = b₁X + b₀)

At this point, let's build a simple linear regression model of what you have, curvilinear and all. We'll use this model as the baseline for comparing our other experimental models later.

At some point in the spreadsheet that's convenient to you, calculate the correlation coëfficient, or R. The Calc formula for this is =correl(c2:c51;b2:b51), that is, correl(dependent or Y variable range;independent or X variable range).

Below that, calculate the coëfficient of determination, or R², which is simply R squared. There are two ways to do that in Calc: =rsq(c2:c51;b2:b51) or you can simply take the cell containing R and square it (^2).

Below that, get the Y intercept or a. Type =intercept(c2:c51;b2:b51). This gives you the point on the Y axis where the regression line crosses it.

Below that, compute the slope or b. Type =slope(c2:c51;b2:b51). This tells you how much vertical movement (change in Y) occurs for each unit change along the horizontal axis (change in X).

Below that, estimate the significance of the association between X and Y, that is, the probability that just random pairs of numbers would generate a correlation higher than the one you got. This is going to be a little hairy. Well below the end of your spreadsheet, we're going to use Calc's array functions to get at a measure of significance for your model. This is kind of a pain, because Calc doesn't label the box of numbers that pours out of it. I'll be your tour guide.

So, way down there, pick a cell to be the upper left corner of a box of schmutz. Then, touch the fx box to the left of the data editing box toward the top of Calc. In this function wizard's dialogue box, scroll down to LINEST and hit Next. Now, type c2:c51 next to data_Y, b2:b51 next to data_X, true next to Linear type, and true next to stats, these last two truths because we want the whole yazoo. Then, hit OK.

You get a 2 by 5 box of unlabeled stuff (one of the things I hate about Calc). The third one down the left side is your coëfficient of determination (spot check for you). The fourth one is the F statistic from the ANOVA that constitutes the test of significance for regression/correlation analyses. The number to its right is DF2 or the second degrees of freedom (basically, observations minus the number of variables you're relating).

Now, to figure out how significant the F statistic is. Below that little table, type =fdist(a92;1;b92). Instead of a92 and b92, please substitute the cells in your spreadsheet where the F test and the DF2 landed. What you're telling Calc to do is figure out the probability that random processes could have generated an F statistic as big as the one you got, given 1 degree of freedom for DF1 (number of variables minus one) and 48 degrees of freedom for DF2. Pretty dinky, eh? As in you'd have to expand that cell out to a huge number of decimal places to find anything more than zeroes.

So, at this point, write in your model (leaving the Y and X letters in there as placeholders):

Y = ______ + ______ X

Something else reported in a regression analysis is the correlation coëfficient, or r. What did you get?

 
 
_________________________________

Below that is the coëfficient of determination. This is the proportion of variation in Y, which is explained by variation in X. It is, basically, the square of the correlation coëfficient, or r² (you can cross-check in that LINEST table). What did you get?

 
 
_________________________________

What did you get for the F-statistic?

 
 
_________________________________

Okay, now just how significant is this model? What is the probability that a random mish-mash of 50 pairs of numbers would have produced results as extreme as this?

 
 
_________________________________

So, using the alpha standard you wrote down in the first part of this project, is this simple linear regression model significant?

```
 
 
_____  yes     _____  no
 
```

Interpretation

Briefly summarize in regular English what the nature of this association is in terms of direction (direct or inverse), how much of the variation in velocity is explained by variation in discharge, and how significant the association is.

 
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________

Data Structure

Let's see if we can improve an already pretty strong model by transforming one of the variables. Those of you who actually played around with the trendline formatting function have already figured out that a logarithmic transformation might do the trick. Let's do a logarithmic transformation of one variable. But which one?

At this point, let's use Calc's graphing ability again to create a histogram, so that we can get a handle on which variable is the least conformal to the normal distribution. Simple linear regression should be used on scalar data from fairly normal distributions.

Calc is not exactly fast in this histogram regard, unfortunately, though it's faster than Excel and doesn't include an error baked into Excel. To make a histogram of the data structure, we need to come up with the classes into which we would like to divide the Discharge column and then the Velocity column. You already figured out the minimum and the maximum values of each column. Divide each range into a nice number of bins that would cover the range, with each bin being the same size range and a nice round number. You probably don't want more than about 10 or so bins (say, bin width of 70 in Discharge) or fewer than 5 (bin width of 140 for that column). You need enough bins to see the pattern but not so many as to lose the pattern in small random variations.

Okay, let's say you picked 0 to 750 for the Discharge column, which creates 10 bins of 75 each. Create a bin column headed something like D classes (I started in column J1 -- substitute wherever you set yours up). In Cell J2, write 75. In Cell J3, write =j2+75 and hit enter. Now, drag Cell J3 down to Cell J10, which should now read 675. You need one more bin. In cell J11, manually write >675. This last entry with the greater than part is very important. There are your bins.

Now, let's have Calc figure out how many observations fit in each of these bins. It will assume that it should count the frequencies up to and including the values you put in the D classes column (e.g., 0 through 75). Label Cell K1 as Freq. Now, put your cursor in Cell K2 and then click on the function wizard, which is the fx button to the left of the data editing box near the top of Calc.

In the dialogue box that comes up, scroll down and select FREQUENCY and then hit Next. Two entry boxes come up on top of one another. In the top one, data, type in b2:b51 if you're working on the Discharge column. Then, in the classes box, enter J2:J11 (or wherever you put the classes/bins). Then, hit OK. Voilà! All your bins are populated with the number of observations that fit in each one. You are ready to graph.

Now, let's turn this into an actual histogram. Highlight your D classes and Freq values and then click on the graphics button up top (looks like a loudly colored pie chart). Accept the default Column chart type and hit Next. Very important: Click the First column as label option, which will turn your graph into something vaguely resembling a histogram. Hit Next and then Next again. Now, you can fill in titles for the graph and for the axes. You can also unclick Display legend (it's kind of useless here). Then, hit Finish. While it's still active, touch one of the bars and right-click and then select Format Data Series. In the dialogue box, pick Options and then, under Settings -- Spacing, edit it to 0% (histogram bars are supposed to touch, expressing the scalar, continuous nature of the X axis).

There's your histogram. You can pretty it up with all kinds of options for coloring the chart area, the chart wall, the color of the bars, the font effects for the titles, etc. There remains one little highly aggravating problem, which bedevils both Calc and Excel, because they're programmed by programmers, not by statisticians. Look at the X axis. No 0 appears to start the axis, like it should. Even more offensively, the numbers marking the upper range of each bin appear, not at the end of the bins where they should but nonsensically right in the middle of the bin, as though they were text labels in a bar chart. I don't expect you to fix this right now, but you need to be aware of it. If you were to include such a histogram in a publication, conference viewgraph, term paper, or thesis, you would need to fix it.

What I do is highlight the graph, importantly including one row and one column on either side of the graph, and paste it into Photoshop or GIMP or even Paint and select the errant axis numbers and move them over. This takes some futzing around to copy in 700 for the last, right-most entry, and to create a zero, but it doesn't take all that long. Save it in whichever graphics format you choose and you have a histogram that's both pretty AND correct.

Now, do the same thing for Velocity (Column C).

Now, look at your two hard-earned graphs. Which one does not show a more bell-shaped distribution that might be from a normal distribution (especially if you had a big enough sample to fill in)? In other words, is there one that forms a really concave distribution and is likelier, therefore, to be creating that curvilinearity in the association? Okay, we're going to log that one.

 
 
_________________________________

Toward a Better Mousetrap: Transforming the Concave Data Distribution

First, in Cell D1, write Log D. Then, look up and down Column B to make sure there are no zero values. How come? Well, try typing "=log(0)" in some random cell. You can't: There is no log of 0. We have to make sure none of our values are exactly 0. Having assured yourself of that, in Cell D2, write =log(b2). Voilà: your first log transform?

Now, copy that formula from D2 down to D51. First, move your cursor around to the bolded lower right corner of the cell, when the arrow cursor becomes a thin crosshair. Then, hold down the left button of the mouse and drag it down to D51 and let go. Instant painless log transform.

Go on and start another X-Y graph using C2:D51 just the way you did your first scatterplot, but, this time, in the Customize data ranges for individual data series dialogue box, you need to transpose X and Y. Click on the X-Values line under Data Ranges. It now appears in the Range for X-Values editing box below. Substitute D for C in there (both times). Now, click on the Y-Values line. When it appears in the Range for Y-Values box, swap out C for D in both parts of the range. That's all there is to that.

Create a title, remembering to put Log Discharge somewhere in the title and X axis, so you can tell this graph from the other one). This time, again, move the completed graph somewhere out of the way, someplace where you can compare it with your first scatterplot.

Ta da! Quite an improvement over your first graph, eh? Well, not so fast: First let's see if the visual improvement is echoed in the statistics (sometimes an improved visual rendering can actually be less effective statistically).

Go on ahead and fit a linear trendline to your data. Looks pretty good, hmmm? The line goes pretty much right through the center of the data cloud at any level of X values: There is no segregation of dots below the line in one area and above the line in another. The data seem to be homoskedastic, which makes us (who?) feel better about using the logarithmic transform.

Building a Logarithmic Regression Model of the Raw Data

Once more, let's do a regression model. Just as you did before, pick a place to put your model coëfficient and constant calculations. Again, calculate the correlation coëfficient, the coëfficient of determination, the Y intercept, the slope, and the t-test prob-value. And way down below somewhere, redo the LINEST function to get your F statistic and then do an F-dist on it to get its prob-value.

Enter your new values in your regression model (Y = a + b(logX)):

Y = ______ + ______ log(X)

If you wanted to read this relationship, you'd have to remember that Y reflects the LOG of X, not X itself. To figure out which X value predicted a given value of Y, you'd have to take the antilog (10^D2, for example) of the value on the X axis or be sure to go back to the original column of X values in your spreadsheet.

Now, the new correlation coëfficient (holy Toledo)!

 
 
_________________________________

And the new coëfficient of determination:

 
 
_________________________________

And the new F-statistic?

 
 
_________________________________

What is the corresponding prob-value?

 
 
_________________________________

Interpretation

Again, in clear, readable English, summarize the highlights of this model. How much of the variation in stream flow velocity is accounted for in this model? How did the significance of this association change comparison with your first, simple linear regression model? (you may not be able to see it from expand the prob-values out to crazy numbers of decimal places, so, instead, you can look at the F-statistic itself. The bigger it is, the more significant/better the model is.

How æsthetically satisfying is the linear regression line when drawn through the data points as compared to the first model? Would you say that the extra hassle involved in doing a logarithmic transformation of X is worthwhile in terms of additional descriptive power and significance?

 
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________

If Logging One Variable is Great, What about Logging TWO?

Let's see if we can improve this model even MORE by logging both variables and creating a power curve regression model. Where a simple linear regression model is reported as Y = a + bX, and a logarithmic regression model is reported as Y = a + b(logX), a power curve is reported as Y = aX^b. This really is nothing more than a more elegant way of saying logY = loga + b(logX).

First, make sure there are no perfect zeros in Column C. Then, in E1, write Log V. In E2, type "log(c2)" and then copy that all the way down to E51.

Repeat the whole sequence to get R, R², a, and b, and don't forget to do the LINEST function to get the F statistic.

Go on and build another X-Y graph just the way you did your first two (remembering to put Log Discharge and Log Velocity somewhere in the title and X and Y axes, so you can tell this graph from the other two). This time, however, select cells D2 through E51. Again, move the completed graph somewhere out of the way.

Add a linear trendline to your scatterplot (first intimations of trouble).

Enter your new values in your regression model (Y = aX^b or logY = loga + b(logX)):

Y = ______X^______

New correlation coëfficient:

 
 
_________________________________

New coëfficient of determination:

 
 
_________________________________

New F statistic:

 
 
_________________________________

And the new prob-value (yeah, yeah, you still can't tell the difference with so few decimal places, but...)?

 
 
_________________________________

Interpretation

Again, in clear, readable English, summarize the highlights of this model. How much of the variation in stream flow velocity is accounted for in this model? How did the significance of this association change (all you have to go on it the size of the F values) in comparison with your first two regression models? How æsthetically satisfying is the linear regression line when drawn through the data points as compared to the first and second models? Would you say that the extra hassle involved in doing a logarithmic transformation of BOTH X and Y is worthwhile in terms of descriptive power and significance, compared to just logging X?

 
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________

Grand Overall Analysis

Again, in clear English, compare all three models with one another. Which one produced the highest coëfficient of determination (explained the most variation in velocity of stream flow) and the highest F ratio (key to significance)? Which model performed the least well? Was the weakest model still a decent description (look at the F ratio value and associated prob-value)? Which of the three would be the best choice in terms of effort and rewards (quality of descriptiveness)?

How could we have saved ourselves a little work in scatterplotting and model-building ahead of time and estimated which of the three models might be the most rewarding before actually doing the work? You could have predicted ahead of time that one of these models would likely be the best if you had done this preliminary step first (NOW she tells us!).

Let's get more adventuresome. Why is it that increasing discharge of a stream results in faster-flowing water? Time to dust off your old Geography 140 or Geology 102 textbooks! Think in terms of the flow patterns in water as the wetted perimeter declines in comparison with the stream cross-section as discharge increases. The point of this is we have to link our statistical analyses with the processes they describe.

 
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________
 
____________________________________________________________________________________________________

GEOG 400/500 Home | Dr. Rodrigue's Home | Geography Home | ES&P Home | EMER Home | Scientific Calculator | QELP

==============================

This document is maintained by Dr. Rodrigue
First placed on Web: 02/20/01
Last Updated: 08/20/16

==============================

CALIFORNIA STATE UNIVERSTIY, LONG BEACH GEOG 400/500 Geographical Analysis Project: Linear Regression with Transformed Variables

Background

Getting the Data

Visualizing the Data

Building a Simple Linear Model of the Raw Data

Interpretation

Data Structure

Toward a Better Mousetrap: Transforming the Concave Data Distribution

Building a Logarithmic Regression Model of the Raw Data

Interpretation

If Logging One Variable is Great, What about Logging TWO?

Interpretation

Grand Overall Analysis

GEOG 400/500 Home | Dr. Rodrigue's Home | Geography Home | ES&P Home | EMER Home | Scientific Calculator | QELP

This document is maintained by Dr. Rodrigue First placed on Web: 02/20/01 Last Updated: 08/20/16

CALIFORNIA STATE UNIVERSTIY, LONG BEACH
GEOG 400/500
Geographical Analysis
Project: Linear Regression with Transformed Variables

This document is maintained by Dr. Rodrigue
First placed on Web: 02/20/01
Last Updated: 08/20/16