CALIFORNIA STATE UNIVERSTIY, LONG BEACH

GEOG 400/500
Multivariate Geographical Analysis

Lab Project: Data Mining in Large Databases
Principal Components Analysis Followed by K-Means Clustering

==========

This lab has the following purposes:

  • to introduce you to the processing of large, unruly databases
  • to encourage you to combine different multivariate techniques to tease out meaning in large databases
  • to introduce you to K-means clustering
  • to introduce you to the (free) multivariate statistical program, PAST
  • to illustrate a common type of applied human geographic problem.
Project deliverables:
  • a copy of your lab report (helpful form designed by Darrell Carvalho for this one), autographed;
  • your rotated component loadings table (quartimax probably easiest to use), autographed;
  • a table representing the character of the clusters you set up, showing their mean scores on each of the three principal components, autographed;
  • two graphs based on that table: a line graph and a stacked bar chart
  • if you choose to do extra credit, a printout of your choropleth map showing each county colored by its cluster classification (don't forget to put a legend explaining your color codes!)

==============================

Background

==============================

The datasets we've worked with this semester are fairly small scale, partly because I wanted you to focus on the technique at hand more than data management issues. In the "real world," you will encounter gigantic databases, many of them collections of whatever is at hand from various public and private sources and data collected internally by an enterprise. Some of them are so huge that the traditional statistical approach of formulating hypotheses, evaluating each variable for conformity to statistical requirements, selecting and implementing a technique, and then interpreting results in a logical way to make decisions is upended by the sheer scale of the data "warehouses" various entities have put together.

The situation has spurred the development of a different, more inductive approach to dealing with these monster databases, which is sometimes called "data mining." Data mining is about exploratory analysis and computerized pattern recognition. This can variously include the application of artificial intelligence, neural networks, and unsupervised classification. In many ways, principal component analysis is a form of data mining. Those of you working with GIS and remote sensing are probably already aware of unsupervised classification. The concept can be applied to all kinds of things, not just spectra and images.

One technique often used in data mining is K-means clustering. This entails picturing the same kind of p-dimensional hyperspace you met in the PCA lab. In this p-dimensional space, you have n points, each representing the coördinates of a given record in each of the p dimensions. K-means clustering tries to find, not component axes, but clusters of these n points in that hard-to-imagine hyperspace. It does this by having you specify how many clusters you want, and then centroids of these clusters are seeded or proposed (sometimes by you, other times automatically by a program). Once the initial centroids are seeded, K-means clustering entails finding the nearest cluster centroid for each point and then, tentatively, assigning that point to that nearest cluster. It can be thought of as creating a kind of geography in that hyperspace, like 2-dimensional Thiessen polygons used in economic geography to estimate market areas or in meteorology to create weather measurement interpolations. The generalized p-dimensional polygons are called Voronoi tessellations.

Then, for each of the clusters, the centroid of the cluster's newly assigned points is calculated. That calculated centroid is then used as the new seed, and the process is repeated all over again. This offset (of the new centroid from the originally seeded one) then causes some of the points originally assigned to a given cluster to turn out to be closer to some other cluster's centroid. So, they are re-assigned and the process keeps on going (this takes a lot of computing power) until the iterations converge more and more on tinier ranges of possible locations in the cluster centers. At some point, further realignments are eventually too small to bother with anymore (no more significant change). The software then reports the stable cluster centroids and the cluster to which each point belongs and, in some software packages, the actual distance from each point to its cluster's centroid.

Here, we're going to apply this approach to a fairly large database I put together from the (no longer supported) U.S. Census USA Counties database: http://censtats.census.gov/usa/usa.shtml. It consists of 3,143 counties and 33 variables. There are over 6,000 variables in the original database, some of them quite exotic (all kinds of things about indoor plumbing?). I went through and grabbed variables I remember using in my long-ago days as a business location analysis consultant: income, housing, ethnicity, education, and political behavior (the Census reports information collected by CQ Press on candidates voted for). I renamed the originally quite opaque field identifiers with new variable names, which I hope are a little more transparent to the reader! (a data dictionary is in the metadata tab)

==============================

Overview of Process

==============================

In this lab project, you will first reduce the complexity of the database by doing a PCA in SPSS to create three components, which you will save as regression variables and then place in your .ods spreadsheet.

You will create a new .xls spreadsheet and copy and paste into it the Areaname and GEOID10 variables in the original spreadsheet, along with the three principal components you brought in from SPSS Then, you'll open (if you're "in" our Virtual Lab) or download PAST (if you're at home) and then have it look for and open your brand-new, slimmed-down database (the one with the five variables). Once your data are in PAST, you'll apply the Multivariate technique, Clustering, and then K-means clustering, requesting that PAST give you 5 clusters.

You'll then copy the contents of the PAST K-means box back into the Libre/OpenOffice Calc spreadsheet. Here, you'll then graph the mean score of the counties in each of the 5 cluster designations, both as a stacked bar chart (clusters as X, mean component scores as Y) and as a regular line chart. Writing from the graphs and consulting the original PCA rotated component matrix, develop a pithy characterization of the populations of these clusters.

For extra credit, you can make a choropleth map of the counties by the cluster to which they were assigned and then inspect it for spatial and conceptual cohesiveness. Does the distribution make sense from what you know of American culture, politics, and economics? If you were a political geographic consultant for one of the political campaigns, which cluster do you think is the one most up for grabs, that might reward concentrated campaign targeting?

==============================

Getting the Data

==============================

The data are in a spreadsheet https://home.csulb.edu/~rodrigue/geog400/census2010bycounty.ods. It is formatted with variable names across the top that I hope are somewhat clear. The original variable names were weird codes assigned by the Census. A list of the originals, my renamings, and the Census data source can be found in the tab "metadata." I've also separated out the national and state data in the third tab, "state and national data," because keeping them in the original database would allow double and triple counting of data, which would contaminate your results. Be sure to save your spreadsheet someplace where you can get to it later (and not have it purged during lab maintenance or, these days, Virtual Lab maintenance).

==============================

Moving the Data into SPSS and Doing a PCA

==============================

First, in order to get these data into SPSS and PAST, you need to convert them into an Excel file. Do this by Saving as and select Save as type and choose Microsoft Excel 97/2000/XP (.xls).

Now, close the spreadsheet because the statistical packages won't open a file that is already open in another program. Fire up SPSS. Ask it to Open -- Data -- Files of Type -- Excel and navigate to wherever you downloaded and saved the spreadsheet. Be sure the box Read variable names from the first row of data is checked and hit OK. As soon as it loads up, save it as a native SPSS file with the .sav extension.

Now, click on Analyze up top and select Dimension Reduction from the drop down menu and then Factor. In the dialogue box that comes up, highlight all variables, other than the two ID variables (Areaname and GEOID10), that is everything from pctpoor09 through medianage10 and then click the right arrow to put them in the Variables box.

In the Extraction box, select Correlation matrix, Unrotated factor solution and Scree plot, and Extract Fixed number of factors and select 3 and then hit Continue. Note that this is different from what we did with the Mars data, where we just accepted the 1-Eigenvalue default setting. Under Rotation, pick Quartimax (which is what I did when I developed this lab, but you could play around and try Varimax or Equamax). Hit Continue.

Under Scores, click on Save as variables and select Regression and hit Continue. Now, hit OK.

At this point, you will find three new "variables" in your .sav file on the far right side, FAC1_1, FAC1_2, and FAC1_3 or something like that. Highlight them, Copy them, and then move back to your original .ods spreadsheet and Paste them on its far right side, making sure the three columns are properly aligned (go to the end and make sure they end on the same line as all your other data). Then, save your spreadsheet.

Meanwhile, back in SPSS, go to the rotated component matrix now and try to figure out what the three components are. Highlight the highest factor loading (positive or negative) for each variable and then try to express what each factor is picking up on, based on the variables that you highlighted as loading highly on it. One of the factors produces virtually nothing but positive values; the other two produce both high negative and high positive loadings, reflecting a polarization in those factors. Try to come up with a short title for each factor.

  1. ________________________________________________________________________

  2. ________________________________________________________________________

  3. ________________________________________________________________________


==============================

Moving over to PAST

==============================

The SPSS version of K-means clustering produces weird results (clusters with 0 or 1 cases -- not useful). Let's move over to PAST, then.

To make life easier, first, open your newly revised .ods spreadsheet (if it's not still open). Now, File -- New -- Spreadsheet -- we're going to create a brand-new reduced database to get our components into PAST. Save that new file as an Excel file (97/2000/XP .xls) file, perhaps something like censusPASTpca.xls. This is where we're going to put just our county names, a mapping ID key, and our three PCA components. In your revised .ods spreadsheet, highlight Columns A and B (Areaname and GEOID10).Hit Control-C to Copy them to the Clipboard. Now, Paste them into your new spreadsheet (censusPASTpca.xls) from Cell A1. Go back to the revised .ods spreadsheet, migrate to the far right where you stored your three principal components from SPSS. Highlight those three and hit Control-C. Paste them at Cell C1. Voilà -- your PCA-reduced dataset, suitable for framing in PAST. Now, close it (PAST will have a fit if you leave it open).

Now, fire up PAST. You'll see it has a spreadsheet-like data editor. Let's ask it to navigate to censusPASTpca.xls and Open it. A dialogue box will appear asking if your rows and columns contain just names and data. Yup, so hit OK. And your database is now in PAST. At this point, please save it in its native format (delete .xls from the file name and it should save as a .dat file -- you can also delete .xls and manually enter .dat).

Now, touch the grey box holding FAC1_1 or PCA1 or whatever you called it to highlight it. Holding down the Shift key, touch the box holding FAC1_3 (or PCA3). This should highlight all three of the factor columns (only, not the Areaname and GEOID10 columns!). Click on the Multivariate option up top and then Clustering and then K-means in the menu that drops down at each step (a lot like SPSS). A dialogue box comes up asking you how many clusters you want. Enter 5. This will bring up a box containing the cluster number for each county, in other words, the cluster the program assigned it to. The numbers don't have any numerical meaning: They are simply labels for the five clusters. Each of you may get different number-names for what are clearly the same clusters. That's okay.

Click the Copy data box at the bottom and move back to Calc. Put your cursor in cell one of the column you want to move the cluster IDs into, probably cell AK2 . Now, hit Paste and voilà, you cluster identifications will plop in. Actually, you'll get two columns, one for your counties and the second one for the cluster ID. You can delete that first one, as it's kind of useless. Name the surviving new column something like Cluster or K-Means.

==============================

Interpretation

==============================

Now for something puzzling and interesting: Look over at the classifications your neighbors wound up with and compare them with the one you got. They may not be identical! You could even go back into PAST and ask for K-means clustering again and copy those into your spreadsheet and compare them, and, again, they often won't match.

This variability is a function of the iterative process the software uses to assign cases to clusters. At each step of moving one case over to another cluster because it's now closer to that centroid, the mean centroids are shifted a little bit, meaning other cases may jump clusters if they are kind of outliers in a group. The software grinds away until the clusters are stable, with only tiny changes from iteration to iteration. The end result, though, will vary a little.

Before succumbing to despair, use the spreadsheet to count the number of cases in each of the clusters using =countif(x$2:x$3144; "1"), assuming X is the column the clusters went into -- use your actual cluster column!!! Drag the formula down through 5 and change the "1" in the formula to 2, 3, 4, 5. Now, compare with your colleagues. You will find you all wind up with clusters having roughly 24 or 25 in one of the clusters, 635 give or take in another cluster, 900-ish in another, and 1200-1250 in the remaining one. Each of you will find they have different allocations of cluster names 1 through 5 ("a rose by any other name will smell as sweet"?).

Try to figure out what your clusters represent. To do this the most easily, apply the AVERAGEIF function in Calc. =averageif($X$2:$X$3143;"1";Y$2:Y$3143). Here, I am pretending that your clusters are in column X and your first component is in column Y -- use the column letters you actually put them in!!! This formula will take the average for anything in the Y column that also contains a "1" in the X column. Once you've got it working, drag it down for Clusters 2 through 5 and substitute the cluster numbers in the formulas as appropriate. Once that works, highlight the whole collection of five cells and drag across two more columns to capture the means for PC2 and PC3. Format the cells to a consistent decimal place accuracy, 2, 3, or 4. You are now good to go for graphing and interpreting. Be sure to label your table with the five cluster numbers and the three PCs.

Now, make two graphs of these (a line graph is the easiest, but a stacked bar chart creates an interesting pattern that may help you in interpretation). Looking at those graphs, characterize each cluster in language like "high FAC1_1, middling FAC1_2, and very negative FAC1_3 score" or whatever.

Now, go back to your interpretation of the rotated component matrix. What does a "strongly negative score" on FAC1_3" mean in terms of the original variables? So, using your interpretation of the principal components, how would you characterize the demographics, economic situation, and political leanings of the five clusters?

  1. ________________________________________________________________________

  2. ________________________________________________________________________

  3. ________________________________________________________________________

  4. ________________________________________________________________________

  5. ________________________________________________________________________

Marketing and political consultants do things like this, often constructing pretty elaborate lifestyle groupings of the American population. A famous example is the Claritas (now Nielsen) PRIZM system of market segmentation, which features something like 60 distinct clusters. These are used to peddle stuff to you, presumably "speaking" to your situation and cultural characteristics. Among the things peddled to you are political candidates ...

With that in mind, which of your five clusters do you think is most likely to be in play this election season? Which is likelier to be solid GOP country? Which is likelier to be solid Democrat country?

  1. ________________________________________________________________________

  2. ________________________________________________________________________

  3. ________________________________________________________________________

  4. ________________________________________________________________________

  5. ________________________________________________________________________
==============================

Extra Credit Opportunity!

==============================

Those of you proficient in GIS might enjoy a mapping exercise based on your clusters. You can download the system of TIGER shapefiles for 2010 county boundaries here, someplace you can get to that won't get wiped in lab maintenance:

Moi, I'm not an ArcGIS user, but I have found a fairly easy to use the free open-source GIS package, Quantum GIS. If you would like a free GIS that works with ESRI shapefiles, and doesn't have a license that runs out, you can download it at http://qgis.org/ .

To make a simple choropleth map of the cluster to which each county belongs, you need to create a .csv file of just the GEOID10 column and the column you put your cluster numbers in. You would open a new spreadsheet and then immediately Save it as Type -- .csv, perhaps something like clusters.csv.

Then, in your original spreadsheet, highlight the GEOID10 column and hit Format -- Cells and, under the Numbers tab, select Text (this is VERY important!). If you don't, you'll get blanked out areas in your map. Save the file.

Then, Copy that text column and Paste it in your .csv file in Column A. Now, go back to the original spreadsheet and highlight your K-means cluster number column (make sure that is formatted as a number to 0 decimal places). Copy it and paste it into Column B of your .csv file. Make sure there's nothing at the bottom of the file from work you were doing in the original spreadsheet. Save that file.

To make sure it came out right, close the file and then immediately re-open it in Notepad. You should see it formatted as a line of GEOID10 numbers in quotation marks, followed by a comma, and then your cluster numbers.

Close it. Now open another Notepad. Save it as the exact same name, but with a different extension, clusters.csvt, with the extra t. In it, put a SINGLE line, consisting of "String", "Integer" --- just like that. This tells your GIS that your .csv has two variables, one a string (text name) and the other a number that can be used for choropleth shading. Save it.

Now, in your GIS (I'm going off QGIS here), Project --> New and Project --> Save As and give it a snappy name, such as clusters. Then, in this newly created project, go to the Layers menu and select >b>Add Layer and then Add Vector Layer. In the dialogue box, accept the File and Encoding defaults and then navigate to the tl_2010_us_county10.shp file. Click the Add box. It will show up in the Layers navigation bar on the left and the county boundaries should appear in the main window of GQIS. Now, do it all again but this time Layer --> Add Vector Layer --> and navigate to the .csv file (ask for All Files, so you can find the census2010PCscores.csv file quickly.

Join your two layers on the GEOID10 variable, which they both have in common. To do this, double-click on the shapefile in the nav bar list and that will bring up the Layer Properties dialogue box. Click on the Joins tab and, in there, click the Plus tab to do the join. In the Add vector join dialogue box, go into the

You now need to join the two vector layers on a field they both have in common (GEOID10), so they can share data. Double-click on the .shp file on the left to bring up the Joins dialogue box. Click on the Plus box at the bottom left, which will bring up another dialogue box. In the drop-down menu beside Join Field, pick . Do the same for the Target Field menu. Then, hit OK.

Now, let's make sure there are no snafus. Back in the main QGIS window, under the Layer tab up top, pick Open attribute table. This brings up the merged database in spreadsheet format, and you should find your Kcluster as a variable on the far right side.

If so, close the attribute table and go back to the main QGIS window. Double-click again on the tl_2010_us_county10.shp layer, which brings up the Layer Properties dialogue box again. This time, select Symbology and then click on the "Single Symbol" drop menu and choose Categorized. This allows you to make a choropleth map, categorized by your clusters (make sure to pick Kcluster in the Column drop menu).

Pick an existing color ramp or go to the bottom and click on New color ramp and then either Gradient, Random, or ColorBrewer (which has a bunch of ramps designed by Dr. Cynthia Brewer for effective cartographic communication). Then, hit Classify and a legend of your color choices comes up. You can even click on each cluster and custom color it (remember you don't need to ramp these colors, because the numbers have no measurement value -- they're just names and each of you will have gotten different numbers for the same cluster, so you can go wild, off the ramp, for coloring your clusters). Then, click Okay and your map will display as designed. Given our crappy b/w printer, you might want to opt for one of the greyscale ramps, come to think of it (or include a second copy of your now printer-uglified glorious full-color map up on BeachBoard).

Inspect your map. Do the places that light up for each cluster make sense to you from your analysis of the PC scores in your chart(s)? Do they form mostly spatially contiguous groups of clusters (er, clusters of clusters)? That is, do you see swaths of the country that are occupied by similar counties that are joined to one another? Do your statistical clusters seem to have geographical meaning?

It's always good form to output the map in a polished form. Aligning your map the way you want it to show on the page first, you can use QGIS' print layout manager to create a printable map. Under the Project tab up top, select Layout Manager. Up comes a dialogue box with a white work area on the left side. First, click the Add Item tab and select Add Map and define a box in the work area you'd like to put it. It should put what you have in your main QGIS map viewer in that box, which you can resize and move around a bit. Then, you can ask it to Add Legend and then define a box on the work area more or less where you want it. Then, you can ask for Add Label and define a box where you think a title would look nicest and then on the right side, look for the Item Properties tab to open a box that will let you write in title text and format it. You can put other things in there (such as Add Picture and putting one of your graphs in there?) but the map, title, and legend are the most important. When you're happy with it all, you can use Layout --> Export as image or pdf and then look at it in your browser and, if you're happy, send it to the printer (well, not during COVID-19 -- let's forget about the printing part). If the b/w just doesn't do it justice, upload the pretty color image or pdf to the Dropbox.

==============================

Lab Report

==============================

Write a brief lab report, which should cover the following points:
  1. What is the first principal component picking up on? (note that it doesn't polarize)
  2. What about the second? What would a high negative score mean and what would a high positive score mean?
  3. What about the third? Again, characterize high negative and high positive component scores.
  4. Come up with a pithy description of your first cluster, considering the mean PC scores it shows on the three components (whether they're high positive, high negative, or sort of middling, especially in comparison with the other four clusters).
  5. Do the same with your second cluster.
  6. Ditto with the third cluster.
  7. And the fourth.
  8. And the fifth.
  9. Looking at the distribution of PC scores by cluster and the meaning of high/low and positive/negative scores on each component, which of your clusters is likely to vote GOP next November and what makes you think so?
  10. Which cluster or clusters is likely to go Democratic and why?
  11. Which cluster or clusters is likeliest to see voters swing either way from one election cycle to the next and, so, is likeliest to be at play next November?
  12. How would you critique your use of PCA and K-means clustering on these data? Are their assumptions of scalar data and roughly normally distributed data met for all original variables? (hint: do the means and medians for each variable and see if they are roughly similar -- will save you the misery of histogramming)
  13. If you have done the extra credit choropleth map of your clusters, please write a very generalized description about the major regional concentrations of each of your clusters. Spatially, name a few states that look as though they are predominantly of your "swing vote" cluster and, thus, might be where the next election is decided?
  14. Optional: If you did the map in QGIS, how would you rate it in terms of ease of learning enough to do the map? Have you ever used a GIS before? If you tried QGIS but have also used ArcGIS, how would you compare them for ease of use?

==============================
GEOG 400/500 Home   |   Dr. Rodrigue's Home   |   Geography Home   |   ES&P Home   |   EMER Home   |   PAST  |   QGIS

==============================

This document is maintained by Dr. Rodrigue
First placed on Web: 01/21/12
Last Updated: 10/13/22

==============================