3  Exploratory Data Analysis in R

4 Overview

There’s no data science without data. And the “learning” in ML is based on the data we observe. The field of statistics is all about the mathematics of data. Over the past century, the field engaged in a rigorous investigation of the origin, nature, transformation and prediction of data from a myriad of angles. That legacy feeds into the computer age and forms the theoretical foundation upon which data science and machine learning have pushed the boundaries of knowledge. What statistics could not do effectively is extend itself beyond tabular data comprised of quantitative and qualitative variables - to new data types. With help from computer science, analysis of new kinds of data became practical.

Find a video and a webpage tutorial on one of the topics below. Basics: read in data and one basic visualization. Email me what you want to do from below

5 Tabular Data

5.1 Tabular Data in R

Intro to Data Frames

A data frame is an object in R that consists of data with labels for columns and/or rows. It is the most fundamental structure on which data analysis is performed in R. Essentially, it is an augmented matrix with the ability to include variable names for each column, and identification of individuals in your data for each row. The modern, “tidy” version of a data frame is called a tibble. Although there is straightforward base R syntax for working with tabular data, we’ll keep it tidy in anticipation of our use of the popular ggplot2 package for visualizations and tidymodels for statistical learning. Back to the subject at hand. If, for example we wanted to create a data set in which the individuals are the top five scorers from the 2018-2019 NBA season, and the variables are their points per game and team name, as found here, we could do the following:

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
NBA_PPG<-tibble(Players=c("Harden","George","Antetokounmpo","Embiid","Curry"),Team=c("HOU","OKC","MIL","PHI","GSW"),PPG=c(36.1,28,27.7,27.5,27.3))

NBA_PPG
# A tibble: 5 × 3
  Players       Team    PPG
  <chr>         <chr> <dbl>
1 Harden        HOU    36.1
2 George        OKC    28  
3 Antetokounmpo MIL    27.7
4 Embiid        PHI    27.5
5 Curry         GSW    27.3

Note that this requires us to load the dplyr package from the tidyverse. To get a quick snapshot of the nature of the variables in your data frame, type

str(NBA_PPG)
tibble [5 × 3] (S3: tbl_df/tbl/data.frame)
 $ Players: chr [1:5] "Harden" "George" "Antetokounmpo" "Embiid" ...
 $ Team   : chr [1:5] "HOU" "OKC" "MIL" "PHI" ...
 $ PPG    : num [1:5] 36.1 28 27.7 27.5 27.3

Note that the variable *Team*, being of the character type is automatically treated as a “Factor” with data frames, but not tibbles. This may or may not be necessary for you depending on downstream statistical analyses intended for this data frame.

At times it will be essential to append our data frame with additional observations on a new variable. Suppose for the NBA data we’d like to include rebounds per game for these five players: 6.6,8.2,12.5,13.6,5.3. We can use the mutate function to implement this:

NBA_data<-NBA_PPG %>% mutate(RPG=c(6.6,8.2,12.5,13.6,5.3))

View(NBA_data)

Note that the **View** function opens the data in an interactive window. Here you can sort the data by any given column. Note that the %>% is called a “pipe”, and it is a convenient way of passing the antecedent to the first argument of the function that follows it. Of course this is the tip of the iceberg of what the mutate function can do. To learn more,

help(mutate)

There are an numerous ways to slice and dice our data, and Google knows them all. We highlight a couple here. To isolate cases in our data based on given characteristics, we can filter. Of the top 5 scorers, which grabbed double digit rebounds as well?

NBA_data %>% filter(RPG>10)
# A tibble: 2 × 4
  Players       Team    PPG   RPG
  <chr>         <chr> <dbl> <dbl>
1 Antetokounmpo MIL    27.7  12.5
2 Embiid        PHI    27.5  13.6

Similarly, select isolated columns for us

NBA_data %>% select(Team)
# A tibble: 5 × 1
  Team 
  <chr>
1 HOU  
2 OKC  
3 MIL  
4 PHI  
5 GSW  

Note that we can chain pipes but they are not commutative:

NBA_data %>% filter(RPG>10) %>% select(Team)
# A tibble: 2 × 1
  Team 
  <chr>
1 MIL  
2 PHI  

Pre-packaged data sets

In addition to a set of built in packages and functions, R also comes with pre-loaded data sets for your exploration and practice. Click on the PACKAGES tab in the pane on the bottom right, and click on the datasets package. This

will list the contents of the package, which are data sets we can use for myriad statistical and machine learning applications. By clicking on any data set, or typing it in the help menu, you can get some details about the included variables. For example, we can learn about the trees data set as follows

?trees
str(trees)
'data.frame':   31 obs. of  3 variables:
 $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
 $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
 $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

A common way to snapshot the data frame is through the head and tail functions:

head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7
tail(trees)
   Girth Height Volume
26  17.3     81   55.4
27  17.5     82   55.7
28  17.9     80   58.3
29  18.0     80   51.5
30  18.0     80   51.0
31  20.6     87   77.0

Other ways to get tabular data into R

Import Button

RStudio offers more convenient ways to get your data into the project. The easiest way to do it is to click on the icon in the Environment pane. From there you can import data from a website, or upload local data from your machine. You do need to be aware of how the data is delimited (commas, tabs, etc) and other stylistic details. For example we can import the first 10 rows of the Point-in-Time estimates of homelessness by state from the US Dept of Housing and Urban Development (HUD) by pasting the following link https://www.huduser.gov/portal/sites/default/files/xls/2007-2022-PIT-Counts-by-CoC.xlsx

in the url field of the import -> From Excel pop up window: Note also that we can change the name of the dataset, as well as copy the source code from this dialog.

Again we may use the View function to scope out the data.

Fread Function

The fread function in the data.table package is another convenient and effective way to read data in R. We can copy and paste delimited data from the web or a document, which will be automatically detected and formatted. The data table from which we extracted the NBA data is much more extensive, and would be unreasonable to input by hand. So let’s copy the data for the top 30 scorers from BasketballReference.com.

library(data.table)

Attaching package: 'data.table'
The following objects are masked from 'package:dplyr':

    between, first, last
NBA_Data_30<-fread(input="Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS▼,Player-additional
1,James Harden,PG,29,HOU,78,78,36.8,10.8,24.5,.442,4.8,13.2,.368,6.0,11.3,.528,.541,9.7,11.0,.879,0.8,5.8,6.6,7.5,2.0,0.7,5.0,3.1,36.1,hardeja01
2,Paul George,SF,28,OKC,77,77,36.9,9.2,21.0,.438,3.8,9.8,.386,5.4,11.1,.484,.529,5.9,7.0,.839,1.4,6.8,8.2,4.1,2.2,0.4,2.7,2.8,28.0,georgpa01
3,Giannis Antetokounmpo,PF,24,MIL,72,72,32.8,10.0,17.3,.578,0.7,2.8,.256,9.3,14.5,.641,.599,6.9,9.5,.729,2.2,10.3,12.5,5.9,1.3,1.5,3.7,3.2,27.7,antetgi01
4,Joel Embiid,C,24,PHI,64,64,33.7,9.1,18.7,.484,1.2,4.1,.300,7.8,14.6,.535,.517,8.2,10.1,.804,2.5,11.1,13.6,3.7,0.7,1.9,3.5,3.3,27.5,embiijo01
5,Stephen Curry,PG,30,GSW,69,69,33.8,9.2,19.4,.472,5.1,11.7,.437,4.0,7.7,.525,.604,3.8,4.2,.916,0.7,4.7,5.3,5.2,1.3,0.4,2.8,2.4,27.3,curryst01
6,Devin Booker,SG,22,PHO,64,64,35.0,9.2,19.6,.467,2.1,6.5,.326,7.0,13.1,.536,.521,6.1,7.1,.866,0.6,3.5,4.1,6.8,0.9,0.2,4.1,3.1,26.6,bookede01
7,Kawhi Leonard,SF,27,TOR,60,60,34.0,9.3,18.8,.496,1.9,5.0,.371,7.5,13.8,.542,.546,6.1,7.1,.854,1.3,6.0,7.3,3.3,1.8,0.4,2.0,1.5,26.6,leonaka01
8,Kevin Durant,SF,30,GSW,78,78,34.6,9.2,17.7,.521,1.8,5.0,.353,7.5,12.8,.587,.571,5.7,6.5,.885,0.4,5.9,6.4,5.9,0.7,1.1,2.9,2.0,26.0,duranke01
9,Damian Lillard,PG,28,POR,80,80,35.5,8.5,19.2,.444,3.0,8.0,.369,5.6,11.1,.499,.522,5.9,6.4,.912,0.9,3.8,4.6,6.9,1.1,0.4,2.7,1.9,25.8,lillada01
10,Bradley Beal,SG,25,WAS,82,82,36.9,9.3,19.6,.475,2.5,7.3,.351,6.8,12.4,.548,.540,4.4,5.5,.808,1.1,3.9,5.0,5.5,1.5,0.7,2.7,2.8,25.6,bealbr01
11,Kemba Walker,PG,28,CHO,82,82,34.9,8.9,20.5,.434,3.2,8.9,.356,5.7,11.6,.494,.511,4.6,5.5,.844,0.6,3.8,4.4,5.9,1.2,0.4,2.6,1.6,25.6,walkeke02
12,Blake Griffin,PF,29,DET,75,75,35.0,8.3,17.9,.462,2.5,7.0,.362,5.7,10.9,.525,.532,5.5,7.3,.753,1.3,6.2,7.5,5.4,0.7,0.4,3.4,2.7,24.5,griffbl01
13,Karl-Anthony Towns,C,23,MIN,77,77,33.1,8.8,17.1,.518,1.8,4.6,.400,7.0,12.5,.562,.572,4.9,5.8,.836,3.4,9.0,12.4,3.4,0.9,1.6,3.1,3.8,24.4,townska01
14,Kyrie Irving,PG,26,BOS,67,67,33.0,9.0,18.5,.487,2.6,6.5,.401,6.4,12.0,.533,.557,3.2,3.7,.873,1.1,3.9,5.0,6.9,1.5,0.5,2.6,2.5,23.8,irvinky01
15,Donovan Mitchell,SG,22,UTA,77,77,33.7,8.6,19.9,.432,2.4,6.7,.362,6.1,13.1,.468,.493,4.1,5.1,.806,0.8,3.3,4.1,4.2,1.4,0.4,2.8,2.7,23.8,mitchdo01
16,Zach LaVine,SG,23,CHI,63,62,34.5,8.4,18.0,.467,1.9,5.1,.374,6.5,12.9,.504,.520,5.0,6.0,.832,0.6,4.0,4.7,4.5,1.0,0.4,3.4,2.2,23.7,lavinza01
17,Russell Westbrook,PG,30,OKC,73,73,36.0,8.6,20.2,.428,1.6,5.6,.290,7.0,14.5,.481,.468,4.1,6.2,.656,1.5,9.6,11.1,10.7,1.9,0.5,4.5,3.4,22.9,westbru01
18,Klay Thompson,SG,28,GSW,78,78,34.0,8.4,18.0,.467,3.1,7.7,.402,5.3,10.3,.516,.553,1.7,2.0,.816,0.5,3.4,3.8,2.4,1.1,0.6,1.5,2.0,21.5,thompkl01
19,Julius Randle,PF,24,NOP,73,49,30.6,7.8,14.9,.524,0.9,2.7,.344,6.9,12.2,.564,.555,4.9,6.7,.731,2.2,6.5,8.7,3.1,0.7,0.6,2.8,3.4,21.4,randlju01
20,LaMarcus Aldridge,C,33,SAS,81,81,33.2,8.4,16.3,.519,0.1,0.5,.238,8.3,15.8,.528,.522,4.3,5.1,.847,3.1,6.1,9.2,2.4,0.5,1.3,1.8,2.2,21.3,aldrila01
21,DeMar DeRozan,SG,29,SAS,77,77,34.9,8.2,17.1,.481,0.1,0.6,.156,8.1,16.5,.492,.483,4.8,5.7,.830,0.7,5.3,6.0,6.2,1.1,0.5,2.6,2.3,21.2,derozde01
22,Luka Dončić,SG,19,DAL,72,72,32.2,7.0,16.5,.427,2.3,7.1,.327,4.7,9.3,.503,.497,4.8,6.7,.713,1.2,6.6,7.8,6.0,1.1,0.3,3.4,1.9,21.2,doncilu01
23,Jrue Holiday,SG,28,NOP,67,67,35.9,8.2,17.3,.472,1.8,5.4,.325,6.4,11.9,.539,.523,3.1,4.0,.768,1.1,3.9,5.0,7.7,1.6,0.8,3.1,2.2,21.2,holidjr01
24,Mike Conley,PG,31,MEM,70,70,33.5,7.0,16.0,.438,2.2,6.1,.364,4.8,9.9,.483,.507,4.9,5.8,.845,0.6,2.8,3.4,6.4,1.3,0.3,1.9,1.8,21.1,conlemi01
25,D'Angelo Russell,PG,22,BRK,81,81,30.2,8.1,18.7,.434,2.9,7.8,.369,5.2,10.9,.482,.512,2.0,2.5,.780,0.7,3.2,3.9,7.0,1.2,0.2,3.1,1.7,21.1,russeda01
26,CJ McCollum,SG,27,POR,70,70,33.9,8.2,17.8,.459,2.4,6.4,.375,5.8,11.4,.506,.527,2.3,2.7,.828,0.9,3.1,4.0,3.0,0.8,0.4,1.5,2.5,21.0,mccolcj01
27,Nikola Vučević,C,28,ORL,80,80,31.4,8.8,16.9,.518,1.1,2.9,.364,7.7,14.0,.549,.549,2.2,2.8,.789,2.8,9.2,12.0,3.8,1.0,1.1,2.0,2.0,20.8,vucevni01
28,Buddy Hield,SG,26,SAC,82,82,31.9,7.6,16.6,.458,3.4,7.9,.427,4.2,8.6,.487,.560,2.1,2.4,.886,1.3,3.7,5.0,2.5,0.7,0.4,1.8,2.5,20.7,hieldbu01
29,Nikola Jokić,C,23,DEN,80,80,31.3,7.7,15.1,.511,1.0,3.4,.307,6.7,11.7,.569,.545,3.6,4.4,.821,2.9,8.0,10.8,7.3,1.4,0.7,3.1,2.9,20.1,jokicni01
30,Tobias Harris,PF,26,TOT,82,82,34.7,7.5,15.3,.487,1.9,4.8,.397,5.5,10.5,.528,.549,3.2,3.7,.866,0.8,7.0,7.9,2.8,0.6,0.5,1.8,2.2,20.0,harrito02")

View(NBA_Data_30)

Visualizing Tabular Data in R

The nature of the visualization, modelling and application of our data depends closely on the types of variables we are analyzing. The primary categorization is either quantitative or qualitative. Let’s use the ggplot2 and related packages to see how we can create graphs that can provide insight into the NBA data.

To get a very effective snapshot of uni- and bi-variate characteristics of points and rebounds in the NBA data, the ggpairs function in the GGally package is a must.

library(GGally)
Loading required package: ggplot2
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Check it out:

ggpairs(data=NBA_Data_30,columns=c(24,30))

With a correlation coefficient of 0.073 and relatively trend free scatterplot, it sems that for top scorers, there is no correlation between points and rebounds. However if we add the qualitative position variable through the aesthetic (aes) argument to see how these vary and interact by position

ggpairs(data=NBA_Data_30,columns=c(24,30),aes(color=Pos,alpha=0.2))

Now we see a great deal of variability in correlation values between positions. This is a simple example that illustrates the power visualization has to mine knowledge from data which otherwise might be difficult to ascertain.

6 Text Data

6.1 Text Data in R

7 Image Data

7.1 Image Data in R

8 Audio Data

8.1 Audio Data in R

Eventually we’ll want to delve into speech recognition using Wav2Vec2 and PyTorch.

9 Geospatial Data

9.1 Geospatial Data in R

9.1.1 Raster Tutorial

Raster Tutorial Source: https://www.neonscience.org/resources/learning-hub/tutorials/raster-data-r

library(raster)
Loading required package: sp

Attaching package: 'raster'
The following object is masked from 'package:dplyr':

    select
library(rgdal)
Please note that rgdal will be retired during 2023,
plan transition to sf/stars/terra functions using GDAL and PROJ
at your earliest convenience.
See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
rgdal: version: 1.6-4, (SVN revision 1196)
Geospatial Data Abstraction Library extensions to R successfully loaded
Loaded GDAL runtime: GDAL 3.4.2, released 2022/03/08
Path to GDAL shared files: /Users/kagbasuaray/Library/Caches/org.R-project.R/R/renv/cache/v5/R-4.2/x86_64-apple-darwin17.0/rgdal/1.6-4/92183bff0bac3a711fde35a22c1bf45b/rgdal/gdal
GDAL binary built with GEOS: FALSE 
Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
Path to PROJ shared files: /Users/kagbasuaray/Library/Caches/org.R-project.R/R/renv/cache/v5/R-4.2/x86_64-apple-darwin17.0/rgdal/1.6-4/92183bff0bac3a711fde35a22c1bf45b/rgdal/proj
PROJ CDN enabled: FALSE
Linking to sp version:1.5-1
To mute warnings of possible GDAL/OSR exportToProj4() degradation,
use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
library(raster)
DEM <-  raster(extent(188500, 190350, 227550, 229550), # xmin, xmax, ymin, ymax
         res = 50, # resolution of 50 meters
         crs = 31370) %>% 
  setValues(runif(ncell(.)))  # fill with random values
  
#raster("C:/Users/sean1/Documents/Stat 697/Homework/geospatial/NEON-DS-Field-Site-Spatial-Data/SJER/DigitalTerrainModel/SJER2013_DTM.tif")
DEM
class      : RasterLayer 
dimensions : 40, 37, 1480  (nrow, ncol, ncell)
resolution : 50, 50  (x, y)
extent     : 188500, 190350, 227550, 229550  (xmin, xmax, ymin, ymax)
crs        : +proj=lcc +lat_0=90 +lon_0=4.36748666666667 +lat_1=51.1666672333333 +lat_2=49.8333339 +x_0=150000.013 +y_0=5400088.438 +ellps=intl +units=m +no_defs 
source     : memory
names      : layer 
values     : 0.00131927, 0.9998783  (min, max)
DEM <- setMinMax(DEM)

DEM
class      : RasterLayer 
dimensions : 40, 37, 1480  (nrow, ncol, ncell)
resolution : 50, 50  (x, y)
extent     : 188500, 190350, 227550, 229550  (xmin, xmax, ymin, ymax)
crs        : +proj=lcc +lat_0=90 +lon_0=4.36748666666667 +lat_1=51.1666672333333 +lat_2=49.8333339 +x_0=150000.013 +y_0=5400088.438 +ellps=intl +units=m +no_defs 
source     : memory
names      : layer 
values     : 0.00131927, 0.9998783  (min, max)
# the distribution of values in the raster
hist(DEM, main="Distribution of elevation values", 
     col= "purple", 
     maxpixels=22000000)

image(DEM)

# add a color map with 5 colors
col=terrain.colors(5)

# add breaks to the colormap (6 breaks = 5 segments)
brk <- c(250, 300, 350, 400, 450, 500)

plot(DEM, col=col, breaks=brk, main="DEM with more breaks")

Raster data is useful for displaying complex graphics as images are generated from individual pixels, making it fitting for digital photos or highly detailed maps (such as topographic maps).

Meanwhile, vector data is more useful for generating maps that wish to define regions of interest by broad shapes. Vector maps can be easier to parse due to the simpler display, and the use of vector graphics allow for the image to be easily scaled without loss of detail.

9.1.1.1 Data types for raster maps

As a raster map is generated pixel by pixel, image file formats such as png, pdf, jpg and gif can all be used. However, one of the most useful formats for raster data is the tif or Tag Image File format, which allows for the tagging of metadata onto the image data. This is particularly useful when generating maps as information regarding the coordinate system used, the location, and the resolution can be included to provide context for the resulting image.

Here is another useful source for learning about raster data: https://datacarpentry.org/organization-geospatial/01-intro-raster-data/

9.1.2 Looking at a topographical map of Catalina Island and the South Bay

Data explanation and source: https://www.sciencebase.gov/catalog/item/5f77842182ce1d74e7d6c3cb

la <- raster('/Users/kagbasuaray/Downloads/MOD_LSTD_CLIM_M_2001-12-01_rgb_360x180.TIFF')
la <-setMinMax(la)
la
class      : RasterLayer 
dimensions : 180, 360, 64800  (nrow, ncol, ncell)
resolution : 1, 1  (x, y)
extent     : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs 
source     : MOD_LSTD_CLIM_M_2001-12-01_rgb_360x180.TIFF 
names      : MOD_LSTD_CLIM_M_2001.12.01_rgb_360x180 
values     : 1, 255  (min, max)

The extent of (xmin:-119.0006, xmax:-117.9994, ymin:32.99944, ymax:34.00056) refers to the range of latitude and longitude in this image. The coordinate system NAD83 refers to the North American Datum of 1983, a model commonly used for accurately measuring and displaying locations in North America (see https://en.wikipedia.org/wiki/North_American_Datum). The values -5.670049 and 640.6235 refer to the lowest and highest points of elevation in meters.

plot(la)

This greener parts of the map show higher elevation, and the whiter parts show sea level. The legend is in units of meters. The highest value in the dataset is 640.6235 meters, which corresponds to the highest point of elevation on Catalina Island (according to Wikipedia).

Note: The map below doesn’t turn out as well so I wouldn’t include it in the final file, I just have it here as I included it in my demonstration on Monday 2/6. The map above I found on 2/7 and I believe is a much better fit for a raster demonstration.

9.1.3 Looking at raster data with Los Angeles raster file

Data from: https://www.faa.gov/air_traffic/flight_info/aeronav/digital_products/vfr/

LOS_AN <- raster('/Users/kagbasuaray/Downloads/Los_Angeles/Los Angeles SEC.tif')
LOS_AN
class      : RasterLayer 
dimensions : 12349, 16645, 205549105  (nrow, ncol, ncell)
resolution : 42.33503, 42.33597  (x, y)
extent     : -351020.5, 353646.1, -245067.9, 277739  (xmin, xmax, ymin, ymax)
crs        : +proj=lcc +lat_0=34.1666666666667 +lon_0=-118.466666666667 +lat_1=38.6666666666667 +lat_2=33.3333333333333 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs 
source     : Los Angeles SEC.tif 
names      : Los.Angeles.SEC 

LCC map projection is a conical projection mainly used for aeronautical charts (see https://en.wikipedia.org/wiki/Lambert_conformal_conic_projection)

plot(LOS_AN)

image(LOS_AN)

While not perfect, these graphics still roughly display areas of higher and lower elevations, as well as water features. The darker the green, generally the higher the elevation.

9.1.4 OSMDATA Package

Video tutorial for an interesting use of geospatial data to map gun violence in Chicago: https://youtu.be/uZtto0cYjZM

Unfortunately, this video tutorial is not 100% complete in its process as it glosses over how it converts the data from city blocks into exact coordinates. With some effort I’m sure it’s not too bad to figure out though. A tutorial on geocoding could help in this matter.


9.1.4.1 Web tutorial used for this exercise:

https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/geospatial-data/#gathering-geospatial-data

library(osmdata)
Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
library(dplyr)
library(sf)
Linking to GEOS 3.10.2, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(ggsn)
Loading required package: grid

Attaching package: 'ggsn'
The following object is masked from 'package:raster':

    scalebar

This tutorial generates a map of the city Mannheim in Germany with streets and buidings.

OSMdata is a package that imports map data from OpenStreetMap (OSM) into R as sf (simple features) or sp (spatial data) objects.

Mannheim Border

This code chunk imports the map data and selects the data to generate the borders for the city of Mannheim

mannheim <-
  #Queries the database for places named Mannheim
  osmdata::getbb("Mannheim") %>% 
  
  #Adds a timeout function if the query takes too long
  osmdata::opq(timeout = 25*100) %>%
  
  #Limits the query to cities of the admin level 6. Lower admin levels refer to larger places (Germany is admin level 2)
  osmdata::add_osm_feature(
    key = "admin_level", 
    value = "6"
  ) %>% 
  
  #Returns the queried data in an sf(simple feature) format. This is a large amount of data (74.4 MB) containing info on names, codes and shape data (lines and points) for numerous features (everything from roads to borders to traffic sign locations)
  osmdata::osmdata_sf() %$% 
  
  #This focuses the data again to be particularly shape data for Mannheim at the city level
  osm_multipolygons %>% 
  dplyr::filter(name == "Mannheim") %>% 
  
  #Selects just the geometry for the city of Mannheim AKA Mannheim's borders 
  dplyr::select(geometry) %>%
  
  #Transforms the shape to fit the coordinate system for that region of the world. 3035 is the coordinate system code used to cover most of Europe. See https://epsg.io/3035. This transformation corrects the orientation of the generated shape.
  sf::st_transform(3035) 

mannheim
Simple feature collection with 1 feature and 0 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 4206252 ymin: 2923007 xmax: 4218767 ymax: 2943155
Projected CRS: ETRS89-extended / LAEA Europe
                        geometry
1 MULTIPOLYGON (((4209130 293...
#This code plots the shape generated above, creating a shape that depicts the borders of Mannheim
ggplot(mannheim, fill = NA) +
  geom_sf() +
  ggsn::blank()

Mannheim Roads

This code chunk selects road lines data for the city of Mannheim

roads <-
  #Queries the database for 'Mannheim'
  osmdata::getbb("Mannheim") %>% 
  
  #Adds a timeout feature for the query
  osmdata::opq(timeout = 25*100) %>%
  
  #Specifies the query to look at the highway data. Trunk, primary, secondary and tertiary are various highway classifications
  osmdata::add_osm_feature(
    key = "highway", 
    value = c(
      "trunk", 
      "primary", 
      "secondary", 
      "tertiary"
      )
  ) %>% 
  
  #Returns the queried data in an sf(simple feature) format
  osmdata::osmdata_sf() %$% 
  
  #Specifies the line data
  osm_lines %>% 
  
  #Selects the geometry data
  dplyr::select(geometry) %>%
  
  #Transforms the line shapes to the 3035(most of Europe) coordinate system
  sf::st_transform(3035) %>% 
  
  #st_intersection makes sure to only keep the parts of the lines that overlap with the 'mannheim' shape object generated earlier. Without this step, roads would be displayed going outside the borders.
  sf::st_intersection(mannheim) %>% 
  
  #Keeps only the line shape data. Without this step, unwanted points will also be placed on image.
  dplyr::filter(
    sf::st_is(., "LINESTRING")
    )

Mannheim Buildings

This code chunk selects building shape data for the city of Mannheim

buildings <-
  #Queries the database for 'Mannheim'
  osmdata::getbb("Mannheim") %>% 
  
  #Adds a timeout feature for the query
  osmdata::opq(timeout = 25*100) %>%
  
  #Specifies the query to look for buildings of various classifications
  osmdata::add_osm_feature(
    key = "building", 
    value = c(
      "apartments", "commercial", 
      "office", "cathredral", "church", 
      "retail", "industrial", "warehouse", 
      "hotel", "house", "civic", 
      "government", "public", "parking", 
      "garages", "carport", 
      "transportation", 
      "semidetached_house", "school", 
      "conservatory"
    )
  ) %>% 
  
  #Returns the queried data in an sf(simple feature) format
  osmdata::osmdata_sf() %$% 
  
  #Specifies the polygon data
  osm_polygons %>%
  
  #Selects the geometry data
  dplyr::select(geometry) %>%
  
  #Transforms the shapes to the appropriate coordinate system (3035 for this part of Europe)
  sf::st_transform(3035) %>% 
  
  #Keeps only the parts of the shapes that overlap with the 'mannheim' shape object created earlier, that way buildings that fall outside the border are not displayed.
  sf::st_intersection(mannheim)

Full Mannheim Map

This code chunk puts the previous data chunks together, overlaying the road lines and building shapes onto the Mannheim city shape, generating a simple map. The black lines are roads and the dark gray shapes are buildings.

ggplot() +
  geom_sf(data = mannheim) +
  geom_sf(data = roads) +
  geom_sf(data = buildings) +
  theme(legend.position = "none") +
  ggsn::blank() 

9.1.4.2 Bonus Map - Torrance

For fun I tried this exercise for the city of Torrance. There were a few tricky points of note for applying the same code to Torrance. One is that Torrance is at a higher admin level of 8 which you have to see by checking for Torrance’s information on OpenStreetMap. Two is that Torrance uses a different coordinate system than Europe, so without adjusting the transformation number, the city is oriented all wrong. After some searching I found a site that lists all the coordinate codes used for different regions of the world: https://spatialreference.org/. For Southern California, the code is 2874.

torrance <-
  osmdata::getbb("torrance CA USA") %>% 
  osmdata::opq(timeout = 25*100) %>%
  osmdata::add_osm_feature(
    key = "admin_level", 
    value = "8"
  ) %>% 
  osmdata::osmdata_sf() %$% 
  osm_multipolygons %>% 
  dplyr::filter(name == "Torrance") %>% # filter on city level
  dplyr::select(geometry) %>%
  sf::st_transform(2874) 

torrance
Simple feature collection with 1 feature and 0 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 6415094 ymin: 1742587 xmax: 6468064 ymax: 1781454
Projected CRS: NAD83(HARN) / California zone 5 (ftUS)
                        geometry
1 MULTIPOLYGON (((6458208 174...

We note that I was having trouble with getting the highway data and building data for Torrance, so this is just the border of the city. The long prong is the part of Torrance that reaches the beach and apparently the jurisdiction extends a few miles into the ocean.

ggplot(torrance, fill = NA) +
  geom_sf() +
  ggsn::blank()

9.2

9.3 Geospatial Data in Python

10 Database Query with SQL

R allows you to write SQL from R by connecting to a database, write DPlyr and the use function to generate SQL. We do not have to know SQL and we can use R code generate SQL queries. Let’s use the https://data.world/ database to se how this is done. First we’ll install the package from GitHub:

library(devtools)
Loading required package: usethis
devtools::install_github("datadotworld/data.world-r", build_vignettes = TRUE)
Skipping install of 'data.world' from a github remote, the SHA1 (a1fd7656) has not changed since last install.
  Use `force = TRUE` to force installation
saved_cfg <- data.world::save_config("eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJyLWFuZC1yLXN0dWRpbzprc3VhcmF5IiwiaXNzIjoiY2xpZW50OnItYW5kLXItc3R1ZGlvOmFnZW50OmtzdWFyYXk6OjJjZDRmOWYzLWQ2YzgtNDJhNi1hMDgzLTg0MTZiYzZhMWZiOCIsImlhdCI6MTY3MjcxNDk1Mywicm9sZSI6WyJ1c2VyX2FwaV9hZG1pbiIsInVzZXJfYXBpX3JlYWQiLCJ1c2VyX2FwaV93cml0ZSJdLCJnZW5lcmFsLXB1cnBvc2UiOnRydWUsInNhbWwiOnt9fQ.Mxb-8ivufGegxMMsQ2E2KEcRZjHqOeCdw4Y35CCnxNxn6StLsWy2X-cfOFMVF82LzDny2mXFNNd2Ft_ouehsmg")
data.world::set_config(saved_cfg)
library("data.world")
Loading required package: dwapi

Attaching package: 'dwapi'
The following object is masked from 'package:usethis':

    create_project
The following object is masked from 'package:dplyr':

    sql
library(tidyverse)
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ tibble  3.1.8     ✔ purrr   1.0.1
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ data.table::between() masks dplyr::between()
✖ tidyr::extract()      masks raster::extract()
✖ dplyr::filter()       masks stats::filter()
✖ data.table::first()   masks dplyr::first()
✖ dplyr::lag()          masks stats::lag()
✖ data.table::last()    masks dplyr::last()
✖ raster::select()      masks dplyr::select()
✖ dwapi::sql()          masks dplyr::sql()
✖ purrr::transpose()    masks data.table::transpose()
vignette("quickstart", package = "data.world")
starting httpd help server ... done
?data.world
ovrvw <- "https://data.world/health/big-cities-health"
bchdi <- data.world::query(
  data.world::qry_sql("SELECT * FROM Big_Cities_Health_Data_Inventory"),
  dataset = ovrvw
)
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 13512 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): indicator_category, indicator, gender, race_ethnicity, place, bchc_...
dbl (2): year, value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(bchdi)

11 Interactive Data Visualization

11.1 Enhanced Graphics with Plotly

11.2 Shiny Apps

12 Building Apps for Data Exploration