5  Predictive Modeling in R

6 Linear Regression

As discussed in class, linear regression is the go-to tool when our task requires us to learn about the values and variation in a continuous response variable (\(y\)), given the values of related explanatory or predictor variables, \(x_1,x_2,\ldots,x_p\). Here \(y\) is dependent, and the \(x_i\)’s are independent. The assumption is that there is some genuine relationship between these variables, and based on the data we collect, we wuold like to use the tools of statistical analysis to effectively estimate this relationship, and evaluate the performance of our estimator. The general form of the multiple linear regression model is as follows: \(y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\ldots+\beta_{p-1}x_{ip-1}+\epsilon_i\\=\mu_i+\epsilon_i\)

for \(i=1,\ldots, n\), where \(\mu_i = E[\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_{p-1} x_{ip-1}]\),

\(y_i=\) value of response for individual \(i\)

\(x_{ij}=\) value of \(j^{th}\) predictor for individual \(i\)

\(\epsilon_i=\) random component (error term) for individual \(i\)

\(\beta_j=\) parameters that characterize the relationship between the predictors and response.

We can express this model in matrix notation as [ =X+ ]

Based on sample data \((X,\vec{y})\), we seek to identify the values of the parameters \(\vec{\beta}\) that minimize the sum of squared error criterion:

\[ Q(\beta_0, \beta_1 , \beta_2 , \ldots, \beta_{p-1} ) = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2} - \ldots - \beta_{p-1} x_{ip-1})^2 \]

Denote these optimal statistics as \(\hat{\vec{\beta}}\). Let’s see this process in action for actual data.

The cereal.csv file on Kaggle lists a number of cereals (can you find your favorite?), along with variables that describe the nutritional and other qualities of each. The target variable here is a rating of each cereal (most likely from Consumer Reports). A listing of the individuals and independent variables is below:

Name: Name of cereal

mfr: Manufacturer of cereal

A = American Home Food Products; G = General Mills K = Kelloggs N = Nabisco P = Post Q = Quaker Oats R = Ralston Purina

type: cold or hot

calories: calories per serving

protein: grams of protein

fat: grams of fat

sodium: milligrams of sodium

fiber: grams of dietary fiber

carbo: grams of complex carbohydrates

sugars: grams of sugars

potass: milligrams of potassium

vitamins: vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended

shelf: display shelf (1, 2, or 3, counting from the floor)

weight: weight in ounces of one serving

cups: number of cups in one serving

rating: a rating of the cereals (Possibly from Consumer Reports?)

Let’s load the data and start to explore.

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ dplyr   1.1.0
✔ tibble  3.1.8     ✔ stringr 1.5.0
✔ tidyr   1.3.0     ✔ forcats 0.5.2
✔ purrr   1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Rows: 77 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): name, mfr, type
dbl (13): calories, protein, fat, sodium, fiber, carbo, sugars, potass, vita...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
# A tibble: 6 × 16
  name        mfr   type  calor…¹ protein   fat sodium fiber carbo sugars potass
  <chr>       <chr> <chr>   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
1 100% Bran   N     C          70       4     1    130  10     5        6    280
2 100% Natur… Q     C         120       3     5     15   2     8        8    135
3 All-Bran    K     C          70       4     1    260   9     7        5    320
4 All-Bran w… K     C          50       4     0    140  14     8        0    330
5 Almond Del… R     C         110       2     2    200   1    14        8     -1
6 Apple Cinn… G     C         110       2     2    180   1.5  10.5     10     70
# … with 5 more variables: vitamins <dbl>, shelf <dbl>, weight <dbl>,
#   cups <dbl>, rating <dbl>, and abbreviated variable name ¹​calories

In preparation for the model validation process, let’s hold out a random subset of our data so that we can validate the results we find using the majority of values. Let’s use a 80% -20% split.

n<-dim(data)[1]
n
[1] 77
0.8*n
[1] 61.6

Randomly sample 0.8*n observations to build the model:

train_proportion<-0.8
nb<-ceiling(train_proportion*n)
set.seed(11)
train_ind<-sample(seq(1,77,1),nb)
data_training<-data[train_ind,]
head(data_training)
# A tibble: 6 × 16
  name        mfr   type  calor…¹ protein   fat sodium fiber carbo sugars potass
  <chr>       <chr> <chr>   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
1 Grape-Nuts  P     C         110       3     0    170   3    17        3     90
2 Puffed Whe… Q     C          50       2     0      0   1    10        0     50
3 Froot Loops K     C         110       2     1    125   1    11       13     30
4 Corn Chex   R     C         110       2     0    280   0    22        3     25
5 Honey Nut … G     C         110       3     1    250   1.5  11.5     10     90
6 Raisin Nut… G     C         100       3     2    140   2.5  10.5      8    140
# … with 5 more variables: vitamins <dbl>, shelf <dbl>, weight <dbl>,
#   cups <dbl>, rating <dbl>, and abbreviated variable name ¹​calories
dim(data_training)
[1] 62 16

Let’s tuck away our validation (testing) set for future reference

data_testing<-data[-train_ind,]
head(data_testing)
# A tibble: 6 × 16
  name        mfr   type  calor…¹ protein   fat sodium fiber carbo sugars potass
  <chr>       <chr> <chr>   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
1 100% Natur… Q     C         120       3     5     15     2     8      8    135
2 All-Bran w… K     C          50       4     0    140    14     8      0    330
3 Cinnamon T… G     C         120       1     3    210     0    13      9     45
4 Cocoa Puffs G     C         110       1     1    180     0    12     13     55
5 Count Choc… G     C         110       1     1    180     0    12     13     65
6 Fruity Peb… P     C         110       1     1    135     0    13     12     25
# … with 5 more variables: vitamins <dbl>, shelf <dbl>, weight <dbl>,
#   cups <dbl>, rating <dbl>, and abbreviated variable name ¹​calories
dim(data_testing)
[1] 15 16

A thorough investigation of relationships between predictor and response variables involves a number of steps, but for illustrative purposes, let’s look at how three specific variables, sugars, protein, and potassium perform as predictors of rating:

fit<-lm(rating~sugars+protein+potass, data=data_training)
summary(fit)

Call:
lm(formula = rating ~ sugars + protein + potass, data = data_training)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.996  -5.285  -1.587   3.772  15.560 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 52.280050   3.219019  16.241  < 2e-16 ***
sugars      -2.298867   0.233191  -9.858 5.25e-14 ***
protein     -0.006724   1.074579  -0.006  0.99503    
potass       0.066174   0.016601   3.986  0.00019 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.145 on 58 degrees of freedom
Multiple R-squared:  0.6868,    Adjusted R-squared:  0.6706 
F-statistic: 42.39 on 3 and 58 DF,  p-value: 1.234e-14
plot(fit)

Let’s discuss the numerical output and diagnostic plots:

6.1 Model Output

We can see that our adjusted R-squared is 0.6868. This indicates that about 70% of the variation in rating can be explained by sugars, protein, and potassium. Moreover, the F statistic yields signifigance for the overall model, with a p-value of \(1.234\times10^{-14}\). These facts are strong evidence that we gain much by incorporating the x variables in our inference process, instead of using \(y\) alone in univariate inference.

  1. State the fitted least squares regression equation in terms of ratings, sugars, protein, and potass

  2. How would you interpret the coefficient for the variable sugars? Is this what you expected?

6.1.1 Residual vs Fitted Plot

The plot of the residuals vs the fitted values provides a red lowess smooth regression estimate, which is the regression version of a kernel density estimate. we can see a curvilinear pattern, which is bad news. This is indicative of a possible violation of the homoscedasticity assumption.

6.1.2 Normal Q-Q

The normal quantile-quantile plot looks at the residuals and evaluates how well their quantiles follow the expected trend of normally distributed variables. Normally distributed observations would fall very close to the line, so the deviations we see (particularly before one and especially after 1 on the horixontal axis) are alarming indicators of a lack of normality. Unfortunately, this calls into question the appropriateness of our tests in the model output, since all distributional results (t, F) are based on the normality assumption.

6.1.3 Scale-Location Plot

This plot gives a clearer picture of whether the homoscedasticity assumption holds. The non random appearance emphasized by the curvilinear trend confirms our uspicions from the residual plot.

6.1.4 Residuals vs Leverage

Occasionally our data contains outliers that drastically impact our analysis. These should be investigated because sometimes they are due to a mistake in the data collection/recording process, or for similar reasons should be removed. In our plot, we are looking for observations in the upper or lower right corners. Since there are none, no observations in our data set are exherting undue influence on our infrence.

Due to the lack of normality, we can run a Box-Cox transformation. The following will produce a visual of a feasible range of transformations, as well as an approximate 95% confidence interval for the optimal one:

To pinpoint the actual input \(\lambda\) that achieves the optimal transformation, we select the “FALSE” value for the plotit argument:

MASS::boxcox(fit,plotit=F)->BC
BC
$x
 [1] -2.0 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.7 -0.6
[16] -0.5 -0.4 -0.3 -0.2 -0.1  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9
[31]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0

$y
 [1] -50.29084 -47.53700 -44.85793 -42.25802 -39.74188 -37.31435 -34.98044
 [8] -32.74535 -30.61436 -28.59282 -26.68603 -24.89920 -23.23734 -21.70515
[15] -20.30693 -19.04649 -17.92704 -16.95110 -16.12043 -15.43598 -14.89786
[22] -14.50533 -14.25683 -14.14996 -14.18163 -14.34808 -14.64497 -15.06750
[29] -15.61052 -16.26861 -17.03619 -17.90760 -18.87723 -19.93950 -21.08900
[36] -22.32052 -23.62905 -25.00983 -26.45838 -27.97049 -29.54223

This produces a list, which is an object in R that provides a convenient way of storing objects of different classes together. Let’s identify which provides the largest log-likelihood

which(BC$y==max(BC$y),arr.ind=T)
[1] 24
BC[1]$x[25]
[1] 0.4

Observation 25 indicates the \(\lambda=0.4\) power transformation, so let’s see how this changes our model output and diagnostic plots:

mutate(data_training,rating.4=rating^0.4)->data_training_4
fit2<-lm(rating.4~sugars+protein+potass, data=data_training_4)
summary(fit2)

Call:
lm(formula = rating.4 ~ sugars + protein + potass, data = data_training_4)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.59283 -0.19873 -0.03096  0.16583  0.64201 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.8110931  0.1318155  36.499  < 2e-16 ***
sugars      -0.0954711  0.0095489  -9.998 3.13e-14 ***
protein      0.0155851  0.0440029   0.354 0.724486    
potass       0.0026428  0.0006798   3.888 0.000263 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2926 on 58 degrees of freedom
Multiple R-squared:  0.6991,    Adjusted R-squared:  0.6835 
F-statistic: 44.91 on 3 and 58 DF,  p-value: 3.887e-15
plot(fit2)

Here we see slight improvement in the \(R^2\) value, the overall model significance from the F test, and the visual representation in the diagnostic plots.