── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0 ✔ dplyr 1.1.0
✔ tibble 3.1.8 ✔ stringr 1.5.0
✔ tidyr 1.3.0 ✔ forcats 0.5.2
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Rows: 77 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): name, mfr, type
dbl (13): calories, protein, fat, sodium, fiber, carbo, sugars, potass, vita...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
5 Predictive Modeling in R
6 Linear Regression
As discussed in class, linear regression is the go-to tool when our task requires us to learn about the values and variation in a continuous response variable (\(y\)), given the values of related explanatory or predictor variables, \(x_1,x_2,\ldots,x_p\). Here \(y\) is dependent, and the \(x_i\)’s are independent. The assumption is that there is some genuine relationship between these variables, and based on the data we collect, we wuold like to use the tools of statistical analysis to effectively estimate this relationship, and evaluate the performance of our estimator. The general form of the multiple linear regression model is as follows: \(y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\ldots+\beta_{p-1}x_{ip-1}+\epsilon_i\\=\mu_i+\epsilon_i\)
for \(i=1,\ldots, n\), where \(\mu_i = E[\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_{p-1} x_{ip-1}]\),
\(y_i=\) value of response for individual \(i\)
\(x_{ij}=\) value of \(j^{th}\) predictor for individual \(i\)
\(\epsilon_i=\) random component (error term) for individual \(i\)
\(\beta_j=\) parameters that characterize the relationship between the predictors and response.
We can express this model in matrix notation as [ =X+ ]
Based on sample data \((X,\vec{y})\), we seek to identify the values of the parameters \(\vec{\beta}\) that minimize the sum of squared error criterion:
\[ Q(\beta_0, \beta_1 , \beta_2 , \ldots, \beta_{p-1} ) = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2} - \ldots - \beta_{p-1} x_{ip-1})^2 \]
Denote these optimal statistics as \(\hat{\vec{\beta}}\). Let’s see this process in action for actual data.
The cereal.csv file on Kaggle lists a number of cereals (can you find your favorite?), along with variables that describe the nutritional and other qualities of each. The target variable here is a rating of each cereal (most likely from Consumer Reports). A listing of the individuals and independent variables is below:
Name: Name of cereal
mfr: Manufacturer of cereal
A = American Home Food Products; G = General Mills K = Kelloggs N = Nabisco P = Post Q = Quaker Oats R = Ralston Purina
type: cold or hot
calories: calories per serving
protein: grams of protein
fat: grams of fat
sodium: milligrams of sodium
fiber: grams of dietary fiber
carbo: grams of complex carbohydrates
sugars: grams of sugars
potass: milligrams of potassium
vitamins: vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended
shelf: display shelf (1, 2, or 3, counting from the floor)
weight: weight in ounces of one serving
cups: number of cups in one serving
rating: a rating of the cereals (Possibly from Consumer Reports?)
Let’s load the data and start to explore.
head(data)
# A tibble: 6 × 16
name mfr type calor…¹ protein fat sodium fiber carbo sugars potass
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100% Bran N C 70 4 1 130 10 5 6 280
2 100% Natur… Q C 120 3 5 15 2 8 8 135
3 All-Bran K C 70 4 1 260 9 7 5 320
4 All-Bran w… K C 50 4 0 140 14 8 0 330
5 Almond Del… R C 110 2 2 200 1 14 8 -1
6 Apple Cinn… G C 110 2 2 180 1.5 10.5 10 70
# … with 5 more variables: vitamins <dbl>, shelf <dbl>, weight <dbl>,
# cups <dbl>, rating <dbl>, and abbreviated variable name ¹calories
In preparation for the model validation process, let’s hold out a random subset of our data so that we can validate the results we find using the majority of values. Let’s use a 80% -20% split.
<-dim(data)[1]
n n
[1] 77
0.8*n
[1] 61.6
Randomly sample 0.8*n observations to build the model:
<-0.8
train_proportion<-ceiling(train_proportion*n)
nbset.seed(11)
<-sample(seq(1,77,1),nb)
train_ind<-data[train_ind,]
data_traininghead(data_training)
# A tibble: 6 × 16
name mfr type calor…¹ protein fat sodium fiber carbo sugars potass
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Grape-Nuts P C 110 3 0 170 3 17 3 90
2 Puffed Whe… Q C 50 2 0 0 1 10 0 50
3 Froot Loops K C 110 2 1 125 1 11 13 30
4 Corn Chex R C 110 2 0 280 0 22 3 25
5 Honey Nut … G C 110 3 1 250 1.5 11.5 10 90
6 Raisin Nut… G C 100 3 2 140 2.5 10.5 8 140
# … with 5 more variables: vitamins <dbl>, shelf <dbl>, weight <dbl>,
# cups <dbl>, rating <dbl>, and abbreviated variable name ¹calories
dim(data_training)
[1] 62 16
Let’s tuck away our validation (testing) set for future reference
<-data[-train_ind,]
data_testinghead(data_testing)
# A tibble: 6 × 16
name mfr type calor…¹ protein fat sodium fiber carbo sugars potass
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100% Natur… Q C 120 3 5 15 2 8 8 135
2 All-Bran w… K C 50 4 0 140 14 8 0 330
3 Cinnamon T… G C 120 1 3 210 0 13 9 45
4 Cocoa Puffs G C 110 1 1 180 0 12 13 55
5 Count Choc… G C 110 1 1 180 0 12 13 65
6 Fruity Peb… P C 110 1 1 135 0 13 12 25
# … with 5 more variables: vitamins <dbl>, shelf <dbl>, weight <dbl>,
# cups <dbl>, rating <dbl>, and abbreviated variable name ¹calories
dim(data_testing)
[1] 15 16
A thorough investigation of relationships between predictor and response variables involves a number of steps, but for illustrative purposes, let’s look at how three specific variables, sugars, protein, and potassium perform as predictors of rating:
<-lm(rating~sugars+protein+potass, data=data_training)
fitsummary(fit)
Call:
lm(formula = rating ~ sugars + protein + potass, data = data_training)
Residuals:
Min 1Q Median 3Q Max
-10.996 -5.285 -1.587 3.772 15.560
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.280050 3.219019 16.241 < 2e-16 ***
sugars -2.298867 0.233191 -9.858 5.25e-14 ***
protein -0.006724 1.074579 -0.006 0.99503
potass 0.066174 0.016601 3.986 0.00019 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.145 on 58 degrees of freedom
Multiple R-squared: 0.6868, Adjusted R-squared: 0.6706
F-statistic: 42.39 on 3 and 58 DF, p-value: 1.234e-14
plot(fit)
Let’s discuss the numerical output and diagnostic plots:
6.1 Model Output
We can see that our adjusted R-squared is 0.6868. This indicates that about 70% of the variation in rating can be explained by sugars, protein, and potassium. Moreover, the F statistic yields signifigance for the overall model, with a p-value of \(1.234\times10^{-14}\). These facts are strong evidence that we gain much by incorporating the x variables in our inference process, instead of using \(y\) alone in univariate inference.
State the fitted least squares regression equation in terms of ratings, sugars, protein, and potass
How would you interpret the coefficient for the variable sugars? Is this what you expected?
6.1.1 Residual vs Fitted Plot
The plot of the residuals vs the fitted values provides a red lowess smooth regression estimate, which is the regression version of a kernel density estimate. we can see a curvilinear pattern, which is bad news. This is indicative of a possible violation of the homoscedasticity assumption.
6.1.2 Normal Q-Q
The normal quantile-quantile plot looks at the residuals and evaluates how well their quantiles follow the expected trend of normally distributed variables. Normally distributed observations would fall very close to the line, so the deviations we see (particularly before one and especially after 1 on the horixontal axis) are alarming indicators of a lack of normality. Unfortunately, this calls into question the appropriateness of our tests in the model output, since all distributional results (t, F) are based on the normality assumption.
6.1.3 Scale-Location Plot
This plot gives a clearer picture of whether the homoscedasticity assumption holds. The non random appearance emphasized by the curvilinear trend confirms our uspicions from the residual plot.
6.1.4 Residuals vs Leverage
Occasionally our data contains outliers that drastically impact our analysis. These should be investigated because sometimes they are due to a mistake in the data collection/recording process, or for similar reasons should be removed. In our plot, we are looking for observations in the upper or lower right corners. Since there are none, no observations in our data set are exherting undue influence on our infrence.
Due to the lack of normality, we can run a Box-Cox transformation. The following will produce a visual of a feasible range of transformations, as well as an approximate 95% confidence interval for the optimal one:
To pinpoint the actual input \(\lambda\) that achieves the optimal transformation, we select the “FALSE” value for the plotit argument:
::boxcox(fit,plotit=F)->BC
MASS BC
$x
[1] -2.0 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.7 -0.6
[16] -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
[31] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
$y
[1] -50.29084 -47.53700 -44.85793 -42.25802 -39.74188 -37.31435 -34.98044
[8] -32.74535 -30.61436 -28.59282 -26.68603 -24.89920 -23.23734 -21.70515
[15] -20.30693 -19.04649 -17.92704 -16.95110 -16.12043 -15.43598 -14.89786
[22] -14.50533 -14.25683 -14.14996 -14.18163 -14.34808 -14.64497 -15.06750
[29] -15.61052 -16.26861 -17.03619 -17.90760 -18.87723 -19.93950 -21.08900
[36] -22.32052 -23.62905 -25.00983 -26.45838 -27.97049 -29.54223
This produces a list, which is an object in R that provides a convenient way of storing objects of different classes together. Let’s identify which provides the largest log-likelihood
which(BC$y==max(BC$y),arr.ind=T)
[1] 24
1]$x[25] BC[
[1] 0.4
Observation 25 indicates the \(\lambda=0.4\) power transformation, so let’s see how this changes our model output and diagnostic plots:
mutate(data_training,rating.4=rating^0.4)->data_training_4
<-lm(rating.4~sugars+protein+potass, data=data_training_4)
fit2summary(fit2)
Call:
lm(formula = rating.4 ~ sugars + protein + potass, data = data_training_4)
Residuals:
Min 1Q Median 3Q Max
-0.59283 -0.19873 -0.03096 0.16583 0.64201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8110931 0.1318155 36.499 < 2e-16 ***
sugars -0.0954711 0.0095489 -9.998 3.13e-14 ***
protein 0.0155851 0.0440029 0.354 0.724486
potass 0.0026428 0.0006798 3.888 0.000263 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2926 on 58 degrees of freedom
Multiple R-squared: 0.6991, Adjusted R-squared: 0.6835
F-statistic: 44.91 on 3 and 58 DF, p-value: 3.887e-15
plot(fit2)
Here we see slight improvement in the \(R^2\) value, the overall model significance from the F test, and the visual representation in the diagnostic plots.