2 Prerequisites

3 Overview

Statistical Data Science has been characterized in a number of ways and a myriad of diagrams. As Tukey famously quipped, “all models are wrong”, and that includes the model we use to explain what data science is. That being said, it is helpful to summarize data science as the intersection of three fields:

Domain expertise will be context dependent, so this chapter is an effort to survey the necessary mathematics and computer science content that make growth in data science possible.

4 Mathematical Content

To typeset mathematics text, we will be using the LaTeX system. Due to its wide adoption for decades in the scientific community, LaTeX has been natively incorporated through markdown into the most common Coding environments, Including Jupiter notebooks, Google Colab and RStudio. Tutorialpoint offers a useful online LaTeX editor that incorporates a dictionary of common mathematical notation and expressions at the bottom of the page that can expedite your typesetting.

So what concepts from mathematics and statistics will be necessary to review before we jump into data science and machine learning?

4.1 Algebra: Functions

Input and Output

$y = f (x)$

Algebraic vs Transcendental - an algebraic function is a function that can be built through addition, subtraction, multiplication and division

EXAMPLE:

$y = \frac{3 x^{10} - 7 x^{2}}{6 - 3 x}$ is algebraic

$y = \log x$

Calculus

Derivatives and Integrals

What other concepts will we need?

4.1.1 Optimization for function of two variables: z = f(x,y)

Optimization is the process of finding the minimums and/or maximums for a function on a given domain. For a two variable function, the minimum is at the point $(x_{0}, y_{0})$ if $f (x_{0}, y_{0}) \leq f (x, y)$ for all points $(x, y)$ within some range. Similarly, the maximum of a two variable function is at the point $(x_{0}, y_{0})$ if $f (x_{0}, y_{0}) \geq f (x, y)$ for all points $(x, y)$ within some range. These extremes can be local (holding true only within a certain range in the domain) or global (holding true for the entire domain of $f$ ).

When optimizing functions of a single variable, we look for critical points where the derivative of the function equals zero or does not exist. These critical points can be the locations of potential maximums or minimums, but further checking is required to verify the critical point is a an extreme. A critical point has the potential to be neither a maximum nor a minimum in which case it is referred to as a saddle point.

To find a critical point of a single variable function, we derive with respect to the variable and set the derivation equal to zero. For example, if $f (x) = x^{2} + 1$ , then $f^{'} (x) = 2 x = 0$ , meaning the critical point is at $x = 0$ .

To verify if this point is a maximum or minimum, we can take use the second derivative test. If the result is greater than zero, the point is at a local minimum. If the result is less than zero, the point is at a local maximum. If the result is neither, the test is inconclusive. Continuing the earlier example, the second derivative of $f (x) = x^{2} + 1$ is $f^{″} (x) = 2 > 0$ . Therefore, the critical point at $x = 0$ is a minimum. This can be confirmed visually as well.

Let us look at another function: $f (x) = x^{3}$ .

Step 1: Take first derivative and set equal to zero to find critical point.

$f^{'} (x) = 3 x^{2} = 0$

Critical point at $x = 0$ .

Step 2: Take second derivative. If result is greater than zero, point is a minimum. If result is less than zero, point is a maximum. If neither, test fails.

$f^{″} (x) = 6 x$ , which is neither conclusively positive or negative. Therefore, the test fails.

Graphing $f (x) = x^{3}$ reveals the point at $x = 0$ to indeed be a saddle point.

The process for optimizing two variable functions is similar to that of one variable functions, just with a couple of extra steps.

For a function $z = f (x, y)$ , the point $(x_{0}, y_{0})$ is called a critical point if either:

$f_{x} (x_{0}, y_{0}) = f_{y} (x_{0}, y_{0}) = 0$ or
Either $f_{x} (x_{0}, y_{0})$ or $f_{y} (x_{0}, y_{0})$ does not exist.

To find the critical points of a function $f (x, y)$ , we take the partial derivative with respect to both $x$ and $y$ , set the derivatives equal to zero, then solve the resulting system of equations.

Example:

$f_{x, y} (x, y) = 2 x^{2} - 4 x + x y - y^{2} + 3 y - 6 f_{x} (x, y) = 4 x - 4 + y f_{y} (x, y) = x - 2 y + 3$

4.1.1.0.1 Setting derivatives equal to zero and solving system:

$4 x - 4 + y = 0 x - 2 y + 3 = 0$

$y = - 4 x + 4 x - 2 (- 4 x + 4) + 3 = 0 x + 8 x - 8 + 3 = 0 9 x - 5 = 0 9 x = 5 x = 5 / 9$

4.1.1.0.2 Substituting in $x$ to get $y$ :

$y = - 4 (5 / 9) + 4 y = - 20 / 9 + 36 / 9 y = 16 / 9$

Therefore, the critical point is at $(5 / 9, 16 / 9)$ .

To determine if a critical point is a maximum or minimum, we can use the second partials test described below:

For a function $z = f (x, y)$ which has continuous first and second partial derivatives on some disk containing the point $(x_{0}, y_{0})$ , the discriminant $D$ is found by the following equation:

$D = f_{x x} (x_{0}, y_{0}) f_{y y} (x_{0}, y_{0}) - (f_{x y} (x_{0}, y_{0}))^{2}$

Using D, we can determine the concavity of the function.

i. If $D > 0$ and $f_{x x} (x_{0}, y_{0}) > 0$ , then $f$ is concave up at the critical point, meaning $f$ has a local minimum at $(x_{0}, y_{0})$ .

ii. If $D > 0$ and $f_{x x} (x_{0}, y_{0}) < 0$ , then $f$ is concave down at the critical point, meaning $f$ has a local maximum at $(x_{0}, y_{0})$ .

iii. If $D < 0$ , then $f$ has a saddle point at $(x_{0}, y_{0})$ .

iv. If $D = 0$ , then the test is inconclusive.

Applying this test to our previous example with critical point at $(5 / 9, 16 / 9)$ goes as follows:

Example continued:

$f_{x} (x, y) = 4 x - 4 + y f_{x x} (x, y) = 4$ $f_{y} (x, y) = x - 2 y + 3 f_{y y} (x, y) = - 2$ $f_{x y} (x, y) = 1$ $D = f_{x x} (5 / 9, 16 / 9) f_{y y} (5 / 9, 16 / 9) - (f_{x y} (5 / 9, 16 / 9))^{2} D = (4) (- 2) - (1)^{2} D = - 8 - 1 D = - 9$

As $D$ is less than zero, we can deduce that at $(5 / 9, 16 / 9)$ , the function has a saddle point. This can be confirmed visually as well below (equation was visualized in GeoGebra).

Source: https://math.libretexts.org/Courses/Monroe_Community_College/MTH_212_Calculus_III/Chapter_13%3A_Functions_of_Multiple_Variables_and_Partial_Derivatives/13.8%3A_Optimization_of_Functions_of_Several_Variables

(Linear) Approximation - Taylor series - Kierra

Linear Algebra

Decompositions - (SVD, QR)

Matrix manipulation

Tensors - Angel

Statistics/Probability

Bias-Variance tradeoff

Regression/Classification - Sean

Regression analysis is the use of statistical methods to estimate one or more dependent variables (also known as response or outcome variables) based on the values of one or more independent variables (also known as predictors, explanatory variables or covariates). A regression model represents this statistical relationship between the response variable(s) Y and the independent variable(s) X, with Y varying in response to X in a systematic fashion. This relation will not be perfect, as data points will generally be scattered around the curve of relation.

The primary purposes of regression model are to predict new values for an outcome variable of interest, and to discovering potential causal relationships between an outcome and its set of predictors.

Two important characteristics to a regression model are that (1) there is a probability distribution for Y for each level of X, and (2) that the means of these probability distributions vary systematically with X.

A basic regression model with one predictor and one response variable and a linear relationship takes the following form:

$Y_{i} = β_{0} + β_{1} X_{i} + ε_{i}$

where:

$Y_{i}$ is the value of the response variable in the \emph{i}th trial.

$β_{0}$ is the intercept parameter.

$β_{1}$ is the parameter that relates X to Y (can be considered as the slope in this particular linear example).

$X_{i}$ is the known value of the predictor variable from the $i$ th trial.

$ε_{i}$ is a random error term from the \emph{i}th trial with mean $E {ε_{i}} = 0$ and variance $σ^{2} {ε_{i}} = σ^{2}$ .

4.1.1.1 Method of Least Squares

To build this model, the beta parameters need to be estimated in some fashion. One approach for doing so is to use the method of least squares. In this method, the total squared differences between the observed data points and the hypothetical regression “line of best fit” are minimized. In other words, beta values are chosen which generate the line that will best run through the center of the data.

Using the simple linear regression model example above, the criterion Q that we are trying to minimize with this least squares method is visualized as:

$Q = \sum_{i = 1}^{n} (Y_{i} - β_{0} - β_{1} X_{i})^{2}$

with $n$ being the number of observations in the regression data set.

Solving the normal equations

$\sum Y_{i} = n b_{0} + b_{1} \sum X_{i}$

$\sum X_{i} Y_{i} = b_{0} \sum X_{i} + b_{1} \sum X_{i}^{2}$

simultaneously provides values for $b_{0}$ and $b_{1}$ , which are unbiased estimators of the actual parameters $β_{0}$ and $β_{1}$ . While this calculation can be done by hand, using a computer is significantly more efficient.

Simple Linear Point Estimation Example

Once a regression model is created from the data, point estimates for Y can be generated based on values of X.

For example, if solving the normal equations gave us $b_{0} = 3$ and $b_{1} = 0.5$ , we would have the equation $\hat{Y} = 3 + 0.5 X$ . Plugging a value of 2 in for the predictor variable X gives us $\hat{Y} = 3 + 0.5 (2) = 4$ . For this relation, we would estimate a value of 4 for our response given a value of 2 for our predictor.

The example above is the simplest case of regression. Regression can include numerous predictors, several response variables, and even have a nonlinear relation.

Categorical Variables

With the proper adjustments made to the basic linear regression model provided above, the predictor and response variables can accommodate not just numeric data, but categorical data as well.

A technique for writing a regression model with a categorical predictor variable is to treat each level as its own entire variable. Each of these subsequent variables then only take on either a value of 0 (if the observation does not exhibit that level of the categorical variable) or 1 (if the observation does exhibit that level of the categorical variable).

For a categorical variable with $m$ levels, $m - 1$ variables are needed to fully express the original variable in the model (setting each sub-variable to 0 generates a way to express the last level).

Categorical Variable Regression Example

For example, a model that regresses GPA $Y$ on grade level $X$ (Freshman, Sophomore, Junior, Senior) could look as follows:

$Y_{i} = β_{0} + β_{1} X_{i 1} + β_{2} X_{i 2} + β_{3} X_{i 3} + ε_{i}$

Here:

$Y_{i}$ is the GPA value for the $i$ th student observed.

Freshman is the reference level of the grade level variable $X$ , so if a student is a freshmen then $X_{i 1}$ , $X_{i 2}$ , and $X_{i 3}$ will all be 0.

$X_{i 1}$ is 1 if the $i$ th student is a sophomore, and 0 otherwise.

$X_{i 2}$ is 1 if the $i$ th student is a junior, and 0 otherwise.

$X_{i 3}$ is 1 if the $i$ th student is a senior, and 0 otherwise.

$β_{0}$ is the intercept parameter. As freshman is the reference level and there are no other predictors besides grade, this is also the estimated mean GPA for freshmen.

$β_{1}$ is the parameter that relates being a sophomore to GPA.

$β_{2}$ is the parameter that relates being a junior to GPA.

$β_{3}$ is the parameter that relates being a senior to GPA.

$ε_{i}$ is a random error term for the $i$th observation with mean $E {ε_{i}} = 0$ and variance $σ^{2} {ε_{i}} = σ^{2}$ .

There is much more to regression analysis than covered here. It is possible to model response variables that are binary or follow a Poisson distribution for instance, but the techniques are more complex.

4.1.2 References:

https://en.wikipedia.org/wiki/Regression_analysis

Kutner, Michael, et al. Applied Linear Regression Models: Fourth Edition. McGraw-Hill, 2004.

4.1.3

Likelihood

Probability distributions (Normal, etc)

Randomization

Time Series, Markov chains

5

Likelihood

Probability distributions (Normal, etc) - Miontranese

Randomization

Time series, Markov chains

6 From Math/Stat to CS

Fact or fiction: In theory, any mathematical or statistical computation could be done by hand.

The above statement is true… almost. If our inputs are rational numbers (meaning…), and we are dealing with an expression that is algebraic - meaning it can be represented as the composition of the four basic operations: addition, subtraction, multiplication and division - then we can do it by hand.

EXAMPLE 1: Calculate $\frac{5^{2} - \frac{1}{2}}{3 (4 + \sqrt[3]{27})}$ by hand.

However, once we transcend these operations, even the simplest calculation becomes impossible:

EXAMPLE 2: Calculate $\sin \frac{π}{5}$ by hand.

True to form, here we are knee deep into the section on computer science and still talking math. Don’t worry, the coding part is coming. But it is important to explain what exactly we need computers for. That will help us understand why computing power has been so instrumental in pushing mathematical and scientific ideas as far as they’ve come.

So about $\sin \frac{π}{5}$ . If we ask R to calculate it, we get

sin(pi/5)

[1] 0.5877853

In Python,

import math
pi = math.pi
math.sin(pi/5)

0.5877852522924731

If a computer is programmed by humans, how did the computer know the value of $\sin \frac{π}{5}$ ?

Taylor series approximation

7 Computer Science Content

In the marketplace, Python is by far the most commonly used language for exploring and mining data, as well as training, testing, and deploying machine learning models. It is also fast becoming the same in academic circles as well. That being said, other programming platforms still have a die-hard base of support in a variety of domains (e.g. SPSS in social and health sciences, Stata in economics, SAS in medicine). R remains popular in statistics circles because it was built by statisticians for statisticians, and it has kept in step with modern advances in data science and machine learning. Many Python libraries are essentially R emulators (and vice versa). It is good practice for aspiring data scientist to be familiar with both.

7.1 R Coding Basics

Both R and Python start with base versions, and are built up by installing crowd sourced packages.

install.packages("tidyverse")

Installing tidyverse [1.3.2] ...
    OK [linked cache]

install.packages("tidymodels")

Installing tidymodels [1.0.0] ...
    OK [linked cache]

install.packages("reticulate")

Installing reticulate [1.28] ...
    OK [linked cache]

2 Prerequisites

3 Overview

4 Mathematical Content

4.1 Algebra: Functions

4.1.1 Optimization for function of two variables: z = f(x,y)

4.1.1.0.1 Setting derivatives equal to zero and solving system:

4.1.1.0.2 Substituting in $x$ to get $y$ :

4.1.1.1 Method of Least Squares

4.1.2 References:

4.1.3

5

6 From Math/Stat to CS

7 Computer Science Content

7.1 R Coding Basics

7.2 Python Coding Basics

7.3 Choosing an Integrated Development Environment (IDE)

7.3.1 RStudio

7.3.2 Google Colab

7.3.3 Jupyter Notebook

7.3.4 VS Code

3 Overview

4 Mathematical Content

4.1 Algebra: Functions

4.1.1 Optimization for function of two variables: z = f(x,y)

4.1.1.0.1 Setting derivatives equal to zero and solving system:

4.1.1.0.2 Substituting in x to get y:

4.1.1.1 Method of Least Squares

4.1.2 References:

4.1.3

5

6 From Math/Stat to CS

7 Computer Science Content

7.1 R Coding Basics

7.2 Python Coding Basics

7.3 Choosing an Integrated Development Environment (IDE)

7.3.1 RStudio

7.3.2 Google Colab

7.3.3 Jupyter Notebook

7.3.4 VS Code

4.1.1.0.2 Substituting in $x$ to get $y$ :