sin(pi/5)
[1] 0.5877853
Statistical Data Science has been characterized in a number of ways and a myriad of diagrams. As Tukey famously quipped, “all models are wrong”, and that includes the model we use to explain what data science is. That being said, it is helpful to summarize data science as the intersection of three fields:
Domain expertise will be context dependent, so this chapter is an effort to survey the necessary mathematics and computer science content that make growth in data science possible.
To typeset mathematics text, we will be using the LaTeX system. Due to its wide adoption for decades in the scientific community, LaTeX has been natively incorporated through markdown into the most common Coding environments, Including Jupiter notebooks, Google Colab and RStudio. Tutorialpoint offers a useful online LaTeX editor that incorporates a dictionary of common mathematical notation and expressions at the bottom of the page that can expedite your typesetting.
So what concepts from mathematics and statistics will be necessary to review before we jump into data science and machine learning?
Input and Output
\(y=f(x)\)
Algebraic vs Transcendental - an algebraic function is a function that can be built through addition, subtraction, multiplication and division
EXAMPLE:
\(y=\frac{3x^{10} -7x^2}{6-3x}\) is algebraic
\(y=\log x\)
Calculus
Derivatives and Integrals
What other concepts will we need?
Optimization is the process of finding the minimums and/or maximums for a function on a given domain. For a two variable function, the minimum is at the point \((x_0, y_0)\) if \(f(x_0, y_0) \leq f(x,y)\) for all points \((x,y)\) within some range. Similarly, the maximum of a two variable function is at the point \((x_0, y_0)\) if \(f(x_0, y_0) \geq f(x,y)\) for all points \((x,y)\) within some range. These extremes can be local (holding true only within a certain range in the domain) or global (holding true for the entire domain of \(f\)).
When optimizing functions of a single variable, we look for critical points where the derivative of the function equals zero or does not exist. These critical points can be the locations of potential maximums or minimums, but further checking is required to verify the critical point is a an extreme. A critical point has the potential to be neither a maximum nor a minimum in which case it is referred to as a saddle point.
To find a critical point of a single variable function, we derive with respect to the variable and set the derivation equal to zero. For example, if \(f(x) = x^2 + 1\) , then \(f'(x) = 2x = 0\) , meaning the critical point is at \(x=0\).
To verify if this point is a maximum or minimum, we can take use the second derivative test. If the result is greater than zero, the point is at a local minimum. If the result is less than zero, the point is at a local maximum. If the result is neither, the test is inconclusive. Continuing the earlier example, the second derivative of \(f(x)=x^2+1\) is \(f''(x)=2 > 0\). Therefore, the critical point at \(x=0\) is a minimum. This can be confirmed visually as well.
Let us look at another function: \(f(x)=x^3\) .
Step 1: Take first derivative and set equal to zero to find critical point.
\(f'(x)=3x^2 = 0\)
Critical point at \(x=0\).
Step 2: Take second derivative. If result is greater than zero, point is a minimum. If result is less than zero, point is a maximum. If neither, test fails.
\(f''(x)=6x\), which is neither conclusively positive or negative. Therefore, the test fails.
Graphing \(f(x)=x^3\) reveals the point at \(x=0\) to indeed be a saddle point.
The process for optimizing two variable functions is similar to that of one variable functions, just with a couple of extra steps.
For a function \(z=f(x,y)\), the point \((x_0, y_0)\) is called a critical point if either:
To find the critical points of a function \(f(x,y)\), we take the partial derivative with respect to both \(x\) and \(y\) , set the derivatives equal to zero, then solve the resulting system of equations.
Example:
\[ f_{x,y}(x,y) = 2x^2 - 4x + xy - y^2 + 3y -6 \\ f_x(x,y)=4x - 4 + y \\ f_y(x,y)= x -2y + 3 \]
\[ 4x-4 + y=0 \\ x-2y+3=0 \\ \]
\[ y=-4x+4\\ x-2(-4x+4) + 3=0 \\ x+8x-8+3=0 \\ 9x - 5 = 0 \\ 9x=5 \\ x=5/9 \]
\[ y=-4(5/9)+4 \\ y=-20/9 + 36/9 \\ y=16/9 \]
Therefore, the critical point is at \((5/9, 16/9)\).
To determine if a critical point is a maximum or minimum, we can use the second partials test described below:
For a function \(z=f(x,y)\) which has continuous first and second partial derivatives on some disk containing the point \((x_0, y_0)\), the discriminant \(D\) is found by the following equation:
\[ D = f_{xx}(x_0,y_0)f_{yy}(x_0,y_0) - (f_{xy}(x_0,y_0))^2 \]
Using D, we can determine the concavity of the function.
i. If \(D>0\) and \(f_{xx}(x_0,y_0) > 0\), then \(f\) is concave up at the critical point, meaning \(f\) has a local minimum at \((x_0,y_0)\).
ii. If \(D>0\) and \(f_{xx}(x_0,y_0) < 0\), then \(f\) is concave down at the critical point, meaning \(f\) has a local maximum at \((x_0,y_0)\).
iii. If \(D<0\), then \(f\) has a saddle point at \((x_0,y_0)\).
iv. If \(D=0\), then the test is inconclusive.
Applying this test to our previous example with critical point at \((5/9, 16/9)\) goes as follows:
Example continued:
\[ f_x(x,y)=4x - 4 + y \\ f_{xx}(x,y)=4 \] \[ f_y(x,y)= x -2y + 3 \\ f_{yy}(x,y)=-2\\ \] \[ f_{xy}(x,y)=1 \] \[ D = f_{xx}(5/9, 16/9)f_{yy}(5/9, 16/9) - (f_{xy}(5/9, 16/9))^2 \\ D = (4)(-2) - (1)^2 \\ D = -8 - 1 \\ D = -9 \]
As \(D\) is less than zero, we can deduce that at \((5/9, 16/9)\), the function has a saddle point. This can be confirmed visually as well below (equation was visualized in GeoGebra).
Source: https://math.libretexts.org/Courses/Monroe_Community_College/MTH_212_Calculus_III/Chapter_13%3A_Functions_of_Multiple_Variables_and_Partial_Derivatives/13.8%3A_Optimization_of_Functions_of_Several_Variables
(Linear) Approximation - Taylor series - Kierra
Linear Algebra
Decompositions - (SVD, QR)
Matrix manipulation
Tensors - Angel
Statistics/Probability
Bias-Variance tradeoff
Regression/Classification - Sean
Regression analysis is the use of statistical methods to estimate one or more dependent variables (also known as response or outcome variables) based on the values of one or more independent variables (also known as predictors, explanatory variables or covariates). A regression model represents this statistical relationship between the response variable(s) Y and the independent variable(s) X, with Y varying in response to X in a systematic fashion. This relation will not be perfect, as data points will generally be scattered around the curve of relation.
The primary purposes of regression model are to predict new values for an outcome variable of interest, and to discovering potential causal relationships between an outcome and its set of predictors.
Two important characteristics to a regression model are that (1) there is a probability distribution for Y for each level of X, and (2) that the means of these probability distributions vary systematically with X.
A basic regression model with one predictor and one response variable and a linear relationship takes the following form:
\[ Y_i = \beta_0 + \beta_1X_i + \varepsilon_i \]
where:
\(Y_i\) is the value of the response variable in the \emph{i}th trial.
\(\beta_0\) is the intercept parameter.
\(\beta_1\) is the parameter that relates X to Y (can be considered as the slope in this particular linear example).
\(X_i\) is the known value of the predictor variable from the \(i\)th trial.
\(\varepsilon_i\) is a random error term from the \emph{i}th trial with mean \(E\{\varepsilon_i\} = 0\) and variance \(\sigma^2\{\varepsilon_i\}=\sigma^2\).
To build this model, the beta parameters need to be estimated in some fashion. One approach for doing so is to use the method of least squares. In this method, the total squared differences between the observed data points and the hypothetical regression “line of best fit” are minimized. In other words, beta values are chosen which generate the line that will best run through the center of the data.
Using the simple linear regression model example above, the criterion Q that we are trying to minimize with this least squares method is visualized as:
\[ Q = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1X_i)^2 \]
with \(n\) being the number of observations in the regression data set.
Solving the normal equations
\[ \sum Y_i=nb_0+b_1\sum X_i \]
\[ \sum X_i Y_i=b_0\sum X_i + b_1\sum X_i^2 \]
simultaneously provides values for \(b_0\) and \(b_1\), which are unbiased estimators of the actual parameters \(\beta_0\) and \(\beta_1\). While this calculation can be done by hand, using a computer is significantly more efficient.
Simple Linear Point Estimation Example
Once a regression model is created from the data, point estimates for Y can be generated based on values of X.
For example, if solving the normal equations gave us \(b_0=3\) and \(b_1=0.5\), we would have the equation \(\hat{Y} = 3 + 0.5X\). Plugging a value of 2 in for the predictor variable X gives us \(\hat{Y} = 3 + 0.5(2) = 4\). For this relation, we would estimate a value of 4 for our response given a value of 2 for our predictor.
The example above is the simplest case of regression. Regression can include numerous predictors, several response variables, and even have a nonlinear relation.
Categorical Variables
With the proper adjustments made to the basic linear regression model provided above, the predictor and response variables can accommodate not just numeric data, but categorical data as well.
A technique for writing a regression model with a categorical predictor variable is to treat each level as its own entire variable. Each of these subsequent variables then only take on either a value of 0 (if the observation does not exhibit that level of the categorical variable) or 1 (if the observation does exhibit that level of the categorical variable).
For a categorical variable with \(m\) levels, \(m-1\) variables are needed to fully express the original variable in the model (setting each sub-variable to 0 generates a way to express the last level).
Categorical Variable Regression Example
For example, a model that regresses GPA \(Y\) on grade level \(X\) (Freshman, Sophomore, Junior, Senior) could look as follows:
\[ Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i3} + \varepsilon_i \]
Here:
\(Y_i\) is the GPA value for the \(i\)th student observed.
Freshman is the reference level of the grade level variable \(X\), so if a student is a freshmen then \(X_{i1}\), \(X_{i2}\), and \(X_{i3}\) will all be 0.
\(X_{i1}\) is 1 if the \(i\)th student is a sophomore, and 0 otherwise.
\(X_{i2}\) is 1 if the \(i\)th student is a junior, and 0 otherwise.
\(X_{i3}\) is 1 if the \(i\)th student is a senior, and 0 otherwise.
\(\beta_0\) is the intercept parameter. As freshman is the reference level and there are no other predictors besides grade, this is also the estimated mean GPA for freshmen.
\(\beta_1\) is the parameter that relates being a sophomore to GPA.
\(\beta_2\) is the parameter that relates being a junior to GPA.
\(\beta_3\) is the parameter that relates being a senior to GPA.
\(\varepsilon_i\) is a random error term for the $i$th observation with mean \(E\{\varepsilon_i\} = 0\) and variance \(\sigma^2\{\varepsilon_i\}=\sigma^2\).
There is much more to regression analysis than covered here. It is possible to model response variables that are binary or follow a Poisson distribution for instance, but the techniques are more complex.
https://en.wikipedia.org/wiki/Regression_analysis
Kutner, Michael, et al. Applied Linear Regression Models: Fourth Edition. McGraw-Hill, 2004.
Likelihood
Probability distributions (Normal, etc)
Randomization
Time Series, Markov chains
Likelihood
Probability distributions (Normal, etc) - Miontranese
Randomization
Time series, Markov chains
Fact or fiction: In theory, any mathematical or statistical computation could be done by hand.
The above statement is true… almost. If our inputs are rational numbers (meaning…), and we are dealing with an expression that is algebraic - meaning it can be represented as the composition of the four basic operations: addition, subtraction, multiplication and division - then we can do it by hand.
EXAMPLE 1: Calculate \(\frac{5^2-\frac{1}{2}}{3(4+\sqrt[3]{27})}\) by hand.
However, once we transcend these operations, even the simplest calculation becomes impossible:
EXAMPLE 2: Calculate \(\sin\frac{\pi}{5}\) by hand.
True to form, here we are knee deep into the section on computer science and still talking math. Don’t worry, the coding part is coming. But it is important to explain what exactly we need computers for. That will help us understand why computing power has been so instrumental in pushing mathematical and scientific ideas as far as they’ve come.
So about \(\sin\frac{\pi}{5}\). If we ask R to calculate it, we get
sin(pi/5)
[1] 0.5877853
In Python,
import math
= math.pi
pi /5) math.sin(pi
0.5877852522924731
If a computer is programmed by humans, how did the computer know the value of \(\sin\frac{\pi}{5}\)?
Taylor series approximation
In the marketplace, Python is by far the most commonly used language for exploring and mining data, as well as training, testing, and deploying machine learning models. It is also fast becoming the same in academic circles as well. That being said, other programming platforms still have a die-hard base of support in a variety of domains (e.g. SPSS in social and health sciences, Stata in economics, SAS in medicine). R remains popular in statistics circles because it was built by statisticians for statisticians, and it has kept in step with modern advances in data science and machine learning. Many Python libraries are essentially R emulators (and vice versa). It is good practice for aspiring data scientist to be familiar with both.
Both R and Python start with base versions, and are built up by installing crowd sourced packages.
install.packages("tidyverse")
Installing tidyverse [1.3.2] ...
OK [linked cache]
install.packages("tidymodels")
Installing tidymodels [1.0.0] ...
OK [linked cache]
install.packages("reticulate")
Installing reticulate [1.28] ...
OK [linked cache]