Statistical Foundations of Machine Learning

Mutual Information

Mutual information measures how closely two variables are related.
Joint probabilities: probability of two things occurring at the same time.
Marginal probabilities: probability of just one thing occurring.
A variable that never changes has zero mutual information with anything else (it provides no information).

Least Squares

Fit a line among points by minimizing vertical distances from points to the line.
Square each distance so contributions are non-negative.
For a line $y = ax + b$ , find $a$ and $b$ that minimize the sum of squared residuals.
This method is called Least Squares.
Conceptually, consider the sum of squared residuals as a surface over $(a, b)$ ; the optimal fit occurs where the derivatives are zero (a minimum).

We want to minimize the squared distance between observed values and the line. We find the minimum by taking derivatives with respect to the line parameters and solving where the derivatives are zero. The resulting line minimizes the sum of squares.

Linear Regression

Linear Regression is:

Use least squares to fit a line to the data
Calculate $R^2$
Calculate a $p$ -value for $R^2$

Example context: mouse size vs. mouse weight. Least squares estimates two parameters: slope and intercept.

R2

Calculate average mouse size (the mean).
Sum of squares around the mean: $\text{SS(mean)} = \sum (\text{data} - \text{mean})^2$
Variation around the mean: $\text{Var(mean)} = \text{SS(mean)} / n$
Sum of squares around the least squares fit: $\text{SS(fit)} = \sum (\text{data} - \text{line})^2$
Variation around the fit: $\text{Var(fit)} = \text{SS(fit)} / n$

Variance is the average of the sum of squares.

$R^2$ explains how much of the variation in mouse size is explained by mouse weight.
Formula: $R^2 = (\text{Var(mean)} - \text{Var(fit)}) / \text{Var(mean)} = (\text{SS(mean)} - \text{SS(fit)}) / \text{SS(mean)}$
A perfect fit yields $R^2 = 1$ (100% of variation explained).
Adjusted $R^2$ penalizes additional parameters.

How to determine if the $R^2$ value is significant?

p-value

$R^2$ = (variation explained by weight) / (variation without using weight)
Define an F-statistic: $F = \text{Variation explained by weight} / \text{Variation not explained by weight}$
Residuals are the variations not explained by weight (the errors around the fit).
$p_{fit}$ (parameters in the fit line) = 2: intercept and slope
$p_{mean}$ (parameters in the mean-only model) = 1: intercept
Numerator: variance explained by the extra parameter(s)
Denominator: variance not explained by the fit (sum of squares of residuals)

How to turn F into a p-value?

Simulate or consider the distribution of F under the null hypothesis by computing F for many random datasets.
Compute F for the original dataset.
The p-value is the fraction of simulated F values that are greater than or equal to the observed F (more extreme), i.e., “extreme count / total count.”

In practice, closed-form F-distribution is used to compute the p-value (one F-distribution for the test; two if comparing models with different parameter counts in nested models).

[!INFO] p-value
A low p-value (typically < 0.05) means it is very unlikely to obtain such a high F-statistic by chance. The model is statistically significant and better than predicting just the mean.
A high p-value means the result could have occurred by chance; you cannot conclude the model predicts better than the mean.

IMPORTANT

$R^2$ , $F$ -statistic, and $p$ -value are related

Higher $R^2$ (better fit) generally corresponds to a larger F-statistic and therefore a smaller p-value.
Strong fit ( $R^2$ ) provides strong evidence (large F) against the null (useless model), implying a low probability (p-value) that the result occurred by chance.
Together these quantify model usefulness.

Multiple Regression

Simple regression fits a line to data (one predictor).
Multiple regression fits a plane (or higher-dimensional hyperplane) using multiple predictors.
Use adjusted $R^2$ to compensate for additional parameters.
Computing F and p-values follows the same principles as simple regression.

[!INFO] Is multiple regression worth it? If the increase in $R^2$ from adding predictors is large and the p-value is small, then adding the predictor(s) (e.g., tail length) is worth the added complexity.

t-tests and ANOVA

Example: gene expression differences between control and mutant mice.

t-test vs. linear regression

Ignore the x-axis and compute the overall mean.
Compute SS(mean): sum of squared residuals around the mean.
Fit models:
1. Least squares line(s) for linear regression.
2. Group means for a t-test (separate means for control and mutant).
3. Combine via a design matrix to represent group membership for each point:
  - For a control point: $1\times \text{control mean} + 0\times \text{mutant mean} + \text{residual}$
  - For a mutant point: $0\times \text{control mean} + 1\times \text{mutant mean} + \text{residual}$
  - Equivalently: $y = \\mu_{\\text{control}} I_{\\text{control}} + \\mu_{\\text{mutant}} I_{\\text{mutant}} + \\epsilon$
Compute SS(fit): sum of squared residuals around the fitted line(s) or group means.
Calculate the F-statistic and p-value for the t-test.

ANOVA

ANOVA generalizes the t-test to compare more than two groups by partitioning variance into between-group and within-group components.

t-tests and linear regression are closely related; ANOVA provides the framework for multi-group comparisons. A common setup is mutant vs. control mice with mouse size vs. mouse weight as covariates.