Evaluating ML Models: Cross Validation, Confusion Matrix, Bias–Variance

Machine learning is about making predictions and classifications. A central theme in building models that generalize is the bias–variance tradeoff.

Estimate parameters for ML methods → train on training data (about 75% of data)
Evaluate how well the methods work → test on held-out data (about 25% of data)

Cross Validation

Start with a split (e.g., first 75% to train, last 25% to test)
Then rotate which 25% is used as the test set so each block is tested once
Keep track of the ML parameters and performance across folds

Common schemes:

Four-fold cross validation (each fold is ~25%)
Leave-one-out cross validation (LOOCV)
Ten-fold cross validation (widely used in practice)

Confusion Matrix

After splitting with cross validation:

Train candidate ML methods on the training split(s)
Evaluate on the test split(s) and summarize with a confusion matrix
Rows = model predictions, Columns = ground truth

	True Yes	True No
Predicted Yes	True Positives	False Positives
Predicted No	False Negatives	True Negatives

The diagonal values indicate correct classifications
Confusion matrices from multiple ML methods can be compared
- More sophisticated metrics can then guide model selection
For multi-class problems, the matrix expands with one row/column per class

Sensitivity and Specificity

Computed from the columns of the binary confusion matrix:

Sensitivity (Recall/TPR) = true positives / (true positives + false negatives)
- Interprets as: of all actual positives, what percent were correctly identified?
Specificity (TNR) = true negatives / (true negatives + false positives)
- Interprets as: of all actual negatives, what percent were correctly identified?

Depending on the application, one may prioritize sensitivity or specificity. For multi-class settings, compute class-specific sensitivity and specificity for each class.

Bias and Variance

Bias: systematic error from an algorithm’s inability to capture the true relationship
Example: a straight line fit to a curved relationship has high bias
Generalization error on held-out data can be decomposed (conceptually) into bias and variance contributions; in practice, we assess with test error by measuring prediction residuals on the test set and aggregating (e.g., sum of squared errors for regression)