AL
Published on

Evaluating ML Models: Cross Validation, Confusion Matrix, Bias–Variance

Authors

Machine learning is about making predictions and classifications. A central theme in building models that generalize is the bias–variance tradeoff.

  • Estimate parameters for ML methods → train on training data (about 75% of data)
  • Evaluate how well the methods work → test on held-out data (about 25% of data)

See also: Cross Validation, Confusion Matrix, Bias, Variance

Cross Validation

  • Start with a split (e.g., first 75% to train, last 25% to test)
  • Then rotate which 25% is used as the test set so each block is tested once
  • Keep track of the ML parameters and performance across folds

Common schemes:

  • Four-fold cross validation (each fold is ~25%)
  • Leave-one-out cross validation (LOOCV)
  • Ten-fold cross validation (widely used in practice)

Confusion Matrix

After splitting with cross validation:

  • Train candidate ML methods on the training split(s)

  • Evaluate on the test split(s) and summarize with a confusion matrix

  • Rows = model predictions, Columns = ground truth

True YesTrue No
Predicted YesTrue PositivesFalse Positives
Predicted NoFalse NegativesTrue Negatives
  • The diagonal values indicate correct classifications
  • Confusion matrices from multiple ML methods can be compared
    • More sophisticated metrics can then guide model selection
  • For multi-class problems, the matrix expands with one row/column per class

Sensitivity and Specificity

Computed from the columns of the binary confusion matrix:

  • Sensitivity (Recall/TPR) = true positives / (true positives + false negatives)
    • Interprets as: of all actual positives, what percent were correctly identified?
  • Specificity (TNR) = true negatives / (true negatives + false positives)
    • Interprets as: of all actual negatives, what percent were correctly identified?

Depending on the application, one may prioritize sensitivity or specificity. For multi-class settings, compute class-specific sensitivity and specificity for each class.

Bias and Variance

  • Bias: systematic error from an algorithm’s inability to capture the true relationship
  • Example: a straight line fit to a curved relationship has high bias
  • Generalization error on held-out data can be decomposed (conceptually) into bias and variance contributions; in practice, we assess with test error by measuring prediction residuals on the test set and aggregating (e.g., sum of squared errors for regression)