The Game of Supervised Machine Learning: Understanding the Setup, Players, and Rules
Five points require mentioning before beginning this whitepaper. First, I created a Python package that contains all the functions provided in this whitepaper that are needed to demonstrate propositions of supervised machine learning theories. The package is called smltheory
and can be installed from the Python Package Index (PyPI) using the code in the Bash code block below (see line 1).
pip install smltheory
Second, although I provide the source code for Python functions in this whitepaper where appropriate, the functions can be easily used by installing the smltheory
package and prefixing each function with smltheory
(or with the desired shorthand; I use sml
to represent smltheory
) so that the function is used from the loaded package.
Third, I used Python and R code throughout this whitepaper and often imported objects created in Python into R for creating plots. To use Python and R coinjointly, I used the reticulate
package made for R and created a conda environment to use Python (see lines 2–13 below). Note that I installed the necessary packages in the conda environment.
library(reticulate)
#create and use conda environment
conda_create(envname = 'blog_posts', python_version = '3.10.11')
use_condaenv(condaenv = 'blog_posts')
#install packages in conda environment
py_packages <- c('numpy', 'pandas', 'scikit-learn', "plotnine", "statsmodels", "smltheory")
conda_install(envname = 'blog_posts', packages = py_packages, pip=T)
#useful for checking what packages are loaded
py_list_packages(envname = 'blog_posts', type = 'conda')
Fourth, for any reader interested in using the source code for each function, it will be necessary to load trun_mvnt
Python module from GitHub (see lines 14–15 below).
git clone https://github.com/ralphma1203/trun_mvnt.git
from trun_mvnt import trun_mvnt as tvm
Fifth, the packages and modules in the Python code block below must be loaded (see lines 15–24 below).
import numpy as np import pandas as pd
import sklearn
from sklearn.metrics import mean_squared_error
import statsmodels.formula.api as smf
from functools import partial
import plotnine as pt
import concurrent.futures
import smltheory as sml
Make sure that version 0.1.10 of smltheory
is loaded (see lines 25–26 below).
sml.version('smltheory')
'0.1.10'
Introduction
In supervised machine learning, a game is played with the goal of maximizing the ability to predict. Whether the focus is to predict churn rates (among employees and/or customers), regions with high potential for growth, or something as innocuous as fruit fly behaviour, supervised machine learning can be used to solve these problems. Although a nearly infinite number of models can be deployed to predict a given outcome, it is important that data scientists not rely on trial and error to find effective models and, instead, make purposeful decisions. To do so, it important that data scientists understand the game in supervised machine learning.
To understand the game of supervised machine learning, I will use the excess risk decomposition framework to situate the commonly known entities of bias, variance, and noise. Because excess risk decomposition is seldom presented in tandem with bias, variance, and noise, I hope to provide a deeeper understanding of both topics by presenting them together.
This paper will contain six parts. First, I will provide a necessary background by presenting a formal model of learning. Second, I will present the excess risk decomposition framework as setup for understanding the game of supervised machine learning. Third, I will introduce the ‘players’ (i.e., bias, variance, and noise) of supervised machine learning by decomposing mean squared error. Fourth, I will explain the behaviours of bias, variance, and noise by presenting the bias-variance tradeoff. Fifth, I will explain the rules of supervised machine learning and how to succeed, Sixth, and last, I will provide an overview of situations where the rules of supervised machine learning cannot be applied.
A Formal Model of Learning
To provide the context of supervised machine learning, I begin with a formal model of learning and present it in three parts. First, I will present the learning context. Second, I will present the learning model. Third, and last, I will present the selection of models.
Learning Context
In this section, I will present the learning context. Consider an example where a tourist arrives in a new country and wants to learn how to identify a good wine without opening the bottle. Given her experience with wine, she decides to use two features: 1) the weather that occurred for the particular year the wine was produced and 2) the winemaking quality. Both features are ranked on a 1–10 scale (where 1 indicates poor weather/winemaking quality and 10 indicates excellent weather/winemaking quality). To determine the effectiveness of this approach, she decides to taste the wines of several vineyards and, for each wine, rate the quality of the weather during the year, the winemaking quality, and then taste each wine and provide a rating (1–10 scale). After collecting the data, the tourist plans to take a subset of the data to train her models. In more formal terms, there are two components that would be considered as learning inputs:
- Feature space ($\mathbf{X}$): the set of features used for prediction. In the current example, the features of weather ($\mathbf{x}_1$) and winemaking quality ($\mathbf{x}_2$) for each wine are stored in a matrix, $\mathbf{X} = [\mathbf{x}_1^\top, \mathbf{x}_2^\top]$, with each row of $\mathbf{X}$ being an observation such that $\mathbf{x}_n = [\mathbf{x}_{1n}, \mathbf{x}_{2n}]$.
- Outcome space ($\mathbf{Y}$): the set of variables being predicted. Because there is only one outcome variable of the wine quality ratings, the outcome space is simply a vector, $\mathbf{Y} = \mathbf{y}$.
Together, the feature and outcome spaces comprise the data collected by the tourist such that, for each $n$ wine, there is a feature vector/outcome pair, $\mathbf{x}_n, y_n$.
Importantly, the tourist does not and never assumes realizability: That the features of weather and winemaking quality completely predict wine quality ratings. Instead, she assumes that, for a given pair of weather and winemaking quality values, there is a distribution of plausible wine ratings, which introduces stochasticity (randomness). Dropping the realizability assumption in the current example is certainly intuitive: Wine quality is affected by many factors outside of weather and winemaking quality such as soil quality, altitude, and age of the vines. Importantly, even if the tourist were to collect data for all the variables that affect wine quality, realizability (i.e., a one-to-one mapping between the predictors and and the outcome) would not occur because of the aforementioned stochasticity (randomness) and measurement error.1
To reflect the uncertainty that follows from dropping realizability (also known as agnostic learning), the tourist assumes a joint distribution over the feature and outcome spaces. In contrast to the realizability setting where there is a one-to-one mapping function between the feature and outcome spaces, $f: \mathbf{X} \rightarrow \mathbf{Y}$, a probabilistic relation between the feature and outcome spaces is specified. Specifically, I assume a joint distribution over the feature and outcome spaces, $\mathcal{D}(\mathbf{X}, \mathbf{Y})$, such that a distribution of possible outcomes exists for any given set of feature values, $\mathcal{D}((\mathbf{x}, y)|\mathbf{x})$. By conceptualizing the feature-outcome space as probabilistic, the tourist can account for measurement error and stochasticity.
Learning Model
With a setup for the learning context, I will present a learning model. Revisiting the wine assessment example, the tourist has now returned from the vineyards and recorded information on each $n$ wine regarding the weather, winemaking quality, and quality of the wine. The tourist then takes a large subset of the data for training models and uses weather and winemaking quality to predict wine quality. To evaluate her models, she decides to use the mean squared error loss such that
$$ \begin{align} MSE &= \frac{\sum^n_{i=1}\big(f(\mathbf{x}_i) - y_i\big)^2}{n} = \frac{\sum^n_{i=1} \ell \ \big(f(\mathbf{x}_i), y_i\big)^2}{n} \label{eq:mean-sq-error} \end{align} $$
where $n$ is the number of observations in the data set and $i$ is an individual observation. Under ideal circumstances, the tourist would know the joint distribution underlying the feature and outcome spaces, $D(\mathbf{X}, \mathbf{Y})$, to obtain an unbiased estimate of the generalization/true error or risk of a function: The error on unseen data. Obtaining the generalization error is important because it indicates how well a model will perform when deployed. If the tourist knows the joint distribution, then the generalization error can be computed as the expected squared loss such that
$$ \begin{align} L_{\mathcal{D}}(f) = \mathbb{E}[\ell \ \big(f(\mathbf{x}), y\big)^2], \end{align} $$
where $L_D(f)$ is the loss of $f$ with respect to the joint distribution, $\mathcal{D}$. Unfortunately, because the joint distribution is seldom known in practice, generalization error cannot be directly computed.
Although generalization error is not directly computable, one solution is to compute loss on a collected data sample. By using a sample of collected data, the strong law of large numbers indicates that, as the number of observations approaches infinity, the loss obtained on the sample data will approach the generalization error (see Equation \ref{eq:gen-error-approx}):
$$ \begin{align} \lim_{n \rightarrow \infty} \frac{\sum_{i=1}^n (f(\mathbf{x}_i) - y_i)^2}{n} &= L_{\mathcal{D}}(f). \label{eq:gen-error-approx} \end{align} $$
Therefore, to obtain an approximation of the generalization error, loss can be computed on a sample, which is often called the empirical loss and is represented as
$$ \begin{align} L_S(f) = \frac{\sum_{i=1}^n\big(f(\mathbf{x}_i) - y_i\big)^2}{n}. \label{eq:training-loss} \end{align} $$
It should be noted that a minimum sample size is required for the empirical loss to estimate the generalization error within some margin of error. This minimum sample size depends on several factors such as the complexity of the class of functions being used (or hypothesis class), the probability of obtaining a bad sample that leads to a generalization error above a certain amount, and the probability of obtaining a bad sample (for detailed explanations, see this course on the theory of machine learning).
Model Selection
Because the empirical loss provides an approximation of the generalization error, one strategy for finding the function with the lowest generalization error, $f^\ast$, is to find the function with the lowest empirical loss. This approach of finding the function with the lowest empirical loss is often called empirical risk minimization. Mathematically, empirical risk minimization can be represented as
$$ \begin{align} f^\ast &= \underset{f}{\arg\min} \ Ls(f), \label{eq:erm} \end{align} $$
where $f^\ast$ is function in the set of all possible functions, $f$, that obtains the lowest empirical loss.
Two points deserve mention with respect to $f^\ast$. First, it is often called the empirical risk minimizer because it obtains the lowest empirical loss. Second, in the current example, because empirical risk minimization occurs over the set of all possible functions, $f$, then the best possible function will inevitably exist in this set. The best possible function (i.e., the function with the lowest generalization error) is often called the Bayes decision function. Because we assume agnostic learning (i.e., that no function can perfectly predict the outcome), the Bayes decision function will have nonzero loss, which is often called the Bayes risk.
Excess Risk Decomposition: A Setup for Understanding Supervised Machine Learning
In the sections that follow, I will provide a general framework or setup for understanding how three types of error inevitably emerge in the formal model of supervised learning presented above. This framework is often called excess risk decomposition because the errors occur in addition to the Bayes risk ( Citation: Bottou et al. 2007 Bottou, L. & Bousquet, O. (2007). The tradeoffs of large scale learning. InKoller, S. (Eds.), Advances in neural information processing systems. (pp. 161–168). Curran Associates, Inc. Retrieved from bit.ly/3qo1xpI ) . For a lecture on excess risk decomposition, see Bloomberg - Lecture 5. Excess risk can be represented mathematically as shown below in Equation \ref{eq:excess-risk}:
$$ \begin{align} \text{Excess risk } = \epsilon_{app} + \epsilon_{est} + \epsilon_{opt}, \label{eq:excess-risk} \end{align} $$
where $\epsilon_{app}$ represents approximation error, $\epsilon_{est}$ represents estimation error, and $\epsilon_{opt}$ represents optimization error.
In the sections that follow, I will explain each type of error and provide Python code to simulate it. Importantly, before explaining each type of error, a data set is first required. Therefore, I first will first provide an overview of the data generation procedure for the wine assessment example.
Preamble: Data Generation Procedure
In the wine assessment example, there were two features and one outcome. For the two features, the tourist collected data on weather and winemaking quality and computed composite scores for each feature, thus yielding a continuous distribution of scores on a 1–10 scale. For the outcome, the tourist tasted each wine and similarly calculated composite scores for wine quality that likewise yielded a continuous distribution of scores on a 1–10 scale (1 indicates a poor wine and 10 indicates an excellent wine). To simulate these data, I will assume agnostic learning and, therefore, use a joint distribution over the feature and outcome spaces, with the features of weather ($\mathbf{x}_1$) and winemaking quality ($\mathbf{x}_2$) being normally distributed such that
$$ \begin{spreadlines}{0.5em} \begin{align*} \mathbf{X} &= [\mathbf{x}_1, \mathbf{x}_2] \sim \mathcal{N}(\mu, \Sigma) \cdot \mathbb{I}(\mathbf{x}_1, \mathbf{x}_2 \in [1, 10]), \\ \boldsymbol{\mu} &= [5, 7] \text{ and} \\ \boldsymbol{\sigma} &= [1.2, 1.4] \\ \end{align*} \end{spreadlines} $$
with the covariance matrix
$$
\begin{spreadlines}{0.5em}
\begin{align*}
\rho_{\mathbf{x}_1, \mathbf{x}_2} &= .35 \text{ and} \\
\boldsymbol{\Sigma} &= \begin{bmatrix}
\boldsymbol{\sigma}^2_1 & \rho_{\mathbf{x}_1, \mathbf{x}_2} \boldsymbol{\sigma}_1\boldsymbol{\sigma}_2 \\
\rho_{\mathbf{x}_1, \mathbf{x}_2} \boldsymbol{\sigma}_1\boldsymbol{\sigma}_2 &\boldsymbol{\sigma}^2_2
\end{bmatrix}.
\end{align*}
\end{spreadlines}
$$
Note that a truncated normal distribution was used to generate data for the features such that all the values for each feature fell between 1 and 10 (inclusive). To generate the outcome values for wine quality, I used the regression equation shown below in Equation \ref{eq:outcome-generate}:
$$ \begin{spreadlines}{0.5em} \begin{align} \mathbf{y} &= \mathbf{Xw} + \mathbf{e}, \text { where} \label{eq:outcome-generate} \\ \mathbf{X} &= [\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_1^2, \mathbf{x}_2^2, \mathbf{x}_1\mathbf{x}_2], \nonumber \\ \mathbf{w} &= [0.3 , 0.1 , 0.07, -0.1 , 0.1], \text{ and} \label{eq:pop-weights} \\ \mathbf{e} &\sim \mathcal{N}(\mu = 0, \sigma = 1.5). \nonumber \end{align} \end{spreadlines} $$
Because Equation \ref{eq:outcome-generate} produced outcome values that were not scaled within a 1–10 range, I used the min-max normalization formula shown below in Equation \ref{eq:rescale} to scale the outcome values of wine quality within a 1–10 range such that
$$ \begin{align} \mathbf{y}_{scaled} = a + \frac{(\mathbf{y} - \min(\mathbf{y}))(b-a)}{\max(\mathbf{y})-\min(\mathbf{y})} \label{eq:rescale}, \end{align} $$
where $a$ indicates the new lower limit (i.e., 1), $b$ indicates the new upper limit (i.e., 10), $\min(y)$ indicates the old lower limit, and $\max(y)$ indicates the old upper limit. To compute the old lower limit, I used the lowest possible value for $\mathbf{x}_1$ and the highest possible value for $\mathbf{x}_2$ of 10,2 which, in turn, set the values for the polynomial features ($\mathbf{x}^2_1, \mathbf{x}^2_2$) and and interaction feature ($\mathbf{x}_1\mathbf{x}_2$). I then pre-multiplied this vector by the vector of population weights (see Equation \ref{eq:pop-weights}) and subtracted an error value that corresponded to three standard deviations below the mean, yielding the following computation for the old lower limit in Equation \ref{eq:old-lower-limit}:
$$ \begin{spreadlines}{0.5em} \begin{align} \text{Old lower limit } &= \mathbf{w}^\top [1, 10, 1, 100, 10] - 3(1.5) \label{eq:old-lower-limit} \\ &=-12.13. \nonumber \end{align} \end{spreadlines} $$
To compute the old upper limit, I used the highest possible values for $\mathbf{x}_1$ and $\mathbf{x}_2$ of 10, which, in turn, set the values for the polynomial features ($\mathbf{x}^2_1, \mathbf{x}^2_2$) and interaction feature ($\mathbf{x}_1\mathbf{x}_2$). I then pre-multiplied this vector by the vector of population weights (see Equation \ref{eq:pop-weights}) and added an error value that corresponded to three standard deviations above the mean, yielding the following computation for the old upper limit in Equation \ref{eq:old-upper-limit}:
$$ \begin{spreadlines}{0.5em} \begin{align} \text{Old upper limit } &= \mathbf{w}^\top [10, 10, 10, 100, 100] + 3(1.5) \label{eq:old-upper-limit} \\ &=15.50. \nonumber \end{align} \end{spreadlines} $$
With the outcome variable of wine quality being scaled, the population regression coefficients can also be scaled by first computing a scaling factor (Equation \ref{eq:scaling}) and then multiplying the old regression weights by the scaling factor (Equation \ref{eq:scaled-weights}).
$$
\begin{spreadlines}{0.5em}
\begin{align}
\text{Scaling factor} &= \frac{\text{New scale range}}{\text{Old scale range}}
\label{eq:scaling} \\
&= \frac{10 - 1}{15.50 - (-12.13)} \nonumber \\
&= 0.33 \nonumber \\
\mathbf{w}_{scaled} &= \mathbf{w}(0.33)
\label{eq:scaled-weights} \\
&= [0.10, 0.03, 0.02, -0.03, 0.03] \label{eq:scaled-values}
\end{align}
\end{spreadlines}
$$
Lastly, the new scaled intercept value can be computed by first computing the mean of each feature in $\bar{\mathbf{X}}$, taking the weighted sum of the rescaled weights and mean of each feature, $ \mathbf{w}_{scaled}^\top\bar{\mathbf{X}}$, and then subtracting this value from the mean outcome value, $\bar{\mathbf{y}}$ (see Equation \ref{eq:scaled-intercept}).
$$ \begin{align} \text{intercept}_{ scaled} &= \bar{\mathbf{y}} - \mathbf{w}^\top_{scaled} \bar{\mathbf{X}} \label{eq:scaled-intercept} \end{align} $$
The Python code block below (lines 27–133) defines the functions that generate the data for the wine example such that there are two features (weather
, winemaking_quality
) and one outcome variable (wine_quality
). The outcome variable is created using Equation \ref{eq:outcome-generate} and then scaled using Equation \ref{eq:rescale}.
def provide_weights():
b_weight_weather = 0.3
b_weight_winemaking = 0.1
b_weight_weather_squared = 0.07
b_weight_winemaking_squared = -0.1
b_weight_weather_winemaking = 0.1
weights = np.array([b_weight_weather, b_weight_winemaking,
b_weight_weather_squared, b_weight_winemaking_squared, b_weight_weather_winemaking])
return weights
def compute_old_scale_range():
# assume highest/lowest possible error scores are +-3 SDs
old_lower_limit = np.dot(a=provide_weights(), b=np.array([1, 10, 1, 100, 10])) - 3 * 1.5
old_upper_limit = np.dot(a=provide_weights(), b=np.array([10, 10, 100, 100, 100])) + 3 * 1.5
return old_upper_limit - old_lower_limit
def compute_rescaling_factor(new_upper_limit=10, new_lower_limit=1):
old_scale_range = compute_old_scale_range()
new_scale_range = new_upper_limit - new_lower_limit
rescaling_factor = new_scale_range / old_scale_range
return rescaling_factor
def create_covariance_matrix(sd, rho):
# Create a lower triangular matrix with zeros
n = len(sd)
cov_matrix = np.zeros((n, n))
# Fill lower and upper triangles of covariance matrix
cov_matrix[np.tril_indices(n, -1)] = rho * np.prod(sd)
cov_matrix = cov_matrix + cov_matrix.T
# Fill diagonal of covariance matrix
np.fill_diagonal(a=cov_matrix, val=sd)
return cov_matrix
def generate_trunc_predictors(mu, cov_matrix, sample_size,
lower_limits=1, upper_limits=10, seed=27):
# Upper and lower limits for variables
lower_limits = np.repeat(lower_limits, len(mu))
upper_limits = np.repeat(upper_limits, len(mu))
# Generate samples from multivariate distribution
sample_size = int(sample_size)
D = np.diag(np.ones(len(mu)))
np.random.seed(seed)
truncated_data = pd.DataFrame(tvm.rtmvn(n=sample_size, Mean=mu, Sigma=cov_matrix,
lower=lower_limits, upper=upper_limits, D=D),
columns=["weather", "winemaking_quality"])
return truncated_data
def compute_outcome_variable(data):
feature_cols = pd.concat(objs=[data, data ** 2, data.prod(axis=1)], axis=1)
# Error
error = np.random.normal(loc=0, scale=1.5, size=data.shape[0])
# Compute outcome variable of wine quality
data["wine_quality"] = np.dot(a=feature_cols, b=provide_weights()) + error
return data
def rescale_outcome_variable(data, new_lower_limit=1, new_upper_limit=10):
rescaling_factor = compute_rescaling_factor(new_upper_limit=new_upper_limit,
new_lower_limit=new_lower_limit)
old_lower_limit = np.dot(a=provide_weights(), b=np.array([1, 10, 1, 100, 10])) - 3 * 1.5
lower_limit_center = (data["wine_quality"] - old_lower_limit)
data["wine_quality"] = pd.DataFrame(data=new_lower_limit + lower_limit_center * rescaling_factor,
columns=["wine_quality"])
return data
def generate_mult_trunc_normal(cov_matrix, mu, sample_size, seed=27):
# generate predictors
data_mult_trunc_normal = generate_trunc_predictors(mu=mu, cov_matrix=cov_matrix,
sample_size=sample_size, seed=seed)
# generate outcome variable
data_mult_trunc_normal = compute_outcome_variable(data=data_mult_trunc_normal)
# scale outcome variable
data_mult_trunc_normal = rescale_outcome_variable(data=data_mult_trunc_normal)
return data_mult_trunc_normal
The Python code block below (lines 134–150) generates the multivariate normal normal such that all values for the features and outcome variables are truncated between values of 1 and 10. Importantly, because we are generating the data, we can adopt an omniscient approach in the winemaking example and generate one data for computing empirical loss and another data set for computing generalization error. Note that the sample size used for the generalization error is substantially larger so that generalization error can be accurately estimated.
#set the population means and SDs for weather and winemaking quality mu = [5, 7]
sd = [1.2, 1.7]
#population correlation
rho_weather_winemaking = 0.35
cov_matrix = sml.create_covariance_matrix(sd = sd, rho = rho_weather_winemaking)
sample_size_emp_loss = 250
sample_size_gen_error = 1e4
#data used to compute empirical loss; note seed = 27 by default
data_emp_loss = sml.generate_mult_trunc_normal(cov_matrix = cov_matrix, mu = mu,
sample_size = sample_size_emp_loss, seed = 27)
#data used to compute generalization error
data_gen_error = sml.generate_mult_trunc_normal(cov_matrix = cov_matrix, mu = mu,
sample_size = sample_size_gen_error, seed = 21)
Because of the long processing time needed to generate the above data sets, I provide the data sets in the smltheory
package and show how they can be obtained in the Python code block below (see lines 151–155 below).
#data for empirical loss data_emp_loss = sml.get_data_emp_loss()
#data for generalization error
data_gen_error= sml.get_data_gen_error()
Approximation Error: The Result of Having to Constrain Empirical Loss Minimization
To obtain an approximation of a function’s ability to make accurate predictions on unseen data (i.e., generalization error), the strong law of large numbers indicated that, with sufficient sample size, the performance on sample data would be sufficient. As a corollary, empirical risk minimization then suggested that the best possible function could be identified by determining the function that obtained the lowest empirical loss. Unfortunately, empirical risk minimization inevitably fails when optimizing over over all possible functions. One method to prevent empirical risk minimization from failing is to constrain optimization to a set of functions. Although constraining empirical risk minimization provides a solution, it creates approximation error.
In the sections that follow, I will explain three points. First, I will explain how empirical risk minimization fails. Second, I will explain how constraining empirical risk minimization can prevent overfitting. Third, I will explain how approximation error is an inevitable outcome of constraining empirical risk minimization.
Overfitting: The Failure of Empirical Risk Minimization
To recall, empirical risk minimization assumes that the function with the lowest generalization error can be identified by simply determining the function with the lowest empirical loss. Mathematically,
$$ \begin{align} f^* &=\underset{f}{\arg\min} \ Ls(f). \tag{\ref{eq:erm} $$ revisited} \end{align} $$
Although optimizing empirical loss over all possible functions makes theoretical sense, it is impractical for two reasons. First, doing so is computationally demanding. Second, and perhaps most important, optimizing generalization error over all possible functions will likely result in overfitting: When the generalization error of a function is considerably higher than its empirical loss because the function’s high complexity has caused it to (over)fit sample-specific information.
Overfitting commonly occurs as an outcome of minimizing empirical loss over a set of functions that is too complex or large. As an example of overfitting, I compute empirical loss and generalization error over the set of polynomial models up to the ninth order ($P = 9$) such that
$$ \begin{align} f \in \mathcal{O}:f(\mathbf{x}_i) &= \text{int} + \bigg(\sum_{p=1}^{P} w_{p_1} x^p_{1_i} + w_{p_2} x^p_{2_i}\bigg) + \mathbb{I}_{(P\ge 2)}\bigg(\sum_{p=1}^{P-1}\sum_{q=1}^{P-1} w_{pq} x^p_{1_i} x^q_{2_i}\bigg), \label{eq:all-functions} \end{align} $$
where $f(\mathbf{x}_i)$ represents the predicted wine quality value for the $i^{th}$ wine, which is the sum of an intercept ($\text{int}$), the weighted sum of the corresponding $i^{th}$ weather ($x_{1_i}$) and winemaking quality ($x_{2_i}$), and the weighted sum of all possible two-way interactions. Note that, in Equation \ref{eq:all-functions}, interactions are only computed if the specified polynomial order is greater than or equal to 2, $\mathbb{I}_({P \ge 2)}$. The Python code block below (lines 156–217) contains functions that compute the generalization error and empirical loss (using mean squared error; see Equation \ref{eq:mean-sq-error}) for each polynomial model with an order in the specified range.
def gen_poly_reg_eq(poly_order, include_interactions=False):
# compute polynomial terms for predictors
weather_pred = ' + '.join(['np.power(weather, {})'.format(ord_value) for ord_value in range(1, poly_order + 1)])
winemaking_quality_pred = ' + '.join(['np.power(winemaking_quality, {})'.format(ord_value)
for ord_value in range(1, poly_order + 1)])
# Compute all two-way interactions between weather and winemaking_quality
if include_interactions and poly_order > 1:
interaction_terms = ' + '.join(
['np.multiply(np.power(weather, {}), np.power(winemaking_quality, {}))'.format(p1, p2)
for p1 in range(1, poly_order)
for p2 in range(1, poly_order)])
predictors = ' + '.join([weather_pred, winemaking_quality_pred, interaction_terms])
else:
predictors = ' + '.join([weather_pred, winemaking_quality_pred])
# create regression equation
equation = "wine_quality ~ {}".format(predictors)
return equation
def compute_emp_gen_error(equation, data_emp_loss, data_gen_error):
import numpy as np
model = smf.ols(data=data_emp_loss, formula=equation).fit()
y_test = model.predict(data_gen_error)
emp_loss = mean_squared_error(y_true=data_emp_loss['wine_quality'], y_pred=model.fittedvalues)
gen_error = mean_squared_error(y_true=data_gen_error['wine_quality'], y_pred=y_test)
emp_loss_gen_error_dict = {'emp_loss': emp_loss, 'gen_error': gen_error}
return emp_loss_gen_error_dict
def compute_all_emp_gen_errors(data_emp_loss, data_gen_error, poly_order_range, include_interactions=False):
# create all polynomial equations within the desired range
gen_poly_reg_eq_partial = partial(gen_poly_reg_eq, include_interactions=include_interactions)
poly_equations = list(map(gen_poly_reg_eq_partial, poly_order_range))
# create partial version of function with data_emp_loss and data_gen_error fixed
emp_gen_error_partial = partial(compute_emp_gen_error, data_emp_loss=data_emp_loss,
data_gen_error=data_gen_error)
# for each polynomial equation, compute empirical loss and generalization error
all_emp_gen_errors = list(map(emp_gen_error_partial, poly_equations))
# convert dictionary to dataframe and then compute polynomial orders by using row indexes
df_emp_gen_errors = pd.DataFrame(all_emp_gen_errors)
poly_orders = pd.Series([poly_order for poly_order in poly_order_range], name="poly_order")
# concatenate poly_orders and dataframe to create complete dataframe
df_emp_gen_errors = pd.concat([poly_orders, df_emp_gen_errors], axis=1)
return df_emp_gen_errors
If empirical risk minimization is followed, then the function with the highest polynomial order would be chosen as the function with the lowest generalization error. Figure 1 shows that empirical loss decreases almost linearly as polynomial order increases. The Python code block below (lines 218–241) computes and plots empirical loss for each polynomial model up to the ninth order.
#compute empirical loss and generalization error for polynomial models up to ninth order data_emp_gen_errors = sml.compute_all_emp_gen_errors(data_emp_loss = data_emp_loss,
data_gen_error = data_gen_error, include_interactions = True,
poly_order_range=range(1, 10))
#set inplace = False to keep gen_error in original data frame
data_poly_emp_loss = data_emp_gen_errors.drop("gen_error", axis=1, inplace = False)
#create empirical loss plot
emp_plot = (
pt.ggplot(data = data_poly_emp_loss,
mapping=pt.aes(x = "poly_order", y = "emp_loss")) +
pt.geom_line(size=1, color = "#9ECAE1") +
pt.scale_x_continuous(name = "Polynomial Order", breaks = range(1, 10)) +
pt.scale_y_continuous(name = "Error Value (Mean Squared Error)") + # breaks=range(0, 601, 100)) +
pt.theme_classic(base_family = 'Helvetica', base_size = 14) +
pt.theme(text = pt.element_text(color = "#002241"),
axis_line = pt.element_line(color = "#002241"),
axis_ticks = pt.element_line(color = "#002241"),
axis_text = pt.element_text(color = "#002241"))
)
#save as .png
emp_plot.save("images/plot_emp_loss.png", width = 8, height = 6, dpi = 1000)

If generalization error is examined, however, then the conclusion of empirical risk minimization to use the highest-order polynomial model results in a highly undesirable outcome. Figure 2 shows that, while empirical loss decreases as a model’s polynomial order increases, generalization error suddenly increases exponentially after the sixth polynomial order. Specifically, while empirical loss remains below 1 across all polynomial orders, generalization error goes from a value of approximately 2.41 at the sixth polynomial order to a value of approximately 1389.20 at the ninth polynomial order. Thus, although the most complex polynomial model results in the lowest empirical loss, it results in the highest generalization error, and so would perform the least well if used to make predictions on new data. The Python code block below (lines 242–270) computes and plots the empirical loss and generalization error for each polynomial model up to the ninth order.
#convert to long format data_emp_gen_errors_long = data_emp_gen_errors.melt(id_vars = "poly_order",
value_vars = ['emp_loss', 'gen_error'],
var_name = "error_type",
value_name = "error_value")
# Define color palette for each error type
color_palette = {'emp_loss': '#9ECAE1', 'gen_error': '#002241'}
overfitting_plot = (
pt.ggplot(data = data_emp_gen_errors_long,
mapping=pt.aes(x = "poly_order", y = "error_value",
factor = "error_type", color = "error_type")) +
pt.geom_line(size=1) +
pt.scale_x_continuous(name = "Polynomial Order", breaks = range(1, 10)) +
pt.scale_y_continuous(name = "Error Value (Mean Squared Error)",
breaks=range(0, 1501, 250), limits = [0, 1501]) +
pt.labs(color = "Error Type") +
pt.scale_color_manual(values = color_palette,
labels = {'emp_loss': 'Empirical Loss', 'gen_error': 'Generalization Error'}) +
pt.theme_classic(base_family = 'Helvetica', base_size = 14) +
pt.theme(text = pt.element_text(color = "#002241"),
axis_line = pt.element_line(color = "#002241"),
axis_ticks = pt.element_line(color = "#002241"),
axis_text = pt.element_text(color = "#002241"))
)
#save as .png
overfitting_plot.save("images/overfitting_curve.png", width = 8, height = 6, dpi = 1000)

Given that the previous example showed how overfitting occurred with just a subset of complex functions, it becomes apparent that overfitting is inevitable if all functions are considered. Overfitting presents a serious issue to practitioners because it can lead to the use of models that make highly inaccurate predictions and cause organizations to incur severe costs. Fortunately, the occurence overfitting can be reduced by modifying the method of empirical risk minimization.
Reducing the Chance of Overfitting: Constrained Empirical Risk Minimization
Although overfitting is inevitable when all possible functions are considered, it is conceivable that a function with low generalization error can still be found if only a subset of functions are considered. By restricting the set of functions over which to optimize empirical loss, it may be possible to largely prevent overfitting. This approach of restricting the set of functions over which to optimize empirical loss is known as constrained empirical risk minimization and can be represented mathematically as
$$ \begin{align} f_\mathcal{F} = \underset{\mathcal{f \in F}}{\arg \min} \ L_s(f), \end{align} $$
where $\mathcal{F}$ represents the constrained set of functions.
Returning to the wine example, the tourist may have knowledge that leads her to constrain the set of functions over which to optimize empirical loss. Specifically, she will only consider polynomial functions up to the fifth order and omit the two-way interaction terms because of how they introduce complexity in an exponential way with respect polynomial order; that is, they grow exponentially in number as a function of polynomial order. Thus, the restricted set of functions over which the tourist will optimize empirical loss is
$$ \begin{align} f \in \mathcal{F}:f(\mathbf{x}_i) &= \text{int} + \bigg(\sum_{p=1}^{P} w_{p_1} x^p_{1_i} + w_{p_2} x^p_{2_i}\bigg), \label{eq:constrained-functions} \end{align} $$
where $P = 5$. Figure 3 below shows the empirical loss and generalization error as a function of polynomial order with constrained empirical risk minimization (see Python clode block below [lines 271–305 ] for corresponding code). Across all polynomial orders, empirical loss provides a close approximation of generalization error. Thus, the tourist has prevented the occurrence of overfitting by implementing constrained risk minimization. As an aside, a proof of the learning success of constrained empirical risk minimization can be found in Section 2.3.1 of ( Citation: Shalev-Shwartz & Ben-David, 2014 Shalev-Shwartz, S. & Ben-David, S. (2014). Understanding machine learning. https://doi.org/10.1017/cbo9781107298019 ) .
#compute empirical loss and generalization error for polynomial models up to fifth order data_emp_gen_errors = sml.compute_all_emp_gen_errors(data_emp_loss = data_emp_loss,
data_gen_error = data_gen_error,
include_interactions = False,
poly_order_range=range(1, 6))
#convert to long format
data_emp_gen_errors_long = data_emp_gen_errors.melt(id_vars = "poly_order",
value_vars = ['emp_loss', 'gen_error'],
var_name = "error_type",
value_name = "error_value")
# Define color palette for each error type
color_palette = {'emp_loss': '#9ECAE1', 'gen_error': '#002241'}
constrained_erm_plot = (
pt.ggplot(data = data_emp_gen_errors_long,
mapping=pt.aes(x = "poly_order", y = "error_value",
factor = "error_type", color = "error_type")) +
pt.geom_line(size=1) +
pt.scale_x_continuous(name = "Polynomial Order", breaks = range(1, 10)) +
pt.scale_y_continuous(name = "Error Value (Mean Squared Error)",
limits = [0.2, 0.28]) +
pt.labs(color = "Error Type") +
pt.scale_color_manual(values = color_palette,
labels = {'emp_loss': 'Empirical Loss', 'gen_error': 'Generalization Error'}) +
pt.theme_classic(base_family = 'Helvetica', base_size = 14) +
pt.theme(text = pt.element_text(color = "#002241"),
axis_line = pt.element_line(color = "#002241"),
axis_ticks = pt.element_line(color = "#002241"),
axis_text = pt.element_text(color = "#002241"))
)
#save as .png
constrained_erm_plot.save("images/constrained_erm_plot.png", width = 8, height = 6, dpi = 1000)

In using constrained empirical risk minimization, the tourist can select the empirical risk minimizer and accurately predict wine quality for unseen data. With constrained empirical risk minimization optimization, an upper limit is placed on the complexity of functions, and, to the extent that complexity is limited, the incidence of overfitting will be reduced. As an aside, other methods such as regularized regression (for a review, see) and bagging (see here) can be used to combat overfitting. Although constrained empirical risk minimization appears is a solution, it is by no means a perfect solution, as will be discussed in the next section.
Model Selection in Practice: Train, Validate, and Test
Before proceeding to discuss the limitations of constrained empirical risk minimization, a discussion regarding model selection in practice is apropos. Up to this point, I used two data sets to compute empirical loss and generalization error, with the data set for generalization error being considerably large in order to obtain an accurate estimate. Given that I generated the data with Python functions, I could easily use two different data sets for computing empirical loss and generalization error. Unfortunately, the privilege of generating data with functions does exist in practice: Data must be collected and are, therefore, finite.
Although data are finite in practice, practitioners still want to accomplish three goals of developing models, comparing them, and then estimating how well the best model performs on unseen data (i.e., generalization error). Models that perform the best on unseen data are chosen for deployment. Importantly, practitioners must prevent overfitting in accomplishing each goal. To do so, they split the data into the following three parts:
- Training set: data used to develop models and obtain parameter estimates.
- Validation set: data used to obtain estimate of each model’s generalization error so that models can be compared. Because validation data is new (with respect to the training data), overfitting is unlikely, and so the model that obtains the lowest validation error is assumed to be the best model.
- Test set: data serves as a last check that the best model identified using the validation set is indeed the best model. Because many models are often compared using the validation set, it is conceivable that one model may outperform all the other ones by overfitting patterns/noise specific to the validation set. To ensure this is not the case, empirical loss is computed on the test set.
Three points deserve mention with respect to model selection in practice. First, more refined approaches exist for model selection such as cross validation (see Section 7.10 in Citation: Hastie et al. 2009 Hastie, T. Tibshirani, R. & Friedman, J. (2009). The elements of statistical learning. Springer New York. https://doi.org/10.1007/978-0-387-84858-7 ) and bagging. Second, the way in which data are trained, validated, and tested should resemble how the model will be used after deployment. For example, if a model will be used to make predictions one month into the future, it should be trained, validated, and tested such that data points within one month of when predictions will be made are not included. Third, deciding the sample size of each set is an important decision, and a good discussion on this issue can be found here.
The Irreducible Error That Results From Constrained Empirical Risk Minimization: Approximation Error
Although constrained empirical risk minimization reduces the incidence of overfitting, it is not without its limitations. By constraining the set of functions over which to optimize empirical loss, it is unlikely that the best possible function (i.e., Bayes decision function) will exist in the constrained set. Figure 4 depicts the set of all functions and the constrained set of functions, $\mathcal{F}$. Although the Bayes decision function, $f^\ast$, exists in the set of all functions, it does not exist in the constrained set of functions. Thus, even the the best function in the constrained set of functions, $f_\mathcal{F}$—which I call the constrained empirical risk minimizer—will have a generalization error greater than that of the Bayes decision function, with the difference between the two functions constituting approximation error (see Equation \ref{eq:approx-error} below).
$$ \begin{align} \text{Approximation error }\hspace{0.25cm} (\epsilon_{app}) \hspace{0.25cm} &= L_D(f_{\mathcal{F}}) - L_D(f^\ast) \label{eq:approx-error} \end{align} $$ Moreover, in looking at Figure 4, it is also apparent that approximation error increases to the extent that empirical risk minimization is constrained. As empirical minimization is constrained (i.e., the size of the circle in Figure 4 decreases), the generalization error of the constrained empirical risk minimizer ,$L_D(f_{\mathcal{F}}$, will increase, causing approximation error to increase. Recall that, because a probabilistic relation is assumed between features and the outcome—and not a a one-to-one relation as is the case with realizability—the Bayes decision function will have a generalization error greater than zero, which is often called the Bayes risk. The Bayes risk can be represented as the difference between the obtained value for the outcome, $y$, and the output of the Bayes decision function, $f^\ast$, which is shown below in Figure 4.

To compute approximation error, I calculate the difference between the generalization errors of the Bayes decision function and the constrained empirical risk minimizer. Beginning with the Bayes decision function, the scaled population regression weights from Equation \ref{eq:scaled-values} can be used to compute predictions. The difference between the predictions and actual wine quality values are used to compute the Bayes risk (or generalization error of the Bayes decision function). The Python code block below (lines 306–339) constructs a function for computing the Bayes risk.
def generate_bayes_features(data):
# Feature columns
predictors = data[["weather", "winemaking_quality"]]
bayes_features = pd.concat(objs=[predictors, predictors**2, predictors.prod(axis=1)], axis=1)
return bayes_features
def compute_bayes_intercept(data):
bayes_features = generate_bayes_features(data=data)
# Intercept
intercept = np.mean(data["wine_quality"]) - np.sum(provide_weights() * compute_rescaling_factor() *
np.mean(bayes_features, axis=0))
return intercept
def compute_bayes_risk(data):
bayes_features = generate_bayes_features(data=data)
# Compute intercept
intercept = compute_bayes_intercept(data=data)
# Compute predictions for wine quality
test_pred = intercept + np.dot(a=bayes_features, b=provide_weights() * compute_rescaling_factor())
# Compute mean squared error
bayes_risk = mean_squared_error(y_true=data['wine_quality'], y_pred=test_pred)
return bayes_risk
Ending with the constrained empirical risk minimizer, Figure 3 indicated that the second-order polynomial model resulted in the lowest generalization error. To get the best possible parameter estimates for the second-order polynomial model (with no interactions), I generated another large data set (data_best_in_class
; see Python code block containing lines 340–345). As with previous data sets, data_best_in_class
can be obtained from the smltheory
package.
sample_size_best_in_class = 1e4
#use large data set to obtain a close approximation of the best possible estimates for the second-order
#polynomial model (without interactions)
data_best_in_class = sml.generate_mult_trunc_normal(cov_matrix = cov_matrix, mu = mu,
sample_size = sample_size_best_in_class, seed=7)
The Python code block below (lines 346–361) computes the approximation error by calculating the difference between the Bayes risk and the generalization error of the constrained empirical risk minimizer.
data_best_in_class = sml.get_data_best_in_class()
#Bayes risk
bayes_risk = sml.compute_bayes_risk(data = data_gen_error)
#compute empirical loss and generalization error for polynomial models up to fifth order
constrained_erm = sml.compute_all_emp_gen_errors(data_emp_loss = data_best_in_class,
data_gen_error = data_gen_error,
include_interactions = False,
poly_order_range=range(2, 3))
#compute approximation error
approx_error = constrained_erm["gen_error"][0] - bayes_risk
print('Approximation error:', np.round(a = approx_error, decimals = 5))
Approximation error: 0.00119
To show approximation error, I plot the Bayes decision function and the constrained empirical risk minimizer below in Figure 5. The left hand plot represents a magnified version of the right hand plot between the y-axis values of .230 and .234. The difference in generalization error between the Bayes decision function ($f^\ast$, light blue line) and the constrained empirical risk minimizer ($\hat{f}_\mathcal{F}$, medium blue line) constitutes approximation error (see Equation \ref{eq:approx-error}). Importantly, both functions do not vary as a function of sample size, which is shown by the x-axis. Because both the constrained empirical risk minimizer and the Bayes decision function represent ideal functions, they are not affected by the sample size used for model development, and so are constant lines in Figure 5 (I plot the sample size used for model development on the x-axis, as I build on Figure 5 throughout this paper). The R code block below (lines 362–419) contains the code for creating Figure 5.

#create long version of data_est_error where function type (bayes or sample minimizer) is #a categorical variable
data_app_error <- data.frame("sample_size" = 5:1000,
"bayes_risk" = py$bayes_risk,
"constr_min" = py$constrained_erm$gen_error)
data_app_error_long <- data_app_error %>%
pivot_longer(cols = "bayes_risk":"constr_min", names_to = "function_type", values_to = "gen_error",
names_ptypes = factor())
data_app_error_long$function_type <- factor(x = data_app_error_long$function_type,
levels = c("constr_min", "bayes_risk"))
labels <- c(TeX('Constrained risk minimizer, $f_\u2131$'),
TeX('Bayes decision function, $f^*$'))
line_color <- setNames(c("#2171B5", "#9ECAE1"), levels(data_app_error_long$function_type))
#create data set for braces that show estimation and approximation error.
##see https://stackoverflow.com/questions/45221783/ggforce-facet-zoom-labels-only-on-zoomed-example
data_brace_app_error <- data.frame("x" = c(250, 300),
"y" = c(py$constrained_erm$gen_error, py$bayes_risk),
"zoom" = T)
label_app_error <- "Approximation error, $\\epsilon_{app}$"
app_error_plot <- ggplot(data = data_app_error_long, mapping = aes(x = sample_size, y = gen_error,
group = function_type,
color = function_type)) +
geom_line(linewidth = 1.2) +
scale_color_manual(name = "Function Type", values = line_color, labels = labels) +
labs(x = "Sample Size for Model Development",
y = "Generalization Error (Mean Squared Error)",
color = "Function Type") +
scale_y_continuous(limits = c(0.22, 0.250)) +
facet_zoom(ylim = c(0.23, .234), zoom.size = 1.2, zoom.data = zoom) +
#approximation error brace
geom_brace(inherit.aes = F, data = data_brace_app_error,
mapping = aes(x = x, y = y, label=TeX(label_app_error, output="character")),
color = '#002241', labelsize = 6, rotate = 90, labeldistance = 5,
parse = T) +
theme_classic(base_family = "Helvetica") +
theme(legend.text.align = 0, #left align text
legend.position = 'bottom',
legend.direction = 'vertical',
legend.justification = 'left',
legend.text = element_text(size = 16),
legend.title = element_text(size = 18),
axis.title = element_text(size = 15),
axis.text = element_text(size = 14, color = "#002241"),
text = element_text(color = "#002241"),,
axis.line = element_line(color = "#002241"),
axis.ticks = element_line(color = "#002241"))
ggsave(plot = app_error_plot, filename = 'images/app_error_plot.png', width = 12, height = 6, dpi = 1000)
Estimation Error: The Result of Having Limited Amount of Data
In constraining empirical risk minimization, the empirical risk minimizer is only guaranteed to be found to the extent that sample size is large. Given that large sample sizes cannot always be obtained in practice, then the empirical risk minimizer, $f_\mathcal{F}$, is unlikely to be obtained. Instead, some imperfect estimation of $f_\mathcal{F}$, will result which I will call the sample risk minimizer and denote as $\hat{f}_s$. Figure 6 below shows that, as a result of having a limited sample size, the sample risk minimizer, $\hat{f}_s$, results in a generalization error that is greater than that of the constrained empirical risk minimizer, $f_\mathcal{F}$. The extent to which the generalization error of the sample risk minimizer is greater than that of the constrained empirical risk minimizer is often called estimation error, and is represented below in Equation \ref{eq:est-error}:
$$ \begin{align} \text{Estimation error}\hspace{0.25cm} (\epsilon_{est}) \hspace{0.25cm} &= L_D(\hat{f}_n) - L_D(f_{\mathcal{F}}) \label{eq:est-error} \end{align} $$

Figure 7 below shows that, with limited sample size, the sample risk minimizer will always have greater generalization error than the constrained empirical risk minimizer. Note that the left hand plot represents a magnified version of the right hand plot between the y-axis values of .230 and .236. The difference in generalization error between the sample risk minimizer ($\hat{f}_s$, blue line) and the constrained empirical risk minimizer ($\hat{f}_\mathcal{F}$, medium blue line) constitutes estimation error (see Equation \ref{eq:est-error}). Importantly, the x-axis represents the sample size used to obtain the sample risk minimizer ($\hat{f}s$) and, because the constrained empirical risk minimizer, $f\mathcal{F}$, and the Bayes decision function, $f^\ast$, represent theoretical ideal functions, they are unaffected by these sample sizes and remain constant across the x-axis.

To construct Figure 7 above, I followed three steps. First, I used a large data set (data_best_in_class
) to identify the constrained empirical risk minimizer, $f_\mathcal{F}$, from the (constrained) set of functions defined in Equation \ref{eq:constrained-functions} (see Python code block above [lines 340–359]). Briefly, only polynomial functions up to the fifth order of weather and winemaking quality were allowed to predict wine quality in the constrained set of functions. Second, to obtain each sample risk minimizer, $\hat{f}_s$, I sampled data (without replacement) from the same large data set used to obtain the constrained empirical risk minimizer (i.e., data_best_in_class
) and then extracted the lowest generalization error value from the function in the constrained set (see Equation \ref{eq:constrained-functions} and Python code block below [lines 420–444]).3 Third, I computed the Bayes risk, $L_D(f^\ast)$. Importantly, to obtain an accurate estimation of each function’s generalization error, I computed mean squared error on another large data set (data_gen_error
). The R code block below (lines 446–511) contains the code for creating Figure 7.
def compute_sample_risk_gen_error(sample_size, data_best_in_class, data_gen_error, poly_order_range=range(1, 6)):
# Use random_state to ensure reproducibility and prevent resampling from adding noise to estimates
gen_errors = compute_all_emp_gen_errors(data_emp_loss=data_best_in_class.sample(n=sample_size, random_state=27),
data_gen_error=data_gen_error,
include_interactions=False,
poly_order_range=poly_order_range)['gen_error']
# Return generalization error of sample risk minimizer
return gen_errors.min()
# Fix the arguments except for sample_size and then use map to vectorize over sample size values
compute_sample_risk_gen_error_partial = functools.partial(sml.compute_sample_risk_gen_error,
data_best_in_class = data_best_in_class,
data_gen_error = data_gen_error,
poly_order_range = range(1, 5))
# Call the partial function with est_sample_sizes
est_sample_sizes = range(5, 1000)
est_gen_errors = list(map(compute_sample_risk_gen_error_partial, est_sample_sizes))
#Create data frame
data_est_error = pd.DataFrame({"sample_size": np.array(est_sample_sizes),
"sample_risk_gen_error": est_gen_errors})
Because of the long computing time needed to compute the generalization error of each sample risk minimizer, I provide the data set that contains these generalization errors in the smltheory
package, which can be obtained using the get_data_est_error()
function (see line 445 below).
data_est_error = sml.get_data_est_error()
#import data set from Python data_est_error <- py$data_est_error
#create long version of data_est_error where function (bayes, constrained minimizer, or sample minimizer) is a
#categorical variable
data_est_error["bayes_risk"] <- py$bayes_risk
data_est_error["constr_min"] <- py$constrained_erm$gen_error
data_est_error_long <- data_est_error %>%
pivot_longer(cols = "sample_risk_gen_error":"constr_min", names_to = "function_type", values_to = "gen_error",
names_ptypes = factor())
data_est_error_long$function_type <- factor(x = data_est_error_long$function_type,
levels = c("sample_risk_gen_error", "constr_min", "bayes_risk"))
#legend labels
labels <- c(TeX('Sample risk minimizer, $\\hat{f}_s$'),
TeX('Constrained risk minimizer, $f_\u2131$'),
TeX('Bayes decision function, $f^*$'))
line_color <- setNames(c("blue", "#2171B5", "#9ECAE1"), levels(data_est_error_long$function_type))
#create data set for braces that show estimation and approximation error.
##see https://stackoverflow.com/questions/45221783/ggforce-facet-zoom-labels-only-on-zoomed-example
brace_x_start <- which.min(data_est_error$sample_risk_gen_error[1:900]) #find minimum value in first 250 values
brace_x_end <- brace_x_start -50
data_brace_est_error <- data.frame("x" = c(brace_x_start, brace_x_end),
"y" = c(data_est_error$sample_risk_gen_error[brace_x_start],
py$constrained_erm$gen_error),
"zoom" = T)
label_est_error <- "Estimation error, $\\epsilon_{est}$"
est_error_plot <- ggplot(data = data_est_error_long, mapping = aes(x = sample_size, y = gen_error,
group = function_type,
color = function_type)) +
geom_line(linewidth = 1.2) +
scale_color_manual(name = "Function Type", values = line_color, labels = labels) +
labs(x = "Sample Size for Obtaining Sample Risk Minimizer", y = "Generalization Error (Mean Squared Error)",
color = "Function Type") +
facet_zoom(ylim = c(0.23, .236), zoom.size = 1.2, zoom.data = zoom) +
#estimation error brace
geom_brace(inherit.aes = F, data = data_brace_est_error,
mapping = aes(x = x, y = y, label=TeX(label_est_error, output="character")),
color = '#002241', labelsize = 6, rotate = 270, labeldistance = 5,
parse=T) +
theme_classic(base_family = "Helvetica") +
theme(legend.text.align = 0, #left align text
legend.position = 'bottom',
legend.direction = 'vertical',
legend.justification = 'left',
legend.text = element_text(size = 16),
legend.title = element_text(size = 18),
legend.margin = margin(t = 1, unit = "cm"),
axis.title = element_text(size = 18),
axis.text = element_text(size = 15, color = "#002241"),
text = element_text(color = "#002241"),,
axis.line = element_line(color = "#002241"),
axis.ticks = element_line(color = "#002241"))
#save plot
ggsave(plot = est_error_plot, filename = 'images/est_error_plot.png', width = 12, height = 9, dpi = 1000)
Optimization Error: The Result of Imperfect Optimization
To understand optimization error, I will first provide brief overviews of two topics. First, I will describe the typical way of obtaining regression weights (i.e., using closed-form solutions). Second, I will provide an overview of gradient descent and how, given enough iterations, it can be used to arrive at the weights obtained with the closed-form solution. After explaining these two topics, I will show how optimization error can be introduced by using an insufficient number of iterations in the gradient descent algorithm.
A Brief Review of Using Closed-Form Solutions to Obtain Regression Weights
Returning to the wine example, the tourist wanted to use weather and winemaking quality to predict wine quality. Together, weather ($\mathbf{x}_1$) and winemaking quality ($\mathbf{x}_2$) constituted the set of features, $\mathbf{X} = [\mathbf{x}_1^\top, \mathbf{x}_2^\top]$, with wine quality representing the outcome vector, $\mathbf{y}$. Thus, using $\mathbf{X}$, the tourist wanted to compute predicted values for wine quality such that
$$ \begin{align} \hat{\mathbf{y}} = \mathbf{Xw}. \end{align} $$
To determine the accuracy of predictions for wine quality, the tourist used the mean squared error such that
$$ \begin{align} \ell_{MSE} = \frac{1}{n} \lVert \mathbf{y} - \mathbf{Xw} \rVert^2_2. \label{eq:mse-matrix} \end{align} $$
Note that Equation \ref{eq:mse-matrix} is simply a matrix version of the scalar mean squared error function presented in Equation \ref{eq:mean-sq-error}. To obtain accurate predictions for wine quality, the tourist decides to find some set of weights that minimizes the mean squared error. Mathematically,
$$ \begin{align} \mathbf{w}_{MSE} &= \underset{\mathbf{w}}{\arg\min} \ \frac{1}{n} \lVert \mathbf{y} - \mathbf{Xw} \rVert^2_2 \nonumber \\ &= \underset{\mathbf{w}}{\arg\min} \ \frac{1}{n} (\mathbf{y} - \mathbf{Xw})^\top (\mathbf{y} - \mathbf{Xw}) \nonumber \\ &= \underset{\mathbf{w}}{\arg\min} \ \frac{1}{n} (\mathbf{y}^\top\mathbf{y} - 2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y} + \mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}). \end{align} $$
To find $\mathbf{w}_{MSE}$, it can be solved by computing the gradient of the mean squared error with respect to $\mathbf{w}$ and setting it to zero (see below).4 Note that I provide Equation \ref{eq:grad-resid} because it shows how the gradient is positively related to the size of the residual (i.e., $\mathbf{y} - \mathbf{Xw}$).
$$ \begin{spreadlines}{0.5em} \begin{align} \nabla_w \ell_{MSE} &= \frac{\partial\ MSE}{\partial \ \mathbf{w}} = \frac{1}{n} \frac{\partial}{\partial \ \mathbf{w}} \bigg(\mathbf{y}^\top\mathbf{y} - 2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y} + \mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}\bigg) \nonumber \\ &=\frac{1}{n}\bigg(0 - 2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\mathbf{w}\bigg) \nonumber \\ &= \frac{1}{n} \bigg(2\mathbf{X}^\top(\underbrace{\mathbf{Xw} - \mathbf{y}}_{\text{Residual}})\bigg) \label{eq:grad-resid} \\ \text{Set } \ \nabla_w \ell_{MSE} &= 0 \nonumber \\ 0 &= - 2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\mathbf{w} \nonumber \\ \mathbf{X}^\top\mathbf{y} &= \mathbf{X}^\top\mathbf{X}\mathbf{w} \nonumber \\ \mathbf{w}_{MSE} &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \label{eq:weight-mse-min} \end{align} \end{spreadlines} $$
To show that $\mathbf{w}_{MSE}$ does indeed contain the set of weights that minimizes the mean squared error, I will first compute $\mathbf{w}_{MSE}$ using Equation \ref{eq:weight-mse-min} and then show that it indeed minimizes the mean squared error. Beginning with the computation of $\mathbf{w}_{MSE}$, I compute it using the wine data set I generated in the section on Data Generation (see the Python code block between lines 512–525, which I repeat below).
#set the population means and SDs for weather and winemaking quality mu = [5, 7]
sd = [1.2, 1.7]
#population correlation
rho_weather_winemaking = 0.35
cov_matrix = create_covariance_matrix(sd = sd, rho = rho_weather_winemaking)
sample_size_emp_loss = 250
#data used to compute empirical loss; note seed = 27 by default
data_emp_loss = generate_mult_trunc_normal(cov_matrix = cov_matrix, mu = mu, sample_size = sample_size_emp_loss)
#note that the data_emp_loss data set can be loaded from the smltheory package
data_emp_loss = sml.get_data_emp_loss()
The Python code block below (see lines 526–557) computes $\mathbf{w}_{MSE}$ and finds that
$$ \begin{align} \mathbf{w}_{MSE} &= [0.64, 1.1, -0.01 ,-0.1 ] \label{eq:mse-min} \end{align} $$
def extract_feature_outcome_data(data, sample_size=10, sample_data=False, poly_order=2):
if sample_data:
data = data.sample(n=sample_size, random_state=27)
# extract set of features and then expand according to polynomial order that is specified
original_features = data[["weather", "winemaking_quality"]]
features = pd.concat(objs=[original_features ** order for order in range(1, poly_order + 1)], axis=1).to_numpy()
outcome = data[["wine_quality"]].to_numpy().ravel()
dict_features_outcome = {"features": features, "outcome": outcome}
return dict_features_outcome
def compute_weight_min_mse(data, poly_order=2):
# gather necessary components for matrix-matrix multiplications
dict_data = extract_feature_outcome_data(data=data, poly_order=poly_order)
features = dict_data["features"]
outcome = dict_data["outcome"]
# compute weight values that result from using closed-form solution
w_mse = np.linalg.inv(np.transpose(features).dot(features)).dot(np.transpose(features)).dot(outcome)
return w_mse.ravel()
w_mse = sml.compute_weight_min_mse(data = data_emp_loss)
print("Best weights:", np.round(a = w_mse, decimals = 2))
Best weights: [ 0.64 1.1 -0.01 -0.1 ]
Ending with the visualization of $\mathbf{w}_{MSE}$, I now show that it indeed results in the lowest mean squared error value. To construct a more palatable visualization of the mean squared error (i.e., using two dimensions), I only allow the first weight to vary ($w_1$) and fix the values of the remaining three weights. Thus, the x-axis shows the values for the first weight and the y-axis represents the mean squared error value. I construct this plot using the Python code block below (see lines 558–618) and show it in Figure 8. When the first regression weight value ($w_1$) is 0.64, the lowest mean squared error value is obtained, and this matches with the result in Equation \ref{eq:mse-min} of using the closed-form solution in Equation \ref{eq:weight-mse-min}.
def compute_ind_mse(data, w_guess, poly_order=2):
# gather necessary components for matrix-matrix multiplications
dict_data = extract_feature_outcome_data(data=data, poly_order=poly_order)
features = dict_data["features"]
outcome = dict_data["outcome"]
# compute weights that result from using closed-form solution
w_mse = compute_weight_min_mse(data=data, poly_order=poly_order)
# extract all the values of the closed-form solution except the first value
w_initial = w_mse[1:]
w_initial = np.insert(arr=w_initial, obj=0, values=w_guess)
# compute mean squared error value (i.e., mse = yy - 2wXy + wXXw)
yy = np.transpose(outcome).dot(outcome)
wXy = np.transpose(w_initial).dot(np.transpose(features)).dot(outcome)
wXXw = np.transpose(w_initial).dot(np.transpose(features)).dot(features).dot(w_initial)
mse = yy - 2 * wXy + wXXw
return mse
def compute_all_mse_loss_values(data, w_guess_list):
# return flattened array
all_mse_loss_values = np.array([compute_ind_mse(data, guess) for guess in w_guess_list])
# create dataframe with guesses and corresponding MSE values
df_mse = pd.DataFrame({"w_guess": w_guess_list,
"mse_value": all_mse_loss_values})
return df_mse
data_mse_values = sml.compute_all_mse_loss_values(data = data_emp_loss, w_guess_list = np.arange(start = 0, stop = 3, step= 0.01))
w_mse1 = data_mse_values.iloc[np.argmin(data_mse_values["mse_value"])].to_frame().transpose()
#values to insert into x-axis labels
values_to_insert = np.concatenate((np.array([0]), w_mse1['w_guess'].to_numpy()))
mse_plot = (pt.ggplot(data = data_mse_values, mapping = pt.aes(x = "w_guess", y = "mse_value")) +
pt.geom_line(size = 1, color = "#002241") +
pt.geom_point(data = w_mse1, size = 2) +
pt.scale_x_continuous(breaks = np.insert(arr = np.arange(1, 3, 0.5), obj = [0, 1],
values = values_to_insert)) +
pt.labs(x = "Weight Value (w₁)", y = "Mean Squared Error") +
pt.annotate(geom = 'text', x=w_mse1['w_guess'], y=w_mse1['mse_value'] + 2e3,
label="$w_{MSE}$", color="#002241") +
pt.theme_classic(base_family = 'Helvetica', base_size = 14) +
pt.theme(text = pt.element_text(color = "#002241"),
axis_line = pt.element_line(color = "#002241"),
axis_ticks = pt.element_line(color = "#002241"),
axis_ticks_minor_x = pt.element_blank()))
#save plot
mse_plot.save("images/mse_plot.png", width = 8, height = 6, dpi = 1000)

A Brief Review of Gradient Descent
In looking at Figure 8, the best set of regression weights ($\mathbf{w}_{MSE}$) occur at the trough of the mean squared error function. In the previous section, the weight value at the trough was solved for analytically by setting the gradient to zero and solving for $\mathbf{w}$ (see Equation \ref{eq:weight-mse-min}). In this section, I will arrive at the trough not through an analytical method, but by descending the mean squared error function in Figure 8 until the trough is reached. This method of descending a function until a trough is reached is known as gradient descent. I represent the gradient descent algorithm in the pseudocode code below in Algorithm 1.
Algorithm 1 Gradient Descent
Require: Function for computing loss function, $\ell_{MSE}$ (see Equation $\ref{eq:mse-matrix}$)
Input: Initial guess for weights $\mathbf{w}^{(0)}$
1:procedure gradientDescent($K, \epsilon, \tau$)
2:$k = 0$
3:while $k < K$ and $\lVert \mathbf{w}^{(k)} - \mathbf{w}^{(k+1)} \rVert^2_2 < \epsilon$ do
4:$\mathbf{w}^{(k+1)} = \mathbf{w}^{(k)} - \tau \nabla_{\mathbf{w}}\ell_{MSE}(\mathbf{w}^{(k)})$.
5:$k \mathrel{+}= 1$
6:$k \mathrel{+}= 1$
7:end while
8:end procedure
In the gradient descent algorithm, weights are updated by travelling in the negative gradient direction until either the squared norm (i.e., square of sum of squared differences) is less than than some error value ($\epsilon$) or $K$ iterations are reached. Importantly, the step size ($\tau$) controls how much the algorithm moves in the negative gradient direction. Although outside the scope of the current discussion, the step size value must not be too small or too large; if too small, too much time will be needed to reach the trough and, if too large, the trough may never be reached because the algorithm will jump around too much.
The Python code block below (lines 619–681) contains functions necessary for computing gradient descent.
def compute_gradient(features, outcome, weights, num_samples):
# Compute predictions (Xw using current weights
predictions = np.dot(features, weights)
# Compute the error (difference between predictions and target, Xw - y)
residuals = outcome - predictions
# Compute the gradient of the cost function with respect to weights (2X(Xw - ))
gradient = -2 * np.dot(features.T, residuals) / num_samples
return gradient
def gradient_descent(data, initial_weights, learning_rate=0.01, num_iterations=1000, poly_order=2,
epsilon=1e-20, expedite_algorithm=False, return_log=False):
dict_data = extract_feature_outcome_data(data=data, poly_order=poly_order)
features = dict_data["features"]
outcome = dict_data["outcome"]
# Initialize weights and variable that stores all weights on each iteration
all_weights = initial_weights # variable necessary for logging weight values after each iteration
weights = np.array(initial_weights)
num_samples = len(features)
iteration = 0
diff = float('inf') # Set an initial large value to ensure the loop runs at least once
# loop proceeds until either maximum number of iterations is reaches or the difference
# becomes less than the desired error"""
while iteration < num_iterations and diff > epsilon:
gradient = compute_gradient(features=features, outcome=outcome,
weights=weights, num_samples=num_samples)
# Update weights using the gradient and learning rate
weights -= learning_rate * gradient
# if desired, I expedite the algorithm by setting all the weights except the first one to the
# values obtained by the closed-form solution in each iteration
if expedite_algorithm:
weights[1:] = compute_weight_min_mse(data=data, poly_order=poly_order)[1:]
# update iteration number and check if difference in weights is less than epsilon
iteration += 1
diff = np.sum(np.square(learning_rate * gradient))
if return_log:
all_weights = np.vstack(tup=(all_weights, weights))
if return_log:
# create dataframe
col_names = ["{}{}".format('w', str(weight_num)) for weight_num in range(1, all_weights.shape[1] + 1)]
df_all_weights = pd.DataFrame(data=np.array(all_weights),
columns=col_names)
# insert iteration number
df_all_weights.insert(0, "iteration_num", df_all_weights.index)
return df_all_weights
else:
return weights
In Figure 9, I plot the same mean squared error function as in Figure 8, but show how the gradient descent algorithm arrives at the best regression weights. After each iteration, the mean squared error value decreases because the algorithm descends the function by travelling in the negative gradient direction (see the mean squared error values obtained after the 100$^{th}$ and 300$^{th}$ iterations). After iteration 1100, the algorithm arrives at the best regression weights, $\mathbf{w}_{MSE}$. Therefore, the gradient descent algorithm provides a viable alternative for obtaining regressoin weights obtained using closed-form solutions.

Note that the Python code block below (lines 682–719) contains the code for producing Figure 9.
np.random.seed(27) #ensure reproducibility initial_weights = np.random.uniform(low = 0, high = 1, size = 3)
initial_weights = np.insert(arr = initial_weights, obj=0, values=0)
data_gradient_weights = sml.gradient_descent(data=data_emp_loss, initial_weights=initial_weights, num_iterations=1100, learning_rate=0.0001, epsilon=1e-20, expedite_algorithm=True, return_log=True)
#obtain MSE values that correspond to each of the weight values in each gradient descent iteration
data_gradient_weights["mse_value"] = sml.compute_all_mse_loss_values(data = data_emp_loss, w_guess_list = data_gradient_weights["w1"])["mse_value"]
#only extract specific rows for plotting (rows 1, 101, and 301)
data_gradient_weights_plot = data_gradient_weights.iloc[[0,100, 300, data_gradient_weights.shape[0] - 1]]
#add a column for the latex math that will be printed
data_gradient_weights_plot["latex_math"] = ["${}^{}$".format('w', '{(' + str(i) + ')}') for i in data_gradient_weights_plot["iteration_num"]]
data_gradient_weights_plot["latex_math"].iloc[-1] = data_gradient_weights_plot["latex_math"].iloc[-1] + " = $w_{MSE}$"
gradient_plot = (pt.ggplot(data = data_mse_values, mapping = pt.aes(x = "w_guess", y = "mse_value")) +
pt.geom_line(size = 1, color = "#002241") +
pt.geom_point(size = 2, data = data_gradient_weights_plot, mapping = pt.aes(x = "w1", y = "mse_value")) +
pt.geom_text(data=data_gradient_weights_plot,
mapping=pt.aes(x = "w1", y = "mse_value", label = "latex_math"),
nudge_y = [2.5e3, 2e3, 1.5e3, 2e3],
nudge_x = [0.05, 0.05, 0.05, 0.2],
size = 10, color = "#002241") +
pt.scale_x_continuous(limits = [0, 2.5],
breaks = np.insert(arr = np.arange(1, 2.50, 0.5), obj = 0, values = 0.64)) +
pt.labs(x = "Weight Value (w₁)", y = "Mean Squared Error") +
pt.theme_classic(base_family = 'Helvetica', base_size = 14) +
pt.theme(text = pt.element_text(color = "#002241"),
axis_line = pt.element_line(color = "#002241"),
axis_ticks = pt.element_line(color = "#002241"),
axis_ticks_minor_x = pt.element_blank()))
#save plot
gradient_plot.save("images/gradient_plot.png", width = 8, height = 6, dpi = 1000)
Creating Optimization Error by Limiting the Number of Gradient Descent Iterations
Given that the gradient descent requires many iterations to arrive at the best possible parameter estimates, the number of iterations can be limited to create optimization error. By limiting the number of gradient descent iterations, the set of obtained estimates will be less optimal than the set of estimates obtained with the closed-form solution and, therefore, have greater generalization error. The increase in generalization error that results from using a suboptimal optimization algorithm is optimization error, and is represented below in Equation \ref{eq:opt-error}:
$$ \begin{align} \text{Optimization error } \hspace{0.25cm} (\epsilon_{opt}) = L_D(\tilde{f}_n) - L_D(\hat{f}_n), \label{eq:opt-error} \end{align} $$
where $L_D(\tilde{f}_s)$ represents generalization error of the model where parameter estimates were obtained with a suboptimal optimization algorithm and $L_D(\hat{f}_s)$ represents generalization error of the sample risk minimizer. Importantly, I will call $L_D(\tilde{f}_s)$ the optimization risk minimizer.
Figure 10 provides a visual for conceptualizing optimization error. If no closed-form solution exists for computing parameter estimates, then some optimization algorithm must be used. If optimization is imperfect, it will obtain estimates that are less optimal than those obtained with the closed-form solution. These suboptimal estimates introduce additional generalization error called optimization error (see Equation \ref{eq:opt-error}). Importantly, optimization error can be negative or positive because optimization algorithm can converge on a set of estimates that are subtoptimal on the data set, but more optimal than $\hat{f}_s$ on new data.

To compute optimization error, $\epsilon_{opt}$, I followed two steps. First, in accordance with model development practices (see Model Development in Practice), I determined the polynomial function with the best generalization error. Second, to obtain parameter estimates that were suboptimal relative to the closed-form estimates obtained for the sample risk minimizer ($\hat{f}_s$), I used the limited number of iterations in the gradient descent algorithm. Specifically, gradient descent was limited by setting the maximum number of iterations to 100 ($k = 100$), the learning rate to 0.005 ($\tau = 0.005$), and the error value is $1e-20$ (see Algorithm 2 below).
Algorithm 2 Limited Gradient Descent
Require: Function for computing loss function, $\ell_{MSE}$ (see Equation $\ref{eq:mse-matrix}$)
Input: Initial guess for weights $\mathbf{w}^{(0)}$, $K = 100$, $\epsilon = 1e-20$, $\tau = 0.005$
1:procedure gradientDescent($K, \epsilon, \tau$)
2:$k = 0$
3:while $k < K$ and $\lVert \mathbf{w}^{(k)} - \mathbf{w}^{(k+1)} \rVert^2_2 < \epsilon$ do
4:$\mathbf{w}^{(k+1)} = \mathbf{w}^{(k)} - \tau \nabla_{\mathbf{w}}\ell_{MSE}(\mathbf{w}^{(k)})$.
5:$k \mathrel{+}= 1$
6:end while
7:end procedure
The Python code block below (lines 720–771) contains the functions needed to compute the generalization error of optimization risk minimizers.
def determine_best_polynomial_model(data_sample, data_gen_error, poly_order_range=range(1, 6)):
# Use random_state to ensure reproducibility and prevent resampling from adding noise to estimates
df_all_emp_gen_errors = compute_all_emp_gen_errors(data_emp_loss=data_sample,
data_gen_error=data_gen_error,
include_interactions=False,
poly_order_range=poly_order_range)
# Identify best polynomial model
best_poly_order = df_all_emp_gen_errors["poly_order"][df_all_emp_gen_errors["gen_error"].idxmin()]
return best_poly_order
def compute_opt_gen_error(opt_weights, poly_order, data_gen_error):
# gather necessary components for matrix-matrix multiplications
dict_data = extract_feature_outcome_data(data=data_gen_error, poly_order=poly_order)
features = dict_data["features"]
outcome = dict_data["outcome"].ravel()
# compute predictions and generalization error
predictions = np.dot(features, opt_weights)
gen_error = mean_squared_error(y_true=outcome, y_pred=predictions)
return gen_error
def get_opt_risk_min(sample_size, data_best_in_class, data_gen_error, num_iterations=100):
# Step 1: Using empirical risk minimization to determine the polynomial model order that results in the lowest
# empirical loss
data_sample = data_best_in_class.sample(n=sample_size, random_state=27)
best_model_poly_order = determine_best_polynomial_model(data_sample=data_sample,
data_gen_error=data_gen_error)
# Step 2: Obtain regression weights for the polynomial order model by running gradient descent.
# I fix learning rate to 1e-5 and number of iterations to 500
np.random.seed(27)
initial_weights = np.random.uniform(low=0, high=1, size=2 * best_model_poly_order)
opt_weights = gradient_descent(data=data_sample, initial_weights=initial_weights,
num_iterations=num_iterations, return_log=False, learning_rate=0.005,
poly_order=best_model_poly_order, expedite_algorithm=True)
# Step 3: Use the number of columns in features data set to generate random guesses for regression
# weight starting values
opt_gen_error = compute_opt_gen_error(opt_weights=opt_weights, poly_order=best_model_poly_order,
data_gen_error=data_gen_error)
return opt_gen_error
The Python code block below (see lines 772–789) computes the generalization error of optimization risk minimizers for each sample size between 5 and 1000.
# Fix the arguments except for sample_size and then use map to vectorize over sample size values get_opt_risk_min_partial = functools.partial(get_opt_risk_min,
data_best_in_class = data_best_in_class,
data_gen_error = data_gen_error)
# Call the partial function with est_sample_sizes
start_time = time.time()
opt_sample_sizes = range(5, 1000)
with concurrent.futures.ThreadPoolExecutor(max_workers = 3) as executor:
opt_gen_errors = list(executor.map(get_opt_risk_min_partial, opt_sample_sizes))
end_time = time.time()
end_time - start_time #468.67104387283325 seconds
#Create data frame
data_opt_error = pd.DataFrame({"sample_size": np.array(opt_sample_sizes),
"opt_gen_error": opt_gen_errors}
Because of the long processing time needed to compute the generalization error of each optimization risk minimizer, I provide the data set that contains these generalization errors in the smltheory
package, which can be obtained using the get_data_opt_error_fast()
function (see lines 790 below).
data_opt_error = sml.get_data_opt_error()
Figure 11 below shows the generalization error of optimization risk minimizers as a function of sample size. he left hand plot represents a zoomed version of the right hand plot between the y-axis values of 0.23 and 0.29. As sample size increases, the generalization error of optimization risk minimizer, $L_D(\tilde{f}_s)$, tends to decrease. The difference between $L_D(\tilde{f}_s)$ (dark blue line) and the generalization error of the sample risk minimizer ($L_D(\hat{f}_s)$, blue line) is shown by the brace and represents optimization error (see Equation \ref{eq:opt-error}). For reference, I also show the constrained empirical risk minimizer ($f_\mathcal{F}$, medium blue line) and the Bayes decision function ($f^\ast$, light blue).

The R code block below (see lines 791–860) plots Figure 11 above.
#read in .csv file data_opt_error <- py$data_opt_error
#create long version of data_est_error where function (bayes, constrained minimizer, or sample minimizer) is a
#categorical variable
data_est_error["opt_gen_error"] <- data_opt_error$opt_gen_error
data_est_error_long <- data_est_error %>%
pivot_longer(cols = "sample_risk_gen_error":"opt_gen_error", names_to = "function_type", values_to = "gen_error",
names_ptypes = factor())
data_est_error_long$function_type <- factor(x = data_est_error_long$function_type,
levels = c("opt_gen_error", "sample_risk_gen_error", "constr_min",
"bayes_risk"))
labels <- c(TeX('Optimization risk minimizer, $\\tilde{f}_s$'),
TeX('Sample risk minimizer, $\\hat{f}_s$'),
TeX('Constrained risk minimizer, $f_\u2131$'),
TeX('Bayes decision function, $f^*$'))
line_color <- setNames(c("#002241", "blue", "#2171B5", "#9ECAE1"), levels(data_est_error_long$function_type))
#create data set for braces that show estimation and approximation error.
##see https://stackoverflow.com/questions/45221783/ggforce-facet-zoom-labels-only-on-zoomed-example
##negative optimization error
brace_x_start <- which.min(data_est_error$opt_gen_error[400:500])+400 #find minimum value in first 250 values
brace_x_end <- brace_x_start + 50
data_brace_opt_error <- data.frame("x" = c(brace_x_start, brace_x_end),
"y" = c(data_est_error$opt_gen_error[brace_x_start],
data_est_error$sample_risk_gen_error[brace_x_end]),
"zoom" = T)
label_opt_error <- "Optimization error, $\\epsilon_{opt}$"
opt_error_plot <- ggplot(data = data_est_error_long, mapping = aes(x = sample_size, y = gen_error,
group = function_type,
color = function_type)) +
geom_line(linewidth = 1.2) +
scale_color_manual(name = "Function Type", values = line_color, labels = labels) +
labs(x = "Sample Size for Obtaining Sample and Optimization Risk Minimizers",
y = "Generalization Error (Mean Squared Error)",
color = "Function Type") +
facet_zoom(ylim = c(0.23, .27), zoom.size = 1.2, zoom.data = zoom) +
#optimization error brace for negative optimization error
geom_brace(inherit.aes = F, data = data_brace_opt_error,
mapping = aes(x = x, y = y, label=TeX(label_opt_error, output="character")),
color = '#002241', labelsize = 6, rotate = 90, labeldistance = 5,
parse=T) +
theme_classic(base_family = "Helvetica") +
theme(legend.text.align = 0, #left align text
legend.position = 'bottom',
legend.direction = 'vertical',
legend.justification = 'left',
legend.margin = margin(t = 1, unit = "cm"),
legend.text = element_text(size = 16),
legend.title = element_text(size = 18),
axis.title = element_text(size = 18),
axis.text = element_text(size = 15, color = "#002241"),
text = element_text(color = "#002241"),
axis.line = element_line(color = "#002241"),
axis.ticks = element_line(color = "#002241"))
ggsave(plot = opt_error_plot, filename = 'images/opt_error_plot2.png', width = 12, height = 9.5)
Summary of Excess Risk Decomposition
Excess risk decomposition provides a general framework for understanding the ‘game’ of machine learning. In this framework, three types of error exist. First, approximation error represents the difference in generalization error between the best possible function (Bayes decision function) and the best function in a constrained set of functions (constrained empirical risk minimizer). Thus, approximation error results from constraining empirical risk minimization, and increases to the extent that the constraint increases (see Figure 5).
Second, estimation error represents the difference in generalization error between the constrained empirical risk minimizer and the best function obtained on a sample (sample risk minimizer). With data often being limited in practice, estimation error increases as sample size decreases (see Figure 7).
Third, optimization error represents the difference in generalization error between the sample risk minimizer and the best function obtained with an optimization algorithm (i.e., the optimization risk minimizer). To the extent that optimization is subtoptimal, optimization error will be nonzero. Although optimization error tends to be some positive nonzero value (see Figure 5), it is possible for optimization error to be negative.
The ‘Players’ of Supervised Machine Learning: Bias, Variance, and Noise
In supervised machine learning, bias, variance, and noise can be conceptualized as the players of the game because they can be derived from the mean squared error loss function. In this section, I will show how mean squared error can be decomposed into bias, variance, and noise. I will then use the setup provided by excess risk decomposition to provide a deeper understanding of bias, variance, and noise. For a great lecture on this topic, see Lecture 19 - Bias Variance Decomposition.
Decomposing Mean Squared Error Into Bias, Variance, and Noise
Before decomposing mean squares error, it is important to become familiar with the following two components:
- Expected classifier: the average function (i.e., average of each parameter estimate) obtained with an infinite sampling of data of sample size $n$ from the joint distribution, $s \sim \mathcal{D}^n$. In the winemaking example, the expected classifier could be obtained if the tourist repeatedly sampled wines (making sure to record the weather, winemaking quality, and wine quality), model each data set with the same model, and then compute average values for each parameter. Mathematically, the average function can be obtained by computing the integral of $f_s$ over all possible $s$ samples (see Equation \ref{eq:exp-class}) such that
\[ \begin{align} \bar{f} &= \mathbb{E}_{s\sim \mathcal{D}^n}[f_s] = \underset{s}{\int} f_s P(s) \partial s. \label{eq:exp-class} \end{align} \]
- Expected label: the average value, $\bar{y}$, for a given $\mathbf{x}$. In the winemaking example, the expected label represents the average wine quality value for a given pair of weather and winemaking quality values. Because computing the expected label requires knowledge of the underlying distribution, it can only be obtained with the Bayes decision function, and so can be represented mathematically in several ways as shown below in Equation \ref{eq:exp-label}:
$$ \begin{align} \bar{y} &= f^\ast(\mathbf{x}) = \mathbb{E}_{y|\mathbf{x}} [y] = \underset{y}{\int} y P(y) \partial y \label{eq:exp-label} \end{align} $$
Now I will decompose mean squared error into bias, variance, and noise. In decomposing mean squared error, we are interested in the how well a function performs on average for a given feature-outcome pair with unlimited sampling. That is, we want to compute the expected generalization error such that, if we were to sample some data set $s$ of sample size $n$ from the joint distribution $\mathcal{D}$, how well would the resulting model perform on some new feature-outcome pair $(\mathbf{x}, y)$ drawn from $\mathcal{D}$. Mathematically, we want to know
$$ \begin{align} \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D} \\ s \sim \mathcal{D}^n} \ \Big[(f_s(\mathbf{x}) - y)^2\Big] &= \underset{s\mathbf{x}y}{\iiint}\Big(f_s(\mathbf{x}) - y\Big)^2 P(\mathbf{x}, y)P(x) \ \partial \mathbf{x} \partial y \partial s. \label{eq:exp-gen-error} \end{align} $$
To decompose the expected generalization error shown in Equation \ref{eq:exp-gen-error} above, I will simply add and subtract the expected classifier of $\bar{f}(\mathbf{x})$ and simplify the resulting equation
$$ \begin{spreadlines}{0.5em} \begin{align} \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D} \\ s \sim \mathcal{D}^n} \ \Big[(f_s(\mathbf{x}) - y)^2\Big] &= \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D} \\ \hspace{-0.3cm} s \sim \mathcal{D}^n} \ \bigg[\Big(\underbrace{f_s(\mathbf{x}) - \bar{f}(\mathbf{x})}_{a} + \underbrace{\bar{f}(\mathbf{x}) - y)}_{b}\Big)^2\bigg] \nonumber \\ &= \mathbb{E}_{\mathbf{x} \sim \mathcal{D} \\ \hspace{-0.6cm}{s \sim \mathcal{D}^n}} \bigg[\Big(f_s(\mathbf{x})-\bar{f}(\mathbf{x})\Big)^2\bigg] + \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}} \bigg[ \Big(\bar{f}(\mathbf{x}) - y\Big)^2 \bigg] + 2 \ \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D} \\ \hspace{-0.3cm} s \sim \mathcal{D}^n}\Bigg[\bigg(\Big(f_s(\mathbf{x}) − \bar{f}(\mathbf{x})\Big)\Big(\bar{f}(\mathbf{x})−y) \bigg) \Bigg] \label{eq:exp-gen-exp} \end{align} \end{spreadlines} $$
A closer inspection of the last term in Equation \ref{eq:exp-gen-exp} below shows that it is zero.
$$ \begin{spreadlines}{0.5em} \begin{align*} 2\ \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D} \\ s \sim \mathcal{D}^n}\Bigg[\bigg(\Big(f_s(\mathbf{x}) − \bar{f}(\mathbf{x})\Big)\Big(\bar{f}(\mathbf{x})−y) \bigg) \Bigg] &= 2\ \mathbb{E}_{(\mathbf{x}, y)}\Bigg[ \mathbb{E}_{ s \sim \mathcal{D}^n} \bigg[\Big(\big(f_s(\mathbf{x}) − \bar{f}(\mathbf{x})\big)\big(\bar{f}(\mathbf{x}) - y\big) \Big) \bigg] \Bigg] \\ &= 2\ \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}}\Bigg[ \bigg(\Big(\underbrace{\bar{f}(\mathbf{x}) - \bar{f}(\mathbf{x})\Big)}_{=0}\Big(\bar{f}(\mathbf{x}) - y\Big)\bigg)\Bigg] \\ &= 0 \end{align*} \end{spreadlines} $$
Thus, Equation \ref{eq:exp-gen-exp} becomes
$$ \begin{spreadlines}{0.5em} \begin{align} \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D} \\ s \sim \mathcal{D}^n} \ \Big[(f_s(\mathbf{x}) - y)^2\Big] &= \underbrace{\mathbb{E}_{\mathbf{x} \sim \mathcal{D} \\ s \sim \mathcal{D}^n} \bigg[\Big(f_s(\mathbf{x})-\bar{f}(\mathbf{x})\Big)^2\bigg]}_{\text{variance}} + \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}} \bigg[ \Big(\bar{f}(\mathbf{x}) - y\Big)^2 \bigg], \label{eq:var-decomp} \end{align} \end{spreadlines} $$
with the first term representing variance: The squared difference between the prediction of a given model trained on single data set, $f_s(\mathbf{x})$, and the expected classifier, $\bar{f}(\mathbf{x})$. Now I will decompose the second term in Equation \ref{eq:var-decomp} above into bias (squared) and noise by adding and subtracting the expected classifier, $f^\ast(\mathbf{x})$.
$$ \begin{spreadlines}{0.5em} \begin{align} \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}} \bigg[ \Big(\bar{f}(\mathbf{x}) - y\Big)^2 \bigg] &= \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}} \Bigg[\bigg( \underbrace{\Big(\bar{f}(\mathbf{x}) - f^\ast(\mathbf{x})\Big)}_{a} + \underbrace{\Big(f^\ast(\mathbf{x}) - y\Big)}_{b}\bigg)^2\Bigg] \nonumber \\ &= \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \Bigg[ \underbrace{\Big(\bar{f}(\mathbf{x}) - f^\ast(\mathbf{x})\Big)^2}_{\text{Bias}^2} \Bigg] + \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}} \Bigg[ \underbrace{\Big(f^\ast(\mathbf{x}) - y\Big)^2}_{\text{Noise}}\Bigg] + 2 \ \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}} \Bigg[ \Big(\bar{f}(\mathbf{x}) - f^\ast(\mathbf{x})\Big) \Big(f^\ast(\mathbf{x}) - y\Big) \Bigg] \label{eq:bias-noise} \end{align} \end{spreadlines} $$
Importantly, Equation \ref{eq:bias-noise} contains bias (squared) and noise. Using the math in Equation \ref{eq:bias-noise}, the following definitions can be deduced for bias and noise:
- Bias: the difference between the prediction of the average function and the Bayes decision function (i.e., best possible function), $\bar{f}(\mathbf{x}) - f^\ast(\mathbf{x})$.
- Noise: the squared difference between the Bayes decision function and the actual outcome value, $f^\ast(\mathbf{x}) - y$.
With respect to the last term in Equation \ref{eq:bias-noise}, it is zero, which can be shown by expanding over the expectation, as shown below.
$$ \begin{spreadlines}{0.5em} \begin{align*} 2 \ \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}} \Bigg[ \Big(\bar{f}(\mathbf{x}) - f^\ast(\mathbf{x})\Big) \Big(f^\ast(\mathbf{x}) - y\Big) \Bigg] &= 2\ \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \Bigg[ \mathbb{E}_{y|\mathbf{x} } \bigg[ \Big(\bar{f}(\mathbf{x}) - f^\ast(\mathbf{x})\Big) \Big(f^\ast(\mathbf{x}) - y\Big) \bigg]\Bigg] \\ &= 2\ \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \bigg[ \Big(\bar{f}(\mathbf{x}) - f^\ast(\mathbf{x})\Big) \underbrace{\Big(f^\ast(\mathbf{x}) - f^\ast(\mathbf{x})\Big)}_{=0}\bigg] \\ &= 0 \end{align*} \end{spreadlines} $$
Therefore, the initial expression of the expected mean error squared error presented in Equation \ref{eq:exp-gen-error} decomposes into the sum of variance, bias (squared), and noise, as shown below in Equation \ref{eq:exp-decomp}:
$$ \begin{spreadlines}{0.5em} \begin{align} \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D} \\ s \sim \mathcal{D}^n} \ \Big[(f_s(\mathbf{x}) - y)^2\Big] &= \underbrace{\mathbb{E}_{\mathbf{x} \sim \mathcal{D} \\ s \sim \mathcal{D}^n} \bigg[\Big(f_s(\mathbf{x})-\bar{f}(\mathbf{x})\Big)^2\bigg]}_{\text{Variance}} + \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \Bigg[ \underbrace{\Big(\bar{f}(\mathbf{x}) - f^\ast(\mathbf{x})\Big)^2}_{\text{Bias}^2} \Bigg] + \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}} \Bigg[ \underbrace{\Big(f^\ast(\mathbf{x}) - y\Big)^2}_{\text{Noise}}\Bigg]. \label{eq:exp-decomp} \end{align} \end{spreadlines} $$
Situating Bias, Variance, and Noise in the Excess Risk Decomposition Setup
With the mean squared error function decomposing into the sum of bias(squared), variance, and noise (see Equation \ref{eq:exp-decomp}), I will now present this decomposition in the excess risk decomposition setup. In Figure 12 below, the set of all functions is contained in the large rectangle and the constrained set of functions, $\mathcal{F}$, is contained in the smaller circle. As a reminder, the constrained set of functions must be considered, $\mathcal{F}$, to prevent overfitting and reduce the computational burden of finding a model with low generalization error (for a review, see section on Constrained Empirical Risk Minimization).
To present bias, variance, and noise within the excess risk decomposition setup, three points must be made. First, Bayes risk (panel A) corresponds to the same error defined by noise (panel B). With both the Bayes risk and noise, the error being defined corresponds to the generalization error of the Bayes decision function (i.e., the difference between the best possible prediction, $f^\ast(\mathbf{x})$, and the actual obtained value, $y$).
Second, the gap between approximation error (panel A) and bias squared (panel B) decreases as sample size increases (see medium blue text in Figure 12). As shown in panel A, approximation error represents the difference in generalization error between the Bayes decision function, $f^\ast$, and the constrained empirical risk minimizer, $f_\mathcal{F}$ (i.e., the best function in the constrained set of $\mathcal{F}$). As shown in panel B, bias (squared) represents the difference in generalization error between the Bayes decision function, $f^\ast$, and the average sample risk minimizer, $\hat{f}$. With limited sample size, $\hat{f}$ will always have greater generalization error than $f_\mathcal{F}$. Thus, bias (squared) will always be larger than approximation error with limited sample size. As sample size approaches infinity, however, $\hat{f}$ will become $f_\mathcal{F}$. Thus, bias (squared) will become equivalent to approximation error with an infinite sample size.
Third, and similar to bias, the gap between the sum of estimation and optimization error (panel A) and variance (panel B) similarly decreases as sample size increases (see medium blue text in Figure 12). As shown in panel A, estimation error represents the difference in generalization error between the constrained empirical risk minimizer, $f_\mathcal{F}$, and the sample risk minimizer, $\hat{f}_s$. As shown in panel B, variance represents the difference between the average sample risk minimizer, $\bar{f}$, and any given individual sample risk minimizer, $\hat{f}_s$. With limited sample size, $\hat{f}$ will always have greater generalization error than $f_\mathcal{F}$. Thus, variance will always be smaller than estimation error with limited sample size. As sample size approaches infinity, however, $\hat{f}$ will become $f_\mathcal{F}$. Thus, variance will become equivalent to estimation error with an infinite sample size.

How the ‘Players’ of Machine Learning Behave: The Bias-Variance Tradeoff
Up until this point, I have provided a setup for the game of supervised machine learning and introduced the players of the game. I used the framework of excess risk decomposition to provide a setup for the game of supervised machine learning and decomposed mean squared error loss into bias (squared), variance, and noise, which I introduced as the ‘players’ of the game. In this section, I will describe how bias, variance, and noise behave.
To understand the behaviour of bias, variance, and noise, I will explain two points that, together, make up the bias-variance tradeoff: The opposing behaviours of bias and variance as model complexity increases (see Figure 13). For the purpose of this conversation, I will define model complexity as the number of parameters in a model. Thus, a sixth-order polynomial model would be considered more complex than a first-order polynomial model.
First, bias (squared) decreases as model complexity increases. In Figure 13, the blue line shows the behaviour of bias (squared) as model complexity increases. When an overly simple model such as a first-order polynomial model is used, it will lead to underfitting: When a model does not capture important features that predict the outcome. In the wine example, wine quality was set as the outcome of a second-order polynomial model of weather and winemaking quality (along with an interaction term; see Equation \ref{eq:outcome-generate}). With a first-order polynomial model, the effect of the second-order predictors and the interaction will not be captured, and so, on average, the model’s predictions will err relative to the predictions of the population-generating model (i.e., the Bayes decision function). As model complexity increases, however, the effects of features previously uncaptured by simple models will be modelled. As a result of increasing complexity, predictions will, on average, converge to those of the Bayes decision function.
Second, and opposite to bias, variance increases as model complexity increases. In Figure 13, the medium blue line shows the behaviour of variance as model complexity increases. When an overly complex model such as a sixth-order polynomial model is used, the excessive number of parameters will model sample-specific patterns that, in reality, have no effect on the outcome variable. In other words, an overly complex model will overfit the data. In the wine example, wine quality was only set to be the outcome of a second-order polynomial model of weather and winemaking quality (along with an interaction term; see Equation \ref{eq:outcome-generate}). With a sixth-order polynomial model, weight values for features raised beyond the second order will be influenced by sample specific patterns. Given that sampling error causes considerable variability in sample-specific patterns, then the predictions of a sixth-order polynomial model on any given sample will, on average, differ considerably from the predictions of the average sixth-order polynomial model. As model complexity decreases, however, fewer parameters will exist, and so sample-specific patterns will be less likely to affect weights and, therefore, variance will decrease.
With bias and variance having opposite relations with model complexity, a tradeoff emerges. That is, an optimization problem emerges whereby variance tends to increase as bias decreases. Returning to the wine example, if the tourist focuses solely on reducing bias, then variance will increase exponentially, causing the generalization error (i.e., dark blue line of mean squared error in Figure 13) to be considerably high. Conversely, if the tourist focuses solely on reducing variance, then bias will increase exponentially and cause generalization error (i.e., dark blue line of mean squared error in Figure 13). Thus, to obtain a model with desirable generalization error, an optimum must be identified whereby bias must be decreased without increasing variance too much.
One last point must be mentioned with respect to noise. Noise remains constant as model complexity increases. In Figure 13, the light blue line shows the behaviour of noise as model complexity increases. Noise represents the generalization error of the Bayes decision function; that is, the difference between the population model and the obtained outcome values. Because there is only one Bayes decision function and one population model, they are not affected from the polynomial model used to analyze the data. Thus, noise is constant across model complexity. If noise is too high, it could suggest data cleaning is needed and/or too many errors have been incurred in data collection.
For the interested reader, I will provide the code for generating in a future demonstration post.

Using the Bias-Variance Tradeoff to Develop Rules of Supervised Machine Learning
With an understanding of the behaviour of bias, variance, and noise, it is now possible to deduce rules of supervised machine learning. In Figure 14, I show two general patterns of empirical loss (light blue line) and generalization error (dark blue line) as a function of sample size. Before explaining each panel in Figure 14, three points need to be mentioned. First, empirical loss is always lower than generalization error because models are effectively always influenced by sample-specific features in the training data (for an example see Figure 3. Second, to represent the common practice of designing models to meet some benchmark level of error, I used the dashed horizontal line in each panel to specify some target amount of generalization error, $\epsilon$. Third, text on each plot describes the solution for each problem.

Beginning with panel A in Figure 14, the pattern of generalization error and empirical loss is caused by high bias (or underfitting). In the case were bias is high, the empirical loss and generalization error are both above the target level of error. Because the model cannot even achieve the target error level, $\epsilon$, on the training data, this is an indication that important patterns are not being modelled. In other words, the model under consideration is underfitting the training data. If a model has high bias, adding more data will not help because the model lacks the parameters to model any additional patterns that may be discovered from the additional data. In order to resolve the problem of high bias, the model’s complexity must be increased, which can be accomplished by decreasing regularization, boosting, kernelization, and adding features.
Ending with panel B in Figure 14, the pattern of generalization error and empirical loss is caused by high variance (or overfitting). In the case were variance is high, the empirical loss is below the target level of error, $\epsilon$, but not the generalization error. Because only the empirical loss is below the targer error level, this is an indication that the model is fitting sample-specific patterns. In other words, the model under consideration is overfitting the training data. To resolve the problem of high variance, two solutions exist. First, more data can be added. By adding more data, the model is exposed to a broader range of patterns, and so may be better able to discern noise from effect, thus reducing the effects of outliers. Second, model complexity can be decreased, which can be accomplished by increasing regularization and boosting.
For the interested reader, I have provided the Python and R code below (see lines 860–924 and lines 925–999, respectively). For a deeper explanation of the function I used to generate the curves for empirical loss and generalization error in Figure 14, see Section 4.1.1.1.3 of my doctoral dissertation.
#create data set for pattern of empirical loss and generalization error that occurs with #high variance
##high variance
training_sample_size = np.linspace(start = 1, stop = 1000, num = 1000)
#plateau point
M = 0.5
#satiation points and values (point at which curve reaches satiation value)
satiation_value_gen_error_high_var = 0.4
satiation_point_gen_error_high_var = 250
satiation_value_emp_loss_high_var = 0.15
satiation_point_emp_loss_high_var = 250
##high bias
satiation_value_gen_error_high_bias = 0.4
satiation_point_gen_error_high_bias = 250
satiation_value_emp_loss_high_bias = 0.37
satiation_point_emp_loss_high_bias = 175
#growth rate parameter value that curve reaches satiation value bu satiation point
a_gen_error_high_var = np.log(1 - satiation_value_gen_error_high_var)/-satiation_point_gen_error_high_var
a_emp_loss_high_var = np.log(1 - satiation_value_emp_loss_high_var)/-satiation_point_emp_loss_high_var
a_gen_error_high_bias = np.log(1 - satiation_value_gen_error_high_bias)/-satiation_point_gen_error_high_bias
a_emp_loss_high_bias = np.log(1 - satiation_value_emp_loss_high_bias)/-satiation_point_emp_loss_high_bias
##high variance
curve_gen_error_high_var = -M*(1 - np.exp(-a_gen_error_high_var*training_sample_size)) + 1
curve_emp_loss_high_var = M*(1 - np.exp(-a_emp_loss_high_var*training_sample_size))
##high bias
curve_gen_error_high_bias = -M*(1 - np.exp(-a_gen_error_high_bias*training_sample_size)) + 1
curve_emp_loss_high_bias = M*(1 - np.exp(-a_emp_loss_high_bias*training_sample_size))
data_emp_gen_error_curve = pd.DataFrame({'training_sample_size': training_sample_size,
'gen_error_high_var': curve_gen_error_high_var,
'emp_loss_high_var': curve_emp_loss_high_var,
'gen_error_high_bias': curve_gen_error_high_bias,
'emp_loss_high_bias': curve_emp_loss_high_bias})
cols_to_expand = data_emp_gen_error_curve.columns[1:5]
data_emp_gen_error_curve_long = data_emp_gen_error_curve.melt(id_vars = 'training_sample_size',
value_vars = cols_to_expand,
var_name = "error_type",
value_name = 'error_value')
# Define the regular expression pattern to match the second underscore and everything after it
pattern = r'^([^_]+_[^_]+)_(.*)$'
# Split each element using regular expressions
split_cols = data_emp_gen_error_curve_long['error_type'].str.extract(pattern)
#Create two new columns
data_emp_gen_error_curve_long["error_type"] = split_cols[0]
data_emp_gen_error_curve_long["problem"] = split_cols[1]
#Relevel error_type column
error_type_releveled = ["gen_error", "emp_loss"]
data_emp_gen_error_curve_long["error_type"] = data_emp_gen_error_curve_long['error_type'].astype('category').cat.reorder_categories(error_type_releveled)
#import Python data set data_emp_gen_error_curve_long <- py$data_emp_gen_error_curve_long
data_emp_gen_error_curve_long$problem <- factor(
x = data_emp_gen_error_curve_long$problem,
labels = c("bold(A:~Error~Patterns~Caused~by~High~Bias~(Underfitting))",
"bold(B:~Error~Patterns~Caused~by~High~Variance~(Overfitting))"))
#text dataframe
data_text_high_bias = data.frame(
"label" = c(TeX(input = "$\\textbf{High\\phantom{.}Bias\\phantom{.}(Underfitting)}$",
output = "character"),
TeX(input = "$\\uparrow$features", output = "character"),
TeX(input = "$\\uparrow$complexity", output = "character"),
TeX(input = "$\\downarrow$regularization", output = "character")),
"x" = c(205, 67, 90, 115),
"y" = c(0.63, .56, .49, .42),
"problem" = factor("bold(A:~Error~Patterns~Caused~by~High~Bias~(Underfitting))"))
data_text_high_var = data.frame(
"label" = c(TeX(input = "$\\textbf{High\\phantom{.}Variance\\phantom{.}(Overfitting)}$",
output = "character"),
TeX(input = "$\\uparrow$data", output = "character"),
TeX(input = "$\\downarrow$complexity", output = "character"),
TeX(input = "$\\uparrow$regularization", output = "character")),
"x" = c(230, 37, 90, 115),
"y" = c(0.63, .56, .49, .42),
"problem" = factor("bold(B:~Error~Patterns~Caused~by~High~Variance~(Overfitting))"))
data_text_merged <- rbind(data_text_high_var, data_text_high_bias)
#colors
color_palette <-c('gen_error' = '#002241',
'emp_loss' ='#9ECAE1')
#plot
plot_bias_var_rules <- ggplot(data = data_emp_gen_error_curve_long,
mapping = aes(x = training_sample_size, y = error_value,
group = error_type, color = error_type)) +
#data aesthetics
geom_line(linewidth = 1) +
geom_hline(yintercept = 0.3, color = "#002241", linetype = "dashed") +
geom_text(inherit.aes = F, data = data_text_merged,
mapping = aes(x = x, y = y, label = label),
size = 6, color = "#002241", parse = T) +
#panel aesthetics
facet_wrap(facets = ~ problem, nrow = 2, ncol = 1, labeller = label_parsed, scales = 'free') +
#scale aesthetics
scale_y_continuous(name = 'Mean Squared Error', breaks = 0.3, labels = expression(epsilon)) +
scale_x_continuous(name = "Sample Size", labels = NULL) +
scale_color_manual(name = "Function Type",
values = color_palette,
labels = c(emp_loss = 'Empirical Loss', gen_error = 'Generalization Error')) +
#custom styling
theme_classic(base_family = 'Helvetica', base_size = 14) +
theme(legend.text = element_text(size = 14),
legend.title = element_text(size = 15),
axis.title = element_text(size = 15),
axis.text = element_text(size = 14, color = "#002241"),
text = element_text(color = "#002241"),
axis.line = element_line(color = "#002241"),
axis.ticks = element_blank(),
axis.text.y = element_text(size = 20),
strip.text = element_text(size = 17, color = "#002241", hjust = 0),
strip.background = element_blank(),
panel.spacing.y = unit(x = 3, units = 'cm'))
#save plot
ggsave(filename = "images/plot_bias_var_rules.png", plot = plot_bias_var_rules, width = 10, height = 12, dpi = 1000)
Boundary Conditions on the Rules of Machine Learning
Although the previous sections showed a tradeoff between bias and variance and showed that rules could be deduced from this tradeoff, these rules are, unfortunately, not universal. In perusing recent papers, I found (at least) two boundary conditions exist on the rules of supervised machine learning presented in Figure 14.
First, the bias-variance tradeoff disappears with exceedingly complex functions (for great introductions, see Citation: Belkin et al. 2019 Belkin, M. Hsu, D. Ma, S. & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical biasvariance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849–15854. https://doi.org/10.1073/pnas.1903070116 ; Citation: Neal, 2019 Neal, B. (2019). On the bias-variance tradeoff: Textbooks need an update. https://doi.org/10.48550/ARXIV.1912.08286 ; Citation: Dar et al. 2021 Dar, Y. Muthukumar, V. & Baraniuk, R. (2021). A farewell to the bias-variance tradeoff? An overview of the theory of overparameterized machine learning. https://doi.org/10.48550/ARXIV.2109.02355 ). I have reprinted Figure 15 from ( Citation: Belkin et al. 2019 Belkin, M. Hsu, D. Ma, S. & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical biasvariance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849–15854. https://doi.org/10.1073/pnas.1903070116 ) to show a more accurate depiction of the relation between generalization error and model complexity known as the double descent curve. Before explaining the double descent curve in Figure 15, three points must be made with respect to terminology. First, and beginning with the x-axis, the capacity of $\mathcal{H}$ can be conceptualized as the complexity of a function $f$ for the purpose of this discussion. Second, test risk is synonymous with generalization error and training risk is synonymous with empirical loss (whether the loss be on a training, validation, or test set). Third, and last, the interpolation threshold represents the point at which the complexity of the function is sufficient enough such that it achieves zero loss on the training data. After the interpolation threshold, functions can be said to be over-parameterized and, before the threshold, functions can be said to be under-parameterized. Returning to Figure 15, the test risk curve follows a U-shaped curve as in the traditional bias-variance explanation (or “classical” regime; see Figure 13), but then decreases exponentially as the capacity of $\mathcal{F}$ (i.e., model complexity) increases past the interpolation threshold. Thus, the bias-variance tradeoff ceases to be a tradeoff after the point which a function has sufficient number of parameters to achieve zero error on training data. Some examples of of the bias-variance tradeoff violations have occurred with neural networks and decision trees (see Citation: Belkin et al. 2019 Belkin, M. Hsu, D. Ma, S. & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical biasvariance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849–15854. https://doi.org/10.1073/pnas.1903070116 ).

Second, the rules of supervised machine learning presented in the previous section appear to be dependent on the specific loss functions, as bias-variance tradeoffs and bias-variance-noise decompositions do not always occur with other loss functions. For example, although a bias-variance decomposition occurs with mean absolute error, there appears to be no bias-variance tradeoff with this loss function ( Citation: Gao, 2021, p. see Gao, J. (2021). Bias-variance decomposition of absolute errors for diagnosing regression models of continuous data. Patterns, 2(8), 100309. https://doi.org/10.1016/j.patter.2021.100309 ) .5 With respect to margin loss functions (e.g., logistic, hinge, exponential), a bias-variance decomposition only occurs with strictly convex gradient-symmetric loss functions ( Citation: Wood et al. 2022 Wood, D. Mu, T. & Brown, G. (2022). Bias-variance decompositions for margin losses. arXiv, https://doi.org/10.48550/ARXIV.2204.12155 ) . As one final example, ( Citation: James, 2003 James, G. (2003). Variance and bias for general loss functions. Machine Learning, 51(2), 115–135. https://doi.org/10.1023/a:1022899518027 ) found
Conclusion
In this paper, supervised machine learning was conceptualized as a game with the goal of maximizing predictive ability. To understand the game being played in supervised machine learning, I provided a setup, the players, and the rules of the game. Excess risk decomposition was used to provide a setup for the game. The players of bias (squared), variance, and noise were obtained by decomposing the mean squared error loss function. Finally, two rules of supervised machine learning were deduced using the bias variance tradeoff. I then ended the paper with a discussion on boundary conditions for the rules of supervised machine learning.
References
Although measurement error is unlikely with directly observable variables (or manifest variables) such as altitude, measurement error is likely be incurred when measuring a variable such as winemaking quality because it is not directly observable; it is a latent variable that must be approximated using indicators (e.g., cleanliness of winemaking procedure, treatment of grapes, winemaker’s motives, etc.). ↩︎
Although counterintuitive, the highest value for $\mathbf{x}_2$ was used because of the negative-value weight applied on the square of the second feature, $\mathbf{x}^2_2$ of -0.1. ↩︎
The astute reader might notice that I eschewed the method of empirical risk minimization in selecting the sample risk minimizer, $\hat{f}_s$. That is, instead of picking the polynomial function that resulted in the lowest empirical loss, I picked the polynomial function with the lowest generalization error. Two reasons exist for why I used the the function with the lowest generalization error as the sample risk minimizer. First, because I used an independent data set for computing generalization error that was much larger than the data set for computing empirical loss (i.e.,
data_gen_error
), using it to identify the function with the lowest generalization error provided a stronger demonstration that the generalization error of the sample risk minimizer is always greater than or equal to the constrained empirical risk minimizer, $f_{\mathcal{F}}$. Second, using the function with the lowest generalization error aligns more closely with how models are selected in practice (see section on Model Selection in Practice). ↩︎Note that, for the formula of $\mathbf{w}_{MSE}$ in Equation \ref{eq:weight-mse-min} to be computable, the columns of $\mathbf{X}$ must be linearly independent. Linear independence guarantees that $(\mathbf{X}^\top\mathbf{X})^{-1}$ exists and that $\mathbf{w}_{MSE}$ is the best solution. For a more detailed explanation see, Lecture 5 - Gradient Descent and Least Squares. Note that, although the lecture uses the sum of squares as the loss function, the conclusions obtained in this paper are identical to those obtained with mean squared error. ↩︎
See ( Citation: Robeson et al. 2023 Robeson, S. & Willmott, C. (2023). Decomposition of the mean absolute error (MAE) into systematic and unsystematic components. In Zama, F. (Eds.), PLOS ONE, 18(2), e0279774. https://doi.org/10.1371/journal.pone.0279774 ) for a different and arguable more useful decomposition of mean absolute error. ↩︎