Bias-variance decomposition
Built with
Page icon

Bias-variance decomposition

The bias-variance tradeoff arise from a frequentist viewpoint of the model complexity issue (as opposed to the Bayesian one). This decomposition helps to understand how complexity affects generalization. Suppose we have a model y^(x)\hat{y}(x)  and a dataset composed of inputs xix_i  and outputs yiy_i, D={(xi,yi)i=1N}\mathcal{D} = \{(x_i, y_i)_{i=1}^N\}. We would like to find a meaningful decomposition of the expectation of the sum of square loss function when we repeatedly train our model over different datasets D:\mathcal{D}: EDL(y^(x),y)=EDEx((y^(x)y)2)=ExED((y^(x)y)2)\mathbb{E}_{\mathcal{D}}\mathcal{L}(\hat{y}(x), y) =\mathbb{E}_{\mathcal{D}}\mathbb{E}_x((\hat{y}(x)- y)^2) =\mathbb{E}_{x}\mathbb{E}_{\mathcal{D}}((\hat{y}(x)- y)^2) . We further assume that the observations y=t+ϵy = t + \epsilon  are the sum of a real target and intrinsic noise ϵ\epsilon  having zero mean and variance σ2\sigma^2.
Let’s start considering the standard variance decomposition of a random variable xx:
E(xEx)2=Ex22E(xEx)+(Ex)2=Ex2(Ex)2\begin{align} \mathbb{E}(x-\mathbb{E}x)^2 &= \mathbb{E}x^2 -2\mathbb{E}(x\mathbb{E}x) + (\mathbb{E}x)^2\\ &=\mathbb{E}x^2-(\mathbb{E}x)^2 \end{align}
where the expectation is taken over the probability distribution of x, p(x). Rearranging the terms, we have an expression for the sum of squares:
Ex2=(Ex)2+E(xEx)2\begin{equation} \mathbb{E}x² = (\mathbb{E}x)^2 + \mathbb{E}(x-\mathbb{E}x)^2 \end{equation}
for instance, this can be view as a special case of Jensen’s inequality for convex functions (Ef(x)f(Ex)\mathbb{E}f(x)\leq f(\mathbb{E}x)). Now, we can replace xx with the model error yy^=t+ϵy^y-\hat{y} = t+\epsilon-\hat{y}  and use the above expression to retrieve the e expectation over different datasets D\mathcal{D} of the sum of squares loss, while keeping xx fixed:
The last passage holds since ϵ\epsilon has zero mean.
ED(ti+ϵy^(xi))2=ED(tiy^(xi))2+EDϵ2+2ED(tiy^(xi))ϵ=ED(tiy^(xi))2+σ2\begin{aligned} \mathbb{E}_{\mathcal{D}}(t_i + \epsilon -\hat{y}(x_i))^2 &= \mathbb{E}_{\mathcal{D}}(t_i-\hat{y}(x_i))^2 + \mathbb{E}_{\mathcal{D}}\epsilon^2+2\mathbb{E}_{\mathcal{D}}(t_i-\hat{y}(x_i))\epsilon \\ &= \mathbb{E}_{\mathcal{D}}(t_i-\hat{y}(x_i))^2 + \sigma^2 \end{aligned}
Now, applying (3) to the first term:
ED(tiy^(xi))2=(ED(tiy^(xi)))2+ED(y^(xi)tiED(y^(xi)+ti))2=(ED(tiy^(xi)))2+ED(y^(xi)EDy^(xi))2=bias2+VaR\begin{align}\mathbb{E}_{\mathcal{D}}(t_i-\hat{y}(x_i))^2 &= (\mathbb{E}_{\mathcal{D}}(t_i-\hat{y}(x_i)))^2 + \mathbb{E}_{\mathcal{D}}(\hat{y}(x_i)-t_i -\mathbb{E}_{\mathcal{D}}(\hat{y}(x_i) + t_i))^2\\ &=(\mathbb{E}_{\mathcal{D}}(t_i-\hat{y}(x_i)))^2 + \mathbb{E}_{\mathcal{D}}(\hat{y}(x_i) -\mathbb{E}_{\mathcal{D}}\hat{y}(x_i) )^2\\ &=bias^2 + VaR \end{align}
Putting the two equations together, we get:
ED(yiy^(xi))2=bias2+VaR+σ2\mathbb{E}_{\mathcal{D}}(y_i-\hat{y}(x_i))^2 = bias^2 +VaR + \sigma^2
The first term, called the squared bias, represents the extent to which the average prediction over all data sets differs from the desired regression function. The second term, called the variance, measures the extent to which the solutions for individual data sets vary around their average, and hence this measures the extent to which the function y^(xi)\hat{y}(x_i) is sensitive to the particular choice of data set.