The bias-variance tradeoff arise from a frequentist viewpoint of the model complexity issue (as opposed to the Bayesian one). This decomposition helps to understand how complexity affects generalization. Suppose we have a model and a dataset composed of inputs and outputs , . We would like to find a meaningful decomposition of the expectation of the sum of square loss function when we repeatedly train our model over different datasets . We further assume that the observations are the sum of a real target and intrinsic noise having zero mean and variance .
Let’s start considering the standard variance decomposition of a random variable :
where the expectation is taken over the probability distribution of x, p(x). Rearranging the terms, we have an expression for the sum of squares:
for instance, this can be view as a special case of Jensen’s inequality for convex functions (). Now, we can replace with the model error and use the above expression to retrieve the e expectation over different datasets of the sum of squares loss, while keeping fixed:
The last passage holds since has zero mean.
Now, applying (3) to the first term:
Putting the two equations together, we get:
The first term, called the squared bias, represents the extent to which the average prediction over all data sets differs from the desired regression function. The second term, called the variance,
measures the extent to which the solutions for individual data sets vary around their average, and hence this measures the extent to which the function is sensitive to the particular choice of data set.