Overfitting

United Nations Photo

Suppose you want to predict the winner of the next presidential election (seems like a good thing to have advance warning for, right?). You decide to create a statistical model by coming up with a number of possible influences on election outcomes and then looking at election results from previous years to see each influence affected the outcomes of previous elections.

So, for example, your model might start with just one input: economic growth. Maybe in years when the economy is growing well the incumbent president (or the incumbent president's party) will get re-elected, and in years when the economy is doing badly the incumbent president's party will get kicked out of office.

This seems kind of reasonable. You run a quick analysis on the previous data and see that economic growth definitely seems to have an influence on election outcomes: in many years with strong economic growth the incumbent party does indeed win, and in many years with bad economic growth the incumbent party does indeed lose.

However, you quickly realise that there are a bunch of data points that don't fit your model: there are a number of elections where the economy is doing badly but the incumbent wins anyway. You look into this data a little and realise that some of those years correspond to times the nation was at war. So you add another variable to your model, and now your prediction is that the incumbent will win if the economy's doing well and lose if the economy's doing badly, unless there's a war happening in which case the incumbent will win if the war's going well and lose if the war's going badly.

This turns out to make your model a little more successful on the past data: it now correctly predicts a few more of the elections that it had previously got wrong. But you're still not happy. You look at the data and try to find more factors that could explain the elections you haven't yet predicted correctly.

The risk in this situation is that you'll fall prey to a common statistical problem: overfitting. Overfitting is when you create a model that just describes the particular quirks of a historical dataset, rather than identifying the true underlying patterns that cause a certain outcome to happen. If you add enough variables to a model you can always find a way to explain every datapoint that has happened in the past... but that doesn't actually mean that your model will be any good at predicting the future, because an overfitted model is actually describing random historical noise rather than meaningful patterns.

In our political example, we might add more and more variables until we can explain every previous election in our country's history... but later discover that a lot of those variables were not really helping us understand the true underlying reasons why presidents win or lose elections. If our final model is something like:

"the incumbent will win if the economy's doing well unless there's a war, in which case the incumbent will win if the war's going well, unless the incumbent is over 6 feet tall in which case she'll win regardless, unless her opponent's name starts with an M. in which case he'll always win"

This model might well "correctly" describe all previous election results in some particular country, but it's a pretty safe bet that it's been "overfitted" to the data: it fits the previous data too well, because we specifically fitted the model to our already-existing historical dataset, and not because we understand the true underlying factors that will affect future elections as well.