What is Multiple Imputation and why does it work?

Imagine that you’re an investigative journalist, and you have a shredded document that you need to reconstruct for a story that you’re writing. You only have a portion of the slices of the document. After putting the slices of paper back together, you are left with a series of sentences with some letters missing from each sentence. Let’s consider the following sentence missing from the document, from a fictional company, Corporation X:

We are involved in ____ing toxic envi____ental waste in a local river.

If you’re skilled at the Wheel of Fortune-style word games, you might be able to fill in the missing letters to some of these words. First let us consider the word “envi____ental.” You might guess that this word is environmental (I can’t think of any other words that could fill in here, can you?). However, two words prior, we have “____ing.” Is it “dumping”, “cleaning”, “jumping” or any number of a variety of missing words? How would you figure out the meaning of the paper here?

In practice, you would likely use context. In addition to the information in the sentence above, the sentences just prior and just after might suggest what that word is. Without context, you might pick the most common word ending in –ing in the English language.

One option for this is that you could pick the most likely option, let’s say that is “dumping” in this case. This would be big news! The headline might read: Corporation X has leaked internal documents saying that they dump toxic environmental waste in a local river!

However, another option for reconstructing the paper would be to try different possibilities for what the missing information could be, rather than just a single option. For example, we could say, given the information in the rest of the document, there’s a 46% chance that the word is “dumping” or a synonym of it, and a 44% chance that the word is “cleaning” or some synonym, and a 10% that the word has some other meaning. We could then express, the possible meanings of the document, based on the different possible imputations, or filling in, of the missing words. This would give us a much better understanding of the document, rather than just looking at one possibility.

Multiple imputation is a statistical method which uses the principle described above to help make accurate estimates and inferences when some of the data are missing. Just as the paper uses context to fill in the missing values, in multiple imputation, we use both information from prior knowledge and information that we learn from the data set. Furthermore, we incorporate a number of possibilities for the missing data. In many applications of multiple imputation, we impute numbers rather than letters, but the principle is the same. Once we impute the missing data, we can complete a statistical analysis using any one of a variety of statistical methods, such as linear regression, with only minor modifications.

Multiple Imputation and other good methods for handling missing data in statistical analysis allow us to make accurate inferences in situations where other methods (like ignoring missing data) yield inaccurate inferences. If you hadn’t considered other possibilities, you might inaccurately conclude that Corporation X is dumping waste, when there’s an almost equal chance it is cleaning waste based on the document. While the details of multiple imputation can get a bit technical, you should now understand the concept of it. For those interested in further reading about some of the more technical aspects of the method, you can see this site: The Multiple Imputation FAQ Page.



Amit Chowdhry is an MD-PhD student at the University of Rochester, currently working toward his PhD in statistics. His research focuses on making accurate inferences when combining the results of multiple studies (meta-analysis). In his free time, he enjoys cooking, reading about other branches of science, and volunteering at a student-run free clinic.