When I get a data set to explore, the first thing I think about is the cleanliness of the data. If it's survey data, there may be inconsistencies in the collection method, especially if the data spans years of surveys.

From reading the freely-downloadable textbook, Think Stats, I discovered the term recode, which refers to data variables that have been re-coded from the original, raw intake data to be more consistent or accurate for data analysis purposes.

The book uses the U.S. National Survey of Family Growth (NSFG) which is conducted annually by the U.S. Center for Disease Control (CDC) to illustrate statistical techniques. As I dug into that data set, I learned that missing data was often recoded using imputed values. The abstract for this article on the survey's sample design and analysis states:

Imputation was accomplished using a multiple regression procedure with
software called IVEware, available from
the University of Michigan website.

If you're wondering why regression is used to fill in missing values, think about what would happen if you just removed those records containing missing values. The remaining data would be biased toward the mean of the remaining data items. By using regression, this survey's imputation attempts to improve on the estimation of those missing values.

If a large proportion of the data variables are missing from a data set, imputation is not a reasonable technique. Changing the intended use of the data set or re-surveying may be necessary in that case.

In the case of the NSFG data, however, the scientists working on the project report that a very small number of the data variables require imputed values. About 10 percent of the approximately 6,000 data variables had missing values. Of the overall number of individual data items collected, however, only between 1 and 2 percent were missing and required imputation.

Learning about how regression is used for missing values is now on my TO-DO list of topics to find out more about. I hope you enjoyed this TIL in data science.