Join 8,587 professionals who get data science tutorials, book recommendations, and tips/tricks each Saturday.

New to Python and SQL? I'll teach you both for free.

I won't send you spam. Unsubscribe at any time.

Issue #8 - Handling Missing Data For Machine Learning: Proxy Features & Imputation

This Week’s Tutorial

While missing data is common in DIY data science, it is problematic for machine learning.

Most machine learning algorithms cannot handle missing data. Even a single missing value in a 100,000-row dataset will throw an error with these algorithms.

As a DIY data scientist, you must have strategies for handling missing data in your machine learning models.

Here are the strategies I teach my corporate clients for handling missing data:

  1. Fix the data.

  2. Use an algorithm that can handle missing values.

  3. If only a small percentage of observations have missing data, remove the observations.

  4. Remove features with missing data.

  5. Find a proxy feature for the feature with missing data.

  6. Fill in the missing data (i.e., impute the missing data).

The above strategies are listed in decreasing order of desirability. Don't make the mistake of jumping down the strategies list because something looks cool.

This week’s issue will cover strategies 5 and 6.

Think of a proxy feature as a column where:

  • There is no missing data.

  • The column's information significantly overlaps with a feature that has missing data.

The technical name for this information overlap is correlation. A good proxy feature will be highly correlated with the feature that's missing data.

If you are unfamiliar with correlation, I can teach you how to perform correlation analysis in my new Visual Analysis with Python online course.

Here's the critical thing to remember with proxy features:

The correlation between a feature with missing data and a proxy is almost always imperfect. Because of this, proxy features can help - or hurt - your model's predictive performance.

This is why it's always a good idea, if possible, to use ML algorithms that can natively handle missing data (e.g., RandomForestClassifier in scikit-learn).

If you're in an experimental mood (which is always a good thing), you can compare a model using the feature with missing data to a model using the proxy feature instead.

When it comes to handling missing data, experimentation is going to be your best bet. Every strategy in this series (excluding strategy #1) has tradeoffs due to the imperfect data given to the ML algorithm.

For an example of engineering a proxy feature, check out my Introduction to Machine Learning online course (offered in partnership with TDWI).

Replacing (imputing) missing data is the siren's call for professionals new to real-world machine learning. I know because I was seduced by the call early in my ML journey.

Imputation seems so much more "data science" than the other options.

Here's why you need to fight the siren's call.

Imputation can range from the simple to the complex. Here are some examples at the simple end of the spectrum:

  • Replacing missing numeric data with the average (mean) of the values present in the feature.

  • Replacing missing categorical values with the mode (i.e., the feature's most common value).

And here are some complex examples:

  • Training a DecisionTreeRegressor to predict the missing values of a numeric feature.

  • Training a DecisionTreeClassifier to predict the missing values of a categorical feature.

Whether using simple or complex techniques, imputation is about using predictive models (yes, the mean is a predictive model) to replace missing data.

And every predictive model is wrong to one degree or another.

Conceptually, let's see why this is such a big problem.

Let's say that you've got a numeric feature missing values, and you've decided to train a DecisionTreeRegressor to impute the missing data.

Let's say the feature in question is Age, and this feature is the only one missing data.

Here's how the process works at a high level:

  1. Working with your original training dataset, you create a 2nd training dataset consisting of only those rows where Age has data present.

  2. Your goal is to train a DecisionTreeRegressor that accurately predicts Age using the other features of this 2nd training dataset.

  3. This effectively becomes a 2nd ML project within your original project, including profiling the data, feature engineering, tuning, etc.

  4. You use a metric like mean absolute error (MAE) to optimize and evaluate the effectiveness of your imputation model (i.e., the DecisionTreeRegressor).

Let's say your MAE is 5.2. That means your imputation model's predictions for Age are off by 5.2 years on average (either high or low). An MAE of 5.2 can be pretty respectable, by the way.

When you use this imputation model to replace missing Age values, you incorporate data with errors into the original training dataset.

You then train another model using the imputed data. This model will also be imperfect. And its imperfections will very likely be exacerbated by the imputed data.

Additionally, you will use the imputation model to replace missing Age values in the test dataset. This will usually negatively impact your accuracy estimates during the final testing of your model.

This is why imputation is the least desirable of all the strategies for handling missing data - despite arguably being the "most cool."

As always, experimentation is a very good idea if you consider using imputation. Be sure to compare a model trained from imputed data to one trained using another strategy.

Check out my Introduction to Machine Learning self-paced online course to learn how to craft an imputation model.

This Week’s Book

I recommend the following book in my new Visual Analysis with Python online course. It's my favorite book on visual data analysis:

I own multiple books by Stephen Few. If your work has a visual aspect to it (e.g., dashboards), you can't go wrong studying the works of Mr. Few.

That's it for this week.

Stay tuned for next week's newsletter, where I will begin a new tutorial series on interpreting machine learning models.

Stay healthy and happy data sleuthing!

Dave Langer

Whenever you're ready, there are 4 ways I can help you:

1 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required, and it works with Python in Excel!

2 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data.

3 - Introduction to Machine Learning: This self-paced online course teaches you how to build predictive models like regression trees and the mighty random forest using Python. Offered in partnership with TDWI, use code LANGER to save 20% off.

4 -Is machine learning right for your business, but don't know where to start? Check out my Machine Learning Accelerator.