Join 6,175 professionals who get data science tutorials, book recommendations, and tips/tricks each Saturday.

New to Python? I'll teach you for free.

I won't send you spam. Unsubscribe at any time.

Issue #6 - Handling Missing Data For Machine Learning: Part 1

This Week’s Tutorial

While missing data is common in DIY data science, it is problematic for machine learning.

Most machine learning algorithms cannot handle missing data. Even a single missing value in a 100,000-row dataset will throw an error with these algorithms.

As a DIY data scientist, you must have strategies for handling missing data in your machine learning models.

Here are the strategies I teach my corporate clients for handling missing data:

  1. Fix the data.

  2. Use an algorithm that can handle missing values.

  3. If only a small percentage of observations have missing data, remove the observations.

  4. Remove features with missing data.

  5. Find a proxy feature for the feature with missing data.

  6. Fill in the missing data (i.e., impute the missing data).

The above strategies are listed in decreasing order of desirability. Don't make the mistake of jumping down the strategies list because something looks cool.

For example, when teaching live, it's common for me to see eyes light up when I discuss filling in (i.e., imputing) missing data. I know the excitement because I felt the same way many years ago.

However, imputing missing data should be a last resort - not the first. I will cover why in a later newsletter issue.

That said, first up is the best option for handling missing data - fix it.

The idea is simple: Fill in the missing data with the values that should have been there in the first place. Nothing beats using the actual data values.

I will be the first to admit that this is far from easy. Researching the data and the processes that created it and assessing whether the original data values can be collected can require much work.

In most organizations, however, this typically means finding the professionals closest to the data generation/collection process (i.e., the SMEs) and asking for their help. In my experience, the SMEs inform you about one of three realities:

  • There's no way to determine the original values.

  • Getting the original values is too time-consuming or expensive (e.g., contacting customers).

  • They're too busy to help you in the first place.

Despite the hurdles, the best DIY data scientists always investigate if fixing the data is possible.

If not, it's on to the next strategy.

The scikit-learn library in Python supports many machine learning algorithms that handle missing data automatically.

You can see the complete list of these algorithms here. Out of the algorithms listed, these four are the most useful for you as a DIY data scientist:

  • DecisionTreeClassifier

  • RandomForestClassifier

  • DecisionTreeRegressor

  • RandomForestRegressor

The above algorithms not only handle missing data, but are also accessible to any professional (e.g., you don't need advanced math), and are also state-of-the-art with data that comes in a tabular format.

You know. Data that comes in the form of CSVs, Excel worksheets, database tables, etc.

As covered in my Introduction to Machine Learning​ self-paced online course (offered in partnership with TDWI), these four algorithms learn by converting the features in your datasets into categories.

For example, these algorithms can take a numeric feature and learn a categorical rule like:

  • Age < 42.5

This rule effectively changes the Age numeric feature into a series of True/False values (i.e., categories).

When it comes to missing values, these algorithms create another rule based on missingness. To expand on the above example, a DecisionTreeClassifier predictive model could learn the following rules in different parts of the tree:

  • Age < 42.5

  • Age is missing

Given this information, it might be tempting to throw a random forest at your data and see what happens.

However, even when using these algorithms, you must profile your data. Even if a RandomForestClassifier can handle a feature with missing data, it doesn't mean it is suitable for machine learning.

When using these algorithms, you need to assess whether there is a potentially predictive relationship between a feature's missing data and your label/target.

Oh, and one last thing.

More recent versions of scikit-learn are required to get access to algorithmic support for missing data. For example, the link above is for version 1.6.

If you're using Anaconda Python (which is what I use in my consulting work and all my courses), here's how to check your scikit-learn version:

  • On Mac, fire up a Terminal window.

  • On Windows, fire up the Anaconda Prompt.

  • Type conda list and hit <enter>.

  • Scroll down the list of libraries until you see scikit-learn.

  • The version will be listed.

At the time of this writing, you can see that my scikit-learn version is 1.5.1 on my Mac:

Be sure to confirm your scikit-learn version and support for missing data with the algorithms you will be using.

This Week’s Book

Machine learning algorithms based on decision trees (e.g., RandomForestClassifier) are state-of-the-art for DIY data science. They handle missing data and consistently provide the most useful predictive models for tabular data (i.e., most real-world datasets).

Here's the definitive book on the Classification and Regression Tree (CART) algorithm:

NOTE - This book is heavy on the mathematics of decision trees and is not cheap. However, nothing beats this book if you want the deepest understanding of decision trees.

That's it for this week.

Stay tuned for next week's newsletter, the first in a tutorial series on strategies for handling missing data with your ML models.

Stay healthy and happy data sleuthing!

Dave Langer

Whenever you're ready, there are 4 ways I can help you:

1 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required.

2 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data.

3 - Introduction to Machine Learning: This self-paced online course teaches you how to build predictive models like regression trees and the mighty random forest using Python. Offered in partnership with TDWI, use code LANGER to save 20% off.

4 - Need personalized help in making an impact at work? I offer 1-on-1 coaching for students of my paid live and online courses.