Join 6,175 professionals who get data science tutorials, book recommendations, and tips/tricks each Saturday.
New to Python? I'll teach you for free.
I won't send you spam. Unsubscribe at any time.
Issue #7 - Handling Missing Data For Machine Learning: Part 2
This Week’s Tutorial
While missing data is common in DIY data science, it is problematic for machine learning.
Most machine learning algorithms cannot handle missing data. Even a single missing value in a 100,000-row dataset will throw an error with these algorithms.
As a DIY data scientist, you must have strategies for handling missing data in your machine learning models.
Here are the strategies I teach my corporate clients for handling missing data:
Fix the data.
Use an algorithm that can handle missing values.
If only a small percentage of observations have missing data, remove the observations.
Remove features with missing data.
Find a proxy feature for the feature with missing data.
Fill in the missing data (i.e., impute the missing data).
The above strategies are listed in decreasing order of desirability. Don't make the mistake of jumping down the strategies list because something looks cool.
For example, when teaching live, it's common for me to see eyes light up when I discuss filling in (i.e., imputing) missing data. I know the excitement because I felt the same way many years ago.
However, imputing missing data should be a last resort - not the first. I will cover why in the next newsletter issue.
Last week's newsletter covered strategies 1 and 2. This week's issue will cover strategies 3 and 4.
While it may seem radical, a viable approach to handling missing data is to simply remove any rows (i.e., observations) that have missing data.
Before removing observations with missing data, you must consider the following:
Will your business stakeholders object?
What percentage of the data would you remove?
What is the nature of the observations that will be removed?
Let's take a look at each of these considerations.
First, I must confess that I've been burned by removing observations without consulting my business stakeholders first.
When the stakeholders found out that I had removed rows, they didn't care about the technical justifications for why removing the rows was OK. Their immediate reaction was to be biased against the findings because they didn't trust my methodology.
Don't repeat my mistake.
If you believe your stakeholders would object to, or be biased by, removing observations, consult them first and clearly articulate the tradeoffs of keeping the rows.
For example, imputation can fill in the missing data and all the risks associated with that strategy.
Second, you want to assess what percentage of the data is missing.
While there is no hard and fast rule on how much is too much to remove, a general guideline is that the larger the dataset, the more you can potentially remove without a problem.
For example, if you have 1,000,000 rows in your dataset, you might be OK removing 50,000 observations (i.e., 5%).
Quantifying the amount of data that might be removed is a must if your stakeholders object. They will want to know the specifics before approving any removals.
Another consideration is the nature of the observations that will be removed. Here are your rules of thumb:
Observations that represent common occurrences in the business process are good candidates for removal.
Observations that represent edge cases in the business process need to be treated with care.
Common occurrences are good candidates for removal because there are many other observations that can be used in your dataset. The larger your dataset, the more this is true.
To revisit the hypothetical example, if the 50,000 observations to be removed are all common occurrences, you're likely OK if you have 950,000 observations remaining.
Edge cases are, of course, the exact opposite of this situation. Removing these observations is problematic as you can deplete your dataset of the raw materials needed to craft the most useful machine learning predictive models.
Basically, if you don't have enough edge case observations, your models might not learn how to predict them.
Here's something that you have to keep in mind.
To understand if the observations to be removed are edge cases, you must look at them. If you're building an ML predictive model, this can result in data leakage.
The best way to avoid data leakage is to split your data into training and test sets and only look at the training observations.
This is absolutely needed if you're splitting your data based on dates (e.g., training in older data, and testing is newer data).
This is less of a concern if you're splitting your data randomly. However, you should take great pains to minimize the amount of profiling you do on the pre-split data.
While there are several things you must vet/consider, removing observations with missing data can be a viable strategy - especially for large datasets.
In some datasets, a subset of the features are the cause for observations to be missing data. Rather than removing observations, another strategy is to remove these features.
Over the years, this is the most common reason I see missing data in observations - a few features ruin it for all the others! 🤣
As with removing observations with missing data, consider if your stakeholders will object before you remove a feature from the dataset.
On many occasions, I've encountered business stakeholders passionate about including certain features in a predictive model and would strongly object - or dismiss out of hand - any model that didn't contain their favorite features.
Again, there are no hard and fast rules regarding how much missing data is too much for an individual feature.
You want to apply the same considerations you learned in the last strategy.
In my work, I've found that features missing more than 5-10% of their values are especially problematic.
This is where I've found these strategies most useful with feature missing data:
Using ML algorithms that can handle missing data (covered in the last issue).
Finding a proxy feature (covered in the next issue).
As an example of this situation and what to do about it, my Introduction to Machine Learning self-paced online course (offered in partnership with TDWI) covers an example of a potentially highly useful feature that is missing a large amount of data.
This Week’s Book
This week I want to recommend my favorite book on feature engineering. This book is freely available online:
NOTE - While this book uses the R programming language, the concepts are what truly matter. Also, with ChatGPT, translating the R code to Python is easy.
That's it for this week.
Stay tuned for next week's newsletter, where I will conclude the tutorial series on handling missing data in your ML models.
Stay healthy and happy data sleuthing!
Dave Langer
Whenever you're ready, there are 4 ways I can help you:
1 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required.
2 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data.
3 - Introduction to Machine Learning: This self-paced online course teaches you how to build predictive models like regression trees and the mighty random forest using Python. Offered in partnership with TDWI, use code LANGER to save 20% off.
4 - Need personalized help in making an impact at work? I offer 1-on-1 coaching for students of my paid live and online courses.