Join 6,175 professionals who get data science tutorials, book recommendations, and tips/tricks each Saturday.
New to Python? I'll teach you for free.
I won't send you spam. Unsubscribe at any time.
Issue #2 - Profiling Numeric Data for Machine Learning
This Week’s Tutorial
Numeric features (i.e., columns) of data are common in real-world business analytics.
Because these features are so common, as a DIY data scientist using machine learning (ML), making sure that your numeric features are of good quality is critical to your success.
Luckily, the mighty ydata-profiling library makes profiling numeric features easy.
You can get the CSV files from the newsletter's GitHub repository if you want to follow along.
First up, the code to get everything working:
Next, select a numeric feature from the ydata-profiling report:
Click CapitalLoss, as that is a numeric feature that has an alert:
In the image above, I'd like to highlight how the Zeros alert draws your attention. Having these alerts right there as you review your features is super handy.
It's a reminder of what you need to pay attention to.
The first thing to evaluate with a numeric feature is what is known as cardinality. That's just a fancy way of saying how many unique values are present in the numeric feature:
The CapitalLoss feature has a cardinality of 72 distinct values.
When profiling this feature for use in machine learning, you have to ask yourself a couple of questions:
Does the level of cardinality makes sense given the business nature of the data?
Is there sufficient variability in the feature's data to be useful for ML?
Regarding the second question, think of the most extreme case of low cardinality - there's only one distinct value in the feature (e.g., the column is all zeros).
ML algorithms can't do anything with a feature like that. It's all the same data. There's no signal.
You must have some variation in your data for ML to learn patterns.
Next is the count of missing data:
As was discussed in the last newsletter, missing data is a big problem for many ML algorithms.
While algorithms like DecisionTreeClassifiers and RandomForestClassifiers from the scikit-learn library in Python can handle missing data, most ML algorithms can not.
For example, clustering algorithms (which are wildly useful for DIY data science) cannot handle missing values.
In the case of the CapitalLoss feature, it isn't missing any data.
Next, is the count of zeros in the feature:
Notice that the font is red because ydata-profiling has flagged this as an alert.
A whopping 93.7% of the values of the CapitalLoss feature are zero!
Numeric features with high counts of zeros are not uncommon in real-world business analytics. Given the business nature of this feature, this makes sense.
For example, think of this feature as recording the amount of money a US citizen lost selling shares on the stock market.
How many US citizens are actively trading stocks? Not that many.
So, lots of zeros in this feature makes sense.
However, one thing you have to ask yourself is this:
Does the data collection process use zeros when the value is unknown?
This happens more often than you think. Rather than leaving the data empty, a zero is used instead.
If you see lots of zeros in your numeric feature, you will want to know if zeros are placeholders for missing data.
If you find out that many of the zeros are, in fact, missing values you may not want to use the feature in your ML model.
Clicking the More details button takes you to the next level of profiling:
On the Statistics tabs, you should be checking the following:
Do the min/max values make sense from a business perspective?
How does the median compare to the average?
What is the monotonicity?
Regarding the second bullet, we can see the CapitalLoss feature's median is 0, while the mean (or average) is 121.227. Given that so many values are zeros, this makes sense.
The third bullet is of special importance. Intuitively, think of monotonic data as regularly increasing values.
For example, unique row identifiers like 10, 20, 30, 40, 50, 60, etc. are monotonic.
Monotonic data can be problematic for ML algorithms. If you see monotonic data in your dataset, you must validate that it is not some form of unique identifier.
If you are familiar with statistics, you can certainly use the other metrics depicted above (e.g., Kurtosis) to glean more insights into the feature.
However, these metrics do not impact some of the most useful ML techniques for DIY data science (e.g., decision trees, random forests, k-means, etc.).
Clicking on Histogram displays the distribution of the CapitalLoss feature values:
When looking at the histogram of a numeric feature, here's what you're asking yourself:
Does the spread of values makes sense from a business perspective?
Where does the center (i.e., a typical value) of the distribution lie? Does it make sense from a business perspective?
What is the overall shape of the distribution? Does it have peaks and valleys? If so, does it make sense from a business perspective.
Do you notice how I keep using the word "business"?
You aren't going to be successful with real-world machine learning if you don't have business subject matter expertise (or access to it).
Looking at the histogram for CapitalLoss doesn't show anything concerning given the business nature of the feature.
Clicking on Common values provides additional insight into the feature:
Not to sound like a broken record, but you take a look at this information and ask if it makes sense given the business nature of the feature.
In the case of CapitalLoss, it does.
Lastly, clicking on Extreme values:
Per the usual, you'll check the minimum and maximum extreme values and ask if they makes sense from the business perspective.
If anything doesn't look right (e.g., a value so high it doesn't make sense), it warrants further investigation.
You may be asking yourself if you need to do this for every numeric feature in your dataset.
Yes, you do. If you don't you're playing with 🔥.
Can you see why 60-80% of your time is devoted to working with the data? 🤣
To close out this week's tutorial, it's important to understand where we are in the overall process of evaluating data for use with machine learning.
At this stage, you are looking for reasons to eliminate problematic features.
For example, a feature with a massive number of missing data.
None of what is covered in today's tutorial evaluates how useful a feature might be for a machine learning model.
To really cement this idea, the profiling shows that we shouldn't remove the CapitalLoss feature.
However, there's no guarantee that CapitalLoss will be useful in a predictive model.
How would you investigate this?
That's coming up in a future tutorial.
This Week’s Book
I used the term "business" a lot in this week's newsletter.
As it turns out, business knowledge is vital in real-world machine learning. This is part of the reason why so many DIY data scientists will come from the ranks of business professionals.
If you come from a heavy technical background, the following book is a great resource to learn how to be more effective with data science:
That's it for this week.
Stay tuned for next week's tutorial where I will teach you how to profile your categorical features.
Stay healthy and happy data sleuthing!
Dave Langer
Whenever you're ready, there are 4 ways I can help you:
1 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required.
2 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data.
3 - Introduction to Machine Learning: This self-paced online course teaches you how to build predictive models like regression trees and the mighty random forest using Python. Offered in partnership with TDWI, use code LANGER to save 20% off.
4 - Need personalized help in making an impact at work? I offer 1-on-1 coaching for students of my paid live and online courses.