Join 1,000s of professionals who are building real-world skills to truly harness the power of AI technologies like Microsoft's Copilot in Excel.

Issue #3 - Profiling Categorical Data for Machine Learning

This Week’s Tutorial

Categorical features are far more common in business analytics than in other fields (e.g., the physical sciences). Examples of categorical data include geographies, product lines, customer types, etc.

The most useful real-world machine learning (ML) models typically use many categorical features. But first, you need to make sure these features should be used in your ML models.

Enter the ydata-profiling library.

You can get the CSV files from the newsletter's GitHub repository if you want to follow along.

First up, the code to get everything working:

Next, select a categorical feature from the ydata-profiling report:

Click MaritalStatus, as that is a numeric feature that has an alert:

The MaritalStatus feature has a High correlation alert with the Label feature. I will cover this in an upcoming newsletter. You can ignore it for now.

As with numeric features, the first thing to evaluate with a categorical feature is what is known as cardinality. That's just a fancy way of saying how many unique values are present in the feature:

The MaritalStatus feature has a cardinality of 7 distinct values.

When profiling categorical features, you think about cardinality differently than numeric features.

First, you certainly want to verify that each of the unique values makes sense, given the business nature of the data.

Second, many machine learning algorithms will run into issues when there are many unique values (i.e., when cardinality is high). As a general guideline, you typically want a cardinality of 35 or less. I will cover exceptions to this later on.

At a cardinality of 7, the MaritalStatus feature passes this guideline.

Next is the count of missing data:

In future newsletters, I will cover strategies for dealing with missing data.

However, the MaritalStatus feature isn't missing any data, so it's on to the counts of the most frequent values:

While the above bar chart can help understand the relative count of unique values (also known as levels), clicking the More details button provides a lot more information:

Selecting Categories provides much more information regarding the levels of the MaritalStatus feature.

When profiling categorical features, ask yourself the following questions:

Do the level values make sense, given the business nature of the data?
Do the counts of each level make sense?
Are any level values far more frequent than others?
Are any level values far less frequent than others?

For the MaritalStatus feature, all the level values make sense. It also makes sense that Married-civ-spouse represents 59% of the data (i.e., people who are married and not serving in the military).

Now consider the Married-AF-spouse level. It occurs only 14 times in the data!

Rare categorical levels, like Married-AF-spouse, are often problematic for machine learning algorithms - especially when using a technique like one-hot encoding (a topic to be covered in a future newsletter).

Here's the reasoning behind the guideline of 35 levels or less. As the number of levels in a categorical feature increases, rare level values like Married-AF-spouse tend to be more frequent.

However, if you have a very large dataset, you could have more than 35 levels that are not rare. In these cases, you're likely OK with using the categorical feature "as-is."

When you do have rare levels in a categorical feature, you can experiment with consolidating the rare levels into a custom level you create.

For example, you could create a new feature named MaritalStatusWrangled from the MaritalStatus feature and substitute the custom level of Miscellaneous for the following values:

Married-spouse-absent
Married-AF-spouse

NOTE - It's best to create a new feature so that you can always revert to the original one if needed.

Here's some sample code to illustrate:

In case you're curious, I use the assign() method in the above code because I'm a big fan of treating pandas DataFrames as immutable. I cover this in my free Python Crash Course.

And here's the ydata-profiling output for the new MaritalStatusWrangled feature:

At this stage of the data profiling process, it's worth noting that there's no guarantee that the MaritalStatusWrangled feature will be better than the MaritalStatus feature.

However, this is a strategy I often implement with my categorical features to compare both the original and the wrangled version regarding their possible predictive performance in an ML model.

Once again, you may ask yourself if you need to do this for every categorical feature in your dataset.

Yes, you do. If you don't, you're playing with 🔥.

Your goal at this stage is to ensure you don't include any problematic features in your ML model.

In the case of MaritalStatus and MaritalStatusWrangled, nothing is jumping out for elimination. These features will continue to the next stage, which will be covered in a future newsletter.

This Week’s Book

I've recommended this book before, but I should include it in today's newsletter because of its relevance.

If you're serious about analyzing categorical data, this book will teach you powerful techniques for extracting insights:

The analysis techniques taught in this book have broad applicability.

For example, my Cluster Analysis with Python online course teaches you how to apply these techniques to handle categorical data correctly for cluster analysis.

That's it for this week.

Stay tuned for next week's tutorial where I will teach you how to profile your date-time and text features.

Stay healthy and happy data sleuthing!

Dave Langer

Whenever you're ready, there are 4 ways I can help you:

1 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required, and Copilot in Excel AI prompts are included!

2 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data. Copilot in Excel AI prompts are included!

3 - Introduction to Machine Learning: This self-paced online course teaches you how to build predictive models like regression trees and the mighty random forest using Python. Offered in partnership with TDWI, use code LANGER to save 20% off.

4 - Is machine learning right for your business, but don't know where to start? Check out my Machine Learning Accelerator.