Join 8,587 professionals who get data science tutorials, book recommendations, and tips/tricks each Saturday.
New to Python and SQL? I'll teach you both for free.
I won't send you spam. Unsubscribe at any time.
Issue #20 - Hierarchical Clustering Part 7: Interpreting Your Clusters
This Week’s Tutorial
In Part 5 of this tutorial series, you learned how to tune the number of clusters. The tuning process showed that a reasonable value of clusters might be 2, 3, 4, or 5.
On the surface, I wouldn't blame you if you thought this wasn't very helpful. However, please consider this.
At the time of this writing, selecting the "right" number of clusters requires interpretation based on the nature of your business process.
Even in the age of ChatGPT, this work is best done by humans.
In this tutorial, you will learn how to interpret your clusters using a machine learning predictive model.
First up, loading the dataset, which is available from the newsletter GitHub:
Next, the data needs to be preprocessed because some of the features are on different scales (e.g., Income vs. Age):
Interpreting clusters is at the heart of cluster analysis. It's an iterative process of experimentation. Given the results of tuning, the first iteration will be to experiment with 2 clusters:
The code above uses Ward's method to calculate distances between clusters. In case you missed it, this was covered in the last tutorial.
The clustering object produced by the above code contains the cluster assignment for each row in the dataset.
Since the AgglomerativeClustering class was set to find 2 clusters, these cluster assignments have the values of 0 and 1.
Interpreting clusters involves characterizing the rows of data assigned to each cluster in terms that business stakeholders can understand.
While there are several options for doing this (e.g., visual data analysis), a powerful technique is to use a machine learning predictive model.
In this tutorial, we will train a DecisionTreeClassifier to learn how to predict the cluster assignments.
To ensure that the model's findings will be understandable to business stakeholders, we will train the ML model using the original (i.e., unscaled) dataset:
The note in the above code deserves some explanation.
When interpreting clusters with an ML predictive model, you typically want to start with a highly accurate model for predicting the cluster assignments.
You typically need to tune the ML predictive model to achieve the high accuracy you want for interpreting clusters. The above code has not been tuned.
Why?
Because tuned models tend to be larger than can be easily seen in an email newsletter format, I'm limiting the decision tree model's size to three layers deep for this tutorial.
In a real-world setting, you would likely use a more complex model for better cluster interpretations than what you'll see in this tutorial.
BTW - My Introduction to Machine Learning online course will teach you everything you need to know about properly tuning decision tree predictive models.
Visualizing the decision tree predictive model is the next step:
Check out this previous tutorial if you're unfamiliar with reading decision tree visualizations.
Consider cluster 0. The following visualization highlights the 3 locations in the decision tree model where cluster 0 is predicted:
Going from left to right, the decision tree model tells us this about the rows of data that are predicted to be cluster 0 in the first highlighted box:
MeatProductPurchases <= 261.5
SweetProductPurchases <= 88.5
FruitProductPurchases <= 72.5
And now the second highlighted box:
MeatProductPurchases <= 261.5
SweetProductPurchases > 88.5
NumDealsPurchases > 7.0
And the last highlighted box:
MeatProductPurchases > 261.5
NumDealsPurchases > 2.5
FruitPurchases <= 65.0
How would we characterize cluster 0 given the above? In natural language, we could say something like:
"Cluster 0 represents approximately 76% of our customers. These customers have relatively small meat and fruit purchases."
Going through the tree for cluster 1, here's what we could say in natural language:
"Cluster 1 represents about 24% of our customers. These customers rarely purchase our deals."
OK, compare the above to the results of using 3 clusters:
Going through the above decision tree model gives us the following natural language interpretations for the clusters:
"Cluster 0 represents about 35% of our customers. These customers have a fair amount of total food purchases, but tend to purchase less meat and sweet products."
"Cluster 1 represents about 24% of our customers. These customers have a fair amount of total food purchases, but tend to purchase more meat products than the other clusters."
"Cluster 2 represents about 41% of our customers. These customers have the smallest total food purchases of all the clusters."
Compare the natural language descriptions of 2 clusters vs. 3 clusters. IMHO, the interpretations provided using 3 clusters are more information-rich than those using only 2 clusters.
I wouldn't be surprised if 4 clusters, maybe even 5, provided better interpretations. Since you've got the code, I'll leave that up to you to explore.
This Week’s Book
Structured Query Language (SQL) is arguably the most valuable data skill. I've been using SQL for 25 years, and it's as relevant today as ever.
Not surprisingly, I'm regularly asked for SQL self-study resources. This is the book I always recommend:
This book teaches the variant of SQL used by Microsoft SQL Server. Here's why I think of this as a good thing:
SQL is highly portable, so it doesn't matter which variant you learn.
SQL Server is used in many, many organizations.
That's it for this week.
Next week's newsletter will start a new tutorial series covering machine learning using Python in Excel.
Yes. You read that correctly.
You can do both Python and machine learning using Microsoft Excel now. 🤯
Stay healthy and happy data sleuthing!
Dave Langer
Whenever you're ready, there are 4 ways I can help you:
1 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required, and it works with Python in Excel!
2 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data.
3 - Introduction to Machine Learning: This self-paced online course teaches you how to build predictive models like regression trees and the mighty random forest using Python. Offered in partnership with TDWI, use code LANGER to save 20% off.
4 - Is machine learning right for your business, but don't know where to start? Check out my Machine Learning Accelerator.