Join 7,715 professionals who get data science tutorials, book recommendations, and tips/tricks each Saturday.

New to Python? I'll teach you for free.

I won't send you spam. Unsubscribe at any time.

Issue #17 - Hierarchical Clustering Part 4: Python Code

This Week’s Tutorial

NOTE - If you would like to follow along with your own code (highly recommended), you can get the dataset from the newsletter GitHub.

After completing the first three tutorials, it's time to apply what you've learned and perform agglomerative hierarchical clustering on a dataset.

Today's tutorial will use a dataset of customer behaviors—specifically, grocery purchases. The dataset also includes customer characteristics (e.g., Income and Age).

Here's the code for loading the dataset:

And information about the DataFrame:

We can use this dataset with the AgglomerativeClustering class from scikit-learn because of the following reasons:

  • There are 2,205 rows of data and no data is missing (i.e., 2205 non-null in the output above).

  • All of the features are numeric.

Like AgglomerativeClustering, the most commonly used clustering algorithms from scikit-learn (e.g., KMeans and DBSCAN) will not work if data is missing in the dataset.

Additionally, these algorithms use Euclidian distance, so they will only work with genuinely numeric columns.

I say "genuinely" here because one-hot encoded categorical features are not genuinely numeric!

If you need to cluster categorical data (which is the norm), my Cluster Analysis with Python online course will teach you how.

Now that we know these requirements are fulfilled, we can take a peek at the data:

When looking at the data in this way, the primary thing you want to know is if any features are of a different magnitude than the others. We see precisely this in the case of the output above:

  • The Income feature is far larger than the other features.

More precisely, the Income feature is a different scale than the other features.

Features of different scales impact the Euclidian distance calculations and can result in suboptimal clusters. We will tackle this problem later by transforming all the features to be on a similar scale.

As covered in the first issues of this newsletter, it's always a good idea to profile your data before applying machine learning to it. My favorite way to profile data is using the mighty ydata-profiling library:

If you're new to ydata-profiling, check out the newsletter back issues, as I won't go through the entire profiling process in this tutorial. Here's the ydata-profiling Alerts report:

To summarize the profiling:

  • The duplicate rows are not a problem as the data is at an abstracted level (i.e., duplicate rows are to be expected).

  • Unlike other machine learning techniques (e.g., OLS linear regression), the correlation between features is not a deal-breaker for this clustering technique.

  • The high percentage of zeros in some features is expected, given the nature of the dataset.

The net-net is that this dataset is good to go for clustering.

For this tutorial, we'll use the default way the AgglomerativeClustering algorithm evaluates the quality of clusters (i.e., we'll use the default linkage), and we'll also tell the algorithm to find as many clusters as it can:

The above parameters are passed to the AgglomerativeClustering constructor, which tells the algorithm to compute the complete taxonomy (i.e., what the scikit-learn documentation calls the "full tree") from the data.

Unfortunately, the scikit-learn library doesn't provide an easy way to create a dendrogram for hierarchical clusterings. However, the scikit-learn online documentation does give the following code for a custom plotting function:

Given that we told the AgglomerativeClustering algorithm to find as many clusters as possible (i.e., compute the entire tree), we'll need a large canvas for the dendrogram. The following code sets this up and calls the custom plotting function:

And the dendrogram:

The above dendrogram illustrates a critical idea in real-world cluster analysis.

To get usable clusters, you must tune the algorithm. This tuning will be different for each dataset that you cluster.

A future tutorial will also discuss the various linkage options and how they can impact the number of clusters found.

This Week’s Book

I was privileged to teach my Machine Learning Bootcamp to diverse professionals last week. I received the following question from one of the bootcamp attendees:

"Dave, can we use random forests to make better forecasts?"

This is a common question I receive (BTW - the answer is a resounding "Yes!"), and I always recommend the following book in response to the question:

Don't let the "Supply Chain" part of the book title fool you. The techniques taught in the book can be used for forecasting in any domain.

That's it for this week.

Stay tuned for next week's newsletter covering how to tune the agglomerative hierarchical clustering algorithm for better results.

Stay healthy and happy data sleuthing!

Dave Langer

Whenever you're ready, there are 4 ways I can help you:

1 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required, and it works with Python in Excel!

2 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data.

3 - Introduction to Machine Learning: This self-paced online course teaches you how to build predictive models like regression trees and the mighty random forest using Python. Offered in partnership with TDWI, use code LANGER to save 20% off.

4 - Is machine learning right for your business, but don't know where to start? Check out my Machine Learning Accelerator.