Join 7,715 professionals who get data science tutorials, book recommendations, and tips/tricks each Saturday.

New to Python? I'll teach you for free.

I won't send you spam. Unsubscribe at any time.

Issue #15 - Hierarchical Clustering Part 2: Algorithm Intuition

This Week’s Tutorial

NOTE - I will assume you've read the first tutorial in this series.

There are two strategies commonly used to perform hierarchical clustering:

  • Divisive is a top-down approach.

  • Agglomerative is a bottom-up approach.

In real-world DIY data science, agglomerative hierarchical clustering is overwhelmingly used. So, this will be the focus of the tutorial series.

This tutorial will help you intuitively understand how this bottom-up approach works - no math required.

Imagine you work for a company that deals with physical products (e.g., retail) and wanted to mine a taxonomy from your product data.

To keep the examples easy to understand, let's assume that your product data consists of only the length and width measurements of the products.

Yep. Only two features.

Don't worry.

Everything you will learn in this tutorial series applies whether you have two features or 100.

Also, since I will use visualizations in this tutorial, let's assume the product catalog consists of six products.

Again, this is OK because what you will learn scales to much larger datasets.

Here's the data:

Learning how clustering algorithms work is much easier using visualizations. Here's the above data visualized as a scatter plot:

Agglomerative hierarchical clustering works by starting with each data point and figuring out which of the other data points are most similar.

As with most clustering algorithms, "similar" is defined as the distance between points:

  1. Close data points are more similar.

  2. Distant data points are less similar.

Consider the point in the top right of the visualization. The algorithm calculates the distance (i.e., similarity) between this point and all the other points (I didn't draw all the lines to reduce clutter):

Compare the above distances to the distances for the point below:

And compare the following:

The algorithm continues in this fashion, calculating distances, to cluster the most similar data points together. The algorithm is greedy, meaning it chooses the best clustering it can find as soon as it can.

This is an important idea.

The clusters found early on may not be the "best" later on. This greediness is necessary for the algorithm to have any chance of running fast enough to be useful.

Greediness is a recurring theme in machine learning (e.g., decision tree-based algorithms) and works well in practice.

Based on all the distance calculations, the algorithm greedily finds the first cluster:

This distance comparison process continues and the algorithm greedily finds a second cluster:

At this stage, the visualization above shows that four products belong to clusters, while two do not based on distances.

Now, here's the interesting thing.

Conceptually, the algorithm treats found clusters like individual data points.

For example, in the visualization above, the distance between the top right data point will be calculated for each cluster, not individual data points within the clusters.

A future tutorial will teach you the strategies used to make this happen. But for now, it's not necessary.

Next, the algorithm considers the distances between clusters and any data points that do not belong to clusters.

This is where something interesting happens.

The algorithm finds that the clusters are closer to each other than the two data points that don't belong to any clusters. In response, the algorithm clusters the clusters:

The visualization above demonstrates how the algorithm constructs the hierarchy of nested clusters. Looking at the data points provides intuition about what's happening:

  • The two unclustered data points have relatively large Length and/or Width values.

  • The cluster of clusters contains the four data points with smaller values.

  • The contained cluster on the left has two data points with relatively low Length values and relatively high Width values.

  • The contained cluster on the right has two data points with relatively high Length values and relatively low Width values.

This is how the algorithm builds a taxonomy directly from the data.

However, it's more common that you don't need a taxonomy. What you need is to find the individual clusters. My Cluster Analysis with Python course will teach you to do this.

However, the algorithm isn't done because it needs to cluster all the data. So, the algorithm compares the cluster of clusters to the two remaining data points and finds the following:

At this point, there's only one last cluster that includes all the data points:

In the visualization above, I drew the last cluster using a rectangle simply because it fit better. 🤣

While going through the images above step-by-step helps to build your intuition, the final result is cluttered and hard to read (I added the row numbers from the dataset for context):

This is where a dendrogram comes in handy:

In the above dendrogram, the numbers along the bottom correspond to the rows of the dataset. When using real (i.e., much larger) datasets, these numbers usually represent the number of rows of data in each cluster.

The heights of the lines in the dendrogram indicate the similarity between clusters. For example, the cluster of rows 2 and 5 is the shortest.

This aligns to what you saw in the previous visualizations - these rows are closest.

Like most data visualizations, dendrograms break down when the scale becomes large (e.g., many nested clusters). A later tutorial will cover a strategy for optimizing the number of clusters found.

This Week’s Book

I've recommended this book previously and I'm recommending it again because it is one of the best resources for cluster analysis that I've ever found:

Not only does it cover hierarchical clustering, it also covers k-means, DBSCAN, and association rule mining (market basket analysis).

It's a tremendous resource if you're serious about building DIY data science skills.

That's it for this week.

Stay tuned for next week's newsletter covering how distances are calculated in agglomerative hierarchical clustering.

Stay healthy and happy data sleuthing!

Dave Langer

Whenever you're ready, there are 4 ways I can help you:

1 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required, and it works with Python in Excel!

2 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data.

3 - Introduction to Machine Learning: This self-paced online course teaches you how to build predictive models like regression trees and the mighty random forest using Python. Offered in partnership with TDWI, use code LANGER to save 20% off.

4 - Is machine learning right for your business, but don't know where to start? Check out my Machine Learning Accelerator.