Join 8,587 professionals who get data science tutorials, book recommendations, and tips/tricks each Saturday.

New to Python and SQL? I'll teach you both for free.

I won't send you spam. Unsubscribe at any time.

Issue #18 - Hierarchical Clustering Part 5: Tuning

This Week’s Tutorial

NOTE - If you would like to follow along with your own code (highly recommended), you can get the dataset from the newsletter GitHub.

As you saw in last week’s tutorial, allowing agglomerative hierarchical clustering to find all possible clusters is usually not a good idea except for the simplest of datasets.

In this tutorial, you will learn a powerful technique for tuning the clustering process to find a more optimal number of clusters.

First up is the code for loading the dataset:

As covered in last week's tutorial, the dataset has been profiled and can be used with the AgglomerativeClustering class from scikit-learn (e.g., there's no missing data).

Last week's tutorial covered how the dataset must be scaled to accommodate features that are much larger than others (e.g,. Income vs. Age):

By default, the AgglomerativeClustering class' n_clusters hyperparameter is set to 2. The way to think about this is that the algorithm defaults to the simplest clustering scenario - only two clusters.

However, to use clustering to drive real-world insights, you need to find the optimal number of clusters. The default is almost always too low and letting the algorithm find as many clusters as it can usually produces unusable results.

Clustering, like so much of machine learning, needs to be tuned if you want to be successful.

At a high level, here is a process for tuning agglomerative hierarchical clustering:

Start with n_clusters = 2 and evaluate the quality of the clusters.
Move to n_clusters = 3 and evaluate the quality.
Repeat this process with higher values of n_clusters.
Pick the n_clusters value with the highest quality score.

The above tuning process is conceptually simple, but there's a catch.

How do you define "cluster quality."

In this tutorial, we will use a cluster quality calculation known as the silhouette coefficient. This calculation gives each data point (i.e., row in the dataset) a score ranging from -1 to 1.

Generally speaking, scores close to 1 indicate high quality, and scores close -1 indicate a poor quality.

The score is dependent on the value used for n_clusters. The goal of the tuning process is to find a value of n_clusters that accomplishes the following simultaneously:

The highest average silhouette score across all the data points.
A collection of clusters that provide business insights.

As we will see later, sometimes you will give up some of the first bullet to get more of the second bullet.

The first step in the tuning process is to define the values of n_clusters to be evaluated. The following code creates a Python range object to hold these values:

Ultimately, the code in this tutorial will evaluate n_clusters values ranging from 2 through 14.

Next, we need to store the average silhouette score for each unique value of n_clusters. An empty Python list will handle this nicely:

A simple for loop will iterate through each value of n_clusters and run the AgglomerativeClustering algorithm on the dataset for each unique value:

The scikit-learn library provides the silhouette_score function to calculate the average silhouette score for all data points in a given clustering:

The above code runs rather quickly, but know that the running time can increase dramatically if you're evaluating many values of n_clusters and/or your dataset is large.

The last line in the above code cell is warrants some explanation. The silhouette score works by looking at each data point in a clustering and then comparing two things:

The average distance from the data point to all the other data points in the same cluster.
The average distance from the data point to all the other data points in the next nearest cluster.

This is why the silhouette_score function needs both the original dataset used for the clustering and the resulting cluster assignments (i.e., labels_).

The silhouette_score list object contains all the scores:

Rather than looking at the raw values, I like to visualize the scores:

The above visualization shows that the highest average silhouette score is where n_clusters = 2. It is tempting to say that the optimal number of clusters is two and leave it at that.

However, there's more going on.

Clustering quality metrics should always guide your analysis. For example, the above diagram shows:

Consistently low values when n_clusters > 5.
Potentially reasonable values of n_clusters are 2, 3, 4, and 5.

As the analyst, you should investigate reasonable values. It's possible that a "suboptimal" value of n_clusters (e.g., 4) might produce the best business interpretation/insights.

As I mentioned earlier, sometimes you give up some silhouette score in the pursuit of a better analysis outcome.

This Week’s Book

The best way to make an impact at work with data is to ensure what you're working on is aligned to the business. You've probably seen advice like this before (e.g., on LinkedIn).

However, how do you know for sure that you're aligned? This week's book recommendation teaches you a great framework for analyzing how businesses work:

I've used the techniques taught in this book to identify the key drivers of success in diverse situations (e.g., when I worked at Microsoft).

This allowed me to not only speak "the language of the business," but also guided my analyses to focus on what really moves the needle for the business.

That's it for this week.

Stay tuned for next week's newsletter, which will cover the various ways agglomerative hierarchical clustering builds clusters (i.e., linkages).

Stay healthy and happy data sleuthing!

Dave Langer

Whenever you're ready, there are 4 ways I can help you:

1 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required, and it works with Python in Excel!

2 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data.

3 - Introduction to Machine Learning: This self-paced online course teaches you how to build predictive models like regression trees and the mighty random forest using Python. Offered in partnership with TDWI, use code LANGER to save 20% off.

4 - Is machine learning right for your business, but don't know where to start? Check out my Machine Learning Accelerator.