Join 7,715 professionals who get data science tutorials, book recommendations, and tips/tricks each Saturday.
New to Python? I'll teach you for free.
I won't send you spam. Unsubscribe at any time.
Issue #14 - Hierarchical Clustering Part 1: Introduction
This Week’s Tutorial
Early in my data science journey, I ignored cluster analysis and missed many opportunities to have impact.
Please don't make the same mistake I did when you get started with machine learning.
There's an unfortunate reality when it comes to how data science is defined in social media and most organizations:
Data science == machine learning.
Machine learning == predictive ML models.
Predictive ML models == production deployments.
Before I get a bunch of 🔥 email replies, let me state something for the record.
When done well, the business value of production ML predictive models can be substantial.
However, these situations are typically the exception rather than the rule. This has been my hands-on experience and is also reported in industry data collected by TDWI, Forrester, and Gartner.
For example, the percentage of ML projects intended for production, but never made it, is very high.
This is unfortunate, because what often gets lost in the discussions about data science is that there are two forms of ML commonly used in business analytics:
Supervised Learning: The machine learns from labeled examples.
Unsupervised Learning: The machine learns from unlabeled examples.
Supervised Learning is how you craft ML predictive models like decision trees and random forests. These models learn from datasets where each row of data has an outcome of interest (i.e., the label) recorded.
For example, you work for a governmental agency and want to craft an ML model to predict claims fraud. Every row of your historical dataset needs a label indicating whether a claim was fraudulent.
Supervised learning gets all the love in social media, but there's a problem.
Most of the world's data is unlabeled - including the data in your organization.
So what do you do?
You use Unsupervised Learning.
More specifically, you use a form of Unsupervised Learning called cluster analysis. Here's a definition from my favorite machine learning textbook:
"Cluster analysis groups data objects based only on information found in the data that describes the objects and their relationships.
The goal is that the objects within a group be similar (or related) to one another and different from (or unrelated to) the object in other groups.
The greater the similarity (or homogeneity) within a group and the greater the difference between groups, the better or more distinct the clustering."
Because so much real-world data is unlabeled, cluster analysis is a widely used tool in analytics to discover structure in data and produce new insights.
While many forms of cluster analysis have been invented over the years, the three clustering algorithms that are most used in business analytics are:
K-means clustering
DBSCAN clustering
Hierarchical clustering
If you're ready, I can teach you how to use the first two over a weekend with my Cluster Analysis with Python online course.
The third is the subject of this newsletter tutorial series.
Based on the above definition, hierarchical clustering mines groupings from unlabeled datasets. What distinguishes hierarchical clustering is how the mined groupings are defined.
The easiest way to intuit how hierarchical clustering works is to see a typical real-world example:
The image above is a typical representation of a company - an org chart. This is an example of hierarchical clustering. Organizations worldwide cluster employees based on management hierarchies.
BTW - In machine learning terminology, the diagram above is known as a dendrogram and is commonly used to visualize hierarchical clustering results.
Hierarchical clustering can take an unlabeled dataset and mine a hierarchical structure (often referred to as a taxonomy) directly from the data.
You can then analyze the hierarchical clustering to derive new insights based on your business/processes.
For example, consider the highlighted portion of the dendrogram below:
Let's assume you're unfamiliar with the above organization and its people. You can use hierarchical clustering to derive insights like:
"The lower left cluster comprises observations (i.e., employees) with titles indicative of supply chain management functions."
"The lower right cluster comprises observations with titles indicative of manufacturing functions."
"The upper cluster appears to represent the organization's manufacturing and supply chain division."
While a contrived example to be sure, the above illustrates that cluster analysis is a universally applicable skill:
Marketing: Segmenting customers into groups for more effective campaigns.
IT Operations: Anomaly detection in network operations and security.
Text Analytics: Group documents based on similar content.
Healthcare: Mining patient data for groups to improve outcomes.
The list is endless!
If you're serious about DIY data science, you want skills with cluster analysis.
This Week’s Book
I regularly speak to frustrated professionals because they can't make an impact at work using data. One of the most common reasons is that their leadership isn't data literate.
If this sounds familiar, do yourself a favor and gift your leaders this book:
I wish I could wave a magic wand and make every manager read this book. It's short. It's entertainingly written.
AND it debunks so many things managers believe make them "data savvy."
It's a classic.
That's it for this week.
Stay tuned for next week's newsletter covering the hierarchical clustering algorithm.
Stay healthy and happy data sleuthing!
Dave Langer
Whenever you're ready, there are 4 ways I can help you:
1 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required, and it works with Python in Excel!
2 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data.
3 - Introduction to Machine Learning: This self-paced online course teaches you how to build predictive models like regression trees and the mighty random forest using Python. Offered in partnership with TDWI, use code LANGER to save 20% off.
4 - Is machine learning right for your business, but don't know where to start? Check out my Machine Learning Accelerator.