Join 8,587 professionals who get data science tutorials, book recommendations, and tips/tricks each Saturday.
New to Python and SQL? I'll teach you both for free.
I won't send you spam. Unsubscribe at any time.
Issue #5 - Profiling Correlation & Duplicate Data for Machine Learning
This Week’s Tutorial
You can get the CSV files from the newsletter's GitHub repository if you want to follow along.
Here's the code to get the tutorial's dataset loaded and profiled:
While correlation has numerous mathematical definitions, today's tutorial will take an intuitive approach. Intuitively, correlation quantifies the relationship between two features (i.e., columns) within a dataset.
In the context of machine learning, you can interpret this relationship as quantifying how well you can predict one feature using the other.
For reasons beyond the scope of this tutorial, you cannot assume that correlation tells the whole story regarding prediction. However, correlation is a great initial indicator during the profiling phase of your ML projects.
This is why ydata-profiling provides correlations in the profiling report.
To access the correlations, scroll down the profile report. Here's what you will see:
I find the Heatmap tab the most useful, but the same information is presented as a table if you prefer.
The first thing to notice is the dark blue diagonal line stretching from the heatmap's top left to the bottom right.
This line is the correlation of each feature with itself, which illustrates some key points of the visualization:
Darker colors represent higher levels of correlation (i.e., predictability).
Every feature is very highly correlated with itself.
The heatmap is divided in half along the diagonal, where each half is the mirror image of the other.
The second thing to consider is the vertical line on the right side of the heatmap. The intensity of color corresponds to the range of values of -1.00 to 1.00.
A value of -1.0 is a perfect negative correlation, while a 1.0 is a perfect positive correlation.
Notice that every feature has a perfect positive correlation with itself. What this means is most easily demonstrated with a visualization known as a scatter plot:
The Age feature is on both the x-axis and y-axis in the scatter plot. Notice how all the dots in the plot are in a straight diagonal line going from the lower left to the upper right. This is because the data for the visual looks like this:
Row 37 is (44, 44)
Row 1,234 is (31, 31)
Row 9,745 is (57, 57)
Etc.
The plot of Age demonstrates a perfect positive correlation. In natural language, you can think of the plot like so:
"If I know someone's Age, I can perfectly predict their Age."
Kind of silly, I know. But we're building an intuition here.
A perfect negative correlation would have an opposite line. The scatter plot would be a straight diagonal line starting at the top left and stretching to the lower right (i.e., a downward line).
Take a look at the heatmap again. Notice that the Education and EducationNum features have a perfect positive correlation. This makes total sense because:
Education is the categorical representation of the highest level of education attained.
EducationNum is the numeric representation of the highest level of education attained.
Now, take a look at the Sex and Relationship features. Notice that the correlation between them is not perfect but is relatively high (i.e., a darker shade of blue). This tells us that there is a predictive relationship between these two features.
In general, we want to see darker colors. Regarding making high-quality predictions, we don't care if the correlation is positive or negative. Machine learning algorithms can use both kinds of correlation.
I'm going to display the heatmap again for convenience:
When profiling correlation for machine learning, we're looking for two things:
High levels of correlation (i.e., dark colors) between features.
High levels of correlation between features and the label.
As a refresher, the label is the outcome of interest that the ML model is learning to predict. In the case of this dataset, the label is named Label (handy, huh?).
As already noted, we have some highly correlated features:
Education and EducationNum.
Sex and Relationship.
This is important to note as some machine learning algorithms (e.g., logistic regression) have difficulty with highly correlated features.
Other algorithms (e.g., decision trees and random forests) are not as adversely impacted by highly correlated features.
A typical outcome of profiling highly correlated features is picking one to use in your machine learning model. For example, you may choose to use Education and leave out EducationNum.
Next up, you want to take a look at the correlation between the features and your label. Here's the heatmap again, focusing on the label:
Going across the Label row shows that some features have little correlation with the label. A prime example is the Fnlwgt feature.
Other features are more highly correlated with the label. Examples include MaritalStatus and Relationship (note how these two features are also correlated with each other).
At this stage of your ML projects, you're looking for features that are highly correlated with your label. These are good candidates for your ML model.
The next step after profiling correlation is a deeper dive into the potential predictive relationship between your features and your label. My Introduction to Machine Learning self-paced online course (offered in partnership with TDWI) covers this in great detail.
The profile report also includes a section that covers duplicate data. You can find this by scrolling down the report until you see the following:
As covered in a previous newsletter tutorial, the Overview section of the profile shows the number of duplicate rows in the dataset.
The Duplicate rows section allows you to examine duplicate rows to assess if any issues require investigation.
As discussed in the previous tutorial, duplicate rows are not automatically a problem. For example, this dataset represents US citizens at a high level of abstraction. Given the large number of US citizens and the level of abstraction, duplicate rows are expected.
This Week’s Book
Given that this week's newsletter focused so much on features, I wanted to recommend a book dedicated to feature engineering with Python:
This is a solid first self-study resource for feature engineering.
That's it for this week.
Stay tuned for next week's newsletter, the first in a tutorial series on strategies for handling missing data with your ML models.
Stay healthy and happy data sleuthing!
Dave Langer
Whenever you're ready, there are 4 ways I can help you:
1 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required, and it works with Python in Excel!
2 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data.
3 - Introduction to Machine Learning: This self-paced online course teaches you how to build predictive models like regression trees and the mighty random forest using Python. Offered in partnership with TDWI, use code LANGER to save 20% off.
4 - Is machine learning right for your business, but don't know where to start? Check out my Machine Learning Accelerator.