Join 1,000s of professionals who are building real-world skills to truly harness the power of AI technologies like Microsoft's Copilot in Excel.

Issue #4 - Profiling Date-Time & Text Data for Machine Learning

This Week’s Tutorial

Categorical features are far more common in business analytics than in other fields (e.g., the physical sciences). Examples of categorical data include geographies, product lines, customer types, etc.

You can get the CSV files from the newsletter's GitHub repository if you want to follow along.

Date-time data can be some of the most valuable in real-world business analytics because almost every business process has a time component. Likewise, text data is commonly produced by business processes and can be valuable for analytics.

But there's a problem.

You can't use date-time and text data "as-is" with your machine learning (ML) models. Both of these types of data have to be transformed into forms that can be used by your ML models.

This process of transforming the data for use by ML algorithms is known as "feature engineering." Feature engineering is where you make your money as a DIY data scientist, but first you need to know if your data is of sufficient quality.

In other words, why go through the trouble of feature engineering if the data stinks?

Here's how you profile date-time and text data.

First up, get the data ready for profiling:

The first dataset comes from Microsoft's AdventureWorksDW sample database. Imagine a company (i.e., AdventureWorks) that sells products to resellers (e.g., bicycle shops). Here's a preview of the dataset:

The next dataset contains product data:

These datasets are screaming to be joined together:

And now to profile the joined dataset:

Here's the Overview that ydata-profiling produces:

As I don't want this tutorial to be too long, I'm not going to cover the Overview and Alerts sections. For more information on these, take a look at the back issue tutorials.

Starting with date-time data, select OrderDate from the Variables drop-down:

As detailed in previous newsletter tutorials, your questions focus on the business nature of the data:

Do the number of Distinct values makes sense for the data?
Does the Minimum date-time value make sense for the data?
Does the Maximum date-time value makes sense for the data?

In the case of the OrderDate feature, the answer to all three questions is "Yes."

Next, as usual, is checking to see if there's any missing data. Missing data is problematic for many machine learning algorithms so determining this is super important.

Unfortunately, with date-time data simply checking the Missing count above isn't sufficient. Here's why.

Many software and IT systems will use default date-time values when a date is missing. A classic example of this is Microsoft Excel's use of 1900-01-01. Typically, these default values are either very far into the past or into the future.

This makes checking the Minimum and Maximum values critical. If you see values that are very far in the future or past, you'll want to consult with subject matter experts to understand why these values are appearing in the data.

Next up, click the More details button:

The Histogram tab provides an enlarged visualization of OrderDate features. Per the usual, you want to ask if the distribution of date-time values makes sense given the business nature of the data.

For example, we see a general trend of the frequency of OrderDate value increasing over time. Generally speaking, this is what most businesses want to see. 😁

When looking at the histogram, here are a few things you want to ask yourself:

Are there cycles in the bars? For example, some business processes run on a quarterly cycle and you may see regular spikes towards the end of each quarter.
Are there any outliers in the data? This would take the form of unexpectedly high bars in the histogram.

As always, if you see something in the data that doesn't make sense, consult a subject matter expert to find out the "why."

Overall, the OrderDate feature looks good. Nothing at this stage can rule it out - it could be used in a machine learning model.

I would encourage you to profile the DueDate and ShipDate features if you're following along with your own code.

As mentioned above, you must not use date-time features "as-is" in your machine learning models.

If you're interested, my Introduction to Machine Learning self-paced online course (offered in partnership with TDWI) teaches you how to engineer date-time features.

Moving on to text data, select the ProductSubcategory feature from the Variables drop-down:

Technically, the ProductSubcategory feature is text data. However, ydata-profiling classifies this feature as Categorical because it has a relatively low count of distinct values.

This situation is very common in business analytics - text is used to denote categories. Here are some examples from the joined dataset:

SalesTerritoryCountry
PromotionName
ProductCategory

With this context, I'm going to provide a working definition of text data:

A text feature with many distinct values, most commonly the result of human-entered data.

A prime example of business data that fits this definition is customer survey feedback.

However, the definition also would include a feature like ProductName. Select this feature from the Variables drop-down:

The first thing to notice about ProductName is that it has 250 distinct values. This is why ydata-profiling designated the feature as being Text as opposed to Categorical.

However, in all likelihood, ProductName is more akin to categorical data because these string values would be critical to various business processes.

As such, ProductName is likely monitored closely for data quality - the kind of thing that can't be done for true text data (e.g., customer survey feedback). For the purposes of this tutorial, we'll assume that ydata-profiling is correct and ProductName is text data.

It's worthy to note that ProductName has no missing data, so we can move on to the word cloud.

If you're unfamiliar, a word cloud is a technique for visualizing text data. Word clouds depict the relatively frequency of words in a set of text data by the size and intensity of each word in the visual.

For example, the word black is the most frequently occurring in the ProductName data, followed by mountain, frame, and red.

In practice, word clouds are not that useful for profiling text data. Clicking the More details button provides far more useful information:

The Words tab shows the result of ydata-profiling processing the raw text data:

All the data is first converted to lower case.
Any data separated by spaces is considered a distinct word.
Spaces are also counted as words.

The above processing is known as tokenization and is required to transform raw text data into something that can be used with analytics and machine learning.

We can see from the output above that the three most common words are black, <space>, and mountain.

The Words tab is super useful for getting a high-level feel for what's contained within a text feature. Unfortunately, that's the extent of what ydata-profiling provides.

A lot more can be done to make text data useful for analytics and machine learning. I'm currently building a new self-paced online course on text mining with Python. More information to come.

At this stage there's nothing present in ProductName that would make us want to rule it out. If you're following along with your own code, be sure to profile the SalesOrderNumber and ProductAlternateKey features.

This Week’s Book

I've recommended this book before, but I'm including it in today's newsletter because of its relevance.

If you're serious about working with date-time data using pandas, treat yourself to this book:

This is my favorite book on using pandas, and includes great resources for working with date-times (e.g., how to handle time zones).

That's it for this week.

Stay tuned for next week's newsletter, where I cover ydata-profiling's correlation and duplicate data reporting.

Stay healthy and happy data sleuthing!

Dave Langer

Whenever you're ready, here are 3 ways I can help you:

1 - The future of Microsoft Excel is AI-powered with Copilot. What Microsoft won’t say is that the AI future of Excel is dependent on skills with Python. My Python in Excel Accelerator online course will teach you the fundamentals you need to understand the Python code Copilot generates.

2 - Are you new to data analysis? My Visual Analysis with Python online course will teach you the fundamentals you need - fast. No complex math required, and Copilot in Excel AI prompts are included!

3 - Cluster Analysis with Python: Most of the world's data is unlabeled and can't be used for predictive models. This is where my self-paced online course teaches you how to extract insights from your unlabeled data. Copilot in Excel AI prompts are included!