What is good training data and how do you use it?
July 27, 2021 5 min read
Machine learning is all about using data to make smarter, faster, more accurate decisions and predictions. To accomplish this, we build models that understand both the subtle and complex relationships that exist within the data, so that those patterns can be relied upon in the future. But in order to actually find those relationships, the model has to have some initial set of data that it can scour and learn from. This first round of data that teaches the model what it needs to know is called “training data.” This is actually one of the few data science terms that is a pretty clear description of what it is.
You can think of training data as your way of showing the model examples of the “right” answer, so that it can learn how to find the right answer on its own later.
Let’s say our mobile app generates push notifications every so often for users so that they remember we’re on their phone and keep using our software. Currently that happens more or less randomly or based on a set of hard-programmed rules we’ve created (i.e., if they haven’t opened the app in 7 days, send them a notification). We can make this far more intelligent- and successful- with machine learning. We can build and use a model that tells us the optimal moment to push a notification for each individual user.
To do so, I’ll want to build a model that learns from all my past success and failures pushing notifications to users. We need to define “success.” In this case, it’s when a user taps on the notification, opens the app, and then engages with the app for at least the next 3 minutes. So to train my model, I’ll want to include lots of data I have on users that I have pushed notifications to in the past, and their activity: for example, how much money the user has ever spent on the app, how active they’ve been in the past, their most recent activity, etc. I’ll also want to include data about the push notifications I have previously sent, such as the proximity to the last notification, the time of day sent, whether it included a promotional offer, etc. And- most importantly- I’ll need to indicate how that user responded to previous push notifications. Were they a “success”? I also need to include “wrong” examples, or examples of failure, so the model can learn to distinguish between the two.
That’s your training data. It’s your historic data full of “right” (they responded well to the notification) and “wrong” (they didn’t respond well) outcomes that will be used to build a model that allows you to get more “right” answers in the future. Here, your model will use that data to understand whatever relationship exists between the user, activity, and notifications that ultimately leads to someone responding favorably to the push notifications. Armed with that model, we can then serve push notifications to users not at a random or rule-based interval, but at the precise moment when it is likely to be most successful in each particular circumstance.
Your training data is “good” if it successfully creates a model that makes accurate predictions. There are two things to keep in mind when pulling together your training data that can help accomplish that:
Everyone will tell you more data is better, and it’s hard to argue with that-- but it’s also not that helpful. In fact, you can actually build very predictive and accurate models with relatively little training data. How? Generally speaking, the more predictive the data you have, the less of it you need to train a sound model. There may be very subtle relationships among all the various ways a user can engage with my app that ultimately predict whether they’re about to churn, so having lots of training data is necessary to pick up on all those subtleties and accurately identify users headed for the door. But if I’m trying to predict how much a car will sell for, knowing a few key characteristics of the car (i.e., the year, the make, the model, etc.) might be so strongly related to its price point, that a great model could be built without a whole lot of training data.
If the data your model is starting to encounter in the world is very different from the data it was trained on, it’s probably a good idea to retrain the model. For example, let’s say you trained a model to classify leads into certain personas so they could be delivered more targeted marketing campaigns that increase the chance they’ll convert. One key datapoint in the training data may have been the lead’s job title, but when we trained the data we had never encountered an ETL Developer, and now leads with that job title are entering our funnel every day. Our current model probably isn’t making very good use of that novel datapoint, so we should retrain the model to include examples of leads with that job title.