Figuring out what data you should use to train your model is very important and can sometimes feel a bit overwhelming. On the one hand, it can feel like you have troves of data and you don’t know where to start when it comes to picking and preparing the data that might be relevant. On the other hand, you might feel like you don’t have enough of the right data that you need.
Chances are, however, that you actually have a pretty good intuition about the data you need and how to get it. This article will help you summon that intuition.
Rather than think about this as a data question, think about it as a problem-solving exercise. You have a problem you’re trying to solve using machine learning, and so the question is really: what information do you need to solve this problem?
Let’s say I’m trying to predict how much money a customer is likely to spend on my app, so I can dedicate more support and service resources to likely higher-value customers. If I were trying to make that prediction today, without machine learning, what information would I want to know in order to do so? I would want to know, at least: how much time they spent engaging with more expensive content, if they had already made any purchases, some details about those purchases, and maybe some demographic data about the customer. If I had all that information and I was an expert about my customers and my app, I could feel comfortable making some educated guess about whether this was likely to be a high-value customer. Conversely, If all I knew was the time and date of their last login and the time they had so far spent in my app, for example, I would probably not even begin to venture a guess.
I like to think of this as the Expert Gut Check: if you asked an expert in the given topic to make a rough prediction based on these data points, would they be willing to make such a prediction or would they tell you the data provided was not nearly sufficient to even make one?
Once you’ve thought through the kind of data you want to use to train your model, the next step is pulling it all together. It may be the case that you don’t yet have all the data you need to fully satisfy the Expert Gut Check. If that’s the case, you should consider how you might begin to collect the necessary data to help train and retrain your model in the future.
In the example above, I probably already have ready access to information about the user’s prior purchases on my app, but perhaps I am not yet tracking the amount of time they are spending viewing and engaging with expensive content. If I train my model without that data, I’ll want to be sure to start collecting that information so that I can retrain my model and see how- if at all- it impacts the performance of the model.
Sometimes I’ll need to combine my data with third-party data sources in order to have the complete view of data I believe is relevant to solving my problem. In our example, I think some demographic data might be useful in predicting the value of the customer, so it may be worthwhile to purchase data on a user’s credit history and estimated net worth, or even just data on average income or home value in their zip code.
Aside from any completely new pieces of data you will want to collect, one of the most important ways you can put the right training data into a model is by performing often very simple calculations on your data. That’s because often there are interesting insights to be gleaned not just from the raw data, but also from looking at that data in some sort of aggregated or transformed way.
In our example, we know we want to consider the prior purchases a customer has made as one factor that may predict the magnitude of their ultimate total value to us. But in addition to just a bare accounting of each prior purchase, I might also want to calculate the average amount of each purchase to include that as an additional factor. Maybe the number or total value of purchases isn’t as predictive as the average per-purchase cost, so by calculating that value and adding it to our data we’re giving our model to consider that possibility. Perhaps I want to see how quickly a user seems to be making purchases, so I calculate the time between when they first logged into the app and when they completed their first transaction and throw that data point in the mix.
It may seem like you could come up with these sorts of calculations endlessly. You could. But you shouldn’t. Here, again, the Expert Gut Check should be your guide: if the calculation seems so convoluted that an expert would never conceive of trying to incorporate it into their consideration of the problem, you’re probably over-engineering. That said, the very best judge will be whether or not the model is successfully solving the problem. So go ahead and test it out! Some of the best models are built with this kind of experimentation.