By now, everyone understands the importance of data, and how important it is to have proper and clean data. In this article, I would like to take this a step further and explore the different features that one has to keep in mind when working with data in the context of machine learning.
1. What am I trying to achieve with my data?
The most important question to ask what are you trying to do? If you don’t know what is the business use, then stop right there! It is important to have a clear purpose, or else there would no way to validate any model you come up with. A business use could be “Is sales of icecream affected by football matches in the area?” or “Which customers are likely to come again?”.
2. Can I do what I want with my data?
Having understood your business case, you have to figure out if your data can solve your problem. This is the biggest challenge in data science, finding the right data for a problem. One way to know if your data is suitable is to look at past use cases by others; it is likely someone has done what you want to do. If not, you will need to analyze and think about what are the different features that are likely to impact your results. For example, trying to understand if a football match impacts icecream sales will require a time-series data about historical and upcoming football matches, pedestrian traffic in the area, car traffic in the area, and maybe a list of other events in the area. I say maybe because it is possible that there is no correlation between pedestrian data and sales of ice cream, that is why you have to do some exploratory analysis on the available data to see what is suitable.
3. Is my data in the right format?
Once you have the right features, the next step is to make sure everything is in the right format. This is an open-ended question because there may be more than one way to do something, so you have to first decide on an approach then ensure your data is correctly formatted. For example, in the ice cream example, since we are doing prediction, we might want to look at a binary classifier. This classifier returns a binary result (1-0 or yes-no) after analyzing a set of features such as football match, traffic level, pedestrians congestions, and a number of other events. As you may have noticed, I started refining the type of features. I defined that each row (record) would be a date in the past, football match would be a binary feature (yes-no), traffic levels, pedestrian congestion, and a number of other events could be a number.
However, before actually agreeing on any of these features, you should do some statistical analysis to see if they are actually relevant to your use case. You can use correlation, ANOVA, or any other technique to either find a relation or normalize and regularize the data. This is because some models, for example, can’t take in negative numbers in a feature, or other can’t take a categorical (non-number) feature.
4. Are my results significant?
Before actually understanding the significance, you have to ensure that the model resulted in accurate results. There are different measurements for each model but typically, the data you have can be split 80:20 to test the model; 80% of the data for training and 20% for testing.
To understand the significance of your results, you need to revisit your hypothesis and business case. Ice cream sales are affected by football games, maybe the answer is yes, but football games play a small part in the correlation, if so then your model would highlight such differences. Once the significance is ascertained, you can then deploy the model and start using it to predict when will the next rise of ice cream sales happen.
Even though I summarised this process in 4 steps, it is actually much more complex. There are so many different kinds of models, each with its own constraints and use cases, choosing the right one needs a blog of its own. Also transforming data, there are so many considerations to make, needs another blog. However, understanding the big picture is the first step of using data science in your business.