Data lies at the heart of AI and ML. Poor data quality is perhaps the biggest enemy to the profitable use of artificial intelligence (AI) and machine learning (ML). Bad data can affect the predictive output of your machine learning model in two ways – in the historical data that is used to train your ML model and in the new data that you feed your model to make future predictions.
Most often people tend to make the mistake of compensating for bad quality data with improvements to the predictive model. Despite your efforts, you will end up getting bad predictions because garbage in is garbage out. In fact as the complexity of the problem increases, the demand for not just more data but also diverse and comprehensive data increases.
In this article, I will discuss some of the most frequently made data mistakes and how you can avoid them.
Before you launch into compiling data from various sources, you need to have an overall effective data collection plan. This is a four-step process.
Step 1: Clearly state the purpose of your ML project. What is the problem you are looking to solve using machine learning? Once you have clarified your objective, you need to assess whether you have the right data to help you achieve the objective. If you do not have the necessary data, you need to find new data to scale back the objectives.
Step 2: Ensure data quality throughout the overall project plan. Feeding your ML model with data is not a one-time event. You need to make sure the data you feed into your ML model is cleaned at every stage of the ML process. This requires measuring of quality levels, assessing sources, removing duplicate data, etc.
Step 3: Maintain an audit trail as you prepare the data for training your predictive model. Keep a copy of the original training data, the data you actually used in the training, and the steps taken for you to go from the first to the second. This may seem like an extra step in the process, but you’ll be glad you did it because it speeds up your efforts in making process improvements. It helps you understand the biases and limitations in your model that you can smartly avoid in the future.
Step 4: Assign the responsibility of quality assurance to a specific individual or team. This person or team should be the single point of contact for anything related to the data and its quality. They should possess complete knowledge of the data. They should set and enforce standards of quality for the data. They should also lead any ongoing efforts related to identify and weed out causes of error.
Collecting data is the first step in a data process but also creates the stage for the biggest mistake when training your ML model. If your dataset is too small, it means your ML model does not have enough examples or information to find discriminative features. So the model overfits the data. This results in low training error but high test error. Simply put, your ML project will not make it to production because the model will fail to scale. Even if you have your own dataset, it’s a good practice to consider utilizing external datasets so that you can enrich the data you already have.
If you are light on the training data, you need to make sure you collect all relevant data from all the sources available to you. If it isn’t still enough, I’d recommend acquiring data from data providers. You could also crowdsource data or try data pooling. The challenge with this approach is the cost and time required to analyze how much data you already have and how much more you would need. You can try comparing your results with different dataset sizes and then try to extrapolate. Another caveat of turning to third-party sources or data providers is data quality.
You could try using combined data augmentation techniques. Data augmentation is a technique that enables you to increase the diversity of data significantly without actually collecting new data. Data augmentation techniques such as cropping, padding, and horizontal flipping are commonly used to train large neural networks which use visual input. Though this approach can be powerful, it is still inferior to collecting more raw data. Additionally, not all augmentation techniques can be used for your data problem.
Data scientists often find it daunting to clean up data because it is a highly time-consuming task. This is not a step in the process that you can dismiss as unimportant.
As mentioned earlier, poor quality of data will lead to poor quality of results or predictions. Even if you have sufficient data but it is far removed from what you need to use, it will only confuse your model. It will be conflicting and misleading for the model and it is bound to fail.
Another common issue with dirty data is that the dataset is made up of data that does not match the real-world applications.
One way to go about this is to remove any outliers, identify and address missing variables, and normalize the data spread. Once you are done here, you can then proceed to reduce dimensionality and decide whether or not oversampling or upsampling is required. This can be a lengthy process but is proven to be useful.
The number of samples per class must be roughly the same for all the classes. If it isn’t so, the model will tend to favor a dominant class since it results in a lower error. But this means the model is biased as the class distribution is skewed.
There are different ways in which you can address this problem.
· Use the right evaluation metrics.
· Gather more samples of the underrepresented classes.
· Resample the training dataset using either the under-sampling or over-sampling approach.
· Normalize the data so that every sample has its data in the same value range.
· Use data augmentation.
· A slightly unconventional method is to design a model that is suited for the unbalanced class.
Once you have cleaned your dataset and labeled it properly, you need to split it. Usually, people split it into an 80-20 ratio. 80% of the data is used for training while the remaining 20% is used for testing. This allows you to spot overfitting easily. But what if you were training multiple models on the same testing set? The result will be different. You will eventually pick the model that provides the best test accuracy and so you end up overfitting the testing set. This happens because you are picking the model for its performance on a specific test data set and not for its value.
Instead of splitting your dataset into two parts, you can split it into three – training, validation, and testing. This approach will protect the model you choose from overfitting. Therefore, the selection process now involves training the models on the training set, testing them on the validation set, picking the most promising model, and then testing it on the testing set. This way you aren’t just shielding the model from overfitting but also finding the true accuracy of your model.
Once you have decided on the model that you will move to production, make sure you don’t forget to train it on the entire dataset. The more the data the better.
I hope you understand the true importance of data in your ML training. You also know the rookie mistakes that you must avoid at all costs so that your ML model provides you with accurate predictions. Always remember – garbage in is garbage out!