It's not magic, it's data!

Countless companies are either working on machine learning projects or are dreaming of using the technology as quickly as possible. However over time, many of these ambitious projects don’t deliver the desired outcome. This is often due to the poor quality of the available data that are fed into the algorithms. “Garbage in, garbage out” is an iron law in the field of machine learning. That’s why data scientists are of crucial importance in any machine learning project. They analyze and clean the data, transform it into the desired format with the required quality

profile picture

Julie Derumeaux

9 May 2019

illustration

Data is the new gold

Machine learning sometimes sounds, feels and seems like something magical. However, it is important to understand that predictive analysis is not magic and although the computer algorithm learns, it can only extract so much valuable information as is contained in the data provided.

Data of sufficiently high quality is rare. The gap between market and book values ​​is constantly growing and that's why companies are racing to implement machine learning. Later comes the realization that the data they need is of insufficient quality or does not even exist.

Combatting dirty data

In nearly every machine learning project, a major part of the resources goes to preprocessing the data. All hail to the data scientists, who are fundamental players in machine learning projects for achieving good performance and accuracy. They use their skills to turn a raw data set with deviations and errors into consistent data that can be manipulated and analyzed. You can compare data preprocessing to preparing your house for when guests come by. You clean or move things, adjust the interior to create a nice atmosphere.

knight fighting a dragan that spews bad data fire

Good data is half the battle

Although machine learning can sometimes surprise us by discovering patterns invisible to humans, it is unable to deliver magical solutions and insights from bad data. The performance of every machine learning model depends on the quality of the data.

You can compare the entire process of generating a predictive model with preparing a delicious meal. The ingredients are our data, and the recipe is the algorithm: if the ingredients are of poor quality, no matter how good the recipe is, the dish will disappoint.

The cleaning process

The purpose of data preprocessing is to make sure the data accurately represent a real world phenomenon and to modify it so that an algorithm can most easily capture useful patterns in the data.

Data preprocessing may involve the removal of old, incomplete or duplicated data, but the primary focus remains on cleaning, transforming and consolidating information to ensure effectiveness. Outliers are identified, incorrectly labeled data is corrected, data entry errors are fixed and incomplete records are supplemented or deleted.

Turn your data into gold

Differentiated data is the key to a successful AI project. With data that your competitors also have, you will likely not discover anything new. Therefore, you should search and identify, both internally and externally, which ( combination of ) datasets can help you find new insights. Focus your data efforts where your organization can differentiate and remember that high quality datasets are often better than very large, poorly structured datasets.

In the process of adopting machine learning, it is important to start with a unique view on what your company considers most important to make certain decisions. This provides information about the data to be collected and the technologies to be used. A simple place to start is to nurture and structure knowledge that your company already has and that can create more value for the company.

pot of gold at the end of the rainbow

Takeaway

Unfortunately clean data is rare. Even more unfortunate is that dirty data costs your company money in different ways. By purifying your data, your company is prepared to take the step towards applying machine learning. Eventually this will lead to more efficient business processes, improved products or services, better informed decision-making, improved marketing campaigns and... more profit.