What’s so important about data preparation? How would it affect leaders in making decisions with data?
The advancement of technology has led to an increase in data volume. 80% of the world’s data today is unstructured. Experts are predicting a 4,300% increase in annual data production by 2020 by which time every human will produce 1.7 megabytes of data per second, growing our digital universe to 10 times that of today.
The emergence of advanced analytics techniques such as artificial intelligence and machine learning is the outcome of faster, cheaper technology and the availability of a large quantum of data. This begs the question: what importance does the role of data preparation play in this eco-system?
Before discussing the importance of data preparation, one must consider the data lifecycle. The simplest way to describe the data lifecycle is in four stages: the data capture stage, the data cleanse stage, the data storage/use stage, and the data care stage. It is worth noting that the role of the data cleanse and data storage/use can be interchangeable. For example, data lake evangelists will argue that it makes more sense to store the data in the lake prior to cleansing it. In this scenario, data is only cleansed when it is identified for use.
The data preparation exercise starts at the data capture stage of the data lifecycle.
The first to consider before embarking on a data project is what data one must have, should have, could have, and want for the organization or specific model builds.
Documenting and prioritizing data you must, could, should, and want at the start of the project, whether it is a full enterprise data warehouse build or creating a simple data reporting dashboard, will provide a reference point for the analysts at several levels. The document is a check-point for large teams to understand the availability of data points which in turn agree on the limitations of the project.
For example, if a field or variable is unattainable, the model is unable to predict a certain outcome. The document is used for reporting at project meetings to maintain project governance. It also acts as an audit trail on long projects, typically those more than six months where potentially there might be a change in personnel.
Once the data needed is caught at the capture stage, the next part is the data cleansing stage. There are several technical methodologies that could be applied here but the crux of it is to identify the missing data (gaps), inconsistencies, lack of data standardization, and variable formats that are the basic needs for any subsequent use of the data. These could have an impact on the outcome of the research or prediction and business decisions.
Negligence can lead to project delays, unreliable findings, and forecasting as well as ineffective data and analytics capabilities, resulting in a lack of confidence and trust in the data. So, ignore the importance of data preparation at your own peril.
Talent is key to operate and manage the Data Lifecycle.
Without talents with data science, analytics, and AI skills, data in an organization will not bring any value to the organization. In the past 20 years, organizations have spent fortunes on IT infrastructure to incorporate digital pillars within a traditional business. However, successful stories always come from tech companies with great talents in computer science and information technology. Such talents build the company from the ground up, incorporating data and information technology within the entire supply chain and business ecosystem.
Do you have future-ready talents with future skills to drive business growth?
Form a Data Science Team to manage the lifecycle of a Data-Driven Organization.
Interested to know how? Talk to our friendly Data-Driven Organization experts to find out your Data-Driven Maturity Level, and develop a roadmap to fill the gaps.
Click here to find out more.