The advancement of technology has led to the increase of data volume. 80% of the world’s data today is unstructured. Experts are predicting a 4,300% increase in annual data production by 2020 by which time every human will produce 1.7 megabytes of data per second, growing our digital universe to 10 times that of today.
The emergence of advanced analytics techniques such as artificial intelligence and machine learning is the outcome of faster, cheaper technology and availability of large quantum of data. This begs the question: what importance does the role of data preparation play in this eco-system?
Before discussing the importance of data preparation, one must consider the data lifecycle. The simplest way to describe the data lifecycle is in four stages: the data capture stage, the data cleanse stage, the data storage/use stage and the data care stage. It is worth noting that the role of the data cleanse and data storage/use can be interchangeable. For example, data lake evangelists will argue that it makes more sense to store the data in the lake prior to cleansing it. In this scenario, data is only cleansed when it is identified for use.
The data preparation exercise starts at the data capture stage of the data lifecycle. The first to consider before embarking on a data project is what data one must have, should have, could have and want for the organisation or specific model builds.
Documenting and prioritising data you must, could, should and want at the start of the project, whether it is a full enterprise data warehouse build or creating a simple data reporting dashboard, will provide a reference point for the analysts at several levels. The document is a check-point for large teams to understand the availability of data points which in turn agree on the limitations of the project. For example, if a field or variable is unattainable, the model is unable to predict a certain outcome. The document is used for reporting at project meetings to maintain project governance. It also acts as an audit trail on long projects, typically those more than six months where potentially there might be a change in personnel.
Once the data needed is caught at the capture stage, the next part is the data cleansing stage. There are several technical methodologies that could be applied here but the crux of it is to identify the missing data (gaps), inconsistencies, lack of data standardisation and variable formats that are the basic needs for any subsequent use of the data. These could have an impact on the outcome of the research or prediction and business decisions.
Negligence can lead to project delays, unreliable findings and forecasting as well as ineffective data and analytics capabilities, resulting in a lack of confidence and trust in the data. So, ignore the importance of data preparation at your own peril.