Real Time Data Analytics: Mitigating the Risks and the Challenges
In the early days of the Data Analytics (DA) companies would sort data, assemble the data sets and establish the respective requirements prior to carrying on with the analysis. No wonder that by the time the analytics activities have been completed and the reports generated, the findings would often turn out to be outdated and could no longer be used for activities other than historic analysis. As the main purpose of employing DA in a commercial environment is performance improvement/optimization rather than making sense of historic studies, once the Real-Time Data (RTD) Analysis tools and patterns emerged, they were immediately embraced by the DA practitioners. While practical benefits of the Real-time DA over historic data analytics are obvious, it is also essential to understand and cater for its risks, challenges and limitations.
Real time (aka live) Data is processed and analysed immediately upon collection so there are no delays in the timeliness of the information provided. For example, if we are to follow up on the current state of the COVID-19 spread in Victoria, Australia (where I happen to reside), only the latest daily figures can provide a real-time snapshot of the situation. Live processing systems enable companies to adjust their activities and processes based on the latest data if required. So what could possibly go ‘’wrong’’ if we rely on the ‘’very latest facts’’? Can the Real-Time Data let us down?
The author believes that while importance and effectiveness of the Real Time DA is beyond doubt, we should acknowledge and consider the following 4 key challenges:
- Data Validation
- DA Process Optimization
- Trend Representation
- ‘’Seasonal’’ Data
Data Validation involves confirming accuracy and quality of both the data sources and the data sets collected and assembled from those sources. It is essential for ensuring fitness and consistency of the data prior to carrying out further analytical studies. Needless to say, it is the initial task that every DA process has to go through, and should it fail, the entire process is going to be corrupted.
Validation of historic data can be carried out at one’s convenience. The validation planning process can be started by establishing the validation requirements + the time frame fulfillment of those requirements is going to take. However, in case of the Real Time Data, it is essential that the validation process is completed promptly, otherwise by the time the data is validated – it will be no longer current and fit for the analysis. This forces the analytics teams adopt very tight (as opposed to some of the projects that involve working with historic data) timelines for getting the validation processes completed.
So how to process the Real Time Data fast as the speedy processing opens doors for further mitigation concerns? On the one hand, sticking to strict timelines will result in a greater %age of discrepancies in data source management, formatting, error identifications & corrections etc. On the other hand, extending the processing timelines increases the risk of the data losing a significant share of its value and relevance all together. With the Real Time data, the principles of diminishing data value (e.g. 90/90) are particularly telling.
In cases where currency of the data is not to last, validation processes have to be fully automated as there is no time left to apply ‘’additional tests & check ups’’ – even on a random basis. It increases the risk of the data validation failures significantly.
DA Process Optimization
DA works best when it is a dynamic process that is subjected to ongoing reviews and updates rather than a static process that is pre-set at the commencement of a DA project/operation. The analysis patterns and dimensions need to be fine-tuned throughout. For example, the initial approaches to identifying, processing and analysing the COVID-19 data (e.g. daily cases, sources of the cases, clusters etc.) have been updated as we’ve been learning more about the virus patterns as well as the analysis patterns that are to be used. Likewise, investment companies keep upgrading DA applications that they require for understanding the trading data successfully. These updates/upgrades are taking place in response to the ongoing process reviews.
While the DA Process Optimization appears to be standard for ‘’grand-scale’’ projects, from a smaller firm’s perspective, it is a tedious task. Historic DA enables them to use a ‘’lessons learned’’ approach where they keep going through the projects completed. This approach is significantly harder to implement for the Real Time DA process optimisation. As discussed in the Data Validation Section, the timelines are very tight and the process optimisation has to be as real-time as the data.
Another aspect of the DA Process Optimisation (at times, even more critical than fixing of the errors) is incorporation of additional tools/technologies. Any such update requires preliminary pilot testing. In a dynamic environment, testing also often has to be done in the live mode with the changes accepted/rejected ‘’on the move’’. This is particularly demanding with multi-level analysis. At each of the levels, findings are consolidated and passed on for the next stage of the analysis. True impact of the process optimization may not be transparent without assessing full impact of the later (more advanced) stages.
Last but not least, as DA may involve creation of Data Marts, further optimization challenges could be anticipated. Data Marts are subsets of the data warehouses that focus on specific aspects of the data. In other words, they are condensed and more focused than the complete data warehouses. Therefore, the DA process may involve creation of multiple Data Marts with each of the Data Marts requiring a separate range of optimization processes. I can recall several instances where instead of a single team overseeing the entire DA process from start to finish, the Data Marts were managed by various teams throughout the enterprise. That required further consolidation of the findings and greater collaboration throughout the optimization processes. If not for the Agile (aka collaborative) approaches, the DA process optimization would be mission impossible.
With the Real Time DA, we are aiming to not only identify the current situation but also look into the future developments. When drilling-down the data, we need to establish sufficient trends to model the possible scenarios as well as relative probability of those scenarios. In a dynamic environment, multiple trends (at times, even including conflicting ones) could be anticipated. The role of the DA is to establish those trends accurately.
The Real Time data is ‘’hot out of the stove’’ but it is not always sufficient for pinpointing the trends as many of these trends can only be identified correctly through studies of longitudinal data. This would entail combining the Real Time data with the Historic data and balancing within the two data types accordingly. If the historic data is to brought into the DA for the purpose of the trends identification, the project will have to incorporate a number of additional dimensions such as for instance: historic data ‘’cut-off’’ dates, additional formatting and source verification.
Seasonality of data refers to the data experiencing regular and predictable changes. Any predictable fluctuations that occur over a distinct time period (e.g. winter) could be referred to as seasonable but with the Real Time data, it is not always easy to establish ‘’Seasons’’ Furthermore, the Real Time data sets may be covering time frames that are not matching (e.g. shorter) the entire seasons.
Longitudinal studies incorporate seasonality as one of the dimensions but with stand-alone sets of the Real Time data, it is sometimes difficult to establish whether a particular phenomena is within the data character, out of character …or simply a seasonal factor. The shorter the data life cycle is, the harder it is for the DA teams to address the data seasonality!
To sum up, it should be emphasized once again that this paper has not been put together with the aim of discouraging the Real-Time DA. Overall, Real-Time approach is definitely far more dynamic and responsive to the fast-paced environment we are currently operating within. However, the more complex the DA gets, the more diligent the Data Analysts need to be when executing the analytics processes. It shows once again that the DA is NOT about the tools used but it is about the PEOPLE who are using the tools!