As Data Scientists, we are supposed to be prepared to face Data ‘’Big’’ and ‘’Simple’’ and scrutinise it to the extent required. In recent years, we have been experiencing a lot of optimistic developments that are supposed to make our jobs easier, namely dramatic improvements to the data storage and security patterns and facilities (blockchain), growing ease of access to the data analysis tools (often available online) and development of global/industry-wide data analysis standards. These highly positive developments may lead us to optimistic assumption that in future – our jobs are only going to get easier as we will be able to increase accuracy of the analysis while keeping the costs down.
Unfortunately, the developments above also result in a number of additional challenges – particularly when dealing with the Big Data. In this paper, I would like to briefly outline some of the emerging Big Data analysis problems that I have been facing in my respective field of the User-Centred Design (UCD).
The Big Data definitions have been evolving over time. As of now, ‘’Big Data’’ is a term that is used to describe large and complex data sets with complexity commonly referring to data coming in a range of formats rather than a single format. More specifically, analysis and management of big data is not possible without integrating all of the data sets by bringing them to common denominators.
The Good News
Working on Data Analysis projects (mostly focusing on UCD design of Service Delivery systems such as e-learning, customer portals etc.) have triggered me to re-consider some of my initial perceptions of the Big Data mining. First of all, I feel that traditional perception of the Big Data as ‘’large’’ is no longer valid. There appear to be few if any differences/variations to the analysis processes based on SIZE of the data sets. Larger data sets (e.g. processing data on user experiences of 10000 users of an e-learning platform as opposed to doing so for 500 users) can be mined in exactly the same ways as the smaller ones and increases to the cost of the analysis are purely nominal. Likewise, cost of the data processing does not increase significantly with every additional data file added and with rapid development of BlockChain technologies, the data expansion costs could be expected to decrease even further.
The ‘’Bad News’’
The ‘’Bad News’’ for UCD analysts is ever-increasing VARIETY of the data. Out of the 10 Vs of Big Data, VARIETY is one dimension of data analysis that poses challenges that are often beyond technologies to ease. In the past, user expectations were significantly easier to identify. Ford’s saying that ‘’any customer could have any car painted any colour as long as it was black’’ was reflective of the user requirements and expectations. The user data sets would have some common denominators that could be used to address the challenge of data variety. These denominators were limited and often appeared throughout the data sets consistently. Furthermore, the ways they were expressed within the data sets were also fairly consistent. Variety of the data could often be addressed through either text mining or identification of consistent patterns to look out for.
Today, UCD-driven analysis has to deal with a far greater variety across the data sets. For example, e-learning projects focusing on the learner requirements can not rely on technology alone to collect and analyse user feedback on the learning materials. There are dramatic differences across both e-learning systems available (and the ways these systems are managed) and the users’ expectations from these systems. These differences increase numbers of potential common denominators to be considered. Even when the denominators are established already, there could be many users providing data that does not fall within the pre-set range. Why? Because the users are having not only unique requirements but also have expectations that these requirements are to be met!
Similar challenges are faced across other UCD-driven data analysis projects. Real Estate portals, Banks, Insurance companies – all appear to be struggling to identify ways of collecting and integrating the user data across 3-4 denominators (something that was possible 10-15 years ago). Emerging technologies can simplify data processing and increase data analysis speed and accuracy but have little impact on how ‘’capricious ‘’ the customers are in their expectations and aspirations.
What is to be Done?
Addressing the data VARIETY challenge is not mission impossible. It does increase complexity and cost of the analysis but does not ‘’kill’’ it altogether. The main remedy for the challenge is ongoing reviews and validations of the data standards and denominators throughout the analysis process. Most importantly, the reviews and validations require proactive involvement of the Data Analysts (aka proactive human involvement) rather than use of the analysis tools and applications alone. I have been working with a number of user data analysis tools and all of them struggle to identify some of the emerging trends and out-of- range data accurately, while experienced analysts can usually pick up the discrepancies and ensure that accuracy and validity of the analysis are maintained.
To Sum Up, when analysing User Data we need to bear in mind that:
- User Centred Design inevitably involves dealing with the Big Data
- As time goes by, VARIETY of the Big Data will continue to get more complex
- Focus of the UCD-driven analysis should be on format of the data batches collected and analysed rather than volume
It is up to the end users to decide how they want their systems and user experiences designed and their preferences are subject to continuous changes of requirements and preferences. Our job is to ‘’listen’’ to these requirements carefully and adjust our analysis processes accordingly!