qualité données big data

MU | The conundrum of data quality for Big Data | Suraj JUDDOO

Survey Mag Africa Qualité des donnéesThe conundrum of data quality for Big Data

Big data is nowadays a popular term. However, upon probing deeper into the concept, there exists still a lot of unknowns, or not so well defined ideas, which researchers all over the world are working on. Big Data started with changes in information technology that allows greater value derivable from data. The use of data by companies has been increasing in size such that in some cases, datasets are measured in the order of petabytes. Retail organisations like Walmart and Tesco handle millions of customer transactions per hour. Billions of people around the world work with different types of data through their mobile devices including phones and other smart devices. Moreover, with the increased use of networks, sensors, transaction processing systems and social media amongst others, organisations are facing a deluge of data which is estimated to reach a staggering worldwide volume of 40 ZB by 2020, where a ZB is equivalent to a trillion of GB.

However, there is no standard definition for the term ‘Big Data’. The most accepted explanation refers to datasets which cannot be processed by relational database and data warehouse related tools; this inability has prompted the development of a myriad of tools and techniques relative to the storage, analysis and display of data. This amalgamation of different tools and techniques gravitating around the concept of data is known as ‘Big Data’.

The value addition behind possessing such gigantic volumes of data resides in the capacity of making sense out the data, which is currently an area incurring huge inefficiencies. The use of data analytics in the field of Information Systems (IS) has been present for a number of years with systems such as ‘business intelligence’ and ‘data mining’.

Unfortunately, current data analytics tools and techniques are no longer adequate due to the following characteristics of Big Data: volume, velocity, variety and veracity.

The velocity aspect refers to the speed with which data collected is analysed such that timely use is made out of it whereas variety refers to the different formats, both structured and unstructured, of data being collected and analysed. Veracity refers to the notion of data quality level. The main argument here is that the quality of data present in extremely large datasets need to conform to a certain quality level due to doubts related to the provenance of the data. Making use of quality data or data ‘fit for purpose’ is also very important in order to produce actionable decisions, insights, knowledge and even intelligence from information systems.

This rationale prompted many research studies aiming to improve the quality of data. This field or domain is often referred to as ‘pre-processing’ activities and consist of data cleaning, data transformation, data integration and data reduction, amongst others.

Unfortunately, pre-processing could reduce the response time and overall efficiency of the whole IS, especially if we refer to Big Data systems. The traditional tools and techniques to improve data quality are not adequate for Big Data, due to the 3Vs (volume, velocity and variety). The data quality community is voicing out for more appropriate tools and systems aiming to address the veracity characteristic of Big Data.

Data quality relative to Big Data in general is a relatively under-researched topic with differing school of thoughts pertaining to its importance.  However, increasing regulatory activities and an increase in the understanding of the value of data have raised the importance of data quality as a discipline across organisations.

Some years ago, there was some doubts about whether data quality initiatives are needed for Big Data. This view is no longer supported, as most Big Data analytics are negatively impacted by dirty data. The models created by the analytics become erroneous due to issues such as outliers and incomplete data, amongst others.

In order to improve Big Data quality, it is essential to understand more precisely what is meant by quality data. This understanding cannot be generic and is highly contextual, differing according to industries and users’ requirements.

Data quality dimensions denote a particular notion or characteristic of quality. Traditional data quality dimensions such as timeliness, availability, accuracy, precision, consistency, security and accessibility might need to be re-considered with the specific features of volume, velocity and variety associated with Big Data.

For example, data coming from sensor sources need to have the timeliness and accuracy characteristics whereas related and similar data coming from social media sources might not possess the same degree of accuracy.

The Canadian Institute for Health Information (CIHI) has identified accuracy, timeliness, comparability, usability and relevance as the main data quality dimensions. On the other hand, completeness, correctness, concordance, plausibility and currency are referred to as the main dimensions for electronic health records. Having a standard set of the most important data quality dimensions for Big Data per industry will already be a meaningful step towards creating a methodology to improve Big Data quality.

Determining data quality relative to a set of dimensions could be accomplished by the application of machine learning classifier techniques. Classifiers are either supervised learning models such as Support Vector Machines or unsupervised learning models such as Probabilistic Latent Semantic Analysis.

Supervised learning models need a priori information about the data to build training sets, most probably in the form of a reference dataset. Statistically based generative data models are well established unsupervised learning models applied in pre-processing tasks for data quality management activities.

An example is ‘BayeSwipe’, which is a tool based upon Bayes Theorem to statistically predict occurrences of incorrect data (Sushovan De, 2014). However, the efficiency of this technique in detecting and correcting incorrect data is just around 40%, which is quite limited.

There is a lack of clear knowledge whether machine learning models could be effective towards improving data quality in the context of Big Data for the health industry. Even machine learning models can improve data quality initiatives, there exists a gap in knowledge relative to which machine learning models would be more suitable in a particular context.

An article by M. Suraj Juddoo, Senior Lecturer at Middlesex University (Mauritius Branch)

Suraj is an IT professional with extensive academic and industrial experience of almost 20 years. Suraj is currently doing doctoral research on quality Big Data models, looking for more efficient ways to locate and clean up data before being used for subsequent data analytics that should provide the power behind predictive analytics.

Leave a Reply

Your email address will not be published. Required fields are marked *