by Craig S. Mullins
"Big data" and the impact of analytics on large quantities of data is a persistent meme in today’s information technology market. But what exactly is big data? The most common definition was coined by Forrester Research defining big data in terms of “The 4 V’s” -
The first V is volume and that is the obvious one, right? In order for “data” to be “big” you have to have a lot of it. And most of us do in some form or another. A recent survey published by IDC claims that the volume of data under management by the year 2020 will be 44 times greater than what was managed in 2009. But volume is only the first dimension of the big data challenge; the others are velocity, variety, and variability. Velocity refers to the increased speed of data arriving in our systems along with the growing number and frequency of business transactions being conducted. Variety refers to the increasing growth in both structured and unstructured data being managed. And the fourth V, variability, refers to the increasing variety of data formats (as opposed to just relational data). Others have tried to add more V’s to the Big Data definition, as well. I’ve seen and heard people add verification, value, and veracity to this discussion.
Management and Administration
One of the big questions looming in IT departments about big data is what, exactly, does it mean in terms of management and administration. Will traditional data management concepts such as data modeling, database administration, data quality, data governance, and data stewardship apply in the new age of big data?
Well, according to analysts at Wikibon, big data refers to datasets whose size, type and speed of creation make it impractical to process and analyze with traditional tools (see http://t.co/awsPyuqXjZ). So, given that definition, it would seem that traditional concepts are at the very least “impractical,” right?
But, of course traditional data management concepts should apply in the age of big data management. Failing to apply these concepts will result in poor data quality and analytics performed on bad quality data will produce bad results. And the whole purpose of big data is to glean intelligence from the large amounts of data we accumulate.
Issues and Adaptations
Yet, there are issues and adaptations that will be required as we apply data quality, data governance and data stewardship to big data management. For example, with traditional data quality, some amount of cleansing can occur as humans eyeball the data. But most raw big data is not eyeballed because there is simply too much of it.
In some applications big data is generated from automated machinery. In those cases (e.g., medical devices, automated metering, etc.) only rudimentary cleansing (if any) may be needed. At least as long as the meters are calibrated and maintained properly.
The bottom line with big data management is that the speed of data accumulation and the overall data volume can make traditional data management techniques challenging. Policies, procedures, automation and education are needed to ensure that the big data makes its way to the right systems and people.
But let’s not burden big data management with things we have yet to master and incorporate into all of our traditional data systems. Sometimes we forget that -
Data Stores for Big Data Processing
Another consideration is the data stores used for big data processing and how they are to be managed. Frequently, big data is coupled with NoSQL database systems. The biggest difference between a NoSQL DBMS and a relational DBMS is that NoSQL does not rely on SQL for accessing data. Additionally, a NoSQL DBMS typically does not require a fixed table schema, does not provide ACID properties (instead delivering “eventually consistent” data), and are highly scalable. There are no hard-
With big data we may be shifting into a new paradigm and we need to take advantage of that shift to implement the data management practices that will ensure success. In other words, ensuring that we treat (big) data like the corporate asset it is.
From Database Trends and Applications, April 2013.
© 2013 Craig S. Mullins,