Suddenly, everyone seems to be discussing “Big Data”. The cloud has facilitated the digitization of information and extended the reach of the enterprise database. Organizations today have access to more information than they can handle. The information is also heterogeneous, varied and highly volatile. Decision making on the basis of big data analysis is here to stay!
Big data has been defined variously as large volumes of data or high dimension data. Gartner defines “big data” as data that has volume, variety and velocity. The data can be video, audio, text or social media files. The data may reside in the organizations data center or elsewhere on the Internet. The speeds required for processing this data is huge and is directly determined by the nature of the data and the period of time for which this data needs to be retained.
As a result, modeling for use of this data is the biggest challenge faced by organizations. It is obvious that entity relationship modeling which was in vogue for organizational databases will work only in a limited fashion with big data. This is because much of the big data lies outside the scope of the database and some of the data is yet to come into existence! So, how does one plan to analyze data that is yet to be created, or is to be found in tables that reside in external databases?
One solution that is offered up on the altar of big data is that the entity relationship modeling can be a starting point. It can help the organization identify the required data, create the necessary communication protocols, define the desirable attributes and pin point possible relationships between the pieces of information. But, the model will fail if efforts are made to stretch it beyond these limits. Newer constructs will have to be thought of and developed.
It follows that the challenges associated with big data for the cloud demands a rethink on existing applications. Some of the technology platforms currently available will have to be retained and others constructed from the beginning. The sheer size of the data must be recognized, the high dimensionality appreciated, and its heterogeneity understood. The questions to the data analysis pipeline cannot be predicted and may be built up on the fly. Potential bottlenecks may emerge when too many people acquire the power to ask the questions of the data and analyze it. Incremental improvements to the technology will not meet this challenge. The Structured Query Language (SQL) currently in vogue in cloud technologies may have to be modified and empowered to cater to the needs of the generations that will be using the systems.