There is so much talk about big data and its handling. What is big data?

Big data is a collection of large data sets that require high capacity state-of-the-art data management tools and applications.  The challenge lies in capturing, storing, searching, sharing and analyzing this data for identifying correlations, business trends, links to additional or derived information.  The order of the data is in Exabyte’s and the amount of data that is being created on a daily basis around the world in 2012 is around 2.5 quintillion (2.5×1018) bytes.  This volume is set to grow. It follows that big data analysis demands the use of parallel software that run on tens to thousands of servers.

The primary question is: Will the cloud drown in the data deluge? Let us look at some of the drivers of big data and understand how the cloud will work for us.

Big data is all about data capture at the user interaction level and not at transaction level. With every user interaction, the amount of data in the repository increases. The need is to analyze the data and understand the user behavior so that the power of the cloud can be harnessed to the service of the user.

Cloud data management experience shows that the cloud is capable of scaling up to very large data volumes on demand. Data partitioning is a reality. Parallel Data Base Management System (DBMS) technologies which were proposed in the 1980s have matured over the last few decades and today there are any number of proprietary DBMS and RDBMS (Relational Data Base Management System) engines out there that can function as data warehousing solutions for management of big data.

Concepts like Indexing, Metadata development, advanced query optimization and data modeling are well understood, studied and implemented. The alternate technology of Hadoop (combination of MAP and REDUCE and DBMS) technology is also gaining popularity. The new technology can instantiate multiple MAP tasks and REDUCE tasks, while leveraging advanced RBMS technology features. Data partitioning, task scheduling, handling machine failures or managing inter machine communication is possible at run time. The entire process is transparent to the user.

In short, any number of parallel data programming models has been implemented for commodity clusters and have been used by big names like Facebook, Yahoo or Google. Automatic fault tolerance and ease of use (fewer programmers and administrators in the fray) features add to the positives in our list. The future cloud will include complex data processing, multidimensional data analytics and stride the physical and virtual worlds with ease.