While data availability is always a priority, it is important to understand that blind, unthinking, unintelligent replication of existing data “as is, where is” is foolishness. Data must be examined, analyzed, categorized, and only unique pieces of data must be filtered out of the volumes, and sent to offsite backup and recovery sites in the Cloud.
But, when data volumes are large, minute, manual examination of the data is almost impossible. This process is eased by use of de-duplication technologies by Cloud backup vendors. De-duplication of data can be initiated at source or at destination. De-duplication is a process where duplicates of data elements are identified and eliminated from a data set while maintaining a referential integrity that will allow the duplicates of data to be rebuilt if a recovery action is initiated.
How does de-duplication work? The first backup or seed backup to the Cloud is a full backup. Every element of a data set is uploaded to the remote server. De-duplication action will be triggered when the next data set is uploaded. Each data element is compared with existing data or archived data and only unique data elements are selected for upload. Wherever, the backup process encounters a duplicate data element, a reference to the original (existing data element) is recorded at the point of occurrence. When an attempt is made to recover the data set, the references are filled out with the original data element and the complete un-de-duplicated data set will be recoverable on the user system.
De-duplication allows Cloud vendors optimize the backup and recovery process for their customers. The process reduces the volume of data that needs to be uploaded to the remote server, thus saving on bandwidth and capacity. This results in savings on storage costs and minimization of WAN requirements.
De-duplication at source can cause latency in upload of data backup sets, if the algorithms are not properly designed. Sophisticated Cloud software take advantage of bandwidth throttling requirements and ensure that de-duplication works on blocks of data in the background while the local system is engaged in other activities that are bandwidth intensive. When bandwidth is released, de-duplicated data is uploaded to the remote server. This leaves the user with the impression that the Cloud backup software is constantly active in the background ensuring that data is continuously being sent offsite to the remote server.
Deduplication is often used in conjunction with compression technologies. Compression is also considered a kind of de-duplication in that repetitive elements in a data set are referentially stored and the entire data set size is reduced to a minimum possible number of bytes.