Data de-duplication also known as “intelligent compression” is a technique used to reduce the volume of data that is sent over the Internet to save on bandwidth and/or disk space. The technology may kick into action at source or destination to eliminate duplicate pieces of information in the data identified for backup so that only unique instances are transmitted or stored in the backup repository.  All subsequent iterations of the information are referenced at the point of occurrence for re-insertion during recovery operations.

Data de-duplication can be initiated at the file, block or bit level. File level de-duplication results in elimination of duplicate files from the backup set. All iterations of the file are stored with pointers for recovery action.  This is not regarded as a highly efficient de-duplication technique; because, even the change of single byte within the file categorizes it as a non-duplicate. A number of files which are largely duplicates of each other may be backed up and stored in the backup repository as non-duplicate files.

Block level de-duplication—that is adopted by most Cloud backup services—checks for duplicates within blocks of data. Bit level de-duplication checks for duplicates at the bit level. Only the data that is changed within the block/bits will be incrementally uploaded to the backup repository. Though the block/bit level compression can compress data ratios of up to 50:1, the de-duplication process will slow down the backup process considerably.  But, the trade off is acceptable in the context of cost savings that occur when volumes of data to be transmitted is very large.

What really happens under the hood when a de-duplication process starts?  The technology uses a hash algorithm (MD5 or SHA-1) to examine each block, bit or file. The process generates a unique number for each unit examined and compares the number with other numbers already indexed. If the number exists, the unit is omitted. Otherwise, it is added to the index and queued for backup.  It follows, that the more the granular de-duplication, the larger the index will be.

De-duplication is not without problems. Hash collisions can result in corruption of data. A hash collision occurs when the algorithm generates the same hash number for two different chunks of data. The information will then be discarded as a duplicate, resulting in data loss. Cloud vendors may combine two or more hash algorithms to guard against the possibility of hash collisions.

Data de-duplication is often used with other data processing technologies such as “compression” and “delta referencing”.