Duplicates may occur when users store multiple copies of the same file or create full backups on their system to safeguard against accidental deletion of files.  Cloud based backup systems may also result in duplication of data, if full backups are performed repeatedly without putting in place systems that check for duplicate data. Incremental backups can further add to the process of creating duplicate data under certain circumstances.  De-duplication systems may be designed to identify data duplicates at a chunk level or a granular level and weed out duplicates before data is encrypted and stored or after data is encrypted/compressed and stored.

One of the biggest challenges faced by deduplication algorithm designers is the challenge of “When”. When should de-duplication of data kick in?  The thumb rule is that de-duplication should be completed before encryption/compression of data.  Compression/encryption scrambles data. The de-duplication system may fail to identify duplicates in a set of compressed/encrypted data. While database administrators may be happy with encryption/compression for reasons of security or speed, De-duplication algorithm designers may have a quarrel with the process.  If deduplication is to be done at destination, data will have to be decompressed/ decrypted before deduplication can be attempted. It may be simpler and even cheaper to complete deduplication before encryption/compression at source even though it is time consuming.

The next big challenge of deduplication is “How”.  Where enterprises multiplex different kinds of backup and one of the backup systems involves use of virtual tape libraries, deduplication becomes complex. Multiplexing different backups to a single tape drive may result in scrambling of data and confounding of dedupe even when it is able to identify that the backup is from different streams.

The restore challenge is another big challenge in deduplication.  This problem is often referred to as dedupe tax. Deduplication systems cause recent backups to be recorded in a fragmented way. Restoring the complete backup using the references to original files that have been backed up earlier and not backed up again can be a time consuming process.  However, in systems that have adopted continuous backup, this challenge gets addressed and restore takes less time.

Despite all the above challenges, data dedupe systems make a significant contribution to the state of your backup and recovery systems. The benefits outweigh the problems and consequently cloud service providers take on the challenges head on and resolve them in innovative ways.