Data transformation can be risky. Any kind of transformation of data can potentially cause data loss. Data de-duplication systems transform the way data is written to the backup repository. If the design of the de-duplication algorithm is not perfect or the quality of its implementation is not up to expectations, it can compromise the integrity of data.

However, the good news is that data de-duplication technologies have matured over the years and there are a number of sophisticated de-duplication technologies that are available today. Commercial implementations of data de-duplication harness these technologies to advantage their customers.

Most of the de-duplication problems arise from the methodology that is adopted for identifying and eliminating duplicates in a given data set.

  1. Poorly designed cryptographic hash functions can cause data corruption. The solution lies in building additional validation functions into the algorithm to ensure that two or more data chunks with the same hash functions are truly the same.
  2. Data de-duplication can be resource intensive. Inefficient de-duplication of large volumes of data inline or post-process can jam up systems. The solution is to use inline de-duplication for small volumes of data and post-process de-duplication, where data need not be mirrored for disaster recovery on an ongoing basis.
  3. Snapshots sent to primary storage after de-duplication hydrate fully when secondary copies are made. This makes the secondary copy larger than the primary copy. This problem is often resolved by making snapshots of data before de-duplication processes are applied to them.
  4. Though de-duplication is a version of compression, de-duplication works in tension with traditional compression. While the goal of encryption is to remove discernable patterns in data, the role of de-duplication is to identify and preserve discernable patterns in data.
  5. Finally scaling is a challenge to all de-duplication algorithms. The hash table or the de-dupe namespace will have to be shared across devices and this affects space efficiency adversely. Even a shared namespace is not very technically reliable as there is often a performance degrade.

It follows that any evaluation of de-duplication technology must take into consideration how the software vendor has resolved these issues in design and deployment of the algorithm. You, as a customer, must be risk aware and must understand how the integrity of your mission critical data will remain uncompromised. You need to make sure that the algorithm does not discard unique data blocks while writing your backup to the disk on the server. Else, your disaster recovery process will be a disaster you can never recover from.