Cloud vendors will tout de-duplication as one of the simplest methods of managing exploding data growth. While the technology will do what it will, there are several things that the end user must be aware of before they subject their data to the de-duplication process.
1/ De-duplication technologies have been around for a long time. They have been used by Zip utilities along with compression technologies. The technology is supported by many applications.
2/ De-duplication is measured as a ratio. It is believed that a higher ratio indicates a better use of the technology. But, it should be remembered that higher ratios imply lower returns as no file or block can be shrunk 100%.
3/ De-duplication is a CPU intensive process.
4/ If de-duplication is performed at source, the process will consume huge amounts of processing power and will slow down the process of backup.
5/ If de-duplication is performed at the destination, it will slow down the backup server performance.
6/ If de-duplication is performed on the target server, there will be no bandwidth saving or space saving in the initial stages. In fact, the target volume may require more space for initiating and completing the de-duplication process.
7/ Hash collisions do occur. If the de-duplication algorithm is not perfect, the process may generate the same hash for two blocks of different data. This may result in the data being discarded as a duplicate. Cloud service providers overcome this problem by using multiple hash algorithms to identify and eliminate duplicates. However, this will increase processing time and CPU usage.
8/ De-duplication of media files are not very efficient. This is because some types of media files, such as, MP3, MP4 or JPEG do not de-duplicate well as redundancy has already been removed from the file.
9/ Windows Server 8 offers a native file system de-duplication feature. This is a source level de-duplication algorithm that uses post-process de-duplication. It complements Hyper-V to reduce redundancy across virtual machines. The file system guards against corruption of data by using copy-on-write algorithm.
10/ De-duplication makes the use of solid state drives with virtualization hosts more practical. These drives are smaller but they deliver optimal performance.
In context, it may be a good idea to read between the lines and understand the nuances of the de-duplication technology offered, before signing the service level agreement (SLA) with your Cloud service vendor.