A lot of people assume that data de-duplication and data compression refer to the same thing or are similar. At a very high level, one can agree that de-duplication and compression are similar. Both technologies are used by online backup service providers to reduce data and help you save on storage space. But, this is where the similarity ends.
De-duplication is often referred to as a compression technique. But, it may be clearer if one referred to de-duplication as a technology that facilitates compression and storage optimization by removing duplicate files or blocks of data.
Compression is a technology that removes redundancies from data at granular levels and reduces the size of files or blocks of data.
At best, de-duplication and compression can be defined as twin concepts.
Compression is used to transform data and shrink it. Primarily, data compression is a process that reduces the size of files by removing redundant data within a file —such as extra spaces or replacing long character strings with shorter representations of them. This makes the file smaller. The process further identifies small groupings and repetitive patterns and replaces it with representative patterns. During de-compression, the algorithm studies the representations and replaces the representative patterns with the groupings, regenerating the original file with all its spaces and long character strings.
De-duplication looks for large groupings and repetitive byte patterns across streams of data and replaces them. It is a process in which repetitive data is cached so that duplicate patterns of data can be matched and then eliminated. So, the larger the cached data the more strings and bytes are available for comparison and elimination. Duplicate blocks of data are then “fingerprinted” using a cryptographic hashing methodology and a unique identifier is provided for each data block. The size of the fingerprint is much smaller than the original block. The block is then written to disk with one copy of the file and its corresponding fingerprints. Duplicated data files can be obtained from the fingerprints.
Data compression ratios and data de-duplication ratios are different. Data de-duplication and data compression are often combined to cumulatively reduce the size of the data that is written on to storage disks. Depending on the type of data de-duplication technology used, the process reduces the size of data by 20-80%. Compression reduces data by 10-50%.
A combination of the de-duplication ratio and the compression ratio is defined as the “data reduction ratio”. The data reduction ratio is represented as the de-duplication ratio multiplied by the compression ratio. This makes compression ratio a data reduction effectiveness multiplier and compressing blocks of data as they get de-duplicated proves to be an advantage to the user when transmitting large volumes of data to remote storage servers. Storage savings can reach or even exceed 90% as a consequence.