What is the significance of Chunking in de-duplication? Data that is passed through the de-duplication engine is chunked into smaller units and assigned identities using cryptographic hash functions. Thereafter, two chunks of data are compared to ascertain whether they have the same identity. If the answer is yes, a link to the data (and not the chunk), is included in the incremental backup. If the answer is no, it is accepted as an independent chunk of data and uploaded to the backup server.

However, the assumption that chunks of data with “similar identities are the same” is subject to debate. While some algorithms accept the premise and create algorithms for identifying similar chunks of data based on similar identities, other algorithms take into consideration the pigeonhole concept while designing the algorithm.

The pigeonhole principle in mathematics and computer science states that if “n” items are put into “m” pigeonholes with n>m, then at least one pigeon hole must contain more than one item.

Chunking for deduplication can be frequency based or content based. Frequency based chunking identifies high frequencies of occurrences of data chunks. The algorithm uses this frequency information to enhance data duplication gain. Content based chunking is a stateless chunking algorithm which partitions a long stream of data into smaller units or chunks and removes duplicate ones. However, this type of algorithm is random and does not provide performance guarantee.

Commercial implementations of de-duplication vary primarily in the chunking method and architecture that is used. Some online backup algorithms chunk data according to physical layer constraints (for instance a 4KB block size); others only use complete files (as in single instance storage algorithms) as data chunks. But, the most intelligent, though CPU intensive, chunking methodology is considered to be the sliding block methodology. In this methodology, a window is passed along the file stream to identify the natural internal file boundaries.

However, it is important to understand the risk involved in data transformation or compression. Whenever data is transformed from its original state, there is a potential risk of data loss.  Chunking transforms data; and de-duplication evaluation algorithms generate automated decision points regarding the type of data that can be written to the backup repository. Added to this is the fact that de-duplication systems store data differently from the way in which data is written. Different de-duplication systems also employ different techniques. Therefore, the data integrity will depend on the type and design of the algorithm that is being used.