De-duplication technologies can be designed to work on primary storage as well as on secondary storage. Primary storage is storage that contains near active or transaction data that is accessible to the CPU on your computer. The volume of this type of data is always very small. Secondary storage is storage that contains ‘not so active’ data or even historical data and is not directly accessible by the CPU. The data volumes are always very large.
De-dupe for primary data sets works with active, online or near-line data where duplication of data is almost assumed to be impossibility. This is completely in contrast with de-dupe for secondary backup data which is backed up daily, weekly or monthly and the possibility of duplicate data is assumed to be almost a certainty. It is therefore, not surprising that de-dupe has largely been used, to date, with secondary storage and rarely with primary storage.
However, it is fast dawning on users and industry experts that primary storage is not free from duplicate data. Research indicates that up to an average of 40% of the data in primary storage is duplicate data and must be cleaned up. This is largely due to lack or best of the breed management practices. For instance, whenever engineers want to test a new application, they make a copy of the primary data on the primary storage and test the application. The copy of the data is then forgotten or no one bothers to clean it up later. Over time, this impacts the performance of primary storage
Should approaches to primary storage or secondary storage de-dupe be the same? While primary storage de-dupe will have more performance accelerating, secondary de-dupe may be more focused on de-dupe for archival with not much focus on performance accelerating technologies built into it. For instance, in block and hash de-duplication schemes, identification of de-duplicated data requires the break up of data into data blocks so that the same data falls into the same block.
Vendors may have to create sophisticated algorithms for figuring out the variable size block’s beginning and ending to maximize on probability of recognizing duplicate data. It consequently requires a lot of power and resources to figure out block boundaries. This will be entirely unsuitable for primary storage, which has to balance latency with de-dupe efficiency. Moreover, primary storage is designed for fixed block sizes and not variable block sizes and is implemented as an extension of the file system. Therefore, primary storage de-dupe will have to identify duplicate files and work with file systems that have a large number of small files.