How many times have you saved a copy of a file “Just in case …? The fear is legitimate, but it can lead to all kinds of complications in the long run. Your systems may display multiplicity of copies of a single file. You may experience confusion regarding the right version or the latest version of the file that must be worked on now! Operation clean up is a process of eliminating duplicate files in your computers and storing just one single copy of the file in the backup repository.

The process of identifying duplicate files is not simple. There is a “no change” scenario, where the seed file compares 100% with the file being backed up. There can also be a situation where the file in the backup set is similar to the seed file, but is not exactly similar. The duplication system will have to identify the dissimilarities and store the file as a new version of the seed file.

Obviously, the de-duplication process is initiated by creating a seed backup.  Every single backup file set thereafter, will have to be compared with the file sets contained in the seed backup. In “no change” scenarios duplicate files will have to be identified, segregated and deleted. Every single deletion will have to be logged and the log constantly updated to keep track of the de-duplication steps.  In “changed file” scenarios, versioning will have to kick in, in synchronization with incremental backup processes, to ensure that only the changes to a file is stored separately in the system and unchanged portions of the file are referentially integrated on call.

The task assumes gigantic proportions when duplicate files are scattered across computing devices connected to a WAN or LAN and all these devices are backed up to a single backup repository on a remote or local device.  The backup process will have to find, compare file sets with the seed backup and eliminate duplicates, triplicates and quadruplets and store a single copy/ or incremented version of the copy of the document so as to be accessible to the authorized users.

If you pause and ruminate on what has been said above, it is clear that files undergoing de-duplication before or after backup are vulnerable. If the process is not well defined, and if the algorithm is improperly crafted and executed, the deduplication process can result in the corruption of files and loss of data. So, check out if your cloud vendor’s deduplication technology is a proven technology. Make sure it is not the Achilles heel in your backup system!