Pre process de-duplication is often referred to as Source de-duplication.  Source de-duplication is used to de-duplicate data before it reaches the point of transmission to the storage device. All data is piped through the source de-dupe hardware or software before it is considered ready for transmission to the storage device.

Source de-duplication aims to avoid transfer of duplicate data over the network to the storage device. The source de-duplication device establishes a connection with the target storage device and evaluates data before initiating the de-duplication process. The synchronization with the destination or target disk is maintained throughout the operation so that data can be synchronized and file matches can be eliminated at source. This saves bandwidth.

Byte level scans are performed by the source de-dupe hardware or software to identify “changed bytes”. The changed bytes are then sent to the target device with pointers to the original files and indexes are updated with the pointers for ease of recovery. The entire operation happens on the fly and is very efficient, accurate and light on processing power compared to post process de-duplication.

Source de-duplication devices have the capacity to classify data in real time. Policy based device configurations can categorize data at granular levels and filter out data even as it passes through the source de-dupe device. Files can be included or excluded on the basis of the domain, group, owner, user, path, age, storage type or file type or on the basis of Recovery point objectives or retention periods.

But, source de-duplication has its flip side. While source de-duplication reduces the amount of bandwidth you need for transmitting data to the destination drive, the process imposes a higher processing load on the client. It is estimated that the CPU power consumed goes up by 25-50% and this may not really be what you are looking for.

Source based de-duplication nodes may have to be incorporated into every connected location.  This has cost implications and is certainly more expensive when compared to target de-duplication techniques in which all de-duplication is done on a single de-duplication device that is located at a nodal point on the network.

Finally, the source software may have to be redesigned if the existing software does not support de-duplication algorithms or hardware. This is not the case in target de-duplication where the de-dupe software and hardware is isolated from the enterprise software or hardware and no changes require to be made at source for de-duplication.