Intelligent data management can play a big role in lowering the amount of data permanently stored in capacity-tier across the cloud while improving durability of the data, resulting in lower capex and opex for cloud implementers. I will cite two examples of intelligent data management: in-node and across-nodes.
In-node data management, such as compression, has been very popular in reducing the number of bits stored for back up applications. Likewise, data de-duplication is a technique that improves storage utilization. The data de-duplication algorithms are applied on block or file of data either in-flight or at-rest. Typical data de-duplication scheme computes hash of the unit of data to check if a similar unit is already stored. If a duplicate data unit is already stored, then new data is discarded and only a new reference to the data is created. Commercial implementation of data de-duplication algorithms, for example EMC Data Domain, claim 10x to 30x efficiency in storage utilization for back up applications.
The open source file system ZFS implements in-line block level data de-duplication. Another open source file system Lessfs implements in-line file level data de-duplication. There are commercial data de-duplication solutions also available such as EMC Data Domain, NetApp, Exagrid, and Druva.
Erasure code is a forward error correction algorithm that is successfully employed to reduce amount of data stored while achieving redundancy across multiple nodes. Erasure code algorithms operate on dataset by transforming X number of blocks in a dataset to Z number of blocks (Z>X), such that the original dataset can be reconstructed from a Y number of available blocks (Z>Y>X).
Traditionally, redundancy is achieved by mirroring data across systems or sites to afford data availability. Say for example, a particular cloud provider’s SLA assures that data will be stored across 3 geographically separate data centers. If the cloud provider uses mirroring, it requires 3-times the raw capacity to implement the specified usable capacity. On the other hand, the cloud provider may choose to implement similar redundancy with erasure coding that results in only 1.6 times the raw capacity needed for the specified usable capacity.
By lowering the volume of data stored using techniques such as compression, data de-duplication and erasure code, we effectively lower the cost of maintaining data integrity and protection as well. As we lower the amount of hardware needed to store data, it results in lower acquisition and operation costs. In particular, it reduces the amount of power consumed to maintain the data on disks and reduces the energy required to cool the datacenter, resulting in lower carbon footprint. Faster data replication across slower WANs is an added benefit.
If the reader has experience with data de-duplication, erasure code or other computational methods that increase capacity, please post a response to this blog or email me directly.