HashTag Erasure Codes: From Theory to Practice

8 September 2016

Abstract

Erasure coding has been increasingly deployed as an alternative to data-replication for fault-tolerance in distributed-storage systems. Conventional erasure codes such as Reed-Solomon (RS) provide savings in the storage space but at the cost of a higher repair bandwidth and more complex computations than replication. Minimum-Storage Regenerating (MSR) codes have emerged as a viable alternative to RS codes as they minimize the repair bandwidth while they are still optimal in terms of reliability and storage overhead. Although several MSR code constructions exist, so far they have not been practically implemented. One of the main reasons for their practical abandonment is that existing MSR code constructions imply much bigger number of I/O operations than RS codes. In this paper, we analyze high-rate MDS codes that are simultaneously optimized in terms of storage, reliability, I/O operations, and repair-bandwidth for single and multiple failures of the systematic nodes. The codes were recently introduced in \cite{7463553} without any specific name. Due to the resemblance between the hashtag sign \# and the procedure of the construction of these codes, we call them in this paper \emph{HashTag Erasure Codes (HTECs)}. HTECs provide the lowest data-read and data-transfer, and thus the lowest repair time for an arbitrary sub-packetization level $\alpha$ , where $\alpha \leq r^{\lceil \sfrac{k}{r} \rceil}$ , among all existing MDS codes proposed for distributed storage. The repair process is linear and highly parallel. Additionally, we show that HTECs are the first high-rate MDS codes that reduce the repair bandwidth for more than one failure. Practical implementations of HTECs in HDFS release 3.0.0-alpha2 demonstrate the great potentials of HTECs.

View on arXiv

Comments on this paper