Reliability of Erasure Coded Storage Systems: A Geometric Approach
We consider the probability of data loss, or equivalently, the reliability function for an erasure coded distributed data storage system. Data loss in an erasure coded system depends on the repair duration and the failure probability of individual disks. This dependence on the repair duration complicates the reliability function calculation. In previous works, the data loss probability of such systems has been studied under the assumption of exponentially distributed disk life and disk repair durations, using well-known analytic methods from the theory of Markov processes. These methods lead to an estimate of the integral of the reliability function. Here, we address the problem of directly calculating the data loss probability under the assumption that the repair duration is a constant. After characterizing the error event, we provide an exact calculation as well as an upper bound on the probability of data loss (lower bound on the reliability function) and show that the problem can be reduced to a volume calculation of specific polytopes determined by the code. Closed form bounds are exhibited for general codes along with the results of simulations.
View on arXiv