Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems(转发)

原文：https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf

Introduction

We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design

We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs.

Real-world distributed systems inevitably experience outages.

Given that many of these systems were designed to be highly available, generally developed using good software engineering practices, and intensely tested, this raises the questions of why these systems still experience failures and what can be done to increase their resiliency.

Our goal is to better understand the specific failure manifestation sequences that occurred in these systems in order to identify opportunities for improving their availability and resiliency. Specifically, we want to better understand how one or multiple errors1 evolve into component failures and how some of them eventually evolve into service-wide catastrophic failures.

Individual elements of the failure sequence have previously been studied in isolation, including root causes categorizations [33, 52, 50, 56], different types of causes including misconfiguraitons [43, 66, 49], bugs [12, 41, 42, 51] hardware faults [62], and the failure symptoms [33, 56], and many of these studies have made significant impact in that they led to tools capable of identifying many bugs (e.g., [16, 39]). However, the entire manifestation sequence connecting them is far less well-understood.

Overall, we found that the error manifestation sequences tend to be relatively complex: more often than not, they require an unusual sequence of multiple events with specific input parameters from a large space to lead the system to a failure. This is perhaps not surprising considering that these systems have undergone thorough testing using unit tests, random error injections [18], and static bug finding tools such as FindBugs [32], and they are deployed widely and in constant use at many organization. But it does suggest that top-down testing, say using input and error injection techniques, will be challenged by the large input and state space. This is perhaps why these studied failures escaped the rigorous testing used in these software projects.

We further studied the characteristics of a specific subset of failures — the catastrophic failures that affect all or a majority of users instead of only a subset of users. Catastrophic failures are of particular interest because they are the most costly ones for the vendors, and they are not supposed to occur as these distributed systems are designed to withstand and automatically recover from component failures. Specifically, we found that:

almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software.

While it is well-known that error handling code is often buggy [24, 44, 55], its sheer prevalence in the causes of the catastrophic failures is still surprising. Even more surprising given that the error handling code is the last line of defense against failures, we further found that: in 58% of the catastrophic failures, the underlying faults could easily have been detected through simple testing of error handling code.

In fact, in 35% of the catastrophic failures, the faults in the error handling code fall into three trivial patterns:

(i) the error handler is simply empty or only contains a log printing statement,

(ii) the error handler aborts the cluster on an overly-general exception, and

(iii) the error handler contains expressions like “FIXME” or “TODO” in the comments.

These faults are easily detectable by tools or code reviews without a deep understanding of the runtime context.

In another 23% of the catastrophic failures, the error handling logic of a non-fatal error was so wrong that any statement coverage testing or more careful code reviews by the developers would have caught the bugs.

All these indicate that the failures can be diagnosed and reproduced in a reasonably straightforward manner, with the primary challenge being to have to sift through relatively noisy logs.

Methodology and Limitations

HDFS, a distributed file system [27];

Hadoop MapReduce, a distributed dataanalytic framework [28];

HBase and Cassandra, two NoSQL distributed databases [2, 3];

and Redis, an inmemory key-value store supporting master/slave replication [54]

HDFS and Hadoop MapReduce are the main elements of the Hadoop platform, which is the predominant big-data analytic solution [29];

HBase and Cassandra are the top two most popular wide column store system [30],

Redis is the most popular keyvalue store system [53].

General Findings

Overall, our findings indicate that the failures are relatively complex, but they identify a number of opportunities for improved testing.

We also show that the logs produced by these systems are rich with information, making the diagnosis of the failures mostly straightforward.

Finally, we show that the failures can be reproduced offline relatively easily, even though they typically occurred on long-running, large production clusters.

Specifically, we show that most failures require no more 3 nodes and no more than 3 input events to reproduce, and most failures are deterministic. In fact, most of them can be reproduced with unit tests.

Complexity of Failures

To expose the failures in testing, we need to not only explore the combination of multiple input events from an exceedingly large event space, we also need to explore different permutations.

Opportunities for Improved Testing

Starting up services: More than half of the failures require the start of some services.This suggests that the starting of services — especially more obscure ones — should be more heavily tested. About a quarter of the failures triggered by starting a service occurred on systems that have been running for a long time; e.g., the HBase “Region Split” service is started only when a table grows larger than a threshold.While such a failure may seem hard to test since it requires a long running system, it can be exposed intentionally by forcing a start of the service during testing.

备注：

A fault is the initial root cause, which could be a hardware malfunction, a software bug, or a misconfiguration. A fault can produce abnormal behaviors referred to as errors, such as system call error return or Java exceptions. Some of the errors will have no user-visible side-effects or may be appropriately handled by software; other errors manifest into a failure, where the system malfunction is noticed by end users or operators.

posted @ 2022-01-30 18:14 PanPan003 阅读(77) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

晴天彩虹