Michael Nygard on Building Resilient Systems
原文 @ InfoQ.
- Feature Complete Software 和 Production Ready Software是不同的。而很多时候,开发人员不清楚Production下的情况,所以没有很好的考虑到在Production下运行的情况。例如,在开发环境下,Sever A和Server B的压力是 1:1的,但是在Production下有可能是20:1,那么这里对Server B就可能会出问题。这一点开发人员往往是不知道而忽略掉的。
Hi, my name is Ryan Slobojan. I'm here with Michael Nygard. Michael, what's the difference between feature complete software and production ready software? Is there a difference?There is definitely a difference. In fact, I think to some extent, the only possible answer for that question is Mu. You have to unask the question. Feature completeness really tells us nothing at all about a software's ability to survive the real world of production. Feature complete tells us that that's past QA, which means that, by large, when I click this button, that label gets activated or when enter a date it's in proper format, it says nothing at all about whether the software will handle continuous traffic from millions of users 4 weeks at a time.
- Circuit Breaker
- Log很有用。监控的内容尽量和业务实现分开,因为监控的策略会经常变化。监控的很多配置项最好是可以动态配置的。
Anywhere there is a pool definitely track who's blocking and how often, high water, low water and some stats about number of times things are being checked in and out. Other kind of health indicators: any place you've got a cache, keep track of how many items are in cache, what the hit rate is, what the eviction rate is; any place you've got the circuit breakers, keep track of how many times the circuit breakers are flipping from an open to a closed state or from closed to open, current state of all of them, of course, and the thresholds that are configured into it. Those are all useful things to expose through a monitoring and management interface.
It can also be useful to expose controls on these things - for instance, with the circuit breaker, a control to reset it; with a pool a control to change what the high water and low water mark will be. I can think of several cases where we've had an ongoing partial failure mode and we needed to go in and change the maximum number of connections in a connection pool and dial it down, so that the front end system would stop crushing the back end system. That's a very useful kind of control to have at runtime.
- 有一些问题,如果在开发阶段解决的, 就会为产品维护节省很多费用。算了一笔账:对于一个访问量为100万的网站,如果每次页面请求多出来250毫秒,这不起眼的250,折合70个额外的计算时间,就需要4个服务器。而出去服务器的购买和维护费用,还有licence的费用,合同管理,还要投入人力维护这些服务器,接下来又涉及到这些维护人员的管理…… 像蝴蝶效应一样。
If we do that we will make the decision differently in some cases and we'll make the decision the same way in some cases. By that I mean we'll sometimes choose to incur that ongoing operational cost we'll sometimes choose to spend some additional development time to avoid the ongoing operations cost. One of the examples that I use when I talk about capacity is if you're handling say, web page requests and you have 1 million hits per day - 1 million hits per day is not all that large these days - and each one takes just an extra 250 milliseconds.
First of all, that's going to have an impact on your revenues, and companies like Google and Amazon have identified that very clearly, but secondly an extra 250 milliseconds on 1 million hits per day is about 70 hours of additional computing time, which means roughly you need 4 additional servers to handle the load. 4 additional servers draw power every month, they require administration every month, they may or may not require software licensing every month, they probably have support contracts. Once you get enough administrators, you need managers of administrators to keep the organization in check, so really, that 250 milliseconds per page that seems pretty small in development, translates into a pretty substantial ongoing operations cost.