《Cloud Native Infrastructure》CHAPTER 7(2)
Application Life Cycle
云原生应用程序的生命周期与传统应用程序没有什么不同,除了它们的阶段应该由软件管理。
Life cycles for cloud native applications are no different than traditional applications, except their stages should be managed by software.
本章不打算解释管理应用程序中涉及的所有模式和选项。 我们将简要讨论在云原生基础设施上运行云原生应用程序特别受益的几个阶段:部署,运行和销毁。
This chapter is not intended to explain all the patterns and options involved in man‐ aging applications. We will briefly discuss a few stages that particularly benefit from running cloud native applications on top of cloud native infrastructure: deploy, run, and retire.
这些主题并不包含所有的内容,但是应用程序的体系结构,语言和选择的库的不同,存在许多其他书籍和文章可以参考。
These topics are not all inclusive of every option, but many other books and articles exist to explore the options, depending on the application’s architecture, language, and chosen libraries.
Deploy
部署是应用程序最依赖基础设施的一个领域。虽然没有什么能阻止应用程序部署自身,但基础设施仍然可以管理许多其他方面。
Deployments are one area where applications rely on infrastructure the most. While there is nothing stopping an application from deploying itself, there are still many other aspects that the infrastructure manages.
如何进行集成和交付是我们不会在此处讨论的主题,但此Topic中的一些实践很清楚。应用程序部署不仅仅是获取代码并运行它。云原生应用程序旨在由软件在所有阶段进行管理。这包括持续的健康检查以及初始的部署阶段。来尽可能消除技术,流程和原则中的“人”方面的瓶颈。
How you do integration and delivery are topics we will not address here, but a few practices in this space are clear. Application deployment is more than just taking code and running it.
Cloud native applications are designed to be managed by software in all stages. This includes ongoing health checks as well as initial deployments. Human bottlenecks should be eliminated as much as possible in the technology, processes, and policies.
应用程序的部署应该是自动化的,自服务的,并且如果经常开发的话,则应是频繁进行的。它们也应该经过测试,验证,并且平安无事。
Deployments for applications should be automated, self-service, and, if under active development, frequent. They should also be tested, verified, and uneventful.
每一次同时更新应用程序的每个实例,不应该是新版本和功能兼容性升级的解决方案。新功能的开关应该是一个配置项,可以在不重启应用程序的情况下有选择地动态启用。版本升级时,先部分更新,然后通过测试验证,并且当所有测试通过时,以受控方式发布。
Replacing every instance of an application at once is rarely the solution for new versions and features. New features are “gated” behind configuration flags, which can be selectively and dynamically enabled without an application restart. Version upgrades are partially rolled out, verified with tests, and, when all tests pass, rolled out in a controlled manner.
当启用新功能或部署新版本时,应该存在控制流量或隔离应用程序的流量的机制(请参阅附录A)。这可以限制中断的影响,并允许缓慢的发布和更快的反馈循环,以评估应用程序性能和新功能的使用。
When new features are enabled or new versions deployed, there should exist mechanisms to control traffic toward or away from the application (see Appendix A). This can limit outage impact and allows slow rollouts and faster feedback loops for application performance and feature usage.
基础设施应该处理部署软件的所有细节。工程师可以定义应用程序版本,基础设施要求和依赖关系,基础设施将向该状态驱动,直到满足所有要求或要求发生变化。
The infrastructure should take care of all details of deploying software. An engineer can define the application version, infrastructure requirements, and dependencies, and the infrastructure will drive toward that state until it has satisfied all requirements or the requirements change.
Run
应用程序的运行阶段应该是应用程序生命周期中最平静,最稳定的阶段。第1章讨论了运行软件的两个最重要的方面:可观察性以了解应用程序正在做什么,以及能够根据需要更改应用程序的可操作性。
Running the application should be the most uneventful and stable stage of an application’s life cycle. The two most important aspects of running software are discussed in Chapter 1: observability to understand what the application is doing, and operability to be able to change the application as needed.
我们已经通过报告健康和遥测数据在第1章详细介绍了应用程序的可观察性,但是当它无法正常工作时,您会怎么做?如果应用程序的遥测数据表明它不符合SLO,您如何排除故障并调试应用程序?
We already went into detail in Chapter 1 about observability for applications by reporting health and telemetry data, but what do you do when things don’t work as intended? If an application’s telemetry data says it’s not meeting the SLO, how can you troubleshoot and debug the application?
使用云原生应用程序,您不应该通过SSH连接到服务器并分析日志。甚至可能值得考虑是否需要ssh、日志文件或服务器。
With cloud native applications, you should not SSH into a server and dig through logs. It may even be worth considering if you need SSH, log files, or servers at all.
您仍然需要AP),日志数据(云日志记录)和堆栈中某些位置的服务器,但是值得一看,您是否需要传统工具。当程序崩溃时,您需要一种方法来调试应用程序和基础设施组件。
You still need application access (API), log data (cloud logging), and servers somewhere in the stack, but it’s worth going through the exercise to see if you need the traditional tools at all. When things break, you need a way to debug the application and infrastructure components.
在调试损坏的系统时,应首先查看基础设施测试,如第5章中所述。测试应公开未正确配置或未提供预期性能的任何基础设施组件。
When debugging a broken system, you should first look at your infrastructure tests, as explained in Chapter 5. Testing should expose any infrastructure components that are not configured properly or are not providing the expected performance.
仅仅因为您不管理底层基础设施并不意味着基础设施不能成为您的问题的原因。通过测试来验证期望将确保您的基础设施按预期执行。
Just because you don’t manage the underlying infrastructure doesn’t mean the infrastructure cannot be the cause of your problems. Having tests to validate expectations will ensure your infrastructure is performing how you expect.
在排除基础设施因素后,您应该查看应用程序以获取更多信息。应用程序调试的最佳方式是通过应用程序性能管理(APM)以及可能通过OpenTracing等进行分布式应用程序进行跟踪。
After infrastructure has been ruled out, you should look to the application for more information. The best places to turn for application debugging are application performance management (APM) and possibly distributed application tracing via standards such as OpenTracing.
OpenTracing示例,实现和APM超出了本书的范围。作为一个非常简短的概述,OpenTracing允许您跟踪整个应用程序中的调用,以更轻松地识别网络和应用程序通信问题。 OpenTracing的示例可视化可以在图7-1中看到。 APM为您的应用程序添加了工具,用于向收集服务报告指标和错误。
OpenTracing examples, implementation, and APM are out of scope for this book. As a very brief overview, OpenTracing allows you to trace calls throughout the application to more easily identify network and application communication problems. An example visualization of OpenTracing can be seen in Figure 7-1. APM adds tools to your applications for reporting metrics and faults to a collection service.
Figure 7-1. OpenTracing visualization
当测试和跟踪仍然没有暴露问题时,有时您只需要在应用程序上启用更详细的日志记录。但是如何在不破坏问题现场的情况下启用调试?
When tests and tracing still do not expose the problem, sometimes you just need to enable more verbose logging on the application. But how do you enable debugging without destroying the problem?
运行时配置对于应用程序很重要,但在云原生环境中,配置应该是动态的,无需重新启动应用程序。配置选项仍然通过应用程序中的库实现,但标志值应该能够通过集中式协调器,应用程序API调用,HTTP头信息或多种方式动态更改。
Configuration at runtime is important for applications, but in a cloud native environment, configuration should be dynamic without application restarts. Configuration options are still implemented via a library in the application, but flag values should have the ability to dynamically change through a centralized coordinator, application API calls, HTTP headers, or a myriad of ways.
动态配置的两个例子是Netflix的Archaius和Facebook的Gatekeeper。前Facebook工程经理Justin Mitchell在Quora帖子中分享,[Gatekeeper]在代码部署发布中隔离feature。当我们观察用户指标,性能并确保服务到位以便扩展时,功能可能会在几天或几周内发布。
Two examples of dynamic configuration are Netflix’s Archaius and Facebook’s Gatekeeper. Justin Mitchell, a former Facebook engineering manager, shared in a Quora post that:
[Gatekeeper] decoupled feature releasing from code deployment. Features might be released over the course of days or weeks as we watched user metrics, performance, and made sure services were in place for it to scale.
允许对应用程序配置进行动态控制,可以更好地控制公开的feature,并更好地测试部署代码,发布新代码很容易并不意味着它是适合所有情况的正确解决方案
Allowing dynamic control over application configuration enables more control over exposed features and better test coverage of deployed code. Just because pushing new code is easy doesn’t mean it is the right solution for every situation.
基础架构可以通过协调何时启用功能并根据高级网络策略路由流量来帮助解决此问题并启用更灵活的应用程序。此模式还允许更精细的控件和更好地协调全量或回滚方案。
Infrastructure can help solve this problem and enable more flexible applications by coordinating when features are enabled and routing traffic based on advanced net‐ work policies. This pattern also allows finer-grained controls and better coordination of roll-out or roll-back scenarios.
在动态的自服务的环境中,将部署的应用程序数量将迅速增长。您需要确保在类似的自助服务模型中动态调试应用程序以便部署应用程序。
In a dynamic, self-service environment the number of applications that will get deployed will grow rapidly. You need to make sure you have an easy way to dynamically debug applications in a similar self-service model to deploy the applications.
尽管工程师喜欢推出新的应用程序,但反过来很难让他们下线旧的应用程序。即使如此,它仍然是应用程序生命周期中的关键阶段。
As much as engineers love pushing new applications, it is conversely difficult to get them to retire old applications. Even still, it is a crucial stage in an application’s life cycle.
Retire
在快速变化的环境中,部署新的应用程序和服务很常见。注销应用程序应该以与创建它们相同的方式自服务。
Deploying new applications and services is common in a fast-moving environment. Retiring applications should be self-service in the same way as creating them.
如果自动部署和监控新服务和新资源,则也应按相同标准销毁。尽快部署新服务而不删除未使用的服务是最容易产生技术债务的。
If new services and resources are deployed and monitored automatically, they should also be retired under the same criteria. Deploying new services as quickly as possible with no removal of unused services is the easiest way to accrue technical debt.
识别应注销的服务和资源是特定于业务的。您可以使用应用程序的遥测测量中的经验数据来了解是否正在使用应用程序,但是应该由业务部门决定是否注销应用程序。
Identifying services and resources that should be retired is business specific. You can use empirical data from your application’s telemetry measurements to know if an application is being used, but the decision to retire applications should be made by the business.
不需要时,应自动清除基础架构组件(例如,VM实例和负载平衡器端点)。一个自动组件清理的例子是Netflix的Janitor Monkey。该公司在一篇博客文章中解释道:
Infrastructure components (e.g., VM instances and load balancer endpoints) should be automatically cleaned up when not needed. One example of automatic component cleanup is Netflix’s Janitor Monkey. The company explains in a blog post:
Janitor Monkey通过对其应用一组规则来确定资源是否应该是被清理的候选者。如果每一个规则都确定资源是清理候选者,则Janitor Monkey会标记资源并安排时间来清理它。
Janitor Monkey determines whether a resource should be a cleanup candidate by applying a set of rules on it. If any of the rules determines that the resource is a cleanup candidate, Janitor Monkey marks the resource and schedules a time to clean it up.
所有这些应用阶段的目标是让基础设施和软件来管理,而不是传统的由人来管理。我们采用协调模式与组件元数据相结合,不再编写由人工运行的自动化脚本,而是根据当前上下文不断运行并根据需要在高级别上执行的操作做出决策。
The goal in all of these application stages is to have infrastructure and software manage the aspects that would traditionally be managed by a human. Instead of writing automation scripts that run once by a human, we employ the reconciler pattern combined with component metadata to constantly run and make decisions about actions that need to be taken on a high level based on current context.
应用程序生命周期阶段不是应用程序依赖于基础设施的唯一方面。还有一些基础服务,每个阶段的应用程序都依赖于基础设施。我们将在下一节中介绍一些为应用程序提供的支持服务和API基础结构。
Application life cycle stages are not the only aspects where applications depend on infrastructure. There are also some fundamental services for which applications in every stage will depend on infrastructure. We will look at some of the supporting services and APIs infrastructure provides to applications in the next section.