《Cloud Native Infrastructure》CHAPTER 7(3)
Application Requirements on Infrastructure
云原生应用程序对基础设施的期望远远超过执行二进制文件。 他们需要抽象,隔离,并保证他们如何运行和管理。 并且它们需要提供钩子(hook)和API以允许基础设施来管理它们。 要取得成功,需要有共生关系。
Cloud native applications expect more from infrastructure than just executing a binary. They need abstractions, isolations, and guarantees about how they’ll run and be managed. In return they are required to provide hooks and APIs to allow the infrastructure to manage them. To be successful, there needs to be a symbiotic relationship.
我们在第1章中定义了云原生应用程序,并讨论了一些生命周期要求。 现在让我们看一下为运行它们而构建的基础架构的更多期望:
We defined cloud native applications in Chapter 1, and just discussed some life cycle requirements. Now let’s look at more expectations they have from an infrastructure built to run them:
- Runtime and isolation(runtime与隔离,指资源的隔离,例如CPU、内存、硬盘)
- Resource allocation and scheduling(资源分配与调度)
- Environment isolation(环境隔离,例如dev、test、beta、online)
- Service discovery(服务发现)
- State management(状态管理)
- Monitoring and logging(监控与日志)
- Metrics aggregation(度量指标聚合)
- Debugging and tracing(debug与追踪)
所有这些都应该是服务的默认选项,或者是从自助服务API提供的。 我们将更详细地解释每个要求,以确保明确期望是明确定义的。
All of these should be default options for services or provided from self-service APIs. We will explain each requirement in more detail to make sure the expectations are clearly defined.
Application Runtime and Isolation
传统应用程序只需要一个内核,可能还需要一个解释器来运行。云原生应用程序仍然需要它,但它们也需要与操作系统和运行它们的其他应用程序隔离。隔离使多个应用程序能够在同一服务器上运行并控制其依赖关系和资源。
Traditional applications only needed a kernel and possibly an interpreter to run. Cloud native applications still need that, but they also need to be isolated from the operating system and other applications where they run. Isolation enables multiple applications to run on the same server and control their dependencies and resources.
应用程序隔离有时称为多“租户”。该术语可用于在同一服务器上运行的多个应用程序以及在共享群集中运行应用程序的多个用户。用户可以运行经过验证的可信代码,也可以运行您无法控制且不信任的代码。
Application isolation is sometimes called multitenancy. That term can be used for multiple applications running on the same server and for multiple users running applications in a shared cluster. The users can be running verified, trusted code, or they may be running code you have no control over and do not trust.
成为云原生并不一定需要使用容器。 Netflix是许多云原生模式的先驱,当公司转变为在公共云上运行时,它使用VM作为部署工件,而不是容器。 FaaS服务(例如,AWS Lambda,serverless的商业产品)是用于打包和部署代码的另一种流行的云原生技术。在大多数情况下,他们使用容器进行应用程序隔离,但容器包装对用户是隐藏的。
To be cloud native does not require using containers. Netflix pioneered many of the cloud native patterns, and when the company transitioned to running on a public cloud, it used VMs as their deployment artifact, not containers. FaaS services (e.g., AWS Lambda) are another popular cloud native technology for packaging and deploying code. In most cases, they use containers for application isolation but container packaging is hidden from the user.
什么是容器?
容器有许多不同的实现。 Docker推出了"容器"术语,以描述在隔离环境中打包和运行应用程序的方法。从根本上说,容器使用内核原语或硬件功能来隔离单个操作系统上的进程。
容器隔离级别可能会有所不同,但通常意味着应用程序使用隔离的根文件系统,命名空间和来自同一服务器上其他进程的资源分配(例如,CPU和RAM)运行。容器格式已被许多项目采用,并创建了Open Container Initiative(OCI),它定义了如何打包和运行应用程序容器的标准。
What Is a Container?
There are many different implementations of containers. Docker popularized the term container to describe a way to package and run an application in an isolated environment. Fundamentally, containers use kernel primitives or hardware features to isolate processes on a single operating system.
Levels of container isolation can vary, but usually it means the application runs with an isolated root filesystem, namespaces, and resource allocation (e.g., CPU and RAM) from other processes on the same server. The container format has been adopted by many projects and has created the Open Container Initiative (OCI), which defines standards on how to package and run an application container.
隔离也给编写应用程序的工程师带来了负担。他们现在负责声明所有软件依赖性。如果他们没有这样做,应用程序将无法运行,因为必需的库不声明就不可用。
Isolation also puts a burden on the engineers writing the application. They are now responsible for declaring all software dependencies. If they fail to do so, the application will not run because necessary libraries will not be available.
通常会为云原生应用程序选择容器,因为已经出现了更好的工具,流程和编排工具来管理它们。虽然容器是目前实现运行时和资源隔离的最简单方法,但这种方法并非(也可能不会)始终如此。
Containers are often chosen for cloud native applications because better tooling, processes, and orchestration tools have emerged for managing them. While containers are currently the easiest way to implement runtime and resource isolation, this has not (and likely will not) always be the case.
Resource Allocation and Scheduling
从历史上看,应用程序将提供有关最低系统要求的粗略估计,并且人有责任确定应用程序可以在何处运行.人员调度可能需要很长时间来准备操作系统和应用程序运行的依赖关系。
Historically, applications would provide rough estimates around minimum system requirements, and it was the responsibility of a human to figure out where the application could run.4 Human scheduling can take a long time to prepare the operating system and dependencies for the application to run.
部署可以通过配置管理和配置自动完成,但仍需要人员验证资源并标记服务器以运行应用程序。云原生基础设施依赖于依赖关系隔离,并允许应用程序在资源可用的任何位置运行。
The deployment can be automated through configuration management and provisioning, but it still requires a human to verify resources and tag a server to run the application. Cloud native infrastructure relies on dependency isolation and allows applications to run wherever resources are available.
通过隔离,只要系统具有可用的处理,存储和对依赖性的访问,就可以在任何地方运行应用程序。动态调度消除了“人”的瓶颈,留给机器做最好的决策。集群调度程序从所有系统收集资源信息,并找出运行应用程序的最佳位置。
With isolation, as long as a system has available processing, storage, and access to dependencies, applications can be scheduled anywhere. Dynamic scheduling removes the human bottleneck from making decisions that are better left to machines. A cluster scheduler gathers resource information from all systems and figures out the best place to run the application.
让人来安排应用程序编排不利于扩展。人类生病,休假(或至少他们应该),并且通常是瓶颈。随着规模和复杂性的增加,人们也无法记住应用程序的运行位置。许多公司试图通过雇用更多人来扩大规模。这加剧了问题,因为需要在多个人之间协调调度。最终,人工调度用于保持每个应用程序运行的电子表格(或类似解决方案)。
Having humans schedule application placement doesn’t scale. Humans get sick, take vacations (or at least they should), and are generally bottlenecks. As scale and complexity increases, it also becomes impossible for a human to remember where applications are running. Many companies try to scale by hiring more people. This exacerbates the problem because then scheduling needs to be coordinated between multiple people. Eventually, human scheduling resorts to keeping a spreadsheet (or similar solution) of where each application runs.
动态调度并不意味着op无法控制。如果调度程序不能覆盖所有情况,op仍然可以强制执行调度决策。应通过API而不是会议请求提供覆盖和手动资源调度。
Dynamic scheduling doesn’t mean operators have no control. There are still ways an operator can override or force a scheduling decision based on knowledge the scheduler may not have. Overrides and manual resource scheduling should be provided via an API and not a meeting request.
解决这些问题是Google编写名为Borg的内部集群调度程序的主要原因之一。在Borg研究论文中,谷歌指出:Borg提供三个主要好处:(1)它隐藏了资源管理和故障处理的细节,因此其用户可以专注于应用程序开发; (2)以非常高的可靠性和可用性运行,并支持执行相同操作的应用程序; (3)让我们有效地在数万台机器上运行工作负载。
Solving these problems is one of the main reasons Google wrote its internal cluster scheduler named Borg. In the Borg research paper, Google points out that:Borg provides three main benefits: it (1) hides the details of resource management and failure handling so its users can focus on application development instead; (2) operates with very high reliability and availability, and supports applications that do the same; and (3) lets us run workloads across tens of thousands of machines effectively.
调度程序在任何云原生环境中的作用都大致相同。从根本上说,它需要抽象出许多机器,并允许用户请求资源,而不是服务器。
The role of a scheduler in any cloud native environment is much of the same. Fundamentally, it needs to abstract away the many machines and allow users to request resources, not servers.
Environment Isolation
当应用程序由许多服务组成时,基础设施需要提供一种定义与所有依赖项进行隔离的方法。通过将服务器,网络或集群复制到开发或测试环境中来传统地管理分离依赖关系。基础设施应该能够通过应用程序环境在逻辑上分离依赖关系,而无需完全集群重复。
When applications are made of many services, infrastructure needs to provide a way to have defined isolation with all dependencies. Separating dependencies is tradition‐ ally managed by duplicating servers, networks, or clusters into development or test‐ ing environments. Infrastructure should be able to logically separate dependencies through application environments without full cluster duplication.
逻辑上分离的环境允许更好地利用硬件,减少自动化的重复,并且更容易测试应用程序。在某些情况下,需要单独的测试环境(例如,当需要进行低级别更改时)。但是,应用程序的测试不是需要完全复制的基础设施的情况。
Logically splitting environments allows for better utilization of hardware, less duplication of automation, and easier testing for the application. On some occasions, a separate testing environment is required (e.g., when low-level changes need to be made). However, application testing is not a situation where a full duplicate infra‐ structure should be required.
环境可以是传统的开发,测试和生产,也可以是动态分支或commit-based。 它们甚至可以是生产环境的一部分,通过动态配置和选择性路由到实例来启用功能。
Environments can be traditional permanent dev, test, stage, and production, or they can be dynamic branch or commit-based. They can even be segments of the production environment with features enabled via dynamic configuration and selective routing to the instances.
环境应包含应用程序所需的所有数据,服务和网络资源。 这包括数据库,文件共享和任何外部服务等内容。 云原生基础设施可以创建开销非常低的环境。
Environments should consist of all the data, services, and network resources needed by the application. This includes things such as databases, file shares, and any external services. Cloud native infrastructure can create environments with very low overhead.
基础设施应该能够提供被使用的环境。 应用程序应遵循最佳实践,以允许灵活配置以支持环境并通过“服务发现”来发现支持服务的端点。
Infrastructure should be able to provision the environment however it’s used. Applications should follow best practices to allow flexible configuration to support environments and discover the endpoints for supporting services through service discovery.
Service Discovery
应用程序几乎肯定依赖于一项或多项服务来提供商业利益。基础设施的责任是提供服务的方式,以便服务在每个环境中找到彼此。
Applications almost certainly depend on one or more services to provide business benefit. It is the responsibility of the infrastructure to provide a way for services to find each other on a per-environment basis.
某些服务发现需要应用程序进行API调用,而其他服务发现需要使用DNS或网络代理进行透明操作。使用什么工具并不重要,但服务使用服务发现很重要。
Some service discovery requires applications to make an API call, while others do it transparently with DNS or network proxies. It does not matter what tool is used, but it’s important that services use service discovery.
虽然服务发现是最古老的网络服务之一(即ARP和DNS),但它经常被忽视而不被利用。在每个实例文本文件或代码中静态定义服务端点是不可扩展的,不适合云原生环境。服务端点注册应在创建服务并且端点可用或注销时自动进行。
While service discovery is one of the oldest network services (i.e., ARP and DNS), it is often overlooked and not utilized. Statically defining service endpoints in a per- instance text file or in code is not scalable and not suitable for a cloud native environment. Endpoint registration should happen automatically when services are created and endpoints become available or go away.
云原生应用程序与基础架构协同工作,以发现其依赖服务。这些包括但不限于DNS,云元数据服务或独立的服务发现工具(即,etcd和consul)。
Cloud native applications work together with infrastructure to discover their dependent services. These include, but are not limited to, DNS, cloud metadata services, or standalone service discovery tools (i.e., etcd and consul).
State Management
状态管理是基础设施如何知道应用程序实例需要做什么(如果有的话)。这与应用程序生命周期明显不同,因为生命周期适用于整个开发过程中的应用程序。状态适用于实例的启动和停止。
State management is how infrastructure can know what needs to be done, if anything, to an application instance. This is distinctly different from application life cycle because the life cycle applies to applications throughout their development process. States apply to instances as they are started and stopped.
应用程序负责提供API或hook,以便检查其当前状态。基础设施的职责是监视实例的当前状态并采取相应措施。
It is the application’s responsibility to provide an API or hook so it can check for its current state. The infrastructure’s responsibility is to monitor the instance’s current state and act accordingly.
以下是一些应用程序状态:(不全部进行中文翻译)
The following are some application states:
- Submitted
- Scheduled(实体任务已被安排部署的意思)
- Ready
- Healthy
- Unhealthy
- Terminating
简要概述各种状态及相应的行动:
- 提交申请。
- 基础设施检查所请求的资源并安排应用程序。应用程序启动时,它将提供就绪/未就绪状态。
- 基础设施将等待就绪状态,然后允许消耗应用程序资源(例如,将实例添加到负载平衡器)。如果应用程序在指定的超时之前未就绪,则基础设施将终止它并安排应用程序的新实例。
- 应用程序准备就绪后,基础设施将监视活动状态并等待不正常状态或直到应用程序设置为不再运行。
A brief overview of the states and corresponding actions would be as follows:
- An application is submitted to be run.
- The infrastructure checks the requested resources and schedules the application.
While the application is starting, it will provide a ready/not ready status. - The infrastructure will wait for a ready state and then allow for consumption of the applications resources (e.g., adding the instance to a load balancer).
If the application is not ready before a specified timeout, the infrastructure will terminate it and schedule a new instance of the application. - Once the application is ready, the infrastructure will watch the liveness status and wait for an unhealthy status or until the application is set to no longer run.
列出的状态不是全面的。如果要正确检查和采取行动,各状态需要得到基础设施的支持。 Kubernetes通过事件,探测器和hook实现应用程序状态管理,但每个业务流程平台应具有类似的应用程序管理功能。
There are more states than those listed. States need to be supported by the infrastructure if they are to be correctly checked and acted upon. Kubernetes implements application state management through events, probes, and hooks, but every orchestration platform should have similar application management capabilities.
在Submit,Schedule或扩缩应用程序时会触发Kubernetes事件。探针用于检查应用程序何时准备好处理流量的准确(准备就绪)并确保应用程序是健康的(活跃度)。hook用于需要在进程开始之前或之后发生的事件。
A Kubernetes event is triggered when an application is submitted, scheduled, or scaled. Probes are used to check when an application is ready to serve traffic (readiness) and to make sure an application is healthy (liveness). Hooks are used for events that need to happen before or after processes start.
应用程序实例的状态与应用程序生命周期管理同样重要。基础设施在确保实例可用并相应地对其起作用方面起着关键作用。
The state of an application instance is just as important as application life cycle management. Infrastructure plays a key role in making sure instances are available and acting on them accordingly.
Monitoring and Logging
应用程序永远不必自己去请求被监控或被记录日志;它们是在基础设施上运行的基本假设。更重要的是,监控和日志记录的配置(如果需要)应该以与应用程序资源请求相同的方式一样,被声明为代码。如果您具有部署应用程序的自动化的能力,但无法动态去监控服务,那你仍然需要去做。
Applications should never have to request to be monitored or logged; they are basic assumptions for running on the infrastructure. More importantly, configuration for monitoring and logging, if required, should be declarative as code in the same way that application resource requests are made. If you have all the automation to deploy applications but can’t dynamically monitor services, there is still work to be done.
状态管理(即进程运行状况检查)和日志记录处理应用程序的各个实例。日志记录系统应该能够基于应用程序,环境,标记或任何其他有用的元数据来合并日志。
State management (i.e., process health checks) and logging deal with individual instances of an application. The logging system should be able to consolidate logs base on the applications, environments, tags, or any other useful metadata.
应用程序应尽可能不单点故障,并且应该运行多个实例。如果应用程序有100个实例正在运行,则监控系统不应在单个实例变得不健康时触发警报。
Applications should, as much as possible, not have single points of failure and should have multiple instances running. If an application has 100 instances running, the monitoring system should not trigger an alert if a single instance becomes unhealthy.
监控在应用程序中进行整体查看,并用于调试和验证预期的状态.监控与警报不同,因为应根据应用程序的指标和SLO触发警报。
Monitoring looks holistically at applications and is used for debugging and verifying desired states. Monitoring is different than alerting, because alerting should be triggered based on the metrics and SLO of the applications.
Metrics Aggregation
指标是来了解应用程序在处于健康状态时的行为方式。它们还可以提供对不健康时可能被破坏的内容的洞察 - 就像监控一样,度量指标收集的请求接口应成为应用程序定义的一部分。
Metrics are required to know how applications behave when they’re in a healthy state. They also can provide insight into what may be broken when they are unhealthy— and just like monitoring, metrics collecting should be requested as code as part of an application definition.
基础设施可以自动收集有关资源利用率的指标,但是应用程序负责预设服务级别指标的指标。
The infrastructure can automatically gather metrics around resource utilization, but it is the application’s responsibility to preset metrics for service-level indicators.
虽然监控和日志记录是应用程序运行状况检查,但度量指标提供了所需的遥测数据。没有指标,就无法知道应用程序是否满足服务水平目标提供商业价值。
While monitoring and logging are application health checks, metrics provide the needed telemetry data. Without metrics, there is no way of knowing if the application is meeting service-level objectives to provide business value.
从日志中提取遥测数据和健康检查数据可能很诱人,但要小心,因为日志记录需要后处理,而且比特定于应用程序的度量标准端点需要更多开销。当涉及到收集指标时,您希望尽可能接近实时数据。这需要一个可以扩展的简单且低开销的解决方案。应该使用Logging进行调试,数据处理的延迟是符合预期的。
It may be tempting to pull telemetry and health check data from logs, but be careful, because logging requires post-processing and more overhead than application-specific metric endpoints.When it comes to gathering metrics, you want as close to real-time data as possible. This requires a simple and low-overhead solution that can scale.Logging should be used for debugging, and a delay for data processing should be expected.
与日志记录类似,度量标指标常在实例级别收集,然后汇总在一起,以提供完整服务的视图,而不是单个实例。
Similarly to logging, metrics are usually gathered at an instance level and then composed together in aggregate to provide a view of a complete service instead of individual instances.
一旦应用程序提供了收集指标的方法,基础设施的工作就是抓取,整合和存储指标以进行分析。收集指标的端点应该基于每个应用程序进行配置,但数据格式化应该标准化,以便可以在单个系统中查看所有指标。
Once applications present a way to gather metrics, it is the infrastructure’s job to scrape, consolidate, and store the metrics for analysis. Endpoints for gathering metrics should be configurable on a per-application basis, but the data formatting should be standardized so all metrics can be viewed in a single system.
Debugging and Tracing
应用程序在开发过程中易于调试。集成开发环境(IDE),代码断点和在调试模式下运行是工程师在编写代码时可以使用的所有工具。
Applications are easy to debug during development. Integrated development environments (IDE), code break points, and running in debug mode are all tools the engineer has at his disposal when writing code.
部署应用程序的"内省"(自我诊断)要困难得多。当应用程序由数十或数百个微服务或独立部署的功能组成时,这个问题更加严重。当服务以多种语言和不同的团队编写时,也可能无法将工具内置到应用程序中。
Introspection is much more difficult for deployed applications. This problem is more acute when applications are composed of tens or hundreds of microservices or independently deployed functions. It may also be impossible to have tooling built into applications when services are written in multiple languages and by different teams.
基础设施需要提供一种调试整个应用程序的方法,而不仅仅是单个服务。调试有时可以通过日志系统完成,但重现错误需要更短的反馈循环。
The infrastructure needs to provide a way to debug a whole application and not just the individual services. Debugging can sometimes be done through the logging system, but reproducing bugs requires a shorter feedback loop.
前面讨论过,调试是动态配置的一个很好的使用场景。找到问题后,可以将应用程序切换到详细日志记录,而无需重新启动,并且可以通过应用程序代理选择性地将流量路由到实例。
Debugging is a good use of dynamic configuration, discussed earlier. When issues are found, applications can be switched to verbose logging, without restarting, and traffic can be routed to the instances selectively through application proxies.
如果无法通过日志输出解决问题,则分布式跟踪提供了一个不同的界面来可视化正在发生的事情。分布式跟踪系统(如OpenTracing,OpenTracing通过提供平台无关、厂商无关的API,使得开发人员能够方便的添加(或更换)追踪系统的实现。OpenTracing正在为全球的分布式追踪,提供统一的概念和数据标准)可以做为日志的补充,以帮助人们调试问题。
If the issue cannot be resolved through log output, then distributed tracing provides a different interface to visualize what is happening. Distributed tracing systems such as OpenTracing can complement logs to help humans debug problems.
追踪为调试分布式系统提供了更短的反馈循环。如果无法将其构建到应用程序中,则可以通过代理或流量分析由基础设施透明地完成。当您大规模运行任何协调的应用程序时,要求基础设施提供调试应用程序的方法。
Tracing provides shorter feedback loops for debugging distributed systems. If it can not be built into applications, it can be done transparently by the infrastructure through proxies or traffic analysis. When you are running any coordinated applications at scale, it is a requirement that the infrastructure provides a way to debug applications.
虽然在分布式系统中设置追踪有许多好处和实现细节,但我们不会在此讨论它们。应用程序追踪一直很重要,并且在分布式系统中越来越困难。
While there are many benefits and implementation details for setting up tracing in a distributed system, we will not discuss them here. Application tracing has always been important, and is increasingly difficult in a distributed system.
Conclusion
应用程序要求改变:具有操作系统和程序包管理器的服务器已不再能够满足。现在,应用程序需要协调服务和更高级别的抽象。抽象允许资源与服务器分离并根据需要以编程方式使用。
The applications requirements have changed: a server with an operating system and package manager is no longer enough. Applications now require coordination of services and higher levels of abstraction. The abstractions allow resources to be separated from servers and consumed programmatically as needed.
本章中列出的要求并非基础设施可以提供的所有服务,但它们是云原生应用程序所期望的基础。如果基础设施不提供这些服务,那么应用程序将不得不自己做这些功能,否则它们将无法达到现代企业所需的规模和速度。
The requirements laid out in this chapter are not all the services that infrastructure can provide, but they are the basis for what cloud native applications expect. If the infrastructure does not provide these services, then applications will have to implement them, or they will fail to reach the scale and velocity required by modern business.
基础设施不会自行进化;人们需要改变他们的行为,并从根本上考虑以不同的方式运行应用程序需要什么。幸运的是,有些项目已经站在了巨人的肩膀上(因为有先驱做了这些实践)。
Infrastructure won’t evolve on its own; people need to change their behavior and fundamentally think of what it takes to run an application a different way. Luckily there are projects that build on experience from companies that have pioneered these solutions.
应用程序依赖于基础设施的功能和服务来支持敏捷开发。基础设施要求应用程序公开服务端点和指标,以便自主管理。工程师应尽可能使用现有工具,并以设计灵活,简单的解决方案为目标进行构建。
Applications depend on the features and services of infrastructure to support agile development. Infrastructure requires applications to expose endpoints and integrations to be managed autonomously. Engineers should use existing tools when possible and build with the goal of designing resilient, simple solutions.