【转】Graceful termination on Container Apps

Overview

Graceful shutdown refers to the “window of opportunity” an application has to programmatically clean up logic/connections/other application behavior in the window of time after SIGTERM is sent to the container(s) in a pod.

SIGTERM is a part of the standard Posix signals - which is as signal sent to process(es) being requested to shut down (but can be ignored, unless SIGKILL is sent) - more of this can be read here

This may be logic such as:

  • Closing database connections
  • Waiting for any long running operations to finish
  • Clearing out a message queue
  • Ensuring any file handles or file operations are cleaned
  • etc.

This “window” is beneficial for applications with logic where interruption to these kinds of operations or behaviors could impactively adverse systems, user experience, or other aspects of a program.

SIGTERM can be sent to containers in pods when scaling in, restarting a specific revision (which will cause new pods/replicas to be created that belong to that revision), or essentially what is described in Container Apps - Demystifying restarts

NOTE: It is possible for containers to immediately receive SIGKILL (exit code 137), which will forcefully kill processes (i.e containers), but that’s not described here. See Container Apps - Backoff restarts and container exits

This behavior is dictated by the property terminationGracePeriodSeconds - which can be set through ARM or the Azure Portal on Container Apps

  • Portal: Go to the Revisions blade -> Create a new revision

Azure Portal for Graceful termination

  • ARM: This is set under the resources.properties.template.terminationGracePeriodSeconds property:

          "template": {
            "terminationGracePeriodSeconds": 35,
            "containers": [
              {
                "image": "someregistry.com/image:tag",
                "name": "some-container",
                "resources": {
                  "cpu": 0.5,
                  "memory": "1.0Gi"
                },
              ....

terminationGracePeriodSeconds, is ultimately from Kubernetes - a more detailed explanation on pod termination is found here - Pod Lifecycle - Kubernetes

You can only set a maximum value of 600 seconds (10 minutes) for terminationGracePeriodSeconds. If an application is needing upwards of 10 minutes to clean up logic, or more, this can pose challenges, especially if an application is scaling out to many replicas (or even just a few). It would be heavily recommended to revisit the applications design around clean up logic to reduce this:

  • Additionally, since the pod (and therefor container(s) within) will still exist, if many pods are pending termination for minutes at a time - and new pods/replicas are created, this can start presenting resource contention issues - depending on how many resources already exist within the environment

Below is an overview of what this would look like in the pod lifecycle - “Window to shut down the application” is the number defined in terminationGracePeriodSeconds by a user - and presents the window to clean up logic before a SIGKILL is sent:

Azure Portal for Graceful termination

Logging

You may not always seen a message regarding which exit code a container exited with (in this case, 137 or 143) - but you can get an idea of when SIGTERM was sent to the container by looking in the ContainerAppSystemLogs_CL Log Analytic (or the Azure Monitor equivalent, ContainerAppSystemLogs) table.

ContainerAppSystemLogs_CL
| where ContainerAppName_s =~ "some-container"
| where Reason_s == "StoppingContainer"
| project TimeGenerated, Log_s, Reason_s

Using something like the above query, we can find a message like this:

TimeGenerated [UTC]         Log_s                                Reason_s
5/27/2024, 7:26:42.894 PM   Stopping container some-container    StoppingContainer

If our application happened to be writing to stdout when a SIGTERM is received, we can correlate these two events together. Which is seeing the Stopping container [container-name] would also mean that a SIGTERM has been sent to the container (s) running in said pod or replica. Note the timeframes.

ContainerAppConsoleLogs_CL
| where ContainerAppName_s =~ "some-container"
| project TimeGenerated, Log_s
TimeGenerated [UTC]          Log_s
5/27/2024, 7:26:42.345 PM    {"level":"warn","ts":1716838001.9346442,"caller":"app/main.go:36","msg":"SIGTERM received.., shutting down the application.."}

Catching signals

There are various ways to catch signals, depending on the language. It’s heavily advised to test this on your local machine first. You can mimic a SIGTERM by running a container on your local machine and sending something like docker stop [container_id] (if you’re using Docker), or the relevant stop command for your container runtime.

Below are some quick examples:

Go:

Below is an example with Fiber

...other code
func main() {
    app := fiber.New()

    app.Get("/", controllers.IndexController)
    // Run this as a goroutine in your application entrypoint
    signalChannel := make(chan os.Signal, 2)
    signal.Notify(signalChannel, os.Interrupt, syscall.SIGTERM)
    go func() {
        sig := <-signalChannel
        switch sig {
        case syscall.SIGTERM:
            zap.L().Info("Caught SIGTERM..")
            zap.L().Info("Calling os.Exit(0)..")
            os.Exit(0)
        }
    }()
    app.Listen(":3000")
}

Node:

Below is an example with Express

import express from "express";
import { homeController } from "./controllers/indexController.js";

const port = process.env.PORT || 3000;
const app = express()

app.use(homeController)

process.on("SIGTERM", () => {
    console.log("SIGTERM received, exiting application with exit(0)");
    process.exit(0);
});

app.listen(port, () => {
    console.log(`Server listening on port ${port}`)
})

Python:

Below is an example with Flask

import signal
import sys
from flask import Flask, jsonify

app = Flask(__name__)

def shutdown_function(signal, frame):
    print('Recieved SIGTERM, exiting with exit(0)')
    sys.exit(0)

signal.signal(signal.SIGTERM, shutdown_function)

app.route('/')
def index():
    return jsonify({'message': 'sigterm-handlers-python'})

Java:

Below is an example with Spring Boot - note, there are lifecycle hook methods/annotations that can be used when SIGTERM is called as well.

@SpringBootApplication
public class AzureApplication {

	private static void addSignalHandler() {
		SignalHandler signalHandler = new SignalHandlerImpl();
		Signal.handle(new Signal("TERM"), signalHandler);
	}

	private static class SignalHandlerImpl implements SignalHandler {

		@Override
		public void handle(Signal signal) {
			switch (signal.getName()) {
				case "TERM":
					System.out.println("Caught signal SIGTERM, exiting application with exit(0)");
					System.exit(0);
					break;
				default:
					break;
			}
		}
	}

	public static void main(String[] args) {
		SpringApplication.run(AzureApplication.class, args);
		addSignalHandler();
	}
}

Dotnet:

Below is an example with .NET 8 - note, that the default host automatically handles SIGTERM - and there are various ways to hook into lifecycle events through .NET once SIGTERM has been sent. This example is showing how to listen for the signal - if exit() is not explicitly called, the default host will handle this

using System.Runtime.InteropServices;

var builder = WebApplication.CreateBuilder(args);

var app = builder.Build();
... other code

// Listen for signals
PosixSignalRegistration.Create(PosixSignal.SIGTERM, (ctx) =>
{
    Console.WriteLine("Caught SIGTERM, default host is shutting down");
});

app.Run();

Signals aren’t being received by the application process

For users who have a Dockerfile that is using ENTRYPOINT to a shell file, like:

ENTRYPOINT [ "/usr/src/app/init_container.sh" ]

You’ll notice that SIGKILL may not be caught by the process. This is because the shell (through init_container.sh) , is now PID 1.

    7     1 root     S    10.2g  62%   6   0% node server.js
   18     0 root     S     1684   0%   3   0% /bin/sh
    1     0 root     S     1616   0%   2   0% {init_container.} /bin/sh /usr/src/app/init_container.sh

This causes the signal to not propagate to the application process. More importantly, this would happen on a local machine as well, or anywhere a container may run. To circumvent this, try:

  • Change ENTRYPOINT [ "/usr/src/app/init_container.sh" ] to something like ENTRYPOINT [ "node", "server.js" ]
  • If you dont want the application process as PID 1, use an init manager like tini. You can then use it like: ENTRYPOINT ["/sbin/tini", "--", "node", "server.js"]
    • If we look below, we can see the the node process is now not PID 1 - but, it still does receive the SIGTERM signal properly
    7     1 root     S    10.2g  62%   7   0% node server.js
   18     0 root     S     1684   0%   2   0% /bin/sh
   24    18 root     R     1612   0%   2   0% top
    1     0 root     S      804   0%   2   0% /sbin/tini -- node server.js

NOTE: Invocation of tini depends on the OS/distro and how this was installed - this is due to differences of where binaries are installed across some distributions or installation methods

  • You can also use CMD [ "/usr/src/app/init_container.sh" ], and then with the shell script, use exec to invoke your application entrypoint - such as exec node server.js:

    #!/bin/sh
    
    exec node server.js
    

Post questions | Provide product feedback

posted @   路边两盏灯  阅读(20)  评论(0编辑  收藏  举报
编辑推荐:
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
阅读排行:
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· 单线程的Redis速度为什么快?
历史上的今天:
2022-11-18 【Azure 环境】向Azure Key Vault中导入证书有输入密码,那么导出pfx证书的时候,为什么没有密码呢?
2021-11-18 【Azure 事件中心】在Windows系统中使用 kafka-consumer-groups.bat 查看Event Hub中kafka的consumer groups信息
2020-11-18 【Azure Redis 缓存 Azure Cache For Redis】Redis支持的版本及不同版本迁移风险
点击右上角即可分享
微信分享提示