基于 ROS 的 Terraform 托管服务轻松部署同城容灾应用

介绍

企业对在线的关键业务应用存在容灾需求,同城主机房发生故障,流量能切换到备机房,备机房具备实时接管能力。本方案介绍了通过阿里云的NLB、MSE、ACK等产品组合能力,实现应用同城多活的方案。

资源编排服务(Resource Orchestration Service, ROS)是阿里云提供基于基础设施即代码(Infrastructure as Code, IaC) 理念的自动化部署服务,我们可以通过定义一个 Terraform 模板,轻松部署一套云上的同城多活应用。

方案架构

本方案基于云原生网关+MSE注册中心实现应用同城多活,通过云原生网关的主动健康检查机制可以实现秒级故障自动转移。云原生网关支持管理多套ACK集群,通过对多个集群的同名服务合并的能力,可以实现非对等部署状态下的全局流量负载均衡。通过每个集群部署一套注册中心实现微服务调用可用区闭环以后,可以配合云原生网关的多种流量路由能力实现蓝绿和灰度等发布策略。

实施步骤

  1. 登录ROS控制台:通过MSE实现应用同城容灾部署页面

  2. 配置模板参数:选择 ECS 实例的可用区、实例类型等

  3. 点击【下一步】,然后【创建】,等待资源栈创建完成。ROS 能够快速完成两套可用区的ACK集群以及MSE网关的部署。

  4. 云原生网关创建来源
    登录 MSE 控制台,在云原生网关中创建两个来源,关联您在两个可用区的容器集群。

  5. 云原生网关创建服务
    在云原生网关中创建服务,选择对应命名空间下的服务。

    云原生网关支持跨集群的服务统一管理,下文中所创建的服务会自动的把两套Kubernetes集群中相同命名空间下的同名服务进行合并。

  6. 云原生网关创建路由
    创建路由,设置域名和匹配规则

    选择目标服务后保存并上线路由

  7. 切流效果验证
    云原生网关可根据后端集群的工作负载数及健康状态动态的调整流量,下文分别演示两个集群工作负载数量对等部署、非对等部署、机房故障场景下的云原生网关自动切流效果。
    可以启动下方的shell脚本,循环对网关发起请求

#!/bin/bash
for (( i = 1; i < 100; i++ )); do
      curl http://{ip}/helloDuohuo     # ip 需要替换成您网关的公网IP
      sleep 1
      echo
done
  1. 两套集群工作负载数量对等部署
    两套集群中工作负载的副本保持一致

    通过上文的shell脚本发起请求,得到以下输出

    输出表明,流量被负载均衡到可用区G和可用区I,两个可用区的集群各承担了50%的流量。

  2. 两个集群工作负载数量非对等部署
    可用区I的集群副本数量缩容为0

    得到以下输出

    输出表明,可用区G的集群承担了全部的流量

部署原理

我们可以看到通过 ROS 可以非常快捷地部署阿里云上的各种云资源(比如 alicloud_vpc、alicloud_vswitch、alicloud_instance 实例等)。如果想了解是如何做到的,那么可以阅读此章节。

  1. 编写 Terraform 模板。在如下模板中定义了:
  • resource:定义了 VPC、VSwitch、K8S、MSE 网关等资源。
  • variable:定义了常用的参数,比如可用区、ECS 实例类型类型。
  • output:定义了自定义输出,比如 MSE 网关的公网IP
  1. 在 ROS 控制台中使用此模板创建资源栈。ROS 提供的 Terraform 托管服务会自动解析出模板中资源的依赖关系,按照资源依赖顺序创建云资源。如果资源间没有依赖,则会并发创建,从而提升部署效率。ROS 会把这次创建的所有资源存放到一个“资源栈”中,后续可以方便地管理这组资源集合。比如:
  • 将新模板应用到这个“资源栈”中,从而更新里面的资源。
  • 删除这个“资源栈”,从而把所有的资源删掉。
ROSTemplateFormatVersion: '2015-09-01'
Transform: Aliyun::Terraform-v1.5
Workspace:
  main.tf: |-
    variable "zone_id1" {
      type        = string
      description = <<EOT
      {
        "AssociationProperty": "ZoneId",
        "Label": {
            "zh-cn": "可用区ID1"
        }
      }
      EOT
      default     = "cn-hangzhou-g"
    }

    variable "zone_id2" {
      type        = string
      description = <<EOT
      {
        "AssociationProperty": "ZoneId",
        "Label": {
            "zh-cn": "可用区ID2"
        }
      }
      EOT
      default     = "cn-hangzhou-i"
    }


    variable "instance_type1" {
      description = <<EOT
      {
        "AssociationProperty": "ALIYUN::ECS::Instance::InstanceType",
        "AssociationPropertyMetadata": {
          "ZoneId": "$${zone_id1}"
        },
        "Label": {
            "zh-cn": "ECS实例规格1"
        }
      }
      EOT
      default     = "ecs.g5.xlarge"
    }

    variable "instance_type2" {
      description = <<EOT
      {
        "AssociationProperty": "ALIYUN::ECS::Instance::InstanceType",
        "AssociationPropertyMetadata": {
          "ZoneId": "$${zone_id2}"
        },
        "Label": {
            "zh-cn": "ECS实例规格2"
        }
      }
      EOT
      default     = "ecs.g6e.xlarge"
    }

    resource random_string "s" {
      length    = 5
      lower     = true
      min_lower = 5
      special   = false
    }


    resource "alicloud_vpc" "vpc1" {
      vpc_name   = "duohuo-vpc1"
      cidr_block = "10.0.0.0/8"
    }

    resource "alicloud_vswitch" "vswitch1" {
      vpc_id     = alicloud_vpc.vpc1.id
      zone_id    = var.zone_id1
      cidr_block = "10.0.0.0/24"
    }

    resource "alicloud_vswitch" "vswitch2" {
      vpc_id     = alicloud_vpc.vpc1.id
      zone_id    = var.zone_id2
      cidr_block = "10.0.1.0/24"
    }


    resource "alicloud_cs_managed_kubernetes" "k8s1" {
      name               = format("duohuo_k8s1_%s", random_string.s.id)
      cluster_spec       = "ack.pro.small"
      version            = "1.28.9-aliyun.1"
      worker_vswitch_ids = [alicloud_vswitch.vswitch1.id]
      service_cidr       = "192.168.0.0/24"
      pod_cidr           = "192.168.1.0/24"
      node_cidr_mask     = 24
      new_nat_gateway    = true
    }

    resource "alicloud_cs_managed_kubernetes" "k8s2" {
      name               = format("duohuo_k8s2_%s", random_string.s.id)
      cluster_spec       = "ack.pro.small"
      version            = "1.28.9-aliyun.1"
      worker_vswitch_ids = [alicloud_vswitch.vswitch2.id]
      service_cidr       = "192.168.2.0/24"
      pod_cidr           = "192.168.3.0/24"
      node_cidr_mask     = 24
      new_nat_gateway    = true
      depends_on = [alicloud_cs_managed_kubernetes.k8s1]
    }

    resource "alicloud_cs_kubernetes_node_pool" "node_pool1" {
      name                 = "duohuo_node_pool1"
      cluster_id           = alicloud_cs_managed_kubernetes.k8s1.id
      vswitch_ids          = [alicloud_vswitch.vswitch1.id]
      instance_types       = [var.instance_type1]
      system_disk_category = "cloud_essd"
      system_disk_size     = 40
      desired_size         = 5
    }

    resource "alicloud_cs_kubernetes_node_pool" "node_pool2" {
      name                 = "duohuo_node_pool2"
      cluster_id           = alicloud_cs_managed_kubernetes.k8s2.id
      vswitch_ids          = [alicloud_vswitch.vswitch2.id]
      instance_types       = [var.instance_type2]
      system_disk_category = "cloud_essd"
      system_disk_size     = 40
      desired_size         = 5
    }


    resource "alicloud_mse_cluster" "mse1" {
      cluster_specification = "MSE_SC_1_2_60_c"
      cluster_type          = "Nacos-Ans"
      cluster_version       = "NACOS_2_0_0"
      instance_count        = 3
      net_type              = "privatenet"
      pub_network_flow      = "1"
      connection_type       = "slb"
      cluster_alias_name    = format("duohuo_mse1_%s", random_string.s.id)
      mse_version           = "mse_pro"
      vswitch_id            = alicloud_vswitch.vswitch1.id
      vpc_id                = alicloud_vpc.vpc1.id
    }

    resource "alicloud_mse_cluster" "mse2" {
      cluster_specification = "MSE_SC_1_2_60_c"
      cluster_type          = "Nacos-Ans"
      cluster_version       = "NACOS_2_0_0"
      instance_count        = 3
      net_type              = "privatenet"
      pub_network_flow      = "1"
      connection_type       = "slb"
      cluster_alias_name    = format("duohuo_mse2_%s", random_string.s.id)
      mse_version           = "mse_pro"
      vswitch_id            = alicloud_vswitch.vswitch2.id
      vpc_id                = alicloud_vpc.vpc1.id
    }

    data "alicloud_mse_clusters" "mse1" {
      ids = [alicloud_mse_cluster.mse1.id]
    }

    data "alicloud_mse_clusters" "mse2" {
      ids = [alicloud_mse_cluster.mse2.id]
    }


    locals {
      ros_tpl = file("${path.cwd}/ros_tpl")
    }

    resource "alicloud_ros_stack" "stack1" {
      stack_name    = format("duohuo1_%s", random_string.s.id)
      template_body = local.ros_tpl
      parameters {
        parameter_key   = "ClusterId"
        parameter_value = alicloud_cs_managed_kubernetes.k8s1.id
      }
      parameters {
        parameter_key   = "MseAddress"
        parameter_value = data.alicloud_mse_clusters.mse1.clusters.0.intranet_domain
      }
      parameters {
        parameter_key   = "Env"
        parameter_value = var.zone_id1
      }
      depends_on = [alicloud_cs_kubernetes_node_pool.node_pool1]
    }

    resource "alicloud_ros_stack" "stack2" {
      stack_name    = format("duohuo2_%s", random_string.s.id)
      template_body = local.ros_tpl
      parameters {
        parameter_key   = "ClusterId"
        parameter_value = alicloud_cs_managed_kubernetes.k8s2.id
      }
      parameters {
        parameter_key   = "MseAddress"
        parameter_value = data.alicloud_mse_clusters.mse2.clusters.0.intranet_domain
      }
      parameters {
        parameter_key   = "Env"
        parameter_value = var.zone_id2
      }
      depends_on = [alicloud_cs_kubernetes_node_pool.node_pool2]
    }

    resource "alicloud_mse_gateway" "gateway1" {
      gateway_name      = "duohuo_gateway1"
      replica           = 2
      spec              = "MSE_GTW_2_4_200_c"
      vswitch_id        = alicloud_vswitch.vswitch1.id
      vpc_id            = alicloud_vpc.vpc1.id
      internet_slb_spec = "slb.s1.small"
      timeouts {
        create = "1200s"
      }
    }

    output "public_ip" {
      value = alicloud_mse_gateway.gateway1.slb_list[0].slb_ip
    }
  ros_tpl: |-
    {
      "ROSTemplateFormatVersion": "2015-09-01",
      "Parameters": {
        "ClusterId": {
          "Type": "String"
        },
        "MseAddress": {
          "Type": "String"
        },
        "Env": {
          "Type": "String"
        }
      },
      "Resources": {
        "CSClusterApplication": {
          "Type": "ALIYUN::CS::ClusterApplication",
          "Properties": {
            "ClusterId": {
              "Ref": "ClusterId"
            },
            "YamlContent": {
              "Fn::Sub": "apiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: duohuo-provider\n  labels:\n    app: duohuo-provider\nspec:\n  replicas: 2\n  selector:\n    matchLabels:\n      app: duohuo-provider\n  template:\n    metadata:\n      labels:\n        app: duohuo-provider\n    spec:\n      containers:\n        - env:\n            - name: nacos_address\n              value: '${MseAddress}:8848' \n            - name: env\n              value: ${Env}\n          image: 'registry.cn-hangzhou.aliyuncs.com/jinfengdocker/mse-duohuo-provider:v1'\n          imagePullPolicy: Always\n          name: duohuo-provider\n          ports:\n            - containerPort: 8080\n              protocol: TCP\n          resources:\n            limits:\n              cpu: '2'\n              memory: 1Gi\n            requests:\n              cpu: '0.5'\n              memory: 0.1Gi\n---\napiVersion: apps/v1         \nkind: Deployment\nmetadata:\n  name: duohuo-consumer\n  labels:\n    app: duohuo-consumer\nspec:\n  replicas: 2\n  selector:\n    matchLabels:\n      app: duohuo-consumer\n  template:\n    metadata:\n      labels:\n        app: duohuo-consumer\n    spec:\n      containers:\n        - env:\n            - name: nacos_address\n              value: '${MseAddress}:8848' \n            - name: env\n              value: ${Env}\n          image: 'registry.cn-hangzhou.aliyuncs.com/jinfengdocker/mse-duohuo-consumer:v1'\n          imagePullPolicy: Always\n          name: duohuo-consumer\n          ports:\n            - containerPort: 8080\n              protocol: TCP\n          resources:\n            limits:\n              cpu: '2'\n              memory: 1Gi\n            requests:\n              cpu: '0.5'\n              memory: 0.1Gi\n---\napiVersion: v1\nkind: Service\nmetadata:\n  name: duohuo-consumer-service\nspec:\n  ports:\n    - name: http\n      port: 8080\n      protocol: TCP\n      targetPort: 8080\n  selector:\n    app: duohuo-consumer\n  sessionAffinity: None\n  type: ClusterIP\nDefaultNamespace: default"
            }
          }
        }
      }
    }
  .metadata: |-
    {
      "ALIYUN::ROS::Interface": {
        "ParameterGroups": [
          {
            "Parameters": [
              "zone_id1",
              "zone_id2"
            ],
            "Label": {
              "default": {
                "zh-cn": "可用区配置",
                "en": "Zone Configuration"
              }
            }
          },
          {
            "Parameters": [
              "instance_type1",
              "instance_type2"
            ],
            "Label": {
              "default": {
                "zh-cn": "Ecs实例类型配置",
                "en": "Ecs InstanceType Configuration"
              }
            }
          }
        ],
        "TemplateTags": [
          "acs:integrate:landing_zone:mse_disaster_tolerance"
        ]
      }
    }
Metadata:
  ALIYUN::ROS::Interface:
    ParameterGroups:
      - Parameters:
          - zone_id1
          - zone_id2
        Label:
          default:
            zh-cn: 可用区配置
            en: Zone Configuration
      - Parameters:
          - instance_type1
          - instance_type2
        Label:
          default:
            zh-cn: Ecs实例类型配置
            en: Ecs InstanceType Configuration
    TemplateTags:
      - acs:integrate:landing_zone:mse_disaster_tolerance

总结

基于 IaC 的理念,通过定义一个模板,使用 ROS 提供的 Terraform 托管服务进行自动化部署,可以非常高效快捷地部署任意云资源和应用(比如部署同城容灾应用)。相比于手动部署或者通过 API、SDK 的部署方式,有着高效、稳定等诸多优势,也是服务上云的最佳实践。

posted @ 2024-07-15 17:44  阿里云CloudOps  阅读(45)  评论(0编辑  收藏  举报