Envoy故障处理机制

 

1、upstream健康监测

健康检测用于防止将流量导向工作异常的上游主机

主动检测:envoy主动发送报文,根据返回结果判断上游主机健康状况

被动检测:envoy根据异常状态(outliner detection)检测

 如:连续5xx,或者连续5xx子集(如502,503等),或者成功率等

2、主动健康监测实例

admin:
  profile_path: /tmp/envoy.prof
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 80 }
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: webservice
              domains: ["*"]
              routes:
              - match: { prefix: "/" }
                route: { cluster: web_cluster_01 }
          http_filters:
          - name: envoy.filters.http.router
  clusters:
  - name: web_cluster_01
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: web_cluster_01
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address: { address: webserver-0.envoy, port_value: 80 }
        - endpoint:
            address:
              socket_address: { address: webserver-1.envoy, port_value: 80 }
    health_checks:
    - timeout: 5s
      interval: 10s
      unhealthy_threshold: 2
      healthy_threshold: 2
      http_health_check:
        path: /livez
        expected_statuses:
          start: 200
          end: 399
View Code

 

在upstream中增加health_check,主动访问upstream路径  /livez的返回状态。200-399为正常,连续2次异常envoy就判断后端不健康,流量不会再转发到异常的后端

如:开始访问正常

健康检查看出后端两个节点健康

 

当我们将upstream中一台主机/livez的值:更改为非OK,如:

 curl -X POST -d 'livez=failed' http://webserver-0.envoy/livez 

 可以看到访问upstream livez响应码为506  

 

 envoy管理界面也可以看到节点不健康

 

 客户端也无法请求的响应也只有健康的服务响应

手动将/livez的值:更改为OK,envoy主动检测到健康后,流量正常调度

 curl -X POST -d 'livez=failed' http://webserver-0.envoy/livez 

3、被动健康检查实例

admin:
  profile_path: /tmp/envoy.prof
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 80 }
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: webservice
              domains: ["*"]
              routes:
              - match: { prefix: "/" }
                route: { cluster: web_cluster_01 }
          http_filters:
          - name: envoy.filters.http.router
  clusters:
  - name: web_cluster_01
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: web_cluster_01
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address: { address: webserver-0.envoy, port_value: 80 }
        - endpoint:
            address:
              socket_address: { address: webserver-1.envoy, port_value: 80 }
    outlier_detection:
      consecutive_5xx: 3
      base_ejection_time: 10s
      max_ejection_percent: 10
      enforcing_consecutive_gateway_failure: "10"
View Code

 

 正常情况:

将一个节点webserver-0 livez 设置为failed  ,当连续3次响应码为5xx后,弹出10s,

 

 

将该节点的livez设置为OK后, 恢复正常

可以配置的参数还有很多,具体可以百度

posted @ 2022-04-19 17:31  西风发财  阅读(192)  评论(0编辑  收藏  举报