Envoy故障处理机制
1、upstream健康监测
健康检测用于防止将流量导向工作异常的上游主机
主动检测:envoy主动发送报文,根据返回结果判断上游主机健康状况
被动检测:envoy根据异常状态(outliner detection)检测
如:连续5xx,或者连续5xx子集(如502,503等),或者成功率等
2、主动健康监测实例

admin:
profile_path: /tmp/envoy.prof
access_log_path: /tmp/admin_access.log
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 80 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
codec_type: AUTO
route_config:
name: local_route
virtual_hosts:
- name: webservice
domains: ["*"]
routes:
- match: { prefix: "/" }
route: { cluster: web_cluster_01 }
http_filters:
- name: envoy.filters.http.router
clusters:
- name: web_cluster_01
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: web_cluster_01
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address: { address: webserver-0.envoy, port_value: 80 }
- endpoint:
address:
socket_address: { address: webserver-1.envoy, port_value: 80 }
health_checks:
- timeout: 5s
interval: 10s
unhealthy_threshold: 2
healthy_threshold: 2
http_health_check:
path: /livez
expected_statuses:
start: 200
end: 399
在upstream中增加health_check,主动访问upstream路径 /livez的返回状态。200-399为正常,连续2次异常envoy就判断后端不健康,流量不会再转发到异常的后端
如:开始访问正常
健康检查看出后端两个节点健康
当我们将upstream中一台主机/livez的值:更改为非OK,如:
curl -X POST -d 'livez=failed' http://webserver-0.envoy/livez
可以看到访问upstream livez响应码为506
envoy管理界面也可以看到节点不健康
客户端也无法请求的响应也只有健康的服务响应
手动将/livez的值:更改为OK,envoy主动检测到健康后,流量正常调度
curl -X POST -d 'livez=failed' http://webserver-0.envoy/livez
3、被动健康检查实例

admin:
profile_path: /tmp/envoy.prof
access_log_path: /tmp/admin_access.log
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 80 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
codec_type: AUTO
route_config:
name: local_route
virtual_hosts:
- name: webservice
domains: ["*"]
routes:
- match: { prefix: "/" }
route: { cluster: web_cluster_01 }
http_filters:
- name: envoy.filters.http.router
clusters:
- name: web_cluster_01
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: web_cluster_01
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address: { address: webserver-0.envoy, port_value: 80 }
- endpoint:
address:
socket_address: { address: webserver-1.envoy, port_value: 80 }
outlier_detection:
consecutive_5xx: 3
base_ejection_time: 10s
max_ejection_percent: 10
enforcing_consecutive_gateway_failure: "10"
正常情况:
将一个节点webserver-0 livez 设置为failed ,当连续3次响应码为5xx后,弹出10s,
将该节点的livez设置为OK后, 恢复正常
可以配置的参数还有很多,具体可以百度
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· winform 绘制太阳,地球,月球 运作规律
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 上周热点回顾(3.3-3.9)
· AI 智能体引爆开源社区「GitHub 热点速览」