随笔 - 295 文章 - 0 评论 - 16 阅读 - 41万

调用链完整方案（zipkin 集群）

一、首先摆一张基于 zipkin 的调用链数据架构图

1.1 应用程序把搜集到的 span 信息发送到 kafka

1.2 zipkin 消费 kafka 中消息

1.3 zipkin 将数据写入到 mysql

1.4 为 zipkin 集群配置反向代理，路由 web ui 请求

二、接下来描述每个步骤的操作流程

2.1 以 java 应用为例，spring cloud sleuth 做了比较完整的封装，引入依赖后，只需要进行相应的配置即可，这里只给出 sender 部分的配置，默认使用的 topic 是 zipkin，因为是部署 zipkin 集群，则分区数应大于等于 zipkin 实例数。

spring.zipkin.sender.type=kafka
spring.kafka.bootstrap-servers=127.0.0.1:9092

2.2 翻阅 zipkin 的文档，找到了一种简洁的配置方式，因 zipkin 是基于 spring boot 开发的应用，启动时按照 spring boot 项目配置即可

在 zipkin.jar 所在目录创建 zipkin-server.properties 文件，该文件中配置的变量会覆盖默认值。

zipkin collector 消费 kafka 数据，配置如下：

zipkin.collector.kafka.enabled=true
zipkin.collector.kafka.bootstrap-servers=172.16.101.74:9092,172.16.101.75:9092,172.16.101.76:9092
zipkin.collector.kafka.topic=zipkin
zipkin.collector.kafka.group-id=zipkin

需要注意的是，zipkin 支持配置多个 collector 同时工作：http, grpc, kafka 等，即 zipkin 可以同时接收 http 方式传输的数据和 kafka 中的消息数据。

2.3 zipkin storage 支持 ES，MySQL，Cassandra 等，本文使用 MySQL 存储，脚本在 zipkin 项目中，如下：

CREATE TABLE IF NOT EXISTS zipkin_spans (
  `trace_id_high` BIGINT NOT NULL DEFAULT 0 COMMENT 'If non zero, this means the trace uses 128 bit traceIds instead of 64 bit',
  `trace_id` BIGINT NOT NULL,
  `id` BIGINT NOT NULL,
  `name` VARCHAR(255) NOT NULL,
  `remote_service_name` VARCHAR(255),
  `parent_id` BIGINT,
  `debug` BIT(1),
  `start_ts` BIGINT COMMENT 'Span.timestamp(): epoch micros used for endTs query and to implement TTL',
  `duration` BIGINT COMMENT 'Span.duration(): micros used for minDuration and maxDuration query',
  PRIMARY KEY (`trace_id_high`, `trace_id`, `id`)
) ENGINE=InnoDB ROW_FORMAT=COMPRESSED CHARACTER SET=utf8 COLLATE utf8_general_ci;

ALTER TABLE zipkin_spans ADD INDEX(`trace_id_high`, `trace_id`) COMMENT 'for getTracesByIds';
ALTER TABLE zipkin_spans ADD INDEX(`name`) COMMENT 'for getTraces and getSpanNames';
ALTER TABLE zipkin_spans ADD INDEX(`remote_service_name`) COMMENT 'for getTraces and getRemoteServiceNames';
ALTER TABLE zipkin_spans ADD INDEX(`start_ts`) COMMENT 'for getTraces ordering and range';

CREATE TABLE IF NOT EXISTS zipkin_annotations (
  `trace_id_high` BIGINT NOT NULL DEFAULT 0 COMMENT 'If non zero, this means the trace uses 128 bit traceIds instead of 64 bit',
  `trace_id` BIGINT NOT NULL COMMENT 'coincides with zipkin_spans.trace_id',
  `span_id` BIGINT NOT NULL COMMENT 'coincides with zipkin_spans.id',
  `a_key` VARCHAR(255) NOT NULL COMMENT 'BinaryAnnotation.key or Annotation.value if type == -1',
  `a_value` BLOB COMMENT 'BinaryAnnotation.value(), which must be smaller than 64KB',
  `a_type` INT NOT NULL COMMENT 'BinaryAnnotation.type() or -1 if Annotation',
  `a_timestamp` BIGINT COMMENT 'Used to implement TTL; Annotation.timestamp or zipkin_spans.timestamp',
  `endpoint_ipv4` INT COMMENT 'Null when Binary/Annotation.endpoint is null',
  `endpoint_ipv6` BINARY(16) COMMENT 'Null when Binary/Annotation.endpoint is null, or no IPv6 address',
  `endpoint_port` SMALLINT COMMENT 'Null when Binary/Annotation.endpoint is null',
  `endpoint_service_name` VARCHAR(255) COMMENT 'Null when Binary/Annotation.endpoint is null'
) ENGINE=InnoDB ROW_FORMAT=COMPRESSED CHARACTER SET=utf8 COLLATE utf8_general_ci;

ALTER TABLE zipkin_annotations ADD UNIQUE KEY(`trace_id_high`, `trace_id`, `span_id`, `a_key`, `a_timestamp`) COMMENT 'Ignore insert on duplicate';
ALTER TABLE zipkin_annotations ADD INDEX(`trace_id_high`, `trace_id`, `span_id`) COMMENT 'for joining with zipkin_spans';
ALTER TABLE zipkin_annotations ADD INDEX(`trace_id_high`, `trace_id`) COMMENT 'for getTraces/ByIds';
ALTER TABLE zipkin_annotations ADD INDEX(`endpoint_service_name`) COMMENT 'for getTraces and getServiceNames';
ALTER TABLE zipkin_annotations ADD INDEX(`a_type`) COMMENT 'for getTraces and autocomplete values';
ALTER TABLE zipkin_annotations ADD INDEX(`a_key`) COMMENT 'for getTraces and autocomplete values';
ALTER TABLE zipkin_annotations ADD INDEX(`trace_id`, `span_id`, `a_key`) COMMENT 'for dependencies job';

CREATE TABLE IF NOT EXISTS zipkin_dependencies (
  `day` DATE NOT NULL,
  `parent` VARCHAR(255) NOT NULL,
  `child` VARCHAR(255) NOT NULL,
  `call_count` BIGINT,
  `error_count` BIGINT,
  PRIMARY KEY (`day`, `parent`, `child`)
) ENGINE=InnoDB ROW_FORMAT=COMPRESSED CHARACTER SET=utf8 COLLATE utf8_general_ci;

View Code

zipkin storage 使用 MySQL 的配置如下：

zipkin.storage.type=mysql
zipkin.storage.mysql.host=127.0.0.1
zipkin.storage.mysql.port=3306
zipkin.storage.mysql.username=root
zipkin.storage.mysql.password=Root_2023
zipkin.storage.mysql.db=zipkin

需要注意的是，调用链数据如果很多，MySQL 表中的数据会急剧膨胀，需要定时清理数据，使用 MySQL 的定时任务（只保留 24 小时以内的数据）即可：

CREATE EVENT clean_zipkin_spans ON
SCHEDULE EVERY 24 HOUR 
DO
delete from zipkin.zipkin_spans
where
    (start_ts / 1000000) < UNIX_TIMESTAMP(now()) - 24 * 60 * 60;

CREATE EVENT clean_zipkin_annotations ON
SCHEDULE EVERY 24 HOUR 
DO
delete from zipkin.zipkin_annotations 
where         
   (a_timestamp/1000000) < UNIX_TIMESTAMP(now()) - 24 * 60 * 60;

2.4 以三节点集群为例，每个 zipkin 实例的配置都是一样的，为集群配置好反向代理后，整个集群就搭建好了

聊点题外话，在使用调用链的过程中，有人好奇 trace id 是如何生成的吗？brave 生成 trace id 的代码在：

// brave.internal.Platform.Jre7#randomLong
@IgnoreJRERequirement @Override public long randomLong() {
    return java.util.concurrent.ThreadLocalRandom.current().nextLong();
}