Trace Context (w3.org)
分布式环境 中, 通过 传递&新增 上下文信息,来 追踪 不同服务调用之间的 链路信息
Table of Contents
- 1.Conformance
- 2.Overview
- 3.Trace Context HTTP Headers Format
- 4.Processing Model
- 5.Other Communication Protocols
- 6.Privacy Considerations
- 7.Security Considerations
- 8.Considerations for trace-id field generation
- A.Acknowledgments
- B.Glossary
- C.References
问题
tracing tools to follow, analyze and debug a transaction across multiple software components
Typically, a distributed trace traverses more than one component which requires it to be uniquely identifiable across all participating systems. Trace context propagation passes along this unique identification.
1. trace 之间?关联
2. trace 跨服务边界,如何 传递?
3. 云服务厂商、中间件服务、服务提供者的共同标准?
分布式服务,让问题凸显 =》a distributed tracing context propagation standard
解决方案:
trace的上下文信息标准
1. an unique identifier for individual traces and requests =》link together
2. an agreed-upon mechanism =》 avoid broken traces
3. an industry standard =》所有人遵守 =》一致
improves visibility into the behavior of distributed applications,facilitating problem and performance analysis.
微服务 必备
设计:
supporting interoperability and vendor-specific extensibility:
traceparent
describes the position of the incoming request in its trace graph in a portable, fixed-length format. Its design focuses on fast parsing. Every tracing tool MUST properly settraceparent
even when it only relies on vendor-specific information intracestate
tracestate
extendstraceparent
with vendor-specific data represented by a set of name/value pairs. Storing information intracestate
is optional.
provide two levels of compliant behavior interacting with trace context:
- At a minimum they MUST propagate the
traceparent
andtracestate
headers and guarantee traces are not broken. This behavior is also referred to as forwarding a trace. - In addition they CAN also choose to participate in a trace by modifying the
traceparent
header and relevant parts of thetracestate
header containing their proprietary information. This is also referred to as participating in a trace.
Trace Context HTTP Headers Format
describes the binding of the distributed trace context to traceparent
and tracestate
HTTP headers.
Relationship Between the Headers
The traceparent
header represents the incoming request in a tracing system in a common format, understood by all vendors. Here’s an example of a traceparent
header.
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
The tracestate
header includes the parent in a potentially vendor-specific format:
tracestate: congo=t61rcWkgMzE
示例:
For example, say a client and server in a system use different tracing vendors: Congo and Rojo. A client traced in the Congo system adds the following headers to an outbound HTTP request.
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
Note: In this case, the tracestate
value t61rcWkgMzE
is the result of Base64 encoding the parent ID (b7ad6b7169203331
), though such manipulations are not required.
The receiving server, traced in the Rojo tracing system, carries over the tracestate
it received and adds a new entry to the left.
traceparent: 00-0af7651916cd43dd8448eb211c80319c-00f067aa0ba902b7-01
tracestate: rojo=00f067aa0ba902b7,congo=t61rcWkgMzE
You'll notice that the Rojo system reuses the value of its traceparent
for its entry in tracestate
. This means it is a generic tracing system (no proprietary information is being passed). Otherwise, tracestate
entries are opaque and can be vendor-specific.
If the next receiving server uses Congo, it carries over the tracestate
from Rojo and adds a new entry for the parent to the left of the previous entry.
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b9c7c989f97918e1-01
tracestate: congo=ucfJifl5GOE,rojo=00f067aa0ba902b7
Note: ucfJifl5GOE
is the Base64 encoded parent ID b9c7c989f97918e1
.
Notice when Congo wrote its traceparent
entry, it is not encoded, which helps in consistency for those doing correlation. However, the value of its entry tracestate
is encoded and different from traceparent
. This is ok.
Finally, you'll see tracestate
retains an entry for Rojo exactly as it was, except pushed to the right. The left-most position lets the next server know which tracing system corresponds with traceparent
. In this case, since Congo wrote traceparent
, its tracestate
entry should be left-most.
Traceparent Header
The traceparent
HTTP header field identifies the incoming request in a tracing system. It has four fields:
version
trace-id
parent-id
trace-flags
trace-id:一组请求
This is the ID of the whole trace forest and is used to uniquely identify a distributed trace through a system. It is represented as a 16-byte array, for example, 4bf92f3577b34da6a3ce929d0e0e4736
. All bytes as zero (00000000000000000000000000000000
) is considered an invalid value.
If the trace-id
value is invalid (for example if it contains non-allowed characters or all zeros), vendors MUST ignore the traceparent
.
See considerations for trace-id field generation for recommendations on how to operate with trace-id
.
parent-id:一个请求
This is the ID of this request as known by the caller (in some tracing systems, this is known as the span-id
, where a span
is the execution of a client request). It is represented as an 8-byte array, for example, 00f067aa0ba902b7
. All bytes as zero (0000000000000000
) is considered an invalid value.
Vendors MUST ignore the traceparent
when the parent-id
is invalid (for example, if it contains non-lowercase hex characters).
trace-flags:打标签
An 8-bit field that controls tracing flags such as sampling, trace level, etc. These flags are recommendations given by the caller rather than strict rules to follow for three reasons:
- Trust and abuse
- Bug in the caller
- Different load between caller service and callee service might force callee to downsample.
You can find more in the section Security considerations of this specification.
Sampled flag
The current version of this specification (00
) only supports a single flag called sampled
.
When set, the least significant bit (right-most), denotes that the caller may have recorded trace data. When unset, the caller did not record trace data out-of-band.
There are a number of recording scenarios that may break distributed tracing:
- Only recording a subset of requests results in broken traces.
- Recording information about all incoming and outgoing requests becomes prohibitively expensive, at load.
- Making random or component-specific data collection decisions leads to fragmented data in all traces.
Because of these issues, tracing vendors make their own recording decisions, and there is no consensus on what is the best algorithm for this job.
Various techniques include:
- Probability sampling (sample 1 out of 100 distributed traces by flipping a coin)
- Delayed decision (make collection decision based on duration or a result of a request)
- Deferred sampling (let the callee decide whether information about this request needs to be collected)
How these techniques are implemented can be tracing vendor-specific or application-defined.
The tracestate
field is designed to handle the variety of techniques for making recording decisions (or other specific information) specific for a given vendor. The sampled
flag provides better interoperability between vendors. It allows vendors to communicate recording decisions and enable a better experience for the customer.
For example, when a SaaS service participates in a distributed trace, this service has no knowledge of the tracing vendor used by its caller. This service may produce records of incoming requests for monitoring or troubleshooting purposes. The sampled
flag can be used to ensure that information about requests that were marked for recording by the caller will also be recorded by SaaS service downstream so that the caller can troubleshoot the behavior of every recorded request.
The sampled
flag has no restriction on its mutations except that it can only be mutated when parent-id is updated.
The following are a set of suggestions that vendors SHOULD use to increase vendor interoperability.
- If a component made definitive recording decision - this decision SHOULD be reflected in the
sampled
flag. - If a component needs to make a recording decision - it SHOULD respect the
sampled
flag value. Security considerations SHOULD be applied to protect from abusive or malicious use of this flag. - If a component deferred or delayed the decision and only a subset of telemetry will be recorded, the
sampled
flag should be propagated unchanged. It should be set to0
as the default option when the trace is initiated by this component.
There are two additional options that vendors MAY follow:
- A component that makes a deferred or delayed recording decision may communicate the priority of a recording by setting
sampled
flag to1
for a subset of requests. - A component may also fall back to probability sampling and set the
sampled
flag to1
for the subset of requests.
Versioning of traceparent
This specification is opinionated about future versions of trace context. The current version of this specification assumes that future versions of the traceparent
header will be additive to the current one.
Vendors MUST follow these rules when parsing headers with an unexpected format:
-
Pass-through services should not analyze the version. They should expect that headers may have larger size limits in the future and only disallow prohibitively large headers.
-
When the version prefix cannot be parsed (it's not 2 hex characters followed by a dash (
-
)), the implementation should restart the trace. -
If a higher version is detected, the implementation SHOULD try to parse it by trying the following:
- If the size of the header is shorter than 55 characters, the vendor should not parse the header and should restart the trace.
- Parse
trace-id
(from the first dash through the next 32 characters). Vendors MUST check that the 32 characters are hex, and that they are followed by a dash (-
). - Parse
parent-id
(from the second dash at the 35th position through the next 16 characters). Vendors MUST check that the 16 characters are hex and followed by a dash. - Parse the
sampled
bit offlags
(2 characters from the third dash). Vendors MUST check that the 2 characters are either at the end of the string or followed by a dash.
If all three values were parsed successfully, the vendor should use them.
Vendors MUST NOT parse or assume anything about unknown fields for this version. Vendors MUST use these fields to construct the new traceparent
field according to the highest version of the specification known to the implementation (in this specification it is 00
).
Tracestate Header:trace 鉴别
The main purpose of the tracestate
HTTP header is to provide additional vendor-specific trace identification information across different distributed tracing systems and is a companion header for the traceparent
field. It also conveys information about the request’s position in multiple distributed tracing graphs.
If the vendor failed to parse traceparent
, it MUST NOT attempt to parse tracestate
. Note that the opposite is not true: failure to parse tracestate
MUST NOT affect the parsing of traceparent
.
tracestate Header Field Values:key value 格式
The tracestate
field may contain any opaque value in any of the keys. Tracestate MAY be sent or received as multiple header fields. Multiple tracestate header fields MUST be handled as specified by RFC7230 Section 3.2.2 Field Order. The tracestate
header SHOULD be sent as a single field when possible, but MAY be split into multiple header fields. When sending tracestate
as multiple header fields, it MUST be split according to RFC7230. When receiving multiple tracestate
header fields, they MUST be combined into a single header according to RFC7230.
This section uses the Augmented Backus-Naur Form (ABNF) notation of [RFC5234], including the DIGIT rule in appendix B.1 for RFC5234. It also includes the OWS
rule from RFC7230 section 3.2.3.
The DIGIT
rule defines numbers 0
-9
.
The OWS
rule defines an optional whitespace character. To improve readability, it is used where zero or more whitespace characters might appear.
The caller SHOULD generate the optional whitespace as a single space; otherwise, a caller SHOULD NOT generate optional whitespace. See details in the corresponding RFC.
The tracestate
field value is a list
of list-members
separated by commas (,
). A list-member
is a key/value pair separated by an equals sign (=
). Spaces and horizontal tabs surrounding list-member
s are ignored. There can be a maximum of 32 list-member
s in a list
.
Empty and whitespace-only list members are allowed. Vendors MUST accept empty tracestate
headers but SHOULD avoid sending them. Empty list members are allowed in tracestate
because it is difficult for a vendor to recognize the empty value when multiple tracestate
headers are sent. Whitespace characters are allowed for a similar reason, as some vendors automatically inject whitespace after a comma separator, even in the case of an empty header.
Combined Header Value:key 唯一 =》信息覆盖
The tracestate
value is the concatenation of trace graph key/value pairs
Example: vendorname1=opaqueValue1,vendorname2=opaqueValue2
Only one entry per key is allowed because the entry represents that last position in the trace. Hence vendors must overwrite their entry upon reentry to their tracing system.
For example, if a vendor name is Congo and a trace started in their system and then went through a system named Rojo and later returned to Congo, the tracestate
value would not be:
congo=congosFirstPosition,rojo=rojosFirstPosition,congo=congosSecondPosition
Instead, the entry would be rewritten to only include the most recent position: congo=congosSecondPosition,rojo=rojosFirstPosition
tracestate Limits:
Vendors SHOULD propagate at least 512 characters of a combined header. This length includes commas required to separate list items and optional white space (OWS
) characters.
There are systems where propagating of 512 characters of tracestate
may be expensive. In this case, the maximum size of the propagated tracestate
header SHOULD be documented and explained. The cost of propagating tracestate
SHOULD be weighted against the value of monitoring scenarios enabled for the end users.
In a situation where tracestate
needs to be truncated due to size limitations, the vendor MUST truncate whole entries. Entries larger than 128
characters long SHOULD be removed first. Then entries SHOULD be removed starting from the end of tracestate
. Note that other truncation strategies like safe list entries, blocked list entries, or size-based truncation MAY be used, but are highly discouraged. Those strategies decrease the interoperability of various tracing vendors.
Versioning of tracesstate:是否 重用父级 tracesstate 信息
The version of tracestate
is defined by the version prefix of traceparent
header. Vendors need to attempt to parse tracestate
if a higher version is detected, to the best of its ability. It is the vendor’s decision whether to use partially-parsed tracestate
key/value pairs or not.
Mutating the traceparent Field
A vendor receiving a traceparent
request header MUST send it to outgoing requests. It MAY mutate the value of this header before passing it to outgoing requests.
If the value of the traceparent
field wasn't changed before propagation, tracestate
MUST NOT be modified as well. Unmodified header propagation is typically implemented in pass-through services like proxies. This behavior may also be implemented in a service which currently does not collect distributed tracing information.
Following is the list of allowed mutations:
- Update
parent-id
: The value of the parent-id field can be set to the new value representing the ID of the current operation. This is the most typical mutation and should be considered a default. - Update
sampled
: The value of the sampled field reflects the caller's recording behavior: either trace data was dropped or may have been recorded out-of-band. This can be indicated by toggling the flag in both directions. This mutation gives the downstream vendor information about the likelihood that its parent's information was recorded. Theparent-id
field MUST be set to a new value with thesampled
flag update. - Restart trace: All properties (
trace-id
,parent-id
,trace-flags
) are regenerated. This mutation is used in services that are defined as a front gate into secure networks and eliminates a potential denial-of-service attack surface. Vendors SHOULD clean uptracestate
collection ontraceparent
restart. There are rare cases when the originaltracestate
entries must be preserved after a restart. This typically happens when thetrace-id
is reverted back at some point of the trace flow, for instance, when it leaves the secure network. However, it SHOULD be an explicit decision, and not the default behavior. - Downgrade the version: This version of the specification (
00
) defines the behavior for a vendor that receives atraceparent
header of a higher version. In this case, the first mutation is to downgrade the version of the header. Other mutations are allowed in combination with this one.
Vendors MUST NOT make any other mutations to the traceparent
header.
Mutating the tracestate Field
Vendors receiving a tracestate
request header MUST send it to outgoing requests. It MAY mutate the value of this header before passing to outgoing requests. When mutating tracestate
, the order of unmodified key/value pairs MUST be preserved. Modified keys MUST be moved to the beginning (left) of the list.
Following are allowed mutations:
- Update key value. The value of any key can be updated. Modified keys MUST be moved to the beginning (left) of the list. This is the most common mutation resuming the trace.
- Add a new key/value pair. The new key-value pair SHOULD be added to the beginning of the list.
- Delete a key/value pair. Any key/value pair MAY be deleted. Vendors SHOULD NOT delete keys that were not generated by them. The deletion of an unknown key/value pair will break correlation in other systems. This mutation enables two scenarios. The first is that proxies can block certain
tracestate
keys for privacy and security concerns. The second scenario is a truncation of longtracestate
s.
A traceparent is Received
If a traceparent
header is received:
-
The vendor checks an incoming request for a
traceparent
and atracestate
header. -
Because the
traceparent
header is present, the vendor tries to parse the version of thetraceparent
header. -
If the version cannot be parsed, the vendor creates a new
traceparent
header and deletestracestate
. -
If the version number is higher than supported by the tracer, the vendor uses the format defined in this specification (
00
) to parsetrace-id
andparent-id
. The vendor will only parse thetrace-flags
values supported by this version of this specification and ignore all other values. If parsing fails, the vendor creates a newtraceparent
header and deletes thetracestate
. Vendors will set all unparsed / unknowntrace-flags
to 0 on outgoing requests. -
If the vendor supports the version number, it validates
trace-id
andparent-id
. If eithertrace-id
,parent-id
ortrace-flags
are invalid, the vendor creates a newtraceparent
header and deletestracestate
. -
The vendor MAY validate the
tracestate
header. If thetracestate
header cannot be parsed the vendor MAY discard the entire header. Invalidtracestate
entries MAY also be discarded. -
For each outgoing request the vendor performs the following steps:
-
The vendor MUST modify the
traceparent
header:- Update
parent-id
: The value of propertyparent-id
MUST be set to a value representing the ID of the current operation. - Update
sampled
: The value ofsampled
reflects the caller's recording behavior. The value of thesampled
flag oftrace-flags
MAY be set to1
if the trace data is likely to be recorded or to0
otherwise. Setting the flag is no guarantee that the trace will be recorded but increases the likeliness of end-to-end recorded traces.
- Update
-
The vendor MAY modify the
tracestate
header:- Update a key value: The value of any key can be updated. Modified keys MUST be moved to the beginning (left) of the list.
- Add a new key/value pair: The new key-value pair MUST be added to the beginning (left) of the list.
- Delete a key/value pair: Any key/value pair MAY be deleted. Vendors SHOULD NOT delete keys that weren't generated by themselves. Deletion of any key/value pair MAY break correlation in other systems.
-
The vendor sets the
traceparent
andtracestate
header for the outgoing request.
-
Privacy Considerations
Requirements to propagate headers to downstream services, as well as storing values of these headers, open up potential privacy concerns. Tracing vendors MUST NOT use traceparent
and tracestate
fields for any personally identifiable or otherwise sensitive information. The only purpose of these fields is to enable trace correlation.
Vendors MUST assess the risk of header abuse. This section provides some considerations and initial assessment of the risk associated with storing and propagating these headers. Tracing vendors may choose to inspect and remove sensitive information from the fields before allowing the tracing system to execute code that can potentially propagate or store these fields. All mutations should, however, conform to the list of mutations defined in this specification.
Privacy of tracestate field
The tracestate
field may contain any opaque value in any of the keys. The main purpose of this header is to provide additional vendor-specific trace-identification information across different distributed tracing systems.
Vendors MUST NOT include any personally identifiable information in the tracestate
header.
Vendors extremely sensitive to personal information exposure MAY implement selective removal of values corresponding to the unknown keys. Vendors SHOULD NOT mutate the tracestate
field, as it defeats the purpose of allowing multiple tracing systems to collaborate.
Security Considerations
There are two types of potential security risks associated with this specification: information exposure and denial-of-service attacks against the vendor.
Vendors relying on traceparent
and tracestate
headers should also follow all best practices for parsing potentially malicious headers, including checking for header length and content of header values. These practices help to avoid buffer overflow and HTML injection attacks.
Denial of Service
When distributed tracing is enabled on a service with a public API and naively continues any trace with the sampled
flag set, a malicious attacker could overwhelm an application with tracing overhead, forge trace-id
collisions that make monitoring data unusable, or run up your tracing bill with your SaaS tracing vendor.
Tracing vendors and platforms should account for these situations and make sure that checks and balances are in place to protect denial of monitoring by malicious or badly authored callers.
One example of such protection may be different tracing behavior for authenticated and unauthenticated requests. Various rate limiters for data recording can also be implemented.