Boundary division of logs, indicators and Tracing
Indicators are responsible for discovering anomalies, Tracing is responsible for narrowing the path, and logs are responsible for restoring the scene; mixing the three will only increase the cost of troubleshooting.
Many teams’ first reaction when trying to supplement observability is to “connect logs, indicators, and tracing”.
This sentence sounds correct, but the real problem is the next step: all three things are connected, but the boundaries are not determined, so each method is wiping the butt of the other two.
The results are usually very familiar: indicator labels explode, Tracing sampling increases all the way, log volume doubles every month, alarms are still inaccurate, and troubleshooting still relies on human grep. There is no shortage of tools, but the system has not really become more observable.
My judgment is: **The boundaries of logs, indicators, and tracing should not be drawn by “what the data looks like” but by “what decision to make next.” **
- The indicator answers: whether it should be dealt with immediately, and whether the scope of impact is large;
- Tracing answers: Which section of the link is the problem probably stuck on?
- The log answers: What exactly happened in that piece of code in that request.
The three are in a relationship of increasing decision-making costs. The indicator is the cheapest and suitable for continuous monitoring of the market; Tracing is more expensive and suitable for narrowing the range; the log is the heaviest and suitable for final restoration of the scene.
If the log is responsible for alarming, Tracing is responsible for auditing, and the indicators are responsible for each user, each order, and each SQL dimension, the system will certainly be able to run, but the price will be very honest in storage, query, sampling, cardinality control, and human troubleshooting.
The real first decision should be the indicators.
When there is a glitch on the line, the first step is usually to “judge whether it is worth dealing with immediately.”
This step is most suitable for indicators. On the surface, it seems that the indicator is more advanced, but it is actually closer to it and the cheapest.
A good indicator system should answer the following questions within tens of seconds:
- Is the error rate increasing, or is it just occasional noise?
- Is the delay variation global or concentrated on a certain interface?
- The impact is on a machine, a computer room, or the entire call chain;
- Did the problem just happen, or has it been going on for half an hour?
These problems are essentially doing the same thing: **Using low-cost aggregation of information in exchange for high-value action judgments. **
Therefore, the most important design principle of indicators is “it can also support decision-making after aggregation”.
Many teams make bad indicators by shoehorning in high-cardinality fields that should not be included in indicators, such as user_id, order_id, trace_id, complete URLs, and dynamic error messages. This is very satisfying in the short term and feels like everything can be done; in the mid-term, storage expansion, query slowdown, and alarm dimensions will start to get out of control; the long-term result is usually that the team itself is afraid of using the indicator platform.
Indicators are suitable for expressing trends and distributions in stable dimensions, such as interface names, status codes, computer rooms, dependent services, and cache hits. The value of these dimensions is not to “restore a specific request”, but to help quickly determine whether the problem occurs in clusters.
Once the question you want to ask is “Which order failed?”, it is no longer a question that the indicator should answer alone.
The value of Tracing lies in shortening the positioning path
Tracing is often misunderstood as “more advanced logging than logging”. This will directly use it up.
What Tracing is really good at is describing the path and time-consuming relationship of a request across services and components. It is naturally suited to answer questions like this:
- Is the slowness caused by the gateway, application, database or external dependencies?
- A retry, which hop is amplified;
- If the P99 of a certain interface is so high, is it related to a certain downstream service?
- An abnormal request starts to deviate from the normal path after passing through which services.
In other words, **Tracing provides a link structure, not a complete scene. **
Therefore, what it should most carry is path information, key stage time-consuming, and a small number of tags that can help filtering, rather than completely stuffing business objects into it, let alone turning every local variable into a span attribute.
I have seen many teams try to make it a unified source of truth as soon as they start tracing: the original SQL text must be hung, user input must be hung, the complete response body must be hung, and all retry parameters must be hung. The final result is:
- The span size rapidly increases, and the sampling rate is forced to be lowered;
- When querying the link, the noise is much greater than the signal;
- When we actually arrived at the accident scene, the key request was not left because of sampling;
- Sensitive data governance begins to become a new maintenance cost.
The most valuable time for Tracing is when you can use a link to quickly determine which service, span, and log section should be viewed on the next hop.
If it is so heavy that it cannot be retained globally stably, or is so heavy that querying it even once is painful, then the boundary has been crossed.
The log is not the second indicator system, it is the final evidence collection material
The log is the easiest to misjudge because it looks like it can hold anything.
Indeed, the log is closest to the scene. How the branch was taken, what the parameters looked like, how many times it was retried, which downgrade path was hit, and why a seemingly successful but semantically incorrect result was returned this time can often only be seen from the logs.
But precisely because it has the highest information entropy, it is the least suitable for “global continuous observation”.
Using logs as the main monitoring source has three common consequences.
First, the query cost is high. **Every time, temporary conclusions are drawn from the original facts, which is slow and has poor stability.
Second, the sound is extremely noisy. ** When the log volume is large, the team will naturally reduce printing; once printing is reduced, key branches often lack evidence; in the end, everyone will go back and forth between “too much to see” and “too few to see”.
Third, it can induce bad engineering habits. ** Many developers create an extra layer of logging when they encounter a problem, and as a result, the system becomes noisier. What really needs to be supplemented is event logs with clear boundaries, error logs with context, and request ids that can be associated, rather than more “entering methods”, “leaving methods”, “starting processing” and “processing completed”.
The most appropriate place for the log is to use indicators to determine “there is really a problem here” and use tracing to know “the problem probably lies here”, and then use it to restore a specific request.
In other words, logs should serve the purpose of forensics, not patrolling.
A simple division of labor is more effective than “all three items”
The approach I recommend is simple: define the troubleshooting sequence first, and then define the collection boundaries.
An online troubleshooting can be broken down into three steps.
- First use indicators to determine whether there are systemic problems;
- Then use tracing to converge the problem to the service and link stage;
- Finally, use the log to explain why the request went like this.
If the embedded design cannot support this sequence, it is usually because the responsibilities are assigned wrongly.
The following is a relatively restrained way of writing:
func CreateOrder(ctx context.Context, req CreateOrderReq) error {
start := time.Now()
defer metrics.OrderCreateLatency.Observe(time.Since(start).Seconds())
ctx, span := tracer.Start(ctx, "order.create")
defer span.End()
span.SetAttributes(
attribute.String("payment_provider", req.Provider),
attribute.Bool("has_coupon", req.CouponID != ""),
)
err := service.Create(ctx, req)
if err != nil {
metrics.OrderCreateErrors.WithLabelValues(classify(err)).Inc()
logger.Error("create order failed",
"request_id", requestid.FromContext(ctx),
"provider", req.Provider,
"err", err,
)
return err
}
return nil
}
There are three boundaries here that are intentionally restrained.
- The indicator only retains dimensions that still have decision-making value after aggregation, such as misclassification;
- Tracing only hangs a few attributes that can help filter paths, rather than the entire request body;
- The log only records key context on the failed path and is guaranteed to be associated with the link through
request_id.
This design is not fancy, but it solves a very real problem: after an accident occurs, can the team approach the answer from low to high cost, instead of plunging into the most expensive data right from the start. **
A common misunderstanding: understanding “unified observability” as “aggregating all data”
Now many platforms are talking about unified observation. There is nothing wrong with this in itself, the problem lies in the way it is implemented.
Some teams understand unification as “all data enters one platform, all fields are connected to each other, and it is best to solve all problems with one query.” This idea is tempting because it seems to eliminate tool boundaries; but in engineering, it often eliminates cost boundaries.
The correct direction for unification is to connect them with low friction.
For example:
- Indicator alarms can jump directly to relevant services and time windows;
- tracing can find the corresponding log according to
request_idor error code; - Logs can bring trace context instead of working independently.
This is called China Unicom, not mixed use.
The real danger is that all questions can only be answered by the heaviest layer of data.
Counterexamples and boundaries: not all systems require a full three-piece set
Also admit that in some scenes the boundaries do not need to be told so completely.
For low-QPS internal tools, stand-alone batch scripts, and very stable offline tasks, logs may be enough; for services with short request links and simple dependencies, the benefits of tracing may not be as high as expected; for certain audit scenarios subject to compliance constraints, you should not expect tracing or ordinary application logs to serve as formal audit records.
So this article is saying: **As long as you decide to do three things, don’t let them replace each other. **
The more complex the system, the more responsibilities need to be narrowed. Otherwise, what is eliminated today is the design boundary, and what is added tomorrow is the platform bill, troubleshooting time, and team cognitive burden.
Summary
The scariest thing about observability is that every layer wants to answer all questions.
Indicators should let you decide whether to act as quickly as possible, Tracing should let you decide where to look first as soon as possible, and logs should let you know what happened as soon as possible.
If the order of these three actions is messed up, no matter how complete they are, it will only bury the cost of troubleshooting deeper.
读完之后,下一步看什么
如果还想继续了解,可以从下面几个方向接着读。