返回首页

Fault location problem after cost reduction in multi-model routing

What you save in cost reduction is the money for a single call, but what you pay online is the cost of recurrence, the cost of attribution, and the time it takes to misjudge the problem from 'quality' to 'random' again and again.

The first time I felt something was wrong online was a complaint that was difficult to explain: the same user asked the same question three times within 5 minutes. The first time he responded well, the second time he started talking nonsense, and the third time it returned to normal.

There are no abnormalities in the logs, the latency is stable, and the token usage has not surged. The only change that can be seen is that we have just turned on “multi-model routing” and used cheaper models to cover part of the traffic.

At that time, the team’s intuition was very consistent: the model is a probability distribution, and fluctuations are normal. Besides, routing is only one layer of gateway, so what big problems can occur?

In the next two weeks, we suffered from this sentence many times.

Core judgment

The core cost of multi-model routing is to change the behavior of the same request from “reproducible” to “probability distribution”.

In the single-model era, no matter how difficult an online problem is, as long as you get the input, prompt, context, and version number, you can most likely reproduce it in a certain environment, and then follow the call chain to pinpoint the root cause.

Once routing is involved, the problem will become:

-Which model, which version, and which set of parameters did I use this time?

  • Why the route was chosen in this way, and what characteristics the threshold hit?
  • Did you take different paths twice in the same session?
  • When a failure occurs, is it possible to re-run the “decision made at that time”?

If there are no traceable logs and rollback mechanisms, online faults will be upgraded from “inaccurate model” to “unlocatable root cause”.

How things get harder step by step

We only adopted the simplest strategy at the beginning: use small models whenever possible, and only upgrade to large models when they “look complicated.”

The so-called “looks complex” are some cheap features: input length, whether it contains code blocks, whether there are multiple rounds of dialogue, and the confidence of a small classifier.

The first wave of problems after going online was that the troubleshooting methods failed.

The same prompt cannot be reproduced by test colleagues in the grayscale environment, and cannot be reproduced locally by developers. In the end, only online users can trigger it stably.

We once suspected that it was a bug in context splicing, caching, or some post-processing. It was not until we captured the full input of a request that we discovered that a small model was used online this time, and the large model was used by default when we reproduced it.

This is a “path change”.

Path changes change troubleshooting from “recurring inputs” to “recurring decisions.” Decisions cannot be replayed at the time.

Misunderstanding 1: Treat routing as pure cost optimization

What you see in the cost table is:

  • 30% of traffic goes to small models
  • Average cost per call dropped by 18%

But what you can’t see in the fault table is:

  • Each quality problem will take an extra 1-2 days to determine whether it is caused by routing.
  • Online reproduction requires a more complete “decision-making context”
  • Rollback is no longer “rollback model”, but “rollback strategy + rollback threshold + rollback feature extraction logic”

When you treat routing as a light change like “changing a supplier”, you will definitely have to pay interest later in troubleshooting.

Misunderstanding 2: Interpreting instability as “LLM is inherently random”

Most of the problems caused by the randomness of a single model are “sampling the same input multiple times with different outputs”.

The problem caused by the randomness of routing is that “the same input goes to different systems.”

Both look like fluctuations, but are diagnosed in completely different ways.

The former often adjusts the temperature, system prompts, and adds constraints; the latter must first answer: Did they go the wrong way this time?

Without routing decision logs, the team will fall into a very bad habit: attributing all anomalies to “model instability”, so the strategy becomes more and more aggressive, and the quality becomes more and more like a dice.

Three types of traceability that really need to be completed

To make routing a “troubleshootable” system, at least three types of records must be completed, and they must be able to be strung together in one request dimension.

1) Routing decision log (decision log)

Not just record “which model was selected”, but also record:

  • Candidate set (what available models are available at that time)
  • Scoring or threshold judgment for each candidate
  • Feature values used (input length, multi-round count, classifier output, etc.)
  • Policy version number (very critical)

Only in this way can we answer “Why did I choose it this time?”

2) Request snapshot (replay snapshot)

At least the following should be available in case of failure:

  • Raw user input
  • The prompt actually sent to the model (including system prompt words, spliced context, and tool results)
  • Key configuration (temperature, top_p, max_tokens, stop, and its own post-processing switch)

Without snapshots, recurrences are just guesswork.

3) Routing rollback (rollback primitive)

Rollback should be “rough” enough and can be executed with one click:

  • Force all players to follow a certain stable model
  • Or fix a certain strategy version

Don’t expect to temporarily change the threshold in an accident. What is needed in an accident is certainty.

Failure case: seemingly smart “adaptive threshold”

We later tried a more “smart” approach: dynamically adjusting the threshold based on the quality signal of the past 10 minutes to allow the small model to swallow more traffic.

The result is a very typical self-oscillation:

  • Small models swallow more and the quality signal becomes worse
  • The threshold is raised, large models swallow more, and the quality signal becomes better.
  • The threshold is lowered, and small models swallow more

Externally it looks like “good times and bad”, but internally the strategy is faltering.

If there is no policy version number and decision log for this kind of problem, it is basically impossible to explain it clearly, let alone fix it.

Applicable boundaries

Multi-model routing is not impossible, but it is more suitable for teams that meet the following prerequisites:

  • Is it acceptable to pay additional storage and privacy compliance costs for traceability?
  • Have clear quality metrics and warnings, instead of relying on user complaints
  • Can the strategy be maintained as an “online system” with versions, grayscales, and rollbacks?

If the current observability is still limited to “request volume, delay, error code”, then don’t rush to do complex routing yet. The money saved may be lost in troubleshooting time.

Summary

The most underestimated thing about multi-model routing is that it changes the objects of troubleshooting.

What used to be reproduced was input, but now what needs to be reproduced is decision-making. Without decision logs, request snapshots, and rollback primitives, online failures will become “random” that cannot be explained or repaired.

Cost reduction accounts are easy to calculate, while recurring accounts are the most difficult to calculate, but they will definitely appear in the accident review in the end.

FAQ

读完之后,下一步看什么

如果还想继续了解,可以从下面几个方向接着读。

Related

继续阅读

这里整理了同分类、同标签或同类问题的文章。

AI · 5 个标签

Confident errors brought about by high RAG recall

What really gets out of control first is when conflicting evidence, expired documents, and content with inconsistent permissions enter the context together. The answer begins to become complete, but the chain of evidence becomes loose.

单一鸣继续阅读