High-cost decision-making chain in AI systems
It’s easy to save money on reasoning. Turning online behavior into a reproducible evidence chain is the real cost control.
Online costs will go up. In many cases, it’s not just the token unit price that is expensive, but also the same type of problems that need to be checked repeatedly. On the surface, you might think you are buying an inference service, but in fact, you are buying a system whose behavior can change at any time. If something goes wrong, you will not be able to produce a complete chain of evidence.
This is why I increasingly distrust the algorithm “AI unit = token”.
For the same call, the difference between reproducible and non-reproducible determines what needs to be paid in the following series of engineering costs, review costs, and compliance costs.
How things got out of control
At the beginning, our cost analysis was very simple, and all accounts could be placed on one line:
-Token unit price
- Number of input and output tokens
- Call volume
After the report is made, it looks very beautiful, with a clear cost reduction curve, and can even tell the outside world “how much the unit cost has dropped.”
The real problem occurred in the second week after launch.
The customer service staff began to report, “Sometimes the same question can be answered correctly, and sometimes it can be answered incorrectly.” The product asked “Is the model getting worse?” Our first reaction was to look at the model version, but it turned out that the model version was not moved.
Then we looked at the prompt word, and the prompt word didn’t move.
Looking down the log further, I discovered that this request actually went through multi-model routing, hitting different models from different suppliers, and the tool calls were inconsistent. What’s even more terrible is that the log at that time only recorded the “final output” and did not record the reasons for the routing decision at that time, nor did it save the context snapshot.
So this kind of problem will become a very typical troubleshooting dead end:
- Cannot be reproduced
- Cannot be attributed
- Can only guess
There are usually two guessing results, both of which are wrong:
- Attribute the problem to “model randomness”, and then use cooling and punishment to suppress it.
- Attribute the problem to “the prompt word was not written well”, and then start piling up instructions until the prompt word becomes another uncontrollable system.
Both of these approaches will make the token more expensive on the account, but do not make the system more controllable.
This type of cost will blow through the budget
The token cost is linear: a call that costs 10% more probably really costs 10% more.
The cost of non-reproducibility is exponential because it will amplify the processing process of each online problem:
- Troubleshooting time increased from 30 minutes to 3 hours as the same request cannot be replayed locally.
- Rollback decisions are slower because it is not known which rollback model, rollback route, or rollback tool.
- Compliance evidence collection becomes difficult because it is impossible to answer “why this conclusion was output at that time and what inputs it was based on.”
- Rework costs become higher as patching has to be done by “adding more guardrail”, but the guardrail itself also requires maintenance.
The most hidden one is that they are often forced to invest a lot of engineering resources into “stabilizing online behavior” instead of investing in “improving capabilities.”
This also shows that many teams are more and more like maintaining a complex rule system. In the end, they neither save money nor become smarter.
What do I recalculate AI units into?
If you only count “AI units” as tokens, you will often optimize a bunch of very dangerous strategies:
- To save money, do more aggressive routing and downgrades.
- To save money, move more logic into prompts and tools.
- In order to save money, leave more judgments to the model to “automatically decide”.
These are pushing the system in the direction of “irreproducibility”.
I prefer to split the AI units into two parts:
- Inference unit: token, delay, throughput.
- Evidence unit: How much traceability cost is required for a decision.
The reasoning unit solves “how much does it cost to run”.
The evidence unit addresses “how much does it cost if something goes wrong.”
The really expensive one is often the second one.
A reproducible decision-making chain, at least what it should look like
I regard it as a “ledger”, and every request must be able to string together key nodes.
At least these types of fields are required. If any of them is missing, the link will be broken in some kind of accident:
- Routing Decision: Which model is hit, why, what are the candidates, and whether to be downgraded.
- Prompt word version: system + developer + template version number, key parameters.
- Context Snapshot: Participate in the generated search result summary, document version, and permission filtering results.
- Tool call chain: Which tools are called, what are the input parameters, what is returned, and how much time it takes.
- Output and post-processing: final output, filtering rule hits, reasons for rejection (if rejection).
I deliberately do not regard “full text context” as a required item here, because many scenarios cannot be saved, or the compliance risks are too great if they are saved.
But at least ensure that it can be replayed to the “same decision path” if necessary.
The most common misunderstandings
Misunderstanding 1: Relying on temperature to suppress randomness
Randomness is not the core issue.
The real problem is: I can’t even explain where this output came from. Lowering the temperature only makes it “more like a stable black box”.
Misunderstanding 2: Treat prompt as the configuration center
When prompt carries more and more business rules, it is no longer a prompt word, but a “runtime configuration” without a type system, no tests, and no rollback mechanism.
This will directly drive up the evidence unit.
Misunderstanding 3: Only remember the final output, not the intermediate path
Just remembering the output is equivalent to turning troubleshooting into “guessing”.
Many online problems are caused by a certain tool call error, a certain search hit error, or a certain route downgrade error. If you don’t record the path, you will always be able to infer from the result, and backward inference usually cannot be done.
Applicable boundaries
Not all systems require a complete ledger for every request.
I will use three conditions to decide whether to include evidence units:
- Will this output enter the business closed loop (affecting transactions, approvals, risk control, and external commitments)?
- Whether this output can be held accountable by users or external audits.
- Once this output is wrong, is the repair cost higher than the cost of one inference?
If any two of the three conditions are met, I will regard the “reproducible decision chain” as the first priority of cost control.
Summary
Token is an explicit cost, and non-recurrence is an implicit tax.
A truly cost-effective AI system turns every online behavior into a traceable chain of evidence.
What is saved is those nights during the next accident.
读完之后,下一步看什么
如果还想继续了解,可以从下面几个方向接着读。