Cache hit rate and cache policy quality
The hit rate is only a result indicator. What really determines the system cost is the failure mode, return pressure and error propagation radius.
When many teams talk about caching, the first thing they say is hit rate.
A hit rate of 95% sounds good; 98% seems to be optimized; 99% even gives people the illusion that the system has been “tamed”.
But when something goes wrong online, everyone often sees another set of problems when reviewing: the database is penetrated after a certain batch of keys expire at the same time, users still see old data within an hour after the price is changed, the cache node shakes, and the most expensive queries all hit the main database directly.
My judgment is simple: **The goal of the cache strategy is to control the cost of “misses”. ** A high hit rate only means that many requests are read into the cache; it does not mean that the failure design is correct, it does not mean that the cost of returning to the source is controllable, and it does not mean that errors will be limited to a small range.
If a cache increases the hit rate from 92% to 99%, but at the cost of making it harder to invalidate, harder to locate dirty data, and more likely to form a back-to-origin storm during jitter, then it is most likely not an optimization, but just moves the complexity from the database to another corner.
The question is not “how many hits”, but “where to hit when you miss”
The hit rate indicator has a natural flaw: it flattens the hot and cold distribution.
One interface receives one million requests a day, 990,000 of which are reading popular product details, and the remaining 10,000 are reading long-tail products. Cache popular data and the hit rate will be very good immediately. But if those 10,000 long-tail misses happen to fall on the slowest, most expensive query that also involves joining tables, the database pressure may not be less than imagined.
In other words, the hit rate is counted as “number of times”, but the system cost is often settled as “price”. **
This also shows that some teams are clearly seeing beautiful cache indicators, but the database CPU still cannot slow down: the cache blocks cheap requests, and really expensive misses are still running naked.
When judging a caching strategy, I prefer to look at three things first:
- Miss the most expensive part of the request;
- Whether misses will occur concentratedly rather than evenly;
- Will the stress be transmitted to downstream after the miss?
These three questions are closer to the true cost than “did the overall hit rate go up another 2 points?”
What really determines cache quality is usually the failure method.
Many cache accidents are essentially “failure design is too rough”.
The most common approach is to just set a TTL. The implementation is simple and the indicators are easy to improve, because as long as the TTL is long enough, the hit rate is usually not too bad. But there are two typical problems with TTL schemes.
First, it leaves time to decide when the data will expire, rather than business changes. **
Prices have changed, inventory has changed, permissions have changed, but the cache is still alive. Of course, the TTL can be shortened, but once the TTL is shortened, the return-to-source frequency will increase again. Many teams are going back and forth between “the old data is too old” and “the database is too busy”.
Second, it is easy to create synchronization failures. **
If a batch of hot keys are written at a similar time and use the same TTL, they will most likely expire together at a similar time. Normally, the hit rate is fine, but as soon as it reaches the expiration point, there will be an instant back-to-source peak. This may only be a five-minute jitter on the dashboard, but it is a very real punch to the database and downstream services.
Therefore, the most important thing in cache design is often whether the failure mechanism is aligned with business changes.
Common modifications include:
- Use event-driven failures instead of just relying on TTL;
- Add random jitter to TTL to prevent hotspot keys from expiring at the same time;
- Put the version number or timestamp into the key to explicitly switch the data generation;
- Allow old values to be returned for a short period of time, but refreshed asynchronously in the background instead of having all requests go back to the origin together.
The core here is that you need to decide when to accept old data, when it must be consistent immediately, and when you would rather return the downgrade result than blow up the main database. ** This is the caching strategy, and the hit rate is just a shadow of it.
Extending the TTL often just pushes the problem back.
I have seen a lot of “cache optimization” done like this: the online database is under pressure, so the TTL is changed from 5 minutes to 30 minutes, and then to 1 hour. As a result, the hit rate went up and the database curve became a little flat, and the team announced that the optimization was completed.
The most dangerous thing about this approach is that it is too easy to appear effective in the short term.
Because most indicator systems are better at recording “how many queries are saved” and not good at recording “how many erroneous data are saved”. When old prices, old configurations, and old permissions are read by more users, the losses are often not reflected immediately on the cache panel, but will appear later in complaints, compensation, manual verification, and accident reviews.
Especially for the following types of data, it is often a bad idea to extend the TTL roughly:
- Data that will affect the amount settlement;
- Data that affects permissions and visibility;
- Data that is not updated frequently, but must be converged as quickly as possible once updated.
For this type of data, if the highest hit rate is obtained by “delayed recognition of changes”, it is overdrafting the accuracy. **
The hardest part of caching is limiting errors
A common situation is to think of caching as “an extra layer in front of the database.” This understanding is not wrong, but it is too optimistic.
Caching in real systems is a stress surface that amplifies design flaws. If the key is designed too thickly, the scope of failure will be too large; if the key is designed too thinly, the memory usage and maintenance complexity will increase; data assembly is placed in the cache layer, and it is easy to repeat calculations when returning to the source; local cache and distributed cache are stacked together, and there will be an additional layer of consistency issues.
So I will divide the caching solution into two questions to review:
- Are you usually happy or not?
- When something goes wrong, will it drag down other parts of the system?
The latter question is usually more important.
A more reliable cache implementation will often explicitly handle the following things:
v, ok := cache.Get(key)
if ok && !v.SoftExpired() {
return v.Data
}
return singleflight.Do(key, func() any {
fresh := db.Load(id)
cache.Set(key, fresh, ttlWithJitter())
return fresh
})
What’s really valuable here are two constraints:
- When the same key is returned to the origin, only one request is allowed to actually hit the database;
- Expiration gives the system a controllable refresh window.
This type of design may not necessarily achieve the best hit rate, but it can significantly reduce return storms and instantaneous amplification. Without jitter, the system is more like “slowing down” rather than “collapsing”.
A common misunderstanding: treating cache hit rate as a team KPI
Once hitting percentage becomes a KPI, it’s easy for teams to work in the wrong direction.
Because the easiest way to improve hit rate is often to improve statistical results:
- Stretch TTL;
- Put more data that should not be cached;
- Use coarser-grained keys to mix different scenes together;
- Splitting the interface to make it cache-friendly makes the business semantics worse.
These actions may make the graph look better, but they will push up other problems: the dirty data window will be longer, failure will be more difficult, troubleshooting will be more difficult, memory will be more expensive, and business boundaries will be blurred.
A healthier approach is to look at caching along with the following metrics:
- Miss P95/P99 takes time to return to the source;
- Number of concurrent back-to-source hotspot keys;
- Error rate and database peak after cache invalidation;
- The time it takes for data to converge to new values after updating;
- How many additional code paths and operational actions are introduced to maintain the cache.
Only when these costs are put together does the hit rate have explanatory power. Looking at it alone, it’s too easy to be fooled.
Counter example: In some scenarios, the hit rate should be as high as possible
You can’t just say it to death.
If the data is naturally stable, or simply an immutable resource, such as static files, versioned configurations, CDN distribution content, and dictionary tables that rarely change, then the hit rate is indeed a very important goal. Because in this kind of scenario, “older” is not a serious problem, failure is relatively simple, and the path back to the source is usually clear and controllable.
There are also some internal tool-based systems where the amount of data is not large, writing is infrequent, and the business cost of occasional dirty reads is also very low. In this case, a simple TTL solution is probably right. On the surface it looks high-end, but in reality it’s closer to being cheap enough.
So this article is against taking the hit rate out of the context and using it alone as evidence of whether the cache is successful.
Summary
To judge whether a caching solution is reliable, I would first ask: **What will happen when the most expensive miss in this system occurs? **
If the answer is “just a little slower, the system can still be stable”, then the cache is most likely designed.
If the answer is “the hit rate is usually very high, but once it fails, it will penetrate the main database and drag down the accuracy”, then it is just a hidden danger that seems to work hard at ordinary times.
Caching is never just about “putting data closer.” What it really tests is how to deal with changes, failures and costs. A high hit rate means that you have done the first half of the sentence well at most; it is the second half of the sentence that determines whether the strategy is worth it.
读完之后,下一步看什么
如果还想继续了解,可以从下面几个方向接着读。