返回首页

Confident errors brought about by high RAG recall

What really gets out of control first is when conflicting evidence, expired documents, and content with inconsistent permissions enter the context together. The answer begins to become complete, but the chain of evidence becomes loose.

When many teams bring RAG into business for the first time, the first indicator they focus on is usually the volume of recalls.

If 3 hits are not enough, adjust it to 8 hits; if 8 hits are still not stable, continue to relax the vector similarity threshold, and then stack BM25, tag filtering, and synonym expansion together. The hit rate on the panel does look nice, and a lot of issues seem to be “covered.” But after running online for a while, another type of more difficult question emerged: the answers began to sound more and more like the truth, and the tone became more and more complete. However, once the source was carefully checked, it was mixed with old version rules, other tenant documents, obsolete SOPs, and even conflicting instructions.

My judgment on this type of problem is: **RAG’s reliability often suffers from “recalling too many things that should not appear at the same time.” Once the context is stuffed with conflicting information, expired documents, and content with inconsistent permissions, the model will not honestly tell you “the evidence conflicts and the answer cannot be answered.” It is more common to follow the inertia of language and sew these fragments into an answer that looks complete, but in fact the evidence chain has been loosened. **

This kind of problem looks like insufficient recall at first, but later turns out to be context pollution.

This is the first time I have made this judgment clear. On the surface, it seems that the model’s answer is too short, but in fact it is closer to that it is too smooth.

The scenario is an internal knowledge question and answer session within an enterprise. The user asked a very specific question in the reimbursement approval link: after an overseas business trip exceeds the limit, whether to go to the direct supervisor for approval first, or to go to the cost center for review first. The system often fails to answer questions at first, and the reason is simple. Related systems are scattered in different knowledge bases, and vector searches can often only get half of them.

So the team made a very typical round of enhancements:

  • Raised topK from 4 to 10;
  • Added keywords to recall the bottom line;
  • Relaxed synonymous expression matching;
  • Put historical announcements, FAQs, and system texts together into the candidate set.

It works well in the short term. The reply is no longer “No relevant information found”, but can now organize complete steps. The problem starts here: users report that the answer “sounds like the right answer”, but if they actually follow it, they will make the wrong order.

Later, when I took apart a wrong answer and looked at it, three types of materials appeared simultaneously in the model context:

  1. The approval chain in the old system half a year ago;
  2. Exception clauses in the text of the new system;
  3. Description of another regional entity in the FAQ.

Each of these three materials is not garbage, and even looks like “highly relevant” when viewed individually. The problem is that they don’t belong to the same decision space. What the model gets is a bunch of fragments that are related in terms of words but have inconsistent business boundaries. In the end, the answer it generated was to knead the three materials into a new process.

This is where many RAG projects are most likely to be misjudged: On the surface, it looks like “recall has become stronger”, but in essence, it upgrades the retrieval error from “lack of evidence” to “dirty evidence entering the generation stage”.

After more recalls, the model will not become more cautious, but will only become better at repairing seams.

A common situation is that by default, giving the model more information will, at best, just let it choose.

But the real situation is closer to another mechanism: the longer the context, the more fragments, and the looser the semantic relationship, the easier it is for the model to spell “partly reasonable” into “overall true”. **

This is because the generation phase is faced with a string of text that has been linearized. As long as these texts can literally bridge each other, the model will naturally tend to bridge the gap. This tendency will be particularly strong in the following situations:

  • The two documents have different conclusions, but share a lot of business terms;
  • When the new system overthrew the old system, it did not clearly say “the old rules are abolished”;
  • The FAQ summarizes the text in colloquial terms, but omits applicable conditions;
  • Multi-tenant, multi-region, and multi-version content are recalled together, but are only distinguished in metadata.

At this time, the model will not directly expose “I see a conflict”, but will often do three things:

  1. Prioritize the sentences that best form a complete narrative;
  2. Automatically fill in the cause and effect connections that are not explicitly stated in the context;
  3. Swallow the boundary conditions and replace them with expressions that are more like general rules.

In the end, what the user sees is an answer that is smooth, full, and seems to have been comprehensively judged. The real danger is that it teases the conflict.

Outdated documents are not noise, they will actively dilute the weight of new evidence

When many teams troubleshoot RAG incorrect answers, they are accustomed to treating expired documents as a kind of “low-quality noise” and feel that as long as the number is small, it is not a big problem.

But during the generation phase, expired documents are often competing evidence that actively changes the focus of the answer.

A more typical example I have seen is the customer service knowledge base. A certain refund rule has been changed in the new version of the policy, but the old version of the FAQ is more likely to be ranked higher in the recall stage due to its high number of visits and more colloquial expressions. The text of the new policy is written accurately but hard; the old FAQ is written smoothly and has a complete rhetorical template. As a result, when the model answers, it is very easy to regard the new version of the rules as local restrictions and the old FAQ as the main narrative.

The final answer often looks like this:

通常情况下用户可先申请原路退款,如遇活动商品则需进一步审核。

The most powerful thing about this sentence is that almost every word can be found in the context, but the entire sentence itself does not exist in any one source. The real new rule may have been changed to “Active products do not support original refunds”, and the “usually” in the old FAQ was used by the model as a general sentence, directly suppressing the new rule into an exception.

Therefore, the problem of expired documents is never just that “old information has been mixed in”, but that old information is often more like human speech and is easier to be used as a skeleton by models**.

Recalling inconsistent permissions is more troublesome than wrong answers because it will create “seemingly well-founded” answers that exceed authority.

Another issue that is often underestimated is permission boundaries.

Many internal RAG systems place permission verification at the level of “whether the document can be opened”, thinking that as long as the original text is not displayed to the user in the end, it will be fine. The real danger of the generative system is: **As long as the restricted document enters the context, even if the original text is not posted in the end, the answer itself may have revealed judgments that should not be known. **

For example, when sales asks a question about contract approval, there are only general procedures in the public knowledge base, and there is an exception clause for special customers in the legal knowledge base. If the retrieval stage only does “recall first, then crop”, then the model may have taken advantage of that exception rule in the draft stage, and finally output a seemingly neutral suggestion:

Such customers usually require additional regional head approval.

The user cannot see the restricted document, but he has been given an organizational rule that he should not have known about. Even more troubling is that this sentence is difficult to identify as a leak in its form, because it is less like a copy and paste and more like the model “summarized it by itself.”

Therefore, the permission issue cannot only be understood as access control, but must be understood as evidence source control. As soon as materials that do not belong to the same visible range are fed to the model together, the system has crossed a line. The subsequent desensitization and reference restrictions are only dealing with contamination that has already occurred.

What really needs to be optimized is to let the evidence converge according to the decision boundary first

Many RAG systems become more and more chaotic later on. On the surface, it seems that the model is too weak. In fact, it is closer to the retrieval stage and the optimization direction itself is biased.

What teams are most likely to indulge in is treating recalls as search engine issues:

  • If the correlation is not enough, add a recall channel;
  • If the coverage is not enough, add a little more topK;
  • The user’s query method is unstable, so do more query rewrite.

These actions are not necessarily wrong, but if there is a lack of “decision boundary” constraints, more materials that should not appear at the same time will be sent to the generation stage.

What I pay more attention to later is another set of convergence sequences:

1. Do range convergence first, and then do correlation sorting.

Many questions and answers can actually limit the scope before semantic retrieval, such as:

  • organizational entity;
  • region or country;
  • Effective time;
  • Document type;
  • User permissions field.

If these conditions are not taken into account first, and ranking is based solely on embedding similarity, the result will definitely include things that are “similar”. That’s because the candidate set is wrongly defined.

2. Treat version and effective time as first-class citizens instead of subsidiary metadata

Many knowledge bases obviously have updated_at, version, and status fields, but they are only used in the presentation layer and hardly participate in decision-making when retrieving and spelling out context. This way, the old document and the new document are treated equally, and the model has no idea who should overwrite whom.

A more stable approach is to handle the coverage relationship explicitly:

  • Deprecated documents do not enter the generation context by default;
  • When the old and new rules conflict, they are directly marked as conflicts and the model is not allowed to be synthesized freely;
  • FAQ cannot cover the main text of the system and can only be used as an explanation layer to supplement it.

3. Let the conflict be exposed instead of letting the model be the referee instead of the system.

Many systems default to splicing multiple candidate materials directly and handing them over to the model, hoping that the model will “comprehensively understand” them on its own. This step is precisely the most dangerous, because it outsources the handling of evidence conflicts to the layer that is best at patching the gaps.

If two high-weight documents have conflicting conclusions, a more reasonable system behavior is usually to explicitly tell the user:

  • Conflicting rules were found;
  • Where are the conflict points;
  • Which version is currently used by default, or manual confirmation is required.

It doesn’t sound as silky, but it’s really controllable. Acknowledging conflict is more like a reliable system than giving a complete but adulterated answer.

A particularly common failure case: treating rearrangement as the final solution

After many teams find that “the more recalls there are, the more chaos there is”, they will immediately use reranker. As a result, the sorting quality has indeed improved, so they regard the problem as solved.

But what reranker can solve is mainly “who is more like the answer to the question”; it cannot solve “whether these candidates belong to the same merging fact space”.

If the candidate set contains both:

  • Region A 2024 rules;
  • Region B 2025 rules;
  • Internal exception instructions for administrators;
  • FAQ for ordinary employees;

The reranker only ranks two or three of the articles higher. It cannot fundamentally decide for the system whether these materials can be fed to the model together.

This also shows that many RAG reviews look good offline, but start to drift as soon as they enter complex scenes online. Questions and answers in offline collections are often single, standard, and have clean boundaries; the real complexity of online questions is that they are connected with versions, permissions, organizational structures, and exceptions. Sorting only puts the most similar materials first and does not automatically manage the team.

Applicable boundary: not all scenarios should reduce the recall amount

Saying “too many recalls makes it easy to make mistakes” does not mean that all systems should cut topK very small.

If you are doing exploratory Q&A, data collection, and research assistance, it is reasonable to provide more materials, and users are willing to accept “there are multiple opinions here.” In this scenario, the system goal is to help users navigate the information space.

What really needs to strictly control the recall boundary are those scenarios where the answer will be directly executed, such as:

  • Institutional Q&A;
  • Approval process;
  • Customer service caliber;
  • Operation and maintenance runbook;
  • Medical, financial, and compliance decision support.

In these scenarios, the most important ability of the system is “don’t combine mutually incompatible evidence into an executable instruction.” Once the cost of a wrong answer is higher than the cost of not being able to answer, the search strategy can no longer revolve solely around coverage.

Summary

The most addictive thing about RAG is that it can always make the panel data look better in the short term by “recalling a little more”.

But after a knowledge system is actually launched, the most difficult thing to collect is whether the materials that enter the context belong to the same set of fact boundaries, the same version semantics and the same scope of authority.

As long as the question is not settled first, the more recalls, the more the model will look like a person who is particularly good at writing summaries: it may not necessarily deliberately say nonsense, but it will sew evidence that should not be put together into an answer that reads very much like a conclusion.

Therefore, in the next step of RAG optimization, many times we should not ask “how much more can be recalled”, but first ask: **Which content should not appear together in the same prompt at all. **

FAQ

读完之后,下一步看什么

如果还想继续了解,可以从下面几个方向接着读。

Related

继续阅读

这里整理了同分类、同标签或同类问题的文章。