返回首页

Swift Concurrency Series 09|Cancellation semantic invalidation problem in Swift Concurrency

What is really difficult to collect is whether the cancellation signal can pass through the Task, bridging layer and side effect boundaries, and do not let the old results be fed back on the page

After the project changes the callback to async/await, a common situation is that there is an illusion that the concurrency problem has been contained.

The function signature has become cleaner, the call chain can be viewed along await, and there are even fewer warnings in Xcode. But the most annoying online problems often start to appear at this stage: the page has been left, but the request has not stopped; the search terms have changed, and the old results have come back; the user manually cancels the upload, but the underlying tasks continue to run.

This type of problem is most easily attributed to “a certain interface is too slow” or “the main thread refreshes at the wrong time”. But if you really take the link apart and look at it, the core is usually that the cancellation signal is not transmitted along the task tree, bridge layer and side effect boundaries to the end.

My judgment is: **After the Swift Concurrency migration is completed, the most common concurrency bug is that people think that “the parent task is canceled, and the following tasks will naturally stop.” In reality, as long as there is an uncontrolled layer of Task, a wrapper that bridges the old API, or a side effect that does not check the cancellation status, the cancellation semantics will be broken at that layer. In the end, the page looks like it jitters occasionally, but it actually means that the status has been forked. **

This kind of problem is usually mainly exposed in state feedback

This is the first time I systematically dealt with this problem. On the surface it looks like a crash, but in reality it is closer to a search page. Some people always report that “the results will jump back by themselves”.

The page logic is not complicated:

  • User enters keywords;
  • ViewModel initiates search;
  • Cancel the previous task when a new keyword arrives;
  • Refresh the list after the request returns.

On the surface, this process is completely consistent with the recommended writing method of Swift Concurrency. The problem is that a very strange phenomenon can be seen in the online screen recording:

  • The user first searches for swift;
  • Then change it to swift concurrency;
  • New results appear first on the interface;
  • After half a second, the old results overwrite the list again.

This cannot be explained by simply “requesting out of order”. Because searchTask?.cancel() is clearly in the code, and cancel can also be seen in the log.

The real problem lies in: **The upper-level task was canceled, but the bottom layer did not regard “cancellation” as a status change that must be closed immediately. **

As long as there is another layer in the system that continues to send old results up, the UI will accept it as a legitimate result.

Many cancellations failed, broken in the most harmless-looking layer of bridging code.

The most common breakpoint is that when wrapping the old callback API into an async function, it only does “wait for the result to come back” and does not do “what to do when the result should not come back”.

For example, a common situation is to package a network request like this:

func loadUser(id: String) async throws -> User {
  try await withCheckedThrowingContinuation { continuation in
    apiClient.loadUser(id: id) { result in
      continuation.resume(with: result)
    }
  }
}

The syntax is fine and the functions work. But this code has two fatal default premises:

  1. Even if the outer task is canceled, the underlying request will stop on its own;
  2. Even if the bottom layer does not stop, the callback will not affect the current state if it comes back later.

These two premises are often not true in real projects.

If apiClient is still underneath URLSessionDataTask, a third-party SDK, or their own callback storage layer, then the cancellation of the outer layer Task will not be automatically transferred. The above async wrapper only changes the calling method to await, but does not allow the underlying layer to obtain cancellation semantics.

What the bridge layer really needs to do is to “translate outer layer cancellation into underlying executable cancellation actions.” Something like this:

func loadUser(id: String) async throws -> User {
  var request: Cancellable?

  return try await withTaskCancellationHandler {
    try await withCheckedThrowingContinuation { continuation in
      request = apiClient.loadUser(id: id) { result in
        continuation.resume(with: result)
      }
    }
  } onCancel: {
    request?.cancel()
  }
}

This code is just starting to get close to “cancellation can really be passed on”.

But writing it here is not enough, because it only solves “try not to continue running”, but does not solve “how to close the late results”. If cancel() of the underlying SDK does not have strong semantic cancellation, but just terminates as much as possible, the callback may still return in the race condition. The upper level will have to continue to do a cancellation check before receiving the results.

What really messes up the page is that the old results are still regarded as valid results.

Many teams feel relieved when they see Task.isCancelled, but it can only answer “whether the current task has been marked as canceled”, but cannot answer “should this result still fall on the current page?”

In scenarios such as search, association, and detail switching, what really needs to be guarded is the ownership of the results.

The following way of writing ViewModel is very common:

final class SearchViewModel: ObservableObject {
  @Published private(set) var items: [Item] = []
  private var searchTask: Task<Void, Never>?

  func search(keyword: String) {
    searchTask?.cancel()
    searchTask = Task {
      do {
        let items = try await repository.search(keyword: keyword)
        self.items = items
      } catch {
        self.items = []
      }
    }
  }
}

The problem seems to be only one cancellation call, but what is really missing is two layers of protection:

  1. After successful return, confirm that the current task is still valid;
  2. Cancellation cannot be treated as a normal error when it fails.

A more stable way of writing it would be like this:

final class SearchViewModel: ObservableObject {
  @MainActor @Published private(set) var items: [Item] = []
  private var searchTask: Task<Void, Never>?

  func search(keyword: String) {
    searchTask?.cancel()

    searchTask = Task { [weak self] in
      guard let self else { return }

      do {
        let items = try await repository.search(keyword: keyword)
        try Task.checkCancellation()
        await MainActor.run {
          self.items = items
        }
      } catch is CancellationError {
        // 取消不是失败,不清空 UI,不弹错误
      } catch {
        await MainActor.run {
          self.items = []
        }
      }
    }
  }
}

What really matters here is the attitude behind it: **Cancellation is a normal control flow, not an abnormal accident. **

Many pages jitter because the code changes “the user has left” to “the request failed, so clear the UI.” As a result, the new task has not yet been rendered, and the wrong branch of the old task first returns the page to an empty state, which visually looks like random flickering.

Another more hidden problem is that the task tree has been broken for a long time, and everyone thinks they are in structured concurrency.

One of the benefits of Swift Concurrency is that structured concurrency makes the life cycle relationship between parent and child tasks much clearer. But the easiest thing to lose in the project is the Task {} that everyone randomly picks up just to “save trouble.”

For example, when a list page is entered to pull details, pull recommendations, and highlight highlights, a lot of code will be broken down into this:

func refresh() async {
  Task {
    async let detail = repository.loadDetail()
    async let recommendation = repository.loadRecommendation()
    let result = try await (detail, recommendation)
    render(result)
  }
}

It seems to be async/await, but the most critical problem with this code is: refresh() itself and the layer Task {} inside no longer have a structured parent-child relationship.

That is to say:

  • The upper layer calling refresh() ends immediately;
  • Even if the page is destroyed;
  • Even if the outer task is cancelled;

The newly opened Task on this floor can still continue to run.

This is the reason why many pages are still making requests even after they have exited. It’s the code that actively bypasses structured concurrency.

If this kind of scenario is just to get results in parallel, it is enough to write directly in the current async context:

func refresh() async throws -> ScreenData {
  async let detail = repository.loadDetail()
  async let recommendation = repository.loadRecommendation()
  return try await ScreenData(
    detail: detail,
    recommendation: recommendation
  )
}

In this way, the cancellation semantics will be collected along with the call chain. Whoever initiates it will be responsible; whoever cancels it will stop it together.

If the side effect boundary is not checked for cancellation, the most difficult to explain dirty state will appear.

If the request is not stopped, it is just a waste of resources. If the side effects are not stopped, the status will be written dirty.

I later specifically investigated a type of problem that is difficult to reproduce: after a user quickly switches accounts, data from the previous account occasionally appears in the cache. Finally it converged, and the cancellation semantics stopped before “getting data” and did not continue to the step of “writing side effects”.

Code like this is dangerous:

func refreshProfile() async throws {
  let profile = try await repository.fetchProfile()
  cache.save(profile)
  analytics.trackProfileLoaded(profile.id)
  state = .loaded(profile)
}

If the task has been canceled when fetchProfile() returns, but there is no cancellation check, then subsequent cache writes, buried points, and status updates will still continue to occur.

What you see on the UI at this time may be just an occasional bounce, but inside the system, the dirty data has been placed on the disk, and the cost of troubleshooting will suddenly increase a lot.

A more prudent approach is usually to do another explicit check before the side effect boundary:

func refreshProfile() async throws {
  let profile = try await repository.fetchProfile()
  try Task.checkCancellation()

  cache.save(profile)
  analytics.trackProfileLoaded(profile.id)
  state = .loaded(profile)
}

This step may seem a bit mechanical, but it solves a very real problem: **Cancellation does not only cancel “waiting”, but also cancels “submit”. **

What really needs to be protected are often the next few actions that will rewrite the old world.

The most common misunderstanding in failure cases is to handle all errors uniformly.

The reason why many concurrent migrations leave long tails is because teams like to write error closures into a unified template:

do {
  let data = try await service.load()
  state = .loaded(data)
} catch {
  state = .error(error)
}

This is no problem in ordinary failure scenarios, but once it is put into scenarios such as high-frequency page switching, Lenovo search, input anti-shake, and upload cancellation, CancellationError is not the same thing as a real business failure.

Mixing the two will bring at least three consequences:

  • The user actively left the page, but it was recorded as a failure;
  • The error rate in the hidden points is artificially high, misleading the stability judgment;
  • Due to the unified error in the UI, toasts, empty states or retry buttons appear that should not appear.

As long as cancellation is shown as failure once in the project, a bunch of strange and seemingly unrelated feedback will appear later:

  • The list is repeatedly cleared when searching;
  • Occasionally an error occurs after the pull-down refresh is completed;
  • When the page returns to the previous level, the loading failure status will flash.

These phenomena are very fragmentary, but the root is the same problem: ** control flow cancellation is mistaken for a business exception. **

Applicable boundaries: not every async function needs to be stuffed with cancellation checks

Cancellation semantics is important, but not every layer has to write Task.checkCancellation() mechanically.

There are three positions that I value more now:

  1. Bridging the entrance to the old API: This is responsible for untranslating the outer layer to the underlying capabilities;
  2. Phase switching points for long time-consuming links: For example, after completing the network, preparing to decode, and preparing to write the cache;
  3. Side effects before submission: Any place that changes status, caches, posts, or writes to the database is worth checking again.

On the other hand, if a function is only pure calculation, has no suspend point, and has no side effects, then there is little point in specifically inserting a cancellation check. Because the real solution to cancellation has always been “Don’t continue to write about the old world.”

Summary

The easiest illusion created by Swift Concurrency is that the code has been moved from callback to await, and the system has naturally entered a more reliable concurrency era.

But real projects will not automatically gain cancellation semantics just because the syntax is new.

Whether the parent task can still control the child task, whether the bridge layer can pass cancel down, and whether the old results will be blocked before the side effects are submitted. If one of these three things is missed, what you see on the page will be a forked state system.

Therefore, what really needs to be examined in this type of issue is at which level does the cancellation stop. As long as this question is not answered clearly, the newer the grammar, the easier it is for people to mistakenly think that they have written concurrency correctly.

FAQ

读完之后,下一步看什么

如果还想继续了解,可以从下面几个方向接着读。

Related

继续阅读

这里整理了同分类、同标签或同类问题的文章。