Back home

Rust/Wasm runtime reliability requires handling both panic and abort recovery

Once the shared Wasm instance starts to accept calls for a long time, the crash will escalate from a single failure to a state recovery and fault isolation problem.

Wasm can easily be regarded as a porting layer at first: the code can be programmed, the page can run, the performance is okay, and things seem to be about the same. It really starts to get difficult, usually after the demo is passed. Once modules such as editors, renderers, and document parsers move from single-page experiments to long-term resident runtimes, the fault models will immediately change.

At this time, panic and abort are no longer exception branches in the language layer. What they decide is: whether this instance can continue to receive subsequent work, whether the state in the memory is contaminated, whether the host layer should discard the instance immediately, and whether the instance pool needs to be filled. When the mobile team moves a kernel that has been running in native containers for a long time to the Web, it is this layer of change that is most easily underestimated.

After the Demo is passed, the fault model has just started.

Crashes in a single call are not difficult to understand. A button click triggers a Wasm call. If it fails, an error will be reported for the operation. Refresh the page and try again. The cost is still controllable.

The problem occurs after the runtime starts reusing instances. When the same Wasm instance continuously opens multiple documents, receives multiple rounds of input events, and passes through multiple JS bridge calls, the scope of influence of panic and abort no longer stops at the current action. An incomplete failure may drag down subsequent requests.

Such risks are often not exposed on the first day. In the first stage, you usually only see scattered error reports: occasional rendering failures, a certain export is stuck, and a certain document is in an incorrect state after being closed and reopened. If you check further, the clues will gradually converge to the same phenomenon: although the failure occurred in a call chain, the damage remained in the shared instance.

At this point, the focus of the discussion is no longer “Whether the Rust code will panic”, but “Whether this runtime is qualified to continue serving the next call after the panic”.

Panic can be caught, abort can only change instances.

The most important thing to separate in Rust/Wasm is the two failure semantics of panic and abort.

Panic also has the opportunity to unwind back along established boundaries. As long as the binding layer and the host layer agree on the recovery method in advance, the current call can fail, and other states in the instance can also be maintained. abort is not the way to go at all. It means that the current execution has reached an unrecoverable state. If you continue to use the same instance to receive requests, you are essentially betting that the memory and resources will not be damaged halfway.

Once the two are mixed together during runtime, problems will definitely occur in subsequent processing:

  • Swallow abort as a normal exception, and the instance pool will continue to reuse objects that have lost credibility.
  • Treat all panics as if the instance must be destroyed, and throughput will be unnecessarily reduced
  • The JS host only knows “the call failed”, but does not know whether to retry, lose the instance, or cut off the current session

This is also the most realistic thing about Wasm runtime reliability: recovery semantics must be defined first before subsequent isolation and scheduling can be implemented.

If the binding layer does not provide recovery semantics, the host layer will take the bad state and continue to accept the work.

The most dangerous place for this kind of problem is not in the business code, but in the binding layer which seems to have been “already taken care of”. The host layer often only sees a thrown error object and records it as a normal call failure. The log is there and the page does not crash immediately, but the system may have left the bad state inside the instance.

What really needs to be fixed is not just try/catch, but the handling actions after failure. Logic similar to the following has just begun to enter reliability design:

async function runWithRecovery(instance, input) {
  try {
    return await instance.exports.handle(input)
  } catch (error) {
    if (isAbort(error)) {
      pool.replace(instance.id)
    }
    throw error
  }
}

The focus of this code is not on syntax, but on a simple judgment: whether the current failure has marked this instance as an untrusted object. If the answer is yes, the recovery action should not stop at throwing errors, but should continue down to instance elimination, resource reconstruction, and request flow cutting.

As long as this layer is not clearly defined, the system will appear to be handling errors, but what it is actually doing is putting a potentially corrupted runtime back into the production path.

Shared instances will amplify the recovery problem into a pooling strategy problem

After Wasm is put into real products, there is rarely only “one instance until the page is closed”. More common are instance pools, worker pools, or foreground documents and background tasks sharing a set of runtime resources. At this stage, the recovery costs of panic and abort will directly rewrite the pooling strategy.

If instance initialization is expensive, the system will naturally tend to reuse it as much as possible. But once reuse is established, fault isolation must be upgraded simultaneously:

  • Which states can only be hung in a single call, and will be discarded with the call after failure
  • Which caches are allowed to be retained across calls, and which caches must be completely invalidated once abort is encountered
  • After the instance is replaced, how will the queued tasks be migrated? Will retrying cause side effects twice?

These are not answers that the language layer will automatically send. They are runtime designs.

Because of this, if the discussion of Rust/Wasm reliability only stops at “can panic be caught?”, it is easy to underestimate the problem. What really widens the maintenance cost gap is whether the instance pool can maintain a clear trust boundary after a failure.

This set of restoration designs is not required for every Wasm project.

If the module is just a one-time offline tool, or the entire instance is recycled when the page is destroyed, then the difference between panic and abort will still exist, but the recovery benefit will be much smaller. It is often enough to refresh the page directly and rerun the task directly.

Once the system has the following characteristics, recovery semantics will quickly change from an “optimization item” to an “infrastructure item”:

  • The instance resides for a long time and is not destroyed together with a single page life cycle
  • The same instance continuously undertakes multiple rounds of calls
  • The hosting layer needs to use pooling in exchange for startup time and throughput
  • Protect session state, cache state and queued tasks after failure

When mobile teams move native capabilities to the Web, this boundary is the most likely to be encountered. The isolation relationship that was originally established by default in the App process often had to be filled in again after reaching the JS/Wasm host boundary.

Wasm makes it easier for native code to enter the browser, but it doesn’t bring runtime recovery semantics with it. As soon as the system starts to share instances, reuse state, and accept long-term calls, panic and abort must be treated as two different runtime events. The former cares about how to end the current call, and the latter cares about whether this instance can continue to live in the pool. If this judgment is not made first, the more successful the code transplantation is, the more difficult it will be to deal with subsequent failures.