未分类April 20, 2026 at 09:30 AM作者单一鸣6 分钟阅读0 个标签

Asynchronous startup optimization and initialization accidental phenomena

It’s usually not worth exchanging 200ms of gain for unrepeatable race conditions and troubleshooting costs.

The first screen indicator dropped, but one of the most annoying glitches began to appear online: it appeared occasionally, was difficult to reproduce, and seemed to be metaphysical.

The crash stack is unstable, the logs all look “normal”, and it can occasionally heal itself. Looking back at the change records, everyone is doing the same thing: breaking up the initialization of the startup phase, delaying it, making it asynchronous, and making it concurrent to make cold start faster.

The problem is not that “the slowness is gone”, but that “the dependencies are gone”, or more accurately, the dependencies are hidden.

In this article, I want to explain the most critical judgment in a real investigation: the pitfall of startup optimization is often to put the first business interaction into a semi-initialized state. **The 200ms saved may end up being spent on occasional crashes, wrong states, mutual coverage, and team troubleshooting time.

Problem background: The first screen is faster, and the first click occasionally crashes

The fault description is very typical:

Android cold start is faster, and the first screen white screen time is reduced
A small proportion of online users occasionally experience crashes or errors on the “first click after the first screen”
The crash stack is sometimes in the business module, sometimes in the network layer, and sometimes in the SDK.
It is almost impossible to reproduce in local and test environments, and grayscale reproduction is also unstable.

This type of problem is most easily misjudged as “differences in online environment”, “model compatibility” and “third-party SDK convulsions”. But when it is highly relevant to a startup optimization change, I will first treat it as something simpler: **race conditions. **

Core judgment: Asynchronization is not an optimization method, it is changing the system’s readiness semantics.

Many intuitions for startup optimization are:

Heavy IO stuff moved to background thread
CPU-heavy stuff in parallel
Delay critical initialization of non-first screen to after the first screen

These are almost always “valid” on metrics.

But they also did something more dangerous: **Erase the dependencies originally implicit in “sequential execution”. **

Previously in Application#onCreate() it was initialized sequentially: A -> B -> C. Even if no one writes the document, the system defaults to this fact:

When onCreate() ends, A/B/C has at least run

Later they were broken down into:

-A execute immediately

B hands over an asynchronous task -C hand over to another asynchronous task

At this time, the end of onCreate() no longer means “the system is ready”, it only means “I threw the task out”.

The first click online often occurs at an unexpected time: the first screen rendering is completed, the user clicks immediately, or an automatic behavior triggers navigation.

So the first business interaction fell into an awkward range:

Some dependencies have been initialized
Some are still running
Some failed but were secretly kept secret
Some have not started yet because they are delayed

That’s not “slow”, it’s incomplete state.

Demonstration process: How does the problem converge to “semi-initialization” step by step?

To troubleshoot such occasional problems, I would not focus on the crash stack first. I will do three things first to turn “unreproducible” into “explainable”.

1) Draw the startup dependency diagram first, do not draw the module diagram

The module diagram answers “who depends on whom”, but the startup question answers:

What initialization must be completed before the first interaction
Which initialization failures will affect business semantics
Which initialization is just the icing on the cake

I will divide the startup dependencies into three categories according to the boundary of “first interaction”:

Must be ready (Hard Ready): If it is not ready, it cannot be allowed to enter the critical path, such as login state, authentication token, routing table, key thread model (such as the constraints of the main thread/coroutine scheduler), and the minimum set of crash reports.
Soft Ready: You can enter the business if you are not ready, but you must downgrade in a controllable manner, such as recommended caching, AB experiments, and buried enhancement fields.
Deferred: It can be done later without affecting the first interaction semantics, such as warm-up, image decoder initialization, and non-critical SDK.

The value of this step is to change the argument from “asynchronous or not asynchronous” to “at what boundary must this dependency be completed”.

2) Give each dependency a “readiness contract”, otherwise asynchronization is equivalent to gambling

The so-called readiness contract is to clarify two things:

Who will judge whether it is ready
How to proceed when business is not ready

Asynchronization without a readiness contract, common manifestations are:

The caller thinks that the initialization has been completed and directly uses
The initializer thought that the caller would not use it so early -Both sides are right, the mistake online is in the “timing”

One of the most typical crashes I’ve seen is changing the initialization of a singleton to lazy + async.

Pseudocode looks like this:

object Foo {
 @Volatile private var inited = false

 fun initAsync() {
 GlobalScope.launch(Dispatchers.IO) {
  // 读配置/解密/拉取远端
  inited = true
 }
 }

 fun doWork() {
 check(inited) { "Foo not initialized" }
 // ...
 }
}

The first screen indicator will get better, but once the calling timing of doWork() is advanced before the end of init, it will become “occasional”.

What’s worse is that many codes will not check(inited), but will continue to run, generating an error state, and will not explode until later.

3) Measure the “window” of competition rather than relying on feelings

The necessary conditions for asynchronousization to cause problems are:

The first interaction occurs before some initialization is completed

So I will add two types of logs (note that they are alignable time points):

t0: Process Start/Application.onCreate Start
t1: The first screen is interactive (it is truly clickable)
t_ready(X): The time point when each key dependency is ready

Then take a look at the distribution:

What is the proportion of t1 < t_ready(Auth)
What proportion is t1 < t_ready(Router)
And whether they are related to the model, network, hot and cold boot, and system version

Once this window can be quantified, many “occurrences” will suddenly become unmysterious: it is just a probability event.

Misunderstandings and failure cases: The more you write down, the more likely you are to create troubles that are more difficult to troubleshoot.

After starting asynchronousization, the team will naturally be cautious:

If the dependency is not ready, use the default value
If the configuration is not pulled, go to the last cache
AB fell to control before getting it.

Each of these makes sense on its own, but they have two side effects.

Misunderstanding 1: Turning “lack of dependencies” into “semantic drift”

Crash is actually easy to troubleshoot, but error states are the most difficult to troubleshoot.

For example, if the login status is not ready, it will become “not logged in”. This will take the user to an error page when the first click triggers a jump. Later, when the real login state is ready, the page status is reset again, so “flash”, “jump back” and “occasionally log out” appear.

You will see a bunch of “normal” branches in the log: they are all covered by the design. But the user experience is bad, and it’s hard to associate it with startup optimization.

Misunderstanding 2: Covering each other’s secrets leads to a broken evidence chain for troubleshooting

Dependency A is not ready, so it takes a treacherous path.

At the same time, dependency B is not ready and has also gone through all kinds of tricks.

In the end the business behaves like B’s problem, but the root cause is A.

What is more realistic is: in order to “not crash”, the exception is swallowed and the failure is recorded as debug, leaving only one “wrong result” online.

This is one of the sources of “irreproducibility”: erasing the key failure signal.

How to fix: Change “Asynchronization” to “Verifiable Readiness Boundary”

To solve this problem, the startup semantics of the system are usually tightened again.

I will do three steps from lowest to highest cost.

1) Define an executable Ready Gate

Dependence on Hard Ready gives a unified gate:

You must pass the gate before interacting for the first time
If it cannot pass, block key operations or provide a clear path to downgrade.

For example, add a small check at the first click entry (navigation/routing/key button):

Continue when ready
Display loading if not ready, or queue first

The key to this step is to change the “dependency not ready” from an implicit race state to an explicit state.

2) Make initialization a “stateful task” instead of fire-and-forget

Many initializations are thrown out directly using GlobalScope.launch or the thread pool, and if they fail, they fail.

A more controllable approach would be to:

Each initialization has status: NotStarted / Running / Ready / Failed
The caller gets a handle that can be awaited (even if it does not await in the end)

Pseudo code:

sealed class InitState {
 data object NotStarted : InitState()
 data object Running : InitState()
 data object Ready : InitState()
 data class Failed(val error: Throwable) : InitState()
}

class Initializer {
 @Volatile private var state: InitState = InitState.NotStarted
 private val deferred = CompletableDeferred<Unit>()

 fun start() {
 if (state != InitState.NotStarted) return
 state = InitState.Running
 scope.launch(Dispatchers.IO) {
  runCatching {
  // do init
  }.onSuccess {
  state = InitState.Ready
  deferred.complete(Unit)
  }.onFailure {
  state = InitState.Failed(it)
  deferred.completeExceptionally(it)
  }
 }
 }

 suspend fun awaitReady() = deferred.await()
}

This makes two things true:

You can choose where to await
No more “thinking it’s better”

3) Set boundaries and rollback switches for delayed initialization

Lazy initialization is not impossible, but it requires boundary conditions:

Which users/scenarios can be delayed (such as only cold starts, or hot starts also delayed)
What to do when failure occurs (retry, disable, rollback)
How to observe grayscale (ready window distribution, failure rate, degradation ratio)

I would rather make “Start Asynchronization” a rollback policy switch rather than a one-time code change.

Because once an occasional problem is discovered online, the fastest way to stop the bleeding is usually to “roll back the asynchronousization.”

Applicable boundaries: when is asynchronousization profitable and when is it a loss?

The premise that asynchronousization is profitable is:

Dependency is Soft Ready or Deferred
The readiness contract is clear, and there is an evidence chain for failure
The ready window is small and stable, and does not span the first interaction

Typical scenarios where asynchronization is a loss:

The dependency is Hard Ready, but it was moved for the sake of metrics
Cover up failure with cover-up, leading to semantic drift
There is no ready gate, so the race condition becomes a probabilistic event

To summarize in one sentence: Can you explain when it is not ready, and if the business can maintain consistent semantics when it is not ready, asynchronousization is considered optimized. **

Summary

Cold start optimization is most easily driven by KPI into a single-objective problem: making the first screen faster.

But what really needs to be kept in mind during the startup phase is “when will the system be considered available?” The more the initialization is broken down into pieces, the more clearly the readiness semantics need to be written into the code and into the observations.

Otherwise, what we will do is replace deterministic slowness with probabilistic mistakes.

FAQ

读完之后，下一步看什么

如果还想继续了解，可以从下面几个方向接着读。

想继续看未分类方向的内容？

同分类通常更适合作为下一步延伸阅读，能快速进入同一主题下的系列文章。

查看同分类

想换一个方向重新找内容？

如果你还不确定要看哪一类问题，可以先回首页，从分类、主题和最新更新重新进入。

回到首页

继续阅读

这里整理了同分类、同标签或同类问题的文章。

未分类 · 0 个标签

Fine-grained component splitting and state ownership issues

After cutting a state into multiple local truths, the sequence becomes a probabilistic event

单一鸣继续阅读

未分类 · 0 个标签

How to use Codex and its boundaries in real projects

Think of it as a piece of the change pipeline, not a faster author

单一鸣继续阅读

未分类 · 0 个标签

High-cost decision-making chain in AI systems

It’s easy to save money on reasoning. Turning online behavior into a reproducible evidence chain is the real cost control.

单一鸣继续阅读

返回首页查看同分类

Problem background: The first screen is faster, and the first click occasionally crashes

Core judgment: Asynchronization is not an optimization method, it is changing the system’s readiness semantics.

Demonstration process: How does the problem converge to “semi-initialization” step by step?

1) Draw the startup dependency diagram first, do not draw the module diagram

2) Give each dependency a “readiness contract”, otherwise asynchronization is equivalent to gambling

3) Measure the “window” of competition rather than relying on feelings

Misunderstandings and failure cases: The more you write down, the more likely you are to create troubles that are more difficult to troubleshoot.

Misunderstanding 1: Turning “lack of dependencies” into “semantic drift”

Misunderstanding 2: Covering each other’s secrets leads to a broken evidence chain for troubleshooting

How to fix: Change “Asynchronization” to “Verifiable Readiness Boundary”

1) Define an executable Ready Gate

2) Make initialization a “stateful task” instead of fire-and-forget

3) Set boundaries and rollback switches for delayed initialization

Applicable boundaries: when is asynchronousization profitable and when is it a loss?

Summary

读完之后，下一步看什么

想继续看 未分类 方向的内容？

想换一个方向重新找内容？

继续阅读

想继续看未分类方向的内容？