Asynchronous startup optimization and initialization accidental phenomena
It’s usually not worth exchanging 200ms of gain for unrepeatable race conditions and troubleshooting costs.
The first screen indicator dropped, but one of the most annoying glitches began to appear online: it appeared occasionally, was difficult to reproduce, and seemed to be metaphysical.
The crash stack is unstable, the logs all look “normal”, and it can occasionally heal itself. Looking back at the change records, everyone is doing the same thing: breaking up the initialization of the startup phase, delaying it, making it asynchronous, and making it concurrent to make cold start faster.
The problem is not that “the slowness is gone”, but that “the dependencies are gone”, or more accurately, the dependencies are hidden.
In this article, I want to explain the most critical judgment in a real investigation: the pitfall of startup optimization is often to put the first business interaction into a semi-initialized state. **The 200ms saved may end up being spent on occasional crashes, wrong states, mutual coverage, and team troubleshooting time.
Problem background: The first screen is faster, and the first click occasionally crashes
The fault description is very typical:
- Android cold start is faster, and the first screen white screen time is reduced
- A small proportion of online users occasionally experience crashes or errors on the “first click after the first screen”
- The crash stack is sometimes in the business module, sometimes in the network layer, and sometimes in the SDK.
- It is almost impossible to reproduce in local and test environments, and grayscale reproduction is also unstable.
This type of problem is most easily misjudged as “differences in online environment”, “model compatibility” and “third-party SDK convulsions”. But when it is highly relevant to a startup optimization change, I will first treat it as something simpler: **race conditions. **
Core judgment: Asynchronization is not an optimization method, it is changing the system’s readiness semantics.
Many intuitions for startup optimization are:
- Heavy IO stuff moved to background thread
- CPU-heavy stuff in parallel
- Delay critical initialization of non-first screen to after the first screen
These are almost always “valid” on metrics.
But they also did something more dangerous: **Erase the dependencies originally implicit in “sequential execution”. **
Previously in Application#onCreate() it was initialized sequentially: A -> B -> C. Even if no one writes the document, the system defaults to this fact:
- When
onCreate()ends, A/B/C has at least run
Later they were broken down into:
-A execute immediately
- B hands over an asynchronous task -C hand over to another asynchronous task
At this time, the end of onCreate() no longer means “the system is ready”, it only means “I threw the task out”.
The first click online often occurs at an unexpected time: the first screen rendering is completed, the user clicks immediately, or an automatic behavior triggers navigation.
So the first business interaction fell into an awkward range:
- Some dependencies have been initialized
- Some are still running
- Some failed but were secretly kept secret
- Some have not started yet because they are delayed
That’s not “slow”, it’s incomplete state.
Demonstration process: How does the problem converge to “semi-initialization” step by step?
To troubleshoot such occasional problems, I would not focus on the crash stack first. I will do three things first to turn “unreproducible” into “explainable”.
1) Draw the startup dependency diagram first, do not draw the module diagram
The module diagram answers “who depends on whom”, but the startup question answers:
- What initialization must be completed before the first interaction
- Which initialization failures will affect business semantics
- Which initialization is just the icing on the cake
I will divide the startup dependencies into three categories according to the boundary of “first interaction”:
- Must be ready (Hard Ready): If it is not ready, it cannot be allowed to enter the critical path, such as login state, authentication token, routing table, key thread model (such as the constraints of the main thread/coroutine scheduler), and the minimum set of crash reports.
- Soft Ready: You can enter the business if you are not ready, but you must downgrade in a controllable manner, such as recommended caching, AB experiments, and buried enhancement fields.
- Deferred: It can be done later without affecting the first interaction semantics, such as warm-up, image decoder initialization, and non-critical SDK.
The value of this step is to change the argument from “asynchronous or not asynchronous” to “at what boundary must this dependency be completed”.
2) Give each dependency a “readiness contract”, otherwise asynchronization is equivalent to gambling
The so-called readiness contract is to clarify two things:
- Who will judge whether it is ready
- How to proceed when business is not ready
Asynchronization without a readiness contract, common manifestations are:
- The caller thinks that the initialization has been completed and directly uses
- The initializer thought that the caller would not use it so early -Both sides are right, the mistake online is in the “timing”
One of the most typical crashes I’ve seen is changing the initialization of a singleton to lazy + async.
Pseudocode looks like this:
object Foo {
@Volatile private var inited = false
fun initAsync() {
GlobalScope.launch(Dispatchers.IO) {
// 读配置/解密/拉取远端
inited = true
}
}
fun doWork() {
check(inited) { "Foo not initialized" }
// ...
}
}
The first screen indicator will get better, but once the calling timing of doWork() is advanced before the end of init, it will become “occasional”.
What’s worse is that many codes will not check(inited), but will continue to run, generating an error state, and will not explode until later.
3) Measure the “window” of competition rather than relying on feelings
The necessary conditions for asynchronousization to cause problems are:
- The first interaction occurs before some initialization is completed
So I will add two types of logs (note that they are alignable time points):
t0: Process Start/Application.onCreateStartt1: The first screen is interactive (it is truly clickable)t_ready(X): The time point when each key dependency is ready
Then take a look at the distribution:
- What is the proportion of
t1 < t_ready(Auth) - What proportion is
t1 < t_ready(Router) - And whether they are related to the model, network, hot and cold boot, and system version
Once this window can be quantified, many “occurrences” will suddenly become unmysterious: it is just a probability event.
Misunderstandings and failure cases: The more you write down, the more likely you are to create troubles that are more difficult to troubleshoot.
After starting asynchronousization, the team will naturally be cautious:
- If the dependency is not ready, use the default value
- If the configuration is not pulled, go to the last cache
- AB fell to control before getting it.
Each of these makes sense on its own, but they have two side effects.
Misunderstanding 1: Turning “lack of dependencies” into “semantic drift”
Crash is actually easy to troubleshoot, but error states are the most difficult to troubleshoot.
For example, if the login status is not ready, it will become “not logged in”. This will take the user to an error page when the first click triggers a jump. Later, when the real login state is ready, the page status is reset again, so “flash”, “jump back” and “occasionally log out” appear.
You will see a bunch of “normal” branches in the log: they are all covered by the design. But the user experience is bad, and it’s hard to associate it with startup optimization.
Misunderstanding 2: Covering each other’s secrets leads to a broken evidence chain for troubleshooting
Dependency A is not ready, so it takes a treacherous path.
At the same time, dependency B is not ready and has also gone through all kinds of tricks.
In the end the business behaves like B’s problem, but the root cause is A.
What is more realistic is: in order to “not crash”, the exception is swallowed and the failure is recorded as debug, leaving only one “wrong result” online.
This is one of the sources of “irreproducibility”: erasing the key failure signal.
How to fix: Change “Asynchronization” to “Verifiable Readiness Boundary”
To solve this problem, the startup semantics of the system are usually tightened again.
I will do three steps from lowest to highest cost.
1) Define an executable Ready Gate
Dependence on Hard Ready gives a unified gate:
- You must pass the gate before interacting for the first time
- If it cannot pass, block key operations or provide a clear path to downgrade.
For example, add a small check at the first click entry (navigation/routing/key button):
- Continue when ready
- Display loading if not ready, or queue first
The key to this step is to change the “dependency not ready” from an implicit race state to an explicit state.
2) Make initialization a “stateful task” instead of fire-and-forget
Many initializations are thrown out directly using GlobalScope.launch or the thread pool, and if they fail, they fail.
A more controllable approach would be to:
- Each initialization has status:
NotStarted / Running / Ready / Failed - The caller gets a handle that can be awaited (even if it does not await in the end)
Pseudo code:
sealed class InitState {
data object NotStarted : InitState()
data object Running : InitState()
data object Ready : InitState()
data class Failed(val error: Throwable) : InitState()
}
class Initializer {
@Volatile private var state: InitState = InitState.NotStarted
private val deferred = CompletableDeferred<Unit>()
fun start() {
if (state != InitState.NotStarted) return
state = InitState.Running
scope.launch(Dispatchers.IO) {
runCatching {
// do init
}.onSuccess {
state = InitState.Ready
deferred.complete(Unit)
}.onFailure {
state = InitState.Failed(it)
deferred.completeExceptionally(it)
}
}
}
suspend fun awaitReady() = deferred.await()
}
This makes two things true:
- You can choose where to await
- No more “thinking it’s better”
3) Set boundaries and rollback switches for delayed initialization
Lazy initialization is not impossible, but it requires boundary conditions:
- Which users/scenarios can be delayed (such as only cold starts, or hot starts also delayed)
- What to do when failure occurs (retry, disable, rollback)
- How to observe grayscale (ready window distribution, failure rate, degradation ratio)
I would rather make “Start Asynchronization” a rollback policy switch rather than a one-time code change.
Because once an occasional problem is discovered online, the fastest way to stop the bleeding is usually to “roll back the asynchronousization.”
Applicable boundaries: when is asynchronousization profitable and when is it a loss?
The premise that asynchronousization is profitable is:
- Dependency is Soft Ready or Deferred
- The readiness contract is clear, and there is an evidence chain for failure
- The ready window is small and stable, and does not span the first interaction
Typical scenarios where asynchronization is a loss:
- The dependency is Hard Ready, but it was moved for the sake of metrics
- Cover up failure with cover-up, leading to semantic drift
- There is no ready gate, so the race condition becomes a probabilistic event
To summarize in one sentence: Can you explain when it is not ready, and if the business can maintain consistent semantics when it is not ready, asynchronousization is considered optimized. **
Summary
Cold start optimization is most easily driven by KPI into a single-objective problem: making the first screen faster.
But what really needs to be kept in mind during the startup phase is “when will the system be considered available?” The more the initialization is broken down into pieces, the more clearly the readiness semantics need to be written into the code and into the observations.
Otherwise, what we will do is replace deterministic slowness with probabilistic mistakes.
读完之后,下一步看什么
如果还想继续了解,可以从下面几个方向接着读。