iOS 性能优化October 9, 2024 at 08:30 PM作者单一鸣9 分钟阅读3 个标签

iOS Performance Optimization Series 07｜A Root Cause Convergence Process of iOS Performance Troubleshooting

What is really difficult is to first determine whether what you see is the same problem, and then eliminate a bunch of false clues layer by layer.

专题入口 / iOS 性能优化 # iOS # 性能排查 # iOS 性能优化

简体中文 English

When doing iOS performance troubleshooting, the most annoying thing is that the scene is often messy from the beginning.

The product said the homepage was stuck, the test said it was frame dropping in the list, the development thought it might be the pictures, and the backend suspected the interface was slow. Everyone talks like a question, but when you start to look at it, you often find that they are not describing the same thing at all.

After doing this kind of work several times, my biggest feeling is: **Performance troubleshooting starts with convergence issues, followed by analyzing data. **

If the problem is not converged first, the more tools you open, the easier it will be to lead yourself astray. Because every indicator you see seems to have information, but every information is not enough to form a conclusion.

The following article does not talk about “theoretically how to troubleshoot”, but writes according to the rhythm that will happen in a real project: How did the problem of frame dropping in the homepage information stream go from “feeling a bit stuck” step by step to the point where it can be fixed manually?

Nail the scene first, otherwise all subsequent analysis will be lost.

The initial description of that problem was very ordinary:

The homepage does not slide smoothly
Some models are more obvious
The latest version feels even more stuck.

This description sounds like information, but in fact it is almost impossible to use it directly for troubleshooting. Because it is mixed with at least three layers of things:

*The page range is unclear *The type of phenomenon is unclear *The conditions for reappearance are unclear

The description that really started working was finally reduced to this:

iPhone 13 Pro, iOS 18, login account A, cold start to enter the homepage recommendation flow, switch to the “Follow” tab after the first screen is loaded, and then switch back to the recommendation flow, swipe up 4 screens in a row, and continue to drop frames starting from the 3rd screen. The problem mainly occurs in the mixed arrangement of graphics and text and the mixed arrangement of video cards.

The importance of this passage is not that it is written like a test case, but that it eliminates a lot of ambiguities at once:

*It is the recommended flow card on the homepage *It is more obvious the second time you enter *is scrolling and dropping frames *The area where images and texts are mixed is more obvious.

The real start of the investigation often begins with this “nailing the scope.” Without this step, all subsequent “analysis” may just be chasing one’s own imagination.

Don’t rush to look at the tools first, first determine whether this is an occasional freeze or a continuous low frame rate.

The four words “not sliding smoothly” may actually correspond to completely different problems.

The first thing I did that time was to record the screen and slide it several times to characterize the phenomenon.

In the end, this is a very typical sustained low framerate:

Once entering a specific content area, several consecutive screens will not go smoothly. *is heavy during the entire rolling phase
The user’s subjective feeling is that “this section cannot slide”

This distinction is very important.

Because occasional lags usually look more like:

A certain image decoding suddenly hit the main thread
The initialization time of a large object is wrong
A certain layout pass occasionally becomes heavy

And sustained low frames look more like:

Doing repetitive work every frame
The cell structure itself is too heavy
Asynchronous results are continuously backfilled during scrolling
The list status refresh range is too large

If these two types of issues are not separated at the beginning, any data later will be easily misinterpreted.

The reproduction path must be written down, otherwise the words “changed” will have no meaning.

One of the most common symptoms of performance problems is “It was obvious just now, but now it’s gone.”

So in that troubleshooting, the first move I made was stupid, but very valuable: writing the recurrence path into fixed steps.

What was sorted out at that time was:

Kill the app and cold start again
Log in to account A
Enter the homepage recommendation flow and wait for the first screen to be completely stable.
Switch to the “Follow” tab
Switch back to the recommendation flow
Swipe up 4 screens in quick succession
Observe the frame rate performance and main thread occupancy starting from the third screen

Also remember the environment:

*Equipment model

System version *Network conditions
Data level
Whether to turn on the debugging switch

The significance of this is not only to facilitate others to reproduce, but more importantly, to facilitate subsequent verification.

Because the most worthless sentence in performance optimization is:

I feel better than before.

There is no fixed path, and the word “shun” has no analytical value. It may just be because the data is different, the network is different, the page does not slide to the same paragraph, or even the finger does not slide so fast this time.

The first round of judgment does not find the root cause, but only determines which link the problem lies on.

When I first start dealing with this type of problem I want to ask:

Which line of code is slow?
Which function is the most time-consuming?
Which library is the problem?

But in a real investigation, this is usually too early to ask.

What I did first that time was coarse segmentation of links:

Is it a data problem or a rendering problem?

This step is critical, because a slow homepage can easily make people instinctively doubt the interface.

I first did a very direct verification: Try to fix the network results as much as possible, so that the data when entering the recommendation flow for the second time comes from existing results rather than network changes.

The result is clear: The data has returned, and the scrolling is still heavy, and it is steady.

This step can basically eliminate “scrolling stuck due to slow interface” from the main problem.

The slow interface will affect the first screen time, but it does not explain “after entering for the second time, a certain piece of content starts to have continuous low frames”.

Is it the main thread that is busy, or is it that the background work keeps returning to the main thread?

The next thing to look at is what the main thread is doing when the card is stuck.

The goal of this step is also to first determine the form of the problem:

The main thread continues to be busy
Or the background callbacks are too dense, and the main thread is interrupted one after another.

The last thing I saw was more of the former: The main thread is not always focused on the scrolling phase.

This shows that the direction is starting to become clear:

The focus is not on the network
The focus is not on a single large task
The focus is more like a list display link itself is too heavy

Is it a single point of exception or a systemic overload?

This is a judgment I value every time.

If only one function occasionally runs out for 200ms, it’s easy to handle, just catch it. But that wasn’t the case that time.

The real troublesome thing about that time was that there was no point that was so outrageous as to be “a real killer at a glance”, but a lot of small expenses that were not exaggerated were stacked together, and they all fell on the scrolling critical path.

This kind of problem is the most annoying, because it doesn’t have a particularly clear stack like crash. It’s more like the system is telling you that every small decision that was “written like this” in the past is now being charged together.

It’s only worth opening Instruments at this point, and it must be done with suspicion.

I’ve never really liked the “open all the tools first and then talk” approach to troubleshooting. It’s so easy to use tools as substitutes for thinking.

Before entering Instruments that time, I actually had several doubts in my mind:

Suspicion 1: The image link puts work that should not be placed in the rolling period into the rolling period.

This is the most common suspect in information flow scenarios.

Especially:

Cover image size is not uniform
Mixed arrangement of graphics, text and video cards
Image backfill rhythm is unstable
Before display, processing such as rounded corners, shadows, and scaling are also done.

If all this work is actually happening during scrolling, the frame rate will be difficult to stabilize.

Suspicion 2: The cell presentation was prepared too late

Many lists seem “logically clear” at first, but when you actually scroll through the equipment, an old problem will be exposed:

A lot of display preparation work is done when the cell is about to be displayed.

For example:

Rich text assembly *Copywriting cutting
Time formatting *Altitude calculation
UI status judgment
Preparation of buried objects

Each item is not that big on its own, but once they are all squeezed into the scrolling critical path, it will go from “acceptable” to “the entire list is heavy.”

Suspicion 3: The status refresh range is too large

This situation is particularly common in reactive architectures.

On the surface, it is just a card status change, but what is actually triggered is:

Section level rebinding
Too many local diffs
Coupling of exposure reporting and UI refresh
Preloading callbacks frequently return to the main thread

This type of problem is the most annoying, because it may not necessarily cause a particularly high peak on a certain function, but the overall experience will always be poor.

What really narrows the direction is a few rounds of cheap experiments.

That time we really narrowed the problem down to a few rounds of very simple experiments.

Experiment 1: Replace all images with placeholder images

This experiment is very crude, but extremely useful.

Once the image is replaced, scrolling is obviously back. This step does not directly explain that the root cause is “too many pictures”, but it is enough to show that image links must be one of the main directions.

If you continue to ask at this time, it will no longer be a general “yes” but more specific:

Whether decoding falls on the main thread
Is the size strategy inconsistent?
Does image backfilling cause re-layout?
Is there too much additional processing done before displaying?

Experiment 2: Preprocess part of the dynamic text of the cell

With this change, scrolling has also been significantly improved.

This shows that the other direction is also true: It is indeed too late to prepare the list before it is presented, and there are more than one or two of these tasks.

In other words, the problem is that the image link and the presentation link are duplicated together.

Experiment 3: Temporarily turn off these edge actions such as exposure, preloading, and automatic playback

After this round of experiments, the frame rate continued to stabilize.

At this point the problem is basically clear: It is the homepage information flow that carries too much “just do it” work during the scrolling stage.

These tasks usually look reasonable on their own:

Pictures need to be backfilled
Exposure must be reported
Preloading needs to be done
Video card needs to be warmed up

But when they happen together on the scrolling critical path, they bring the whole list down.

What the real root cause looks like in the end is a set of bad decisions

When it finally landed, the real problem was probably this combination of punches:

The image size strategy is not uniform, resulting in processing costs before display
Some pictures are decoded too late
Cell display preparation work is done on-site in the main thread
Exposure and preload callbacks are too dense during the scroll period
List local state changes trigger a larger refresh range than expected

This is what many performance issues really look like.

It’s not the end of “line 357 found”, But in the end, you will find that the design of several layers is not restrained enough, and finally it will be settled together in the rolling stage.

Therefore, I increasingly dislike the argument that “there must be a key function for performance problems”. Of course this happens in real projects, but more often than not it is “a set of bad shapes that have been piled up for a long time.”

The verification phase is most afraid of subjective feelings. You must go back to the same path to compare.

After the modification, the problem was not solved directly, but the original recurrence path was re-taken.

The same device, the same account, the same switching method, the same sliding to the content area, and then look at:

Whether frame drops still occur stably?
Is the main thread still focusing on the same period of time?
Are image backfill and edge behaviors still competing with scrolling to compete with the main thread?

At this step, I also took a look at the side effects:

After the picture link is connected, is there any significant increase in memory?
If you do too much preprocessing, will the first screen become slower?
Exposure reporting is delayed. Will statistics be affected?

A mistake often made in performance optimization: They only focus on “whether an indicator looks good”, but don’t look at whether it has replaced other problems.

But in real work, optimization is always about exchange in the overall experience. Acceptable exchange is called optimization, exchanging other problems is not called optimization.

This type of article is most likely to be written as empty talk, but it is also the most realistic place for performance troubleshooting.

Many summary articles will end up writing:

Define the problem first
To establish a hypothesis
To verify the results

These words are certainly true, but they can all too easily turn into correct nonsense.

The difference in real experience is not whether you can say these words, but whether you have experienced the following moments:

At first everyone said it looked like the same problem, but in fact it was not the same problem at all.
The direction that looks most like the root cause at first glance is just a false clue in the end.
Every indicator in the tool seems to be abnormal, but there is only one real main line
The final answer is a set of small overheads that add up over time

Once you have done these things a few times, you will naturally understand one thing:

**The real value of performance troubleshooting is “can I turn a chaotic scene into a workable link?” **This can’t be done, the more tools there are, the more messy it will be. By doing this, you can get closer and closer to the root cause of many problems without going through all the charts.

When I look at performance issues now, what I care most about is whether we can get the team to agree on the following as soon as possible:

We are now investigating the same problem, and we already know which link it mainly falls on.

As long as this is established, subsequent repairs are usually not too confusing. The really difficult thing is always to gather the problem into a shape that can be dealt with.

FAQ

读完之后，下一步看什么

如果还想继续了解，可以从下面几个方向接着读。

想继续看 iOS 性能优化方向的内容？

同分类通常更适合作为下一步延伸阅读，能快速进入同一主题下的系列文章。

查看同分类

想继续沿着 #iOS 往下找？

标签更适合继续查看细分问题、相关工具和同类排错文章。

查看同标签

想换一个方向重新找内容？

如果你还不确定要看哪一类问题，可以先回首页，从分类、主题和最新更新重新进入。

回到首页

继续阅读

这里整理了同分类、同标签或同类问题的文章。

iOS 性能优化 · 3 个标签

iOS Performance Optimization Series 06｜The correct way to use Instruments

The tool itself is not difficult. What is difficult is to go in with a clear question, rather than looking at a bunch of pictures and coming up with no conclusion.

单一鸣继续阅读

iOS 性能优化 · 3 个标签

iOS Performance Optimization Series 05｜Effective practices for iOS startup optimization

The biggest fear when starting optimization is not distinguishing which tasks belong to the critical path and which ones are just thrown in.

单一鸣继续阅读

iOS 性能优化 · 3 个标签

iOS Performance Optimization Series 04｜High-frequency performance issues in image loading and caching

The real trouble with picture problems is that downloading, decoding, scaling, memory usage and display timing are all intertwined.

单一鸣继续阅读

返回首页查看同分类

Nail the scene first, otherwise all subsequent analysis will be lost.

Don’t rush to look at the tools first, first determine whether this is an occasional freeze or a continuous low frame rate.

The reproduction path must be written down, otherwise the words “changed” will have no meaning.

The first round of judgment does not find the root cause, but only determines which link the problem lies on.

Is it a data problem or a rendering problem?

Is it the main thread that is busy, or is it that the background work keeps returning to the main thread?

Is it a single point of exception or a systemic overload?

It’s only worth opening Instruments at this point, and it must be done with suspicion.

Suspicion 1: The image link puts work that should not be placed in the rolling period into the rolling period.

Suspicion 2: The cell presentation was prepared too late

Suspicion 3: The status refresh range is too large

What really narrows the direction is a few rounds of cheap experiments.

Experiment 1: Replace all images with placeholder images

Experiment 2: Preprocess part of the dynamic text of the cell

Experiment 3: Temporarily turn off these edge actions such as exposure, preloading, and automatic playback

What the real root cause looks like in the end is a set of bad decisions

The verification phase is most afraid of subjective feelings. You must go back to the same path to compare.

This type of article is most likely to be written as empty talk, but it is also the most realistic place for performance troubleshooting.

读完之后，下一步看什么

想继续看 iOS 性能优化 方向的内容？

想继续沿着 #iOS 往下找？

想换一个方向重新找内容？

继续阅读

想继续看 iOS 性能优化方向的内容？