返回首页

iOS Performance Optimization Series 07|A Root Cause Convergence Process of iOS Performance Troubleshooting

What is really difficult is to first determine whether what you see is the same problem, and then eliminate a bunch of false clues layer by layer.

When doing iOS performance troubleshooting, the most annoying thing is that the scene is often messy from the beginning.

The product said the homepage was stuck, the test said it was frame dropping in the list, the development thought it might be the pictures, and the backend suspected the interface was slow. Everyone talks like a question, but when you start to look at it, you often find that they are not describing the same thing at all.

After doing this kind of work several times, my biggest feeling is: **Performance troubleshooting starts with convergence issues, followed by analyzing data. **

If the problem is not converged first, the more tools you open, the easier it will be to lead yourself astray. Because every indicator you see seems to have information, but every information is not enough to form a conclusion.

The following article does not talk about “theoretically how to troubleshoot”, but writes according to the rhythm that will happen in a real project: How did the problem of frame dropping in the homepage information stream go from “feeling a bit stuck” step by step to the point where it can be fixed manually?

Nail the scene first, otherwise all subsequent analysis will be lost.

The initial description of that problem was very ordinary:

  • The homepage does not slide smoothly
  • Some models are more obvious
  • The latest version feels even more stuck.

This description sounds like information, but in fact it is almost impossible to use it directly for troubleshooting. Because it is mixed with at least three layers of things:

*The page range is unclear *The type of phenomenon is unclear *The conditions for reappearance are unclear

The description that really started working was finally reduced to this:

iPhone 13 Pro, iOS 18, login account A, cold start to enter the homepage recommendation flow, switch to the “Follow” tab after the first screen is loaded, and then switch back to the recommendation flow, swipe up 4 screens in a row, and continue to drop frames starting from the 3rd screen. The problem mainly occurs in the mixed arrangement of graphics and text and the mixed arrangement of video cards.

The importance of this passage is not that it is written like a test case, but that it eliminates a lot of ambiguities at once:

*It is the recommended flow card on the homepage *It is more obvious the second time you enter *is scrolling and dropping frames *The area where images and texts are mixed is more obvious.

The real start of the investigation often begins with this “nailing the scope.” Without this step, all subsequent “analysis” may just be chasing one’s own imagination.

Don’t rush to look at the tools first, first determine whether this is an occasional freeze or a continuous low frame rate.

The four words “not sliding smoothly” may actually correspond to completely different problems.

The first thing I did that time was to record the screen and slide it several times to characterize the phenomenon.

In the end, this is a very typical sustained low framerate:

  • Once entering a specific content area, several consecutive screens will not go smoothly. *is heavy during the entire rolling phase
  • The user’s subjective feeling is that “this section cannot slide”

This distinction is very important.

Because occasional lags usually look more like:

  • A certain image decoding suddenly hit the main thread
  • The initialization time of a large object is wrong
  • A certain layout pass occasionally becomes heavy

And sustained low frames look more like:

  • Doing repetitive work every frame
  • The cell structure itself is too heavy
  • Asynchronous results are continuously backfilled during scrolling
  • The list status refresh range is too large

If these two types of issues are not separated at the beginning, any data later will be easily misinterpreted.

The reproduction path must be written down, otherwise the words “changed” will have no meaning.

One of the most common symptoms of performance problems is “It was obvious just now, but now it’s gone.”

So in that troubleshooting, the first move I made was stupid, but very valuable: writing the recurrence path into fixed steps.

What was sorted out at that time was:

  1. Kill the app and cold start again
  2. Log in to account A
  3. Enter the homepage recommendation flow and wait for the first screen to be completely stable.
  4. Switch to the “Follow” tab
  5. Switch back to the recommendation flow
  6. Swipe up 4 screens in quick succession
  7. Observe the frame rate performance and main thread occupancy starting from the third screen

Also remember the environment:

*Equipment model

  • System version *Network conditions
  • Data level
  • Whether to turn on the debugging switch

The significance of this is not only to facilitate others to reproduce, but more importantly, to facilitate subsequent verification.

Because the most worthless sentence in performance optimization is:

I feel better than before.

There is no fixed path, and the word “shun” has no analytical value. It may just be because the data is different, the network is different, the page does not slide to the same paragraph, or even the finger does not slide so fast this time.

When I first start dealing with this type of problem I want to ask:

  • Which line of code is slow?
  • Which function is the most time-consuming?
  • Which library is the problem?

But in a real investigation, this is usually too early to ask.

What I did first that time was coarse segmentation of links:

Is it a data problem or a rendering problem?

This step is critical, because a slow homepage can easily make people instinctively doubt the interface.

I first did a very direct verification: Try to fix the network results as much as possible, so that the data when entering the recommendation flow for the second time comes from existing results rather than network changes.

The result is clear: The data has returned, and the scrolling is still heavy, and it is steady.

This step can basically eliminate “scrolling stuck due to slow interface” from the main problem.

The slow interface will affect the first screen time, but it does not explain “after entering for the second time, a certain piece of content starts to have continuous low frames”.

Is it the main thread that is busy, or is it that the background work keeps returning to the main thread?

The next thing to look at is what the main thread is doing when the card is stuck.

The goal of this step is also to first determine the form of the problem:

  • The main thread continues to be busy
  • Or the background callbacks are too dense, and the main thread is interrupted one after another.

The last thing I saw was more of the former: The main thread is not always focused on the scrolling phase.

This shows that the direction is starting to become clear:

  • The focus is not on the network
  • The focus is not on a single large task
  • The focus is more like a list display link itself is too heavy

Is it a single point of exception or a systemic overload?

This is a judgment I value every time.

If only one function occasionally runs out for 200ms, it’s easy to handle, just catch it. But that wasn’t the case that time.

The real troublesome thing about that time was that there was no point that was so outrageous as to be “a real killer at a glance”, but a lot of small expenses that were not exaggerated were stacked together, and they all fell on the scrolling critical path.

This kind of problem is the most annoying, because it doesn’t have a particularly clear stack like crash. It’s more like the system is telling you that every small decision that was “written like this” in the past is now being charged together.

It’s only worth opening Instruments at this point, and it must be done with suspicion.

I’ve never really liked the “open all the tools first and then talk” approach to troubleshooting. It’s so easy to use tools as substitutes for thinking.

Before entering Instruments that time, I actually had several doubts in my mind:

This is the most common suspect in information flow scenarios.

Especially:

  • Cover image size is not uniform
  • Mixed arrangement of graphics, text and video cards
  • Image backfill rhythm is unstable
  • Before display, processing such as rounded corners, shadows, and scaling are also done.

If all this work is actually happening during scrolling, the frame rate will be difficult to stabilize.

Suspicion 2: The cell presentation was prepared too late

Many lists seem “logically clear” at first, but when you actually scroll through the equipment, an old problem will be exposed:

A lot of display preparation work is done when the cell is about to be displayed.

For example:

  • Rich text assembly *Copywriting cutting
  • Time formatting *Altitude calculation
  • UI status judgment
  • Preparation of buried objects

Each item is not that big on its own, but once they are all squeezed into the scrolling critical path, it will go from “acceptable” to “the entire list is heavy.”

Suspicion 3: The status refresh range is too large

This situation is particularly common in reactive architectures.

On the surface, it is just a card status change, but what is actually triggered is:

  • Section level rebinding
  • Too many local diffs
  • Coupling of exposure reporting and UI refresh
  • Preloading callbacks frequently return to the main thread

This type of problem is the most annoying, because it may not necessarily cause a particularly high peak on a certain function, but the overall experience will always be poor.

What really narrows the direction is a few rounds of cheap experiments.

That time we really narrowed the problem down to a few rounds of very simple experiments.

Experiment 1: Replace all images with placeholder images

This experiment is very crude, but extremely useful.

Once the image is replaced, scrolling is obviously back. This step does not directly explain that the root cause is “too many pictures”, but it is enough to show that image links must be one of the main directions.

If you continue to ask at this time, it will no longer be a general “yes” but more specific:

  • Whether decoding falls on the main thread
  • Is the size strategy inconsistent?
  • Does image backfilling cause re-layout?
  • Is there too much additional processing done before displaying?

Experiment 2: Preprocess part of the dynamic text of the cell

With this change, scrolling has also been significantly improved.

This shows that the other direction is also true: It is indeed too late to prepare the list before it is presented, and there are more than one or two of these tasks.

In other words, the problem is that the image link and the presentation link are duplicated together.

Experiment 3: Temporarily turn off these edge actions such as exposure, preloading, and automatic playback

After this round of experiments, the frame rate continued to stabilize.

At this point the problem is basically clear: It is the homepage information flow that carries too much “just do it” work during the scrolling stage.

These tasks usually look reasonable on their own:

  • Pictures need to be backfilled
  • Exposure must be reported
  • Preloading needs to be done
  • Video card needs to be warmed up

But when they happen together on the scrolling critical path, they bring the whole list down.

What the real root cause looks like in the end is a set of bad decisions

When it finally landed, the real problem was probably this combination of punches:

  • The image size strategy is not uniform, resulting in processing costs before display
  • Some pictures are decoded too late
  • Cell display preparation work is done on-site in the main thread
  • Exposure and preload callbacks are too dense during the scroll period
  • List local state changes trigger a larger refresh range than expected

This is what many performance issues really look like.

It’s not the end of “line 357 found”, But in the end, you will find that the design of several layers is not restrained enough, and finally it will be settled together in the rolling stage.

Therefore, I increasingly dislike the argument that “there must be a key function for performance problems”. Of course this happens in real projects, but more often than not it is “a set of bad shapes that have been piled up for a long time.”

The verification phase is most afraid of subjective feelings. You must go back to the same path to compare.

After the modification, the problem was not solved directly, but the original recurrence path was re-taken.

The same device, the same account, the same switching method, the same sliding to the content area, and then look at:

  • Whether frame drops still occur stably?
  • Is the main thread still focusing on the same period of time?
  • Are image backfill and edge behaviors still competing with scrolling to compete with the main thread?

At this step, I also took a look at the side effects:

  • After the picture link is connected, is there any significant increase in memory?
  • If you do too much preprocessing, will the first screen become slower?
  • Exposure reporting is delayed. Will statistics be affected?

A mistake often made in performance optimization: They only focus on “whether an indicator looks good”, but don’t look at whether it has replaced other problems.

But in real work, optimization is always about exchange in the overall experience. Acceptable exchange is called optimization, exchanging other problems is not called optimization.

This type of article is most likely to be written as empty talk, but it is also the most realistic place for performance troubleshooting.

Many summary articles will end up writing:

  • Define the problem first
  • To establish a hypothesis
  • To verify the results

These words are certainly true, but they can all too easily turn into correct nonsense.

The difference in real experience is not whether you can say these words, but whether you have experienced the following moments:

  • At first everyone said it looked like the same problem, but in fact it was not the same problem at all.
  • The direction that looks most like the root cause at first glance is just a false clue in the end.
  • Every indicator in the tool seems to be abnormal, but there is only one real main line
  • The final answer is a set of small overheads that add up over time

Once you have done these things a few times, you will naturally understand one thing:

**The real value of performance troubleshooting is “can I turn a chaotic scene into a workable link?” **This can’t be done, the more tools there are, the more messy it will be. By doing this, you can get closer and closer to the root cause of many problems without going through all the charts.

When I look at performance issues now, what I care most about is whether we can get the team to agree on the following as soon as possible:

We are now investigating the same problem, and we already know which link it mainly falls on.

As long as this is established, subsequent repairs are usually not too confusing. The really difficult thing is always to gather the problem into a shape that can be dealt with.

FAQ

读完之后,下一步看什么

如果还想继续了解,可以从下面几个方向接着读。

Related

继续阅读

这里整理了同分类、同标签或同类问题的文章。