diff --git a/lectures/459.bib b/lectures/459.bib index face0ae6..166ccb69 100644 --- a/lectures/459.bib +++ b/lectures/459.bib @@ -1385,7 +1385,7 @@ @misc{xzlib @INPROCEEDINGS{bottlenecks-android, author={Linares-Vásquez, Mario and Vendome, Christopher and Luo, Qi and Poshyvanyk, Denys}, booktitle={2015 IEEE International Conference on Software Maintenance and Evolution (ICSME)}, - title={How developers detect and fix performance bottlenecks in Android apps}, + title={How developers detect and fix performance bottlenecks in {Android} apps}, year={2015}, volume={}, number={}, diff --git a/lectures/L26-slides.tex b/lectures/L26-slides.tex index 218b58a8..3c8e53a4 100644 --- a/lectures/L26-slides.tex +++ b/lectures/L26-slides.tex @@ -149,7 +149,7 @@ Still, that tells you about right now; what about the long term average? -Checking with my machine ``Loki'', that has since ascended to Valhalla:\\[1em] +Checking with my machine ``Loki'', which has since ascended to Valhalla:\\[1em] {\scriptsize \begin{verbatim} @@ -561,4 +561,4 @@ \end{frame} -\end{document} \ No newline at end of file +\end{document} diff --git a/lectures/L26.tex b/lectures/L26.tex index 19b7b328..2ae7f6b2 100644 --- a/lectures/L26.tex +++ b/lectures/L26.tex @@ -6,9 +6,9 @@ \section*{Characterizing Performance \& Scalability Problems} -Studies show that poor mobile app performance, whether that's by high usage of resources (e.g., battery, memory) or just being slow, is a major complaint that users write about in app stores~\cite{free-apps}. The paper is somewhat older at this point, and while we might like to think that developers are doing a better job of proactively identifying issues, chances are it's still the case that a distressing number of issues are first discovered when someone posts a negative review. \textit{``One star, because it's not possible to give zero!''}. +Studies show that poor mobile app performance, whether due to high usage of resources (e.g., battery, memory) or to just being slow, is a major complaint that users write about in app stores~\cite{free-apps}. The paper is somewhat older at this point, and while we might like to think that developers are doing a better job of proactively identifying issues, chances are it's still the case that a distressing number of issues are first discovered when someone posts a negative review. \textit{``One star, because it's not possible to give zero!''}. -Given that, a user might be able to give a vague idea of what's wrong, or point out a workflow where they're dissatisfied with the performance. Then it's up to the development team to figure out what's the cause of the problem. Most of our next topics in profiling will be around the idea of CPU profiling, which is to say, examining where CPU time is going. That's generally predicated on the assumption that CPU time is limiting factor. But maybe it isn't, and before we go about focusing on how to examine CPU usage, let's check that assumption -- because it could be something else. +A user might be able to give a vague idea of what's wrong, or point out a workflow where they're dissatisfied with the performance. Then it's up to the development team to figure out what's the cause of the problem. Most of our next topics in profiling will be around the idea of CPU profiling, which is to say, examining where CPU time is going. That's generally predicated on the assumption that CPU time is limiting factor. But maybe it isn't, and before we go about focusing on how to examine CPU usage, let's check that assumption---because it could be something else. \begin{quote} \textit{It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.} @@ -16,7 +16,7 @@ \section*{Characterizing Performance \& Scalability Problems} \hfill - Sherlock Holmes (\textit{A Scandal in Bohemia}; Sir Arthur Conan Doyle) Keeping the wisdom of Mr. Holmes in mind, we need to collect evidence before reaching conclusions. At a high level we probably have the following potential culprits to start with: -\begin{enumerate} +\begin{enumerate}[noitemsep] \item CPU \item Memory \item Disk @@ -25,14 +25,14 @@ \section*{Characterizing Performance \& Scalability Problems} \end{enumerate} \paragraph{Caveats.} -The list above is, obviously, categories, but they are starting points for further investigation. They are listed in some numerical order, but there is no reason why one would have to investigate them in the order defined there. We'll go in that order in this topic, just for convenience. +The list above is, obviously, just categories, but they are starting points for further investigation. They are listed in some numerical order, which we'll follow for convenience, but there is no reason why one would have to investigate them in that order. -If we get to the end of a particular investigative avenue, fixing it is a separate issue involving the application of the techniques covered elsewhere in the course. That might be really difficult, and sometimes the cause of the problem is simply that the user's device/hardware is too old (or lacking important hardware for acceleration). Games are likely the most obvious example of situations where the ``minimum'' and ``recommended'' hardware configurations are published and if you don't meet those requirements, you're going to have a bad time (if the game runs at all). That applies to apps too, though dropping older devices from the supported set is more likely a result of the device no longer getting OS/API upgrades rather than being ``too slow''. All of which is to say, at the end of the analysis, sometimes the correct solution is to advise the user to upgrade their hardware. +Even if we identify a culprit, fixing it is a separate issue involving the application of the techniques covered elsewhere in the course. That might be really difficult---sometimes the cause of the problem is simply that the user's device/hardware is too old (or lacking important hardware for acceleration). Games are likely the most obvious example of situations where the ``minimum'' and ``recommended'' hardware configurations are published and if you don't meet those requirements, you're going to have a bad time (if the game runs at all). That applies to apps too, though dropping older devices from the supported set is more likely a result of the device no longer getting OS/API upgrades rather than being ``too slow''. All of which is to say, at the end of the analysis, sometimes the correct solution is to advise the user to upgrade their hardware. Another possible outcome of finding the bottleneck when reported by the user is that it actually isn't a performance problem as much as a programming error. For example, the user may report the GUI lagging or the application not responding (up to and including the actual system not-responding dialog), but these scenarios are not the result of low computational power on the device; they are caused instead by doing some slow or computationally-intensive task in the UI thread rather than in the background~\cite{bottlenecks-android}. The fix there is just to, well, fix the bug. -\paragraph{CPU.} CPU is probably the easiest of these to diagnose. Something like \texttt{top} or Task Manager will tell you pretty quickly if the CPU is busy. You can look at the \%CPU columns and see where all your CPU is going. Still, that tells you about right now; what about the long term average? Checking with my machine ``Loki'', that used to donate its free CPU cycles to world community grid (I was singlehandedly saving the world, you see. I mean. I did stop in 2016, and look at what's happened since then.): +\paragraph{CPU.} CPU is probably the easiest of these to diagnose. Something like \texttt{top} or Task Manager will tell you pretty quickly if the CPU is busy. You can look at the \%CPU columns and see where all your CPU is going. Still, that tells you about right now; what about the long term average? Checking with my machine ``Loki'', that used to donate its free CPU cycles to world community grid (I was singlehandedly saving the world, you see. I mean. I did stop in 2016, and look at what's happened since then): \begin{verbatim} top - 07:28:19 up 151 days, 23:38, 8 users, load average: 0.87, 0.92, 0.91 @@ -43,7 +43,7 @@ \section*{Characterizing Performance \& Scalability Problems} Picture a single core of a CPU as a lane of traffic. You are a bridge operator and so you need to monitor how many cars are waiting to cross that bridge. If no cars are waiting, traffic is good and drivers are happy. If there is a backup of cars, then there will be delays. Our numbering scheme corresponds to this: \begin{enumerate} - \item 0.00 means no traffic (and in fact anything between 0.00 and 0.99) means we're under capacity and there will be no delay. + \item 0.00 means no traffic (and in fact anything between 0.00 and 0.99); means we're under capacity and there will be no delay. \item 1.00 means we are exactly at capacity. Everything is okay, but if one more car shows up, there will be a delay. \item Anything above 1.00 means there's a backup (delay). If we have 2.00 load, then the bridge is full and there's an equal number of cars waiting to get on the bridge. \end{enumerate} @@ -54,9 +54,9 @@ \section*{Characterizing Performance \& Scalability Problems} \includegraphics[width=0.55\textwidth]{images/car-analogy.png} \end{center} -Being at or above 1.00 isn't necessarily bad, but you should be concerned if there is consistent load of 1.00 or above. And if you are below 1.00 but getting close to it, you know how much room you have to scale things up -- if load is 0.4 you can increase handily. If load is 0.9 you're pushing the limit already. If load is above 0.70 then it's probably time to investigate. If it's at 1.00 consistently we have a serious problem. If it's up to 5.00 then this is a red alert situation. +Being at or above 1.00 isn't necessarily bad, but you should be concerned if there is consistent load of 1.00 or above. And if you are below 1.00 but getting close to it, you know how much room you have to scale things up---if load is 0.4 you can increase handily. If load is 0.9 you're pushing the limit already. If load is above 0.70 then it's probably time to investigate. If it's at 1.00 consistently we have a serious problem. If it's up to 5.00 then this is a red alert situation. -Now this is for a single CPU -- if you have a load of 3.00 and a quad core CPU, this is okay. You have, in the traffic analogy, four lanes of traffic, of which 3 are being used to capacity. So we have a fourth lane free and it's as if we're at 75\% utilization on a single CPU. +Now this is for a single CPU---if you have a load of 3.00 and a quad core CPU, this is okay. You have, in the traffic analogy, four lanes of traffic, of which 3 are being used to capacity. So we have a fourth lane free and it's as if we're at 75\% utilization on a single CPU. \paragraph{Memory and Disk.} Next on the list is memory. If you are using some garbage-collected language or framework, you will find lots of runs of the garbage collector, or at least very long ones when it does run; in the worst case scenario you'll see your application run out of memory and either crash or recover from it~\cite{bottlenecks-android}. @@ -68,7 +68,7 @@ \section*{Characterizing Performance \& Scalability Problems} KiB Swap: 8378364 total, 1313972 used, 7064392 free. 2084336 cached Mem \end{verbatim} -This can be misleading though, because memory being ``full'' does not necessarily mean anything bad. It means the resource is being used to its maximum potential, yes, but there is no benefit to keeping a block of memory open for no reason. Things will move into and out of memory as they need to, and nobody hands out medals to indicate that you did an awesome job of keeping free memory. It's not like going under budget in your department for the year. Also, memory is not like the CPU; if there's nothing for the CPU to do, it will just idle (or go to a low power state, which is nice for saving the planet). But memory won't ``forget'' data if it doesn't happen to be needed right now - data will hang around in memory until there is a reason to move or change it. So freaking out about memory appearing as full is kind of like getting all in a knot about how ``System Idle Process'' is hammering the CPU\footnote{Yes, a tech journalist named John Dvorak really wrote an article about this, and here I am roasting him about it decades later because it's just so ridiculous a theory that I can't help it.}. +This can be misleading, because memory being ``full'' does not necessarily mean anything bad. It means the resource is being used to its maximum potential, yes, but there is no benefit to keeping a block of memory open for no reason. Things will move into and out of memory as they need to, and nobody hands out medals to indicate that you did an awesome job of keeping free memory. It's not like going under budget in your department for the year. Also, memory is not like the CPU; if there's nothing for the CPU to do, it will just idle (or go to a low power state, which is nice for saving the planet). But memory won't ``forget'' data if it doesn't happen to be needed right now - data will hang around in memory until there is a reason to move or change it. So freaking out about memory appearing as full is kind of like getting all in a knot about how ``System Idle Process'' is hammering the CPU\footnote{Yes, a tech journalist named John Dvorak really wrote an article about this, and here I am roasting him about it decades later because it's just so ridiculous a theory that I can't help it.}. You can also ask about page faults, with the command \texttt{ps -eo min\_flt,maj\_flt,cmd} which will give you the major page faults (had to fetch from disk) and minor page faults (had to copy a page from another process). The output of this is too big even for the notes, but try it yourself (or I might be able to do a demo of it in class). But this is lifetime and you could have a trillion page faults at the beginning of your program and then after that everything is fine. What you really want is to ask Linux for a report on swapping: @@ -85,6 +85,8 @@ \section*{Characterizing Performance \& Scalability Problems} In particular, the columns ``si'' (swap in) and ``so'' (swap out) are the ones to pay attention to. In the above example, they are all zero. That is excellent and tends to indicate that we are not swapping to disk and that's not the performance limiting factor. Sometimes we don't get that situation. A little bit of swapping may be inevitable, but if we have lots of swapping, we have a very big problem. Here's a not-so-nice example, from~\cite{vmstat}: +\vspace*{-1em} +{\scriptsize \begin{verbatim} procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id @@ -93,11 +95,14 @@ \section*{Characterizing Performance \& Scalability Problems} 1 0 0 13856 1640 1308 18524 64 516 379 129 4341 646 24 34 42 3 0 0 13856 1084 1308 18316 56 64 14 0 320 1022 84 9 8 \end{verbatim} +} +\vspace*{-1em} If we're not doing significant swapping, then memory isn't holding us back, so we can conclude it is not the limiting factor in scaling the application up. On to disk. Looking at disk might seem slightly redundant if memory is not the limiting factor. After all, if the data were in memory it would be unnecessary to go to disk in the first place. Still, sometimes we can take a look at the disk and see if that is our bottleneck. +\vspace*{-1em} {\scriptsize \begin{verbatim} jz@Loki:~$ iostat -dx /dev/sda 5 @@ -107,24 +112,30 @@ \section*{Characterizing Performance \& Scalability Problems} sda 0.24 2.78 0.45 2.40 11.60 154.98 116.91 0.17 61.07 11.57 70.27 4.70 1.34 \end{verbatim} } +\vspace*{-1em} -It's that last column, \%util that tells us what we want to know. The device bandwidth here is barely being used at all. If you saw it up at 100\% then you would know that the disk was being maxed out and that would be a pretty obvious indicator that it is the limiting factor. This does not tell you much about what is using the CPU, of course, and you can look at what processes are using the I/O subsystems with \texttt{iotop} which requires root privileges\footnote{https://xkcd.com/149/}. +It's that last column, \%util, that tells us what we want to know. The device bandwidth here is barely being used at all. If you saw it up at 100\% then you would know that the disk was being maxed out and that would be a pretty obvious indicator that it is the limiting factor. This does not tell you much about what is using the CPU, of course, and you can look at what processes are using the I/O subsystems with \texttt{iotop} (requires root privileges\footnote{\url{https://xkcd.com/149/}}). -\paragraph{Network.} That leaves us with networks. We can ask about the network with \texttt{nload}: which gives the current, average, min, max, and total values. And you get a nice little graph if there is anything to see. It's not so much fun if nothing is happening. But you'll get the summary, at least: +\paragraph{Network.} That leaves us with networks. We can ask about the network with \texttt{nload}, which gives the current, average, min, max, and total values. And you get a nice little graph if there is anything to see. It's not so much fun if nothing is happening. But you'll get the summary, at least: +\vspace*{-1em} +{\scriptsize \begin{verbatim} Curr: 3.32 kBit/s Avg: 2.95 kBit/s Min: 1.02 kBit/s Max: 12.60 kBit/s -Ttl: 39.76 GByte \end{verbatim} +Ttl: 39.76 GByte +\end{verbatim} +} \vspace*{-1em} -So if you saw here, for example, that data was leaving at 100 Megabits per second you'd have a pretty good idea that was the limitation, but you may still be network limited at lower speeds. Intermediary devices or other non-optimal hardware can get in the way. For example, it's possible to run a wired network over power lines -- this is not optimal, but it's effective in older buildings where it would be difficult or expensive to run network cables through the walls. This will limit your speed based on the condition of the wires in the wall, alongside any other devices on the same circuit adding noise to the signal. So what should be 1000 or 100 MBit might actually only be more like 32 Mbit. Wireless networks have the same problem, being affected by walls, floors, electromagnetic interference, humidity... +So if you saw here, for example, that data was leaving at 100 Megabits per second you'd have a pretty good idea that was the limitation, but you may still be network limited at lower speeds. Intermediary devices or other non-optimal hardware can get in the way. For example, it's possible to run a wired network over power lines---this is not optimal, but it's effective in older buildings where it would be difficult or expensive to run network cables through the walls. This will limit your speed based on the condition of the wires in the wall, alongside any other devices on the same circuit adding noise to the signal. So what should be 1000 or 100 MBit might actually only be more like 32 Mbit. Wireless networks have the same problem, being affected by walls, floors, electromagnetic interference, humidity... Testing network speed can be done using tools like speedtest.net, which gives some indication of what the network connection speed is like from a given device (upload and download). You may need to test multiple times to get a realistic picture of speed. The test validates the speed of connection to the upstream system (e.g., some server operated by the speed test service), but not necessarily to your own data centre. Both locations can have good network connections to the outside world, but might not be easily able to talk to each other well. In my [JZ] experience, I've seen network speed issues for people working in Hong Kong using an application that has the backend deployed in a Frankfurt (Germany) data centre. The problem here is not bandwidth but latency. Tools like speedtest can tell you the latency of communication alongside the bandwidth. If you want to get an idea of the path and the latency to a particular remote system, you can use the \texttt{traceroute} tool. Here is an example from Catchpoint \url{https://www.catchpoint.com/network-admin-guide/how-to-read-a-traceroute}, which also provides some guidance on how to interpret a traceroute result: +\vspace*{-1em} {\scriptsize \begin{verbatim} Microsoft Windows [Version 10.0.19043.1288] @@ -152,8 +163,9 @@ \section*{Characterizing Performance \& Scalability Problems} Trace complete \end{verbatim} } +\vspace*{-1em} -Remember that communication latency can never be truly eliminated, because it takes non-zero time for packets to get processed through network hardware (e.g., switches), and, more importantly, the speed of light. I looked up some data on \url{https://wondernetwork.com/pings/New\%20York} -- use with a grain of salt -- that says the ping from New York to Lyon is 73.21ms which is something like 83.79\% of the speed of light in fibre-optic cable (as of August 2024). Even if we got it up to 100\% of the speed of light in fibre-optic cable, or used some other material that had a higher speed of light in it, it can't ever get down to nothing. +Remember that communication latency can never be truly eliminated, because it takes non-zero time for packets to get processed through network hardware (e.g., switches), and, more importantly, the speed of light. I looked up some data on \url{https://wondernetwork.com/pings/New\%20York}---use with a grain of salt---that says the ping from New York to Lyon is 73.21ms which is something like 83.79\% of the speed of light in fibre-optic cable (as of August 2024). Even if we got it up to 100\% of the speed of light in fibre-optic cable, or used some other material that had a higher speed of light in it, it can't ever get down to nothing. One more thing that can cut into your effective bandwidth is packet loss: data getting dropped or corrupted en route. That requires some device along the line to identify that the packets are not as they should be, re-request the needed packets, and wait for them to arrive. Packet loss may be environmental, but it also might mean a device needs replacing. @@ -162,11 +174,11 @@ \section*{Characterizing Performance \& Scalability Problems} We'll exclude the discussion of detecting deadlock, because we'll say that deadlock is a correctness problem more than a performance problem. In any case, a previous course (ECE 252, SE 350, MTE 241) very likely covered the idea of deadlock and how to avoid it. The Helgrind tool (Valgrind suite) is a good way to identify things like lock ordering problems that cause a deadlock. Onwards then. -Unexpectedly low CPU usage, that's not explained by I/O-waits, may be a good indicator of lock contention. If that's the case, when CPU usage is low we would see many threads are blocked. +Unexpectedly low CPU usage, not explained by I/O-waits, may be a good indicator of lock contention. If that's the case, when CPU usage is low we would see many threads are blocked. Unlike some of the other things we've discussed, there's no magical \texttt{locktrace} tool that would tell us about critical section locks, and the POSIX pthread library does not have any locks tracing in its specification~\cite{usd}. One possibility is to introduce some logging or tracing ourselves, e.g., recording in the log that we want to enter a critical section $A$ and then another entry once we're in it and a third entry when we leave $A$. That's not great, but it is something! -I did some reading about \texttt{perf lock} but the problem is, as above, that it doesn't really find user-space lock contention. You can ask tools to tell you about thread switches but that's not quite the same. Other commercial tools like the Intel VTune claim that they can find these sorts of problems. But those cost money and may be CPU-vendor-specific. +I did some reading about \texttt{perf lock} but the problem is, as above, that it doesn't really find user-space lock contention. Tools can tell you about thread switches but that's not quite the same. Other commercial tools like Intel VTune claim that they can find these sorts of problems. But those cost money and may be CPU-vendor-specific. \subsection*{But it's probably CPU...}