From 979dba6d7f4f8a9f1381a75e6cf6b033a363fe91 Mon Sep 17 00:00:00 2001 From: Patrick Lam Date: Mon, 16 Sep 2024 17:55:48 +1200 Subject: [PATCH] L25 --- lectures/L25-slides.tex | 20 ++++++++++---------- lectures/L25.tex | 40 ++++++++++++++++++++-------------------- 2 files changed, 30 insertions(+), 30 deletions(-) diff --git a/lectures/L25-slides.tex b/lectures/L25-slides.tex index 8f62f8f9..49597666 100644 --- a/lectures/L25-slides.tex +++ b/lectures/L25-slides.tex @@ -51,7 +51,7 @@ Why are we doing this? \begin{itemize} - \item A new system is being developed -- limits \& risks + \item A new system is being developed---limits \& risks \item Workload increase; can we handle it? \item A sudden spike in workload is expected \item Uptime requirements: 99.99\% @@ -81,7 +81,7 @@ \begin{frame} \frametitle{Stress Test?} -Load testing is not the same as \alert{stress testing}. +Load testing is not the same as \alert{stress testing} (to find the ultimate breaking strain\ldots). \begin{center} \includegraphics[width=0.5\textwidth]{images/barrescue.jpg} @@ -94,7 +94,7 @@ \begin{frame} \frametitle{Making the Plan} -Let's make a test plan! Need answers to textit{who, what, where, when, \& why}. +Let's make a test plan! Need answers to \textit{who, what, where, when, \& why}. We already covered why and that tells us about what. @@ -203,7 +203,7 @@ Scalability testing $\neq$ QA testing. -Dev + QA on local computer.\\ +Dev + QA on local computer (or CI infrastructure).\\ \quad Your concerns: Is it right? Is it fast enough? Fine, but it's no way to test if it scales. @@ -288,7 +288,7 @@ \begin{frame} -\frametitle{Endurance Tests -- How Long?} +\frametitle{Endurance Tests---How Long?} You're familiar with the idea that endurance is different from peak workload. @@ -309,7 +309,7 @@ \item 4 hours (40 km)? No. \end{itemize} -But if we just observed my running for 15 minutes, we might conclude I could run at 10 km/h indefinitely, but that's not true. +If we just observed my running for 15 minutes, we might conclude I could run at 10 km/h indefinitely. But that's not true. \end{frame} @@ -344,7 +344,7 @@ Same for filling up disk, exhausting file handles, log length. -Another example: holiday code freeze as unintentional endurance test. +Another example: holiday code freeze as unplanned surprise endurance test. \end{frame} @@ -427,7 +427,7 @@ There is a point at which I can make no more improvements in my running. -Eliud Kipchoge can run a marathon in 2:01:09.\\ +Eliud Kipchoge ran a marathon in 2:01:09.\\ \begin{center} \includegraphics[width=0.5\textwidth]{images/kipchoge.jpg} \end{center} @@ -444,7 +444,7 @@ Alternatively: change expectations!\\ \quad Example: What if we don't bill all customers on the 1st of the month? -Think outside the constraints as given in the initial problem statement. +Think outside the constraints---i.e. those given in the initial problem statement. \end{frame} @@ -465,4 +465,4 @@ \end{frame} -\end{document} \ No newline at end of file +\end{document} diff --git a/lectures/L25.tex b/lectures/L25.tex index 60f0beb3..aea0cf0b 100644 --- a/lectures/L25.tex +++ b/lectures/L25.tex @@ -6,12 +6,12 @@ \section*{Load Testing} -We've had a long look at the subject of observation -- identifying areas in the execution of our program that are a potential issue. Now, we will also take some time to relate performance to scalability. Very early on in the course we mentioned the idea that what we want in scalability is to take our software from 1 user to 100 to 10 million. To scale up, we probably need to do some profiling to find out what's slow and make a decision about what to change. It's also useful in terms of making an estimate of our maximum number of users or transactions could be. +We've had a long look at the subject of observation---identifying areas in the execution of our program that are a potential issue. Now, we will also take some time to relate performance to scalability. Very early on in the course we mentioned the idea that what we want in scalability is to take our software from 1 user to 100 to 10 million. To scale up, we probably need to do some profiling to find out what's slow and make a decision about what to change. It's also useful in terms of making an estimate of our maximum number of users or transactions could be. -That's not a hypothetical scenario either; I [JZ] was asked by a C-Level (company executive) whether a particular system could handle 10$\times$ as many users -- and that answer had to be supported with some numbers. How would I make that determination? Analysis and testing, of course. +That's not a hypothetical scenario either; I [JZ] was asked by a C-Level (company executive) whether a particular system could handle 10$\times$ as many users---and that answer had to be supported with some numbers. How would I make that determination? Analysis and testing, of course. \paragraph{Start with why.} The most important question when we want to start doing some load testing is: why are we doing this? A few possible answers~\cite{hitchhiking}: -\begin{itemize} +\begin{itemize}[noitemsep] \item A new system is being developed, and management wants to understand the limits and the risks. \item The workload is expected to increase, and we (the CFO) want(s) to be sure the system can handle it. \item A sudden spike in workload is expected, like tax season, and we want to know we can handle it. @@ -19,11 +19,11 @@ \section*{Load Testing} \item The project plan has a checkbox ``performance testing'', even though nobody has a real idea of what it means. \end{itemize} -Leaving aside the last ``reason'', each of the actual reasons why implies a different kind of testing is required. If it's a new system, maybe we just need to understand the average workload (plus, perhaps, a buffer for safety) and make a plan for that. If the workload is expected to increase, like having ten times the number of users, we just need to establish our bottlenecks as in the next topic to identify what we think the limit is. If it's the spike situation, we need to find a way to generate that spike behaviour to see how the system responds, and the hardest part might actually be in how to create the test. If our reason is uptime requirements, then in addition to putting a lot of load on the system, we have to maintain it for an extended period to make sure there's no performance degradation -- a test of endurance. +Leaving aside the last ``reason'', each of the actual reasons why implies a different kind of testing is required. If it's a new system, maybe we just need to understand the average workload (plus, perhaps, a buffer for safety) and make a plan for that. If the workload is expected to increase, like having ten times the number of users, we just need to establish our bottlenecks as in the next topic to identify what we think the limit is. If it's the spike situation, we need to find a way to generate that spike behaviour to see how the system responds, and the hardest part might actually be in how to create the test. If our reason is uptime requirements, then in addition to putting a lot of load on the system, we have to maintain it for an extended period to make sure there's no performance degradation---a test of endurance. -\paragraph{Stress Tests?} Load testing is not the same thing as \textit{stress testing}. Load testing involves having a specific target and the goal is to demonstrate that the application can handle that amount. Stress testing is about turning up the pressure (load) until things break, or at least stop working well. We are not going to go into stress testing in this topic, though it should be possible to take the load testing lessons and repurpose them by simply turning the setting up to even-bigger numbers. +\paragraph{Stress Tests?} Load testing is not the same thing as \textit{stress testing}. Load testing involves having a specific target and the goal is to demonstrate that the application can handle that amount. Stress testing is about turning up the pressure (load) until things break\footnote{The ``ultimate breaking strain'', as some say, or the ultimate tensile strength, in more modern language.}, or at least stop working well. We are not going to go into stress testing in this topic, though it should be possible to take the load testing lessons and repurpose them by simply turning the setting up to even-bigger numbers. -\subsection*{Plans Are Useless, Planning Is Essential} The previously-cited book suggests making a plan that answers the questions of \textit{who, what, where, when, \& why}. We just covered the ``why'' question, and understood how it points us towards the answer of ``what'' -- what kind of load testing are we intending to do here. That leaves ``who'' and ``when''. In a company (or FYDP) situation, there might be reason for debate or discussion around who should do the tests and when they should take place. For our purposes the answers are: we are going to do them, and we're going to do them now. +\subsection*{Plans Are Useless, Planning Is Essential} The previously-cited book suggests making a plan that answers the questions of \textit{who, what, where, when, \& why}. We just covered the ``why'' question, and understood how it points us towards the answer of ``what''---what kind of load testing are we intending to do here. That leaves ``who'' and ``when''. In a company (or FYDP) situation, there might be reason for debate or discussion around who should do the tests and when they should take place. For our purposes the answers are: we are going to do them, and we're going to do them now. How detailed the plan needs to be is going to depend on your organization, and the same applies for how many sign-offs (if any!) you need on the plan. Some companies have a high need for justification of the time invested in this, after all, because time is money (developer salary) and there are opportunity costs (what work are we not doing in favour of this). There's lots of literature out there about how to justify technical investments to senior management and let's not get sidetracked. @@ -31,7 +31,7 @@ \subsection*{Plans Are Useless, Planning Is Essential} The previously-cited book \subsection*{What Workflows to Test?} -While we might have a clear direction about the kind of test we want from why, the question still remains about what workflows are going to be tested. We cannot test everything -- aiming for 100\% coverage is unnecessary, but load testing should be reserved for only where it's truly needed because of its high cost and effort requirements. +While we might have a clear direction about the kind of test we want from why, the question still remains about what workflows are going to be tested. We cannot test everything---aiming for 100\% coverage is unnecessary, but load testing should be reserved for only where it's truly needed because of its high cost and effort requirements. If we already know which ones are slow (rate-limiting) or on the critical path, then we can certainly start there. Ideally, the observability/monitoring (you have it, right?) gives us the guidance needed here. If monitoring doesn't exist, that might need to get addressed first. @@ -39,7 +39,7 @@ \subsection*{What Workflows to Test?} If your current utilization is low, you might not know what the rate-limiting steps are at a glance. You can take a guess, but be prepared to revise those guesses partway through the process as you ramp up the load. Actually, you might need to do that even if utilization isn't low; you may find new things along the way that turn out to be the real limiting factors. -In the event of the uptime requirements, the tests likely look the same as the increased-workload situation -- we just run them for longer. Endurance tests have significant overlap with load tests, but are not exactly the same. We'll come back to figuring out how long an endurance test should be shortly. +In the event of the uptime requirements, the tests likely look the same as the increased-workload situation---we just run them for longer. Endurance tests have significant overlap with load tests, but are not exactly the same. We'll come back to figuring out how long an endurance test should be shortly. \subsection*{How to Test Them} @@ -50,10 +50,10 @@ \subsection*{How to Test Them} \paragraph{Hardware Principle.} -Scalability testing is very different from QA testing (you test your code, right?) in that you will do development and QA on your local computer and all you really care about is whether the program produces the correct output ``fast enough''. That's fine, but it's no way to test if it scales. If you actually want to test for scalability and real world performance, you should be doing it on the machines that are going to run the program in the live/production environment. Why? Well, low-end systems have very different limiting factors. You might be limited by the 16GB of RAM in your laptop and that would go away in the 64GB of RAM server you're using. So you might spend a great deal of time worrying about RAM usage when it turns out it doesn't matter. +Scalability testing is very different from QA testing (you test your code, right?) in that you will do development and QA on your local computer (or CI infrastructure) and all you really care about is whether the program produces the correct output ``fast enough''. That's fine, but it's no way to test scalability. If you actually want to test for scalability and real world performance, you have to do it on the machines that will run the program in the live/production environment. Why? Well, low-end systems have very different limiting factors. You might be limited by the 16GB of RAM in your laptop---but the server might have 128GB of RAM. So you might spend a great deal of time worrying about RAM usage when it turns out it doesn't matter. \paragraph{Reality Principle.} -It would be a good idea to use a ``real'' workload, as much as one can possibly simulate. Legal reasons might prevent you from using actual customer data, but you should be trying to use the best approximation of it that you have. +It would be a good idea to use a ``real'' workload, as much as one can simulate. Legal reasons might prevent you from using actual customer data, but you should use the best approximation of it that you have. If you only generate some test data, it might not be representative of the actual data: you might say 1\% of your customers are on the annual subscription but if it's really 10\% that might make a difference in how long it takes to run an analysis. On the other hand, your test data might be much larger than in production because your tests create (and don't delete) entities in the test system database every time you run. @@ -73,24 +73,24 @@ \subsection*{How to Test Them} \subsection*{Endurance Tests: How Long?} -Chances are you're familiar with the idea of endurance being different from peak workload. Can I [JZ] run at 10 km/h for 1 minute? Sure -- I can run much faster than that for 1 minute! Can I run 30 minutes (5 km) at 10 km/h? Yes. 60 minutes (10 km)? Yes, but with difficulty. Four hours (40 km)? No, I'll get tired and slow down, stop, or maybe hurt myself. Okay, so we've found a limit here -- I can endure an hour at this workload but things break down after that. If we only studied my running for a time period that was less than half an hour, we might conclude that I could run at 10 km/h forever (or at least indefinitely), even though that's absolutely not true\footnote{That's a common fallacy in the media, too -- you may notice, if you're looking for it, a plethora of journalists writing opinion pieces that say ``the current situation is the new normal and will go on forever'' -- even though that's almost certainly not true. Things change all the time -- political parties that win an election are often voted out again after a few years; rough job markets improve and good ones get tighter; high interest rates come down or low rates increase, etc. New and unprecedented things happen a lot, and change is constant.}. The problem is that if our sample period is 15 minutes, that is not long enough to reflect the cumulative negative effects that contribute to my eventual slowing down and being forced to stop. +Chances are you're familiar with the idea of endurance being different from peak workload. Can I [JZ] run at 10 km/h for 1 minute? Sure---I can run much faster than that for 1 minute! Can I run 30 minutes (5 km) at 10 km/h? Yes. 60 minutes (10 km)? Yes, but with difficulty. Four hours (40 km)? No, I'll get tired and slow down, stop, or maybe hurt myself. Okay, so we've found a limit here---I can endure an hour at this workload but things break down after that. If we only studied my running for less than half an hour, we might conclude that I could run at 10 km/h forever (or at least indefinitely), even though that's absolutely not true\footnote{That's a common fallacy in the media, too---you may notice, if you're looking for it, a plethora of journalists writing opinion pieces that say ``the current situation is the new normal and will go on forever''---even though that's almost certainly not true. Things change all the time---political parties that win an election are often voted out again after a few years; rough job markets improve and good ones get tighter; high interest rates come down or low rates increase, etc. New and unprecedented things happen a lot, and change is constant.}. But if our sample period is 15 minutes, that is not long enough to reflect the cumulative negative effects that slow and stop me eventually. Is this analogy suitable for software, though? CPUs don't get ``tired'', nor do their parts accumulate fatigue at anywhere near the same rate as a runner's muscles. Yes, it's still valid! A process that has a data hoarding problem is slowly accumulating memory usage and when its memory space is exhausted, we might encounter a hard stop (e.g., encountering the error \texttt{java.lang.OutOfMemoryError: Java heap space}) or just a degradation of performance based on the increasing amount of swapping memory to disk that is required. Same for filling up disk, exhausting file handles, log length, accumulated errors, whatever it is. That's a little bit like fatigue for the executing application, whether it's building in the program or its environment: it builds over time, and eventually the effects of it force a slowdown or stoppage of execution. -Accumulated ``fatigue'' for an application is not just a hypothetical, either. I [JZ] have personally seen services that encountered a problem due to running out of some internal resources as a result of a code freeze over the holiday season. That wasn't a load test in the sense of applying a higher load to validate execution rate -- it was just an endurance test, even though it was not planned as one. The solution to getting the error rate down involved restarting the instances one-by-one and things got back on track. This example calls back what I said about how endurance tests are not exactly the same as load tests, in that we can have an endurance test with low load and it's still valid. +Accumulated ``fatigue'' for an application is not just a hypothetical, either. I [JZ] have personally seen services that encountered a problem due to running out of internal resources as a result of a holiday-season code freeze. That wasn't a load test in the sense of applying a higher load to validate execution rate---it was just an (unplanned surprise) endurance test. The solution to getting the error rate down involved restarting the instances one-by-one and things got back on track. This example calls back what I said about how endurance tests are not exactly the same as load tests, in that we can have an endurance test with low load, yet it may still be valid. -With that said, how do we identify what is the relevant period for the endurance test -- or in the running analogy, how do I know if 30 minutes is the right running length to get a real idea? Should it be 3 hours? Once again, our first guide might be the product requirements for what we're building (testing) might give an idea. If it's an e-commerce platform then the endurance test might be something like five days to cover the period of US Thanksgiving, Black Friday, the weekend, and then Cyber Monday. +With that said, how do we identify what is the relevant period for the endurance test---or in the running analogy, how do I know if 30 minutes is the right running length to get a real idea? Should it be 3 hours? Once again, our first guide might be the product requirements for what we're building (testing) might give an idea. If it's an e-commerce platform then the endurance test might be something like five days to cover the period of US Thanksgiving, Black Friday, the weekend, and then Cyber Monday. -Other ideas for choosing endurance targets might consider the maintenance windows for the platform. Suppose you have a scheduled maintenance window that takes place on Sundays from 02:00 -- 03:00 and there can be downtime during that period, according to the contracts (service level agreements) the company has signed. In that case, you want to validate that the system will work correctly and consistently long enough that it can be restarted (or updated) only during that maintenance window. +Other ideas for choosing endurance targets might consider the maintenance windows for the platform. Suppose you have a scheduled maintenance window that takes place on Sundays from 02:00--03:00 and there can be downtime during that period, according to the contracts (service level agreements) the company has signed. In that case, you want to validate that the system will work correctly and consistently long enough that it can be restarted (or updated) only during that maintenance window. Unfortunately, there are no universal rules we can offer that give you the exact length of time to evaluate. You'll have to consider the requirements, the likely scenarios, and use your judgement. \subsection*{How to Evaluate Success} There are two kinds of answers that load testing can give you (and they're related). The first is "Yes or no, can the system handle a load of $X$?"; the second is "What is the maximum load $Y$ our current system can handle?". If we know $Y$ then we can easily answer whether $X$ is higher or lower. Between the two of them, this suggests we might prefer to find $Y$ rather than answer the first question. But it might be hard to find the maximum limit. The difficulty of generating test data or load might increase much faster than the rate of load added to the system, and we might be crossing over into stress testing. Sometimes answering the first question is all that is necessary. -The value of $Y$ may have some nuance above. The maximum rate we can handle may imply a hard stop -- if the load exceeds $Y$ then we crash, run out of memory, reject requests, or something else. It may also be a degradation of service: this is the point at which performance degrades below our target or minimum. +The value of $Y$ may have some nuance above. The maximum rate we can handle may imply a hard stop---if the load exceeds $Y$ then we crash, run out of memory, reject requests, or something else. It may also be a degradation of service: this is the point at which performance degrades below our target or minimum. -Observability has come up previously and it might have been present to help decide what's important. We might also need to add some monitoring or logging that tracks when events start and end, so we can gather that data needed to make the overall evaluation. Examples that we are looking for are things like the following. Is the total work completed within the total time limit? Did individual items get completed within the item time limit 99\% of the time or more? +Observability has come up previously and it might have helped decide what's important. We might also need to add some monitoring or logging that tracks when events start and end, to gather data needed to make the overall evaluation. Examples that we are looking for are things like the following. Is the total work completed within the total time limit? Did individual items get completed within the item time limit 99\% of the time or more? As you might expect, the raw load test results are not always sufficient to make the call as to whether a test has passed or succeeded. Then there is post-processing to aggregate and analyze the data. Some manual work might also be necessary to correlate the data with other known factors, particularly if it had not been possible to test on separate hardware or separate instances~\cite{hitchhiking}. At the end of this process, hopefully you can look at the outcomes and conclude whether a given scenario is passed or failed. @@ -98,18 +98,18 @@ \subsection*{How to Evaluate Success} \subsection*{So You Failed the Load Test...} -Like a software program, when it comes to running, I [JZ] can get better -- I just need to make some changes. To get better at endurance running, chances are I mostly just need to increase (slowly) the running distances and practice more and I'll get better. Or as the internet might say, git gud (get good). +Like a software program, when it comes to running, I [JZ] can get better---I just need to make some changes. To get better at endurance running, chances are I mostly just need to increase (slowly) the running distances and practice more and I'll get better. Or as the internet might say, git gud (get good). -The idea works for the programs too -- if the load test has failed, we'll have one or more specific scenarios to improve and we can apply techniques from this course (or elsewhere) to make that part better. Then re-run the test and re-evaluate. If all is well, done; otherwise, repeat as necessary until we've passed the scenario. +The idea works for the programs too---if the load test has failed, we'll have one or more specific scenarios to improve and we can apply techniques from this course (or elsewhere) to make that part better. Then re-run the test and re-evaluate. If all is well, done; otherwise, repeat as necessary until we've passed the scenario. And also like a software program, there is a point at which I can make no more improvements. At the time of writing, Google says that the world record for marathon running is 2:01:09, held by Eliud Kipchoge, or an average speed of 21.02 km/h. There is absolutely no chance I can get to that level, no matter how hard I train. So I am not getting selected for the Canadian Olympic team. -If we've reached the limits of what we can do in the software, it might not be the right tool for your needs and a major redesign or replacement is needed. Designing the system with higher load in mind is sometimes possible, though even in situations where it's possible the cost might be prohibitive. Are we stuck? +If we've reached the limits of what we can do in the software, it might not be the right tool for your needs and a major redesign or replacement is needed. Designing the system with higher load in mind is sometimes possible, though even in situations where it's possible, the cost might be prohibitive. Are we stuck? No! You can have different expectations. A four-hour marathon seems achievable for me if I worked on it. If it's unrealistic to bill all your customers on the same day, why not convince the company to let billing be spread across the whole month? Think about the problem outside the constraints that are currently given. \subsection*{Constant Vigilance} -You may have heard the saying ``constant vigilance'' around the topic of defending against the dark arts. Load testing at a given moment in time captures the state of things at that time only. The load testing procedure needs to be repeated regularly to catch performance degradations that would make your software fail to meet its targets or design load. These are rarely intentional, but the tendency of software is to grow in complexity and in functionality, both of which are likely to make it slower. Improved hardware does, over time, offset some of the slowdown of added complexity. However, at the time of writing, this is an era of small, incremental improvements year-to-year, not big leaps and bounds forward \footnote{Considering the previous footnote, it would be thoroughly foolish to now fall into the trap of saying ``the current situation will go on forever'' and say that there will never be revolutionary change in execution hardware that offsets the increases in complexity. But that's not where we are right now.}. So, for now, it is still a priority to repeat the load tests often enough to catch major regressions before they become a problem. +You may have heard the saying ``constant vigilance'' around the topic of defending against the dark arts. Load testing at a given time captures the state of things at that time only. It needs to be repeated regularly to catch performance degradations that would make your software fail to meet its targets or design load. These are rarely intentional. But, software grows in complexity and in functionality, both of which are likely to make it slower. Improved hardware does, over time, offset some of the slowdown of added complexity. However, at the time of writing, this is an era of small, incremental improvements year-to-year, not big leaps and bounds forward \footnote{Considering the previous footnote, it would be thoroughly foolish to now fall into the trap of saying ``the current situation will go on forever'' and say that there will never be revolutionary change in execution hardware that offsets the increases in complexity. But that's not where we are right now.}. So, for now, we must still repeat the load tests often enough to catch major regressions before they become a problem. Real-life example: \url{https://arewefastyet.com} tracks regressions/improvements in Firefox performance.