Skip to content

Commit

Permalink
emdashes in L24
Browse files Browse the repository at this point in the history
  • Loading branch information
patricklam committed Sep 16, 2024
1 parent 8ecd212 commit eb57a18
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 12 deletions.
12 changes: 6 additions & 6 deletions lectures/L24-slides.tex
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@
\begin{frame}
\frametitle{Logging}

If we get only one tool -- logging!
If we get only one tool---logging!

How often to log, and what content?

Expand Down Expand Up @@ -214,7 +214,7 @@

CPU trace tool: $20\times$ slowdown.

Timestamp at function call: $1.2-1.5\times$ slowdown.
Timestamp at function call: $1.2$--$1.5\times$ slowdown.

Timestamp of system call entry/exit: $< 1\%$

Expand Down Expand Up @@ -286,20 +286,20 @@
\begin{frame}
\frametitle{Context to Aggregation}

After my PR, login time $0.75s$ -- Good or bad?
After my PR, login time $0.75s$---Good or bad?

Depends on the baseline. Maybe it was $0.5s$?


Okay, increased -- is that bad?
Okay, increased---is that bad?

\end{frame}


\begin{frame}
\frametitle{Another Example}

Request takes on average $1.27s$ -- Good or Bad?
Request takes on average $1.27s$---Good or Bad?

What if the time limit is 1 second? 10 seconds?

Expand All @@ -312,7 +312,7 @@
\frametitle{Averages Misleading}

All of these are 7 requests per second on average:

\vspace*{-10em}
\begin{center}
\includegraphics[width=\textwidth]{images/burst1}\\
\includegraphics[width=\textwidth]{images/burst2}\\
Expand Down
12 changes: 6 additions & 6 deletions lectures/L24.tex
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ \section*{Observations Precede Conclusions}

The general idea is, collect some data on what parts of the code are taking up the majority of the time. This can be broken down into looking at what functions get called, or how long functions take, or what's using memory\ldots

\paragraph{Why Observation?} We're talking here about the idea of observation of our program, which is a little bit more inclusive than just measuring things, because we may observe things that are hard or impossible to quantify. Observing the behaviour of the program will obviously be super helpful in terms of figuring out what -- if anything -- to change. We have several different ways of looking at the situation, including logs, counters, profiling, and traces. Differences between them, briefly~\cite{usd}:
\paragraph{Why Observation?} We're talking here about the idea of observation of our program, which is a little bit more inclusive than just measuring things, because we may observe things that are hard or impossible to quantify. Observing the behaviour of the program will obviously be super helpful in terms of figuring out what---if anything---to change. We have several different ways of looking at the situation, including logs, counters, profiling, and traces. Differences between them, briefly~\cite{usd}:

\begin{itemize}
\item \textbf{Counters} are, well, a stored count of the occurrences of something: how many times we had a cache miss, how many times we called \texttt{foo()}, how many times a user logged in...
Expand Down Expand Up @@ -52,7 +52,7 @@ \subsection*{Tracing: Logging}

As I already said, logging every request might be too noisy to actually find anything in the resulting data. The ability to search and sort helps, but it can still be going by too fast to realistically assess in real-time. Log aggregation services exist and they can help, especially when trying to do a retrospective.

Typically, adding any relevant attributes is very helpful to identify what has happened or to correlate knowledge. If we can see in the attributes that updates for plans of companies always take ten times longer if the company's address is in Portugal, that's actually a useful observation. We can take that information and put it in the bug ticket and ideally help the developer -- even if it's our future selves -- to find the issue and resolve it.
Typically, adding any relevant attributes is very helpful to identify what has happened or to correlate knowledge. If we can see in the attributes that updates for plans of companies always take ten times longer if the company's address is in Portugal, that's actually a useful observation. We can take that information and put it in the bug ticket and ideally help the developer---even if it's our future selves---to find the issue and resolve it.

With that said: there's such a thing as too much detail in logging. Logs should always redact personally-identifiable information such as people's names, contact information, medical records, etc. Unless the logs are really kept in secure storage with appropriate access controls, they can be seen by people who shouldn't have that information. Why should, for example, a developer on the website team have access to the banking details of a customer? My opinion: they shouldn't. And it's not just my opinion, but also the opinion of people like the Privacy Commissioner of Canada, whom you probably do not want to anger.

Expand All @@ -65,7 +65,7 @@ \subsection*{Tracing: Logging}
Some quick, off-the-cuff figures from~\cite{usd}:
\begin{itemize}
\item If we ask the CPU trace tool to track every conditional branch taken, that would be about a $20\times$ slowdown. This amount of overhead would be acceptable if we're debugging the program on a development machine (own laptop or testing deployment), but certainly not in a production environment.
\item If we ask for a timestamp at every function call and return in our program the slowdown is around $1.2-1.5\times$. This may (emphasis on may) be acceptable in a production environment if the task is not time-critical, but only temporarily while experimenting or observing.
\item If we ask for a timestamp at every function call and return in our program the slowdown is around $1.2$--$1.5\times$. This may (emphasis on may) be acceptable in a production environment if the task is not time-critical, but only temporarily while experimenting or observing.
\item If we ask for a timestamp of every system call entry and exit (user-space to kernel transition and back), that might be much less than $1\%$ overhead. This would likely be acceptable in a production environment for all but the most time-critical of operations and could remain in place at all times.
\end{itemize}

Expand All @@ -75,14 +75,14 @@ \subsection*{Tracing: Logging}

Tracing also can play a role in identifying deadlocks and delays in threads caused by waiting for locks: simply log every lock and unlock. But we might really be only interested in the locks where there is contention, i.e., threads are sometimes or often failing to acquire the lock because it's held by another thread~\cite{usd}. After all, if our intention is to observe the behaviour of the program with the intention of improving performance, the part where a thread isn't getting what it wants immediately and must wait is much more interesting than the part where everything is fine and there are no delays.

\paragraph{Space, the Final Frontier.} Whatever trace strategy we choose, the trace itself takes up space. If we are producing a very large amount of data, it won't all fit in memory and has to go on disk. It's possible that the amount of data that we produce will fill up the disk very quickly, or arrive so fast that the disk cannot keep up. In~\cite{usd} there's a quick calculation that says a main memory system with 20 GB/s bandwidth and 64-byte cache lines that records up to 8 bytes of data per trace entry could result in producing data at a rate of 2.4 GB per second! That's a rather overwhelming amount of data -- even if you could write it all out to disk fast enough, it would fill up an 8 TB hard drive in just under an hour. Either we need to capture less data by recording fewer things or by recording for a shorter time. Just add this to the reasons why we need to be judicious about how much trace data to capture.
\paragraph{Space, the Final Frontier.} Whatever trace strategy we choose, the trace itself takes up space. If we are producing a very large amount of data, it won't all fit in memory and has to go on disk. It's possible that the amount of data that we produce will fill up the disk very quickly, or arrive so fast that the disk cannot keep up. In~\cite{usd} there's a quick calculation that says a main memory system with 20 GB/s bandwidth and 64-byte cache lines that records up to 8 bytes of data per trace entry could result in producing data at a rate of 2.4 GB per second! That's a rather overwhelming amount of data---even if you could write it all out to disk fast enough, it would fill up an 8 TB hard drive in just under an hour. Either we need to capture less data by recording fewer things or by recording for a shorter time. Just add this to the reasons why we need to be judicious about how much trace data to capture.

\subsection*{Aggregate Measures (Counters)}
Many profiling tools rely heavily on counters. Counters, as the name suggests, keep a count of events: interrupts, cache misses, data bytes written, calls to function \texttt{do\_magic()}. Keeping track of these numbers is relatively inexpensive (specially when compared to other approaches) so counting every occurrence of an event is actually plausible. Counters are a form of aggregation, because we're summing the number of occurrences and at the end of the program we have a total number. The counter and any data derived from it certainly takes much less space than a trace.
Many profiling tools rely heavily on counters. As the name suggests, they keep a count of events: interrupts, cache misses, data bytes written, calls to function \texttt{do\_magic()}. Keeping track of these numbers is relatively inexpensive (specially when compared to other approaches) so counting every occurrence of an event is plausible. Counters are a form of aggregation, because we're summing the number of occurrences and at the end of the program we have a total number. The counter and any data derived from it certainly takes much less space than a trace.

The sum is the simplest kind, but other aggregate measures are exactly what they sound like: summaries of data. Calculating the average response time of a request is an aggregate measure: we've summed up the total time per request and divided it by the number of requests in the given period and hopefully the resulting value is a useful one. Asking the computer to calculate the summary is sensible, of course, because it's not realistic to ask a human to look at 50~000 requests and calculate their average time. Some obvious aggregate measures are things like: number of requests, requests broken down by type, average time to respond to a request, percentage of error responses...

Whatever aggregate measures we use, they are useful only with context. Suppose that after my pull request is merged, the average login time for a user is 0.75 seconds; is that a problem? Without a baseline to compare against, I'm not sure. If before my PR it was 0.5 seconds, I made performance much worse and that doesn't sound good; should I revert it and re-work it? Maybe, unless I am intentionally making login slower to make a brute-force password attack more expensive for an attacker. Context is key: the summary tells you some data, but not the reasons.
Whatever aggregate measures we use, they are useful only with context. Suppose that after my pull request is merged, the average login time for a user is 0.75 s; is that a problem? Without a baseline to compare against, I'm not sure. If before my PR it was 0.5 s, I made performance much worse and that doesn't sound good; should I revert it and re-work it? Maybe, unless I am intentionally making login slower to make a brute-force password attack more expensive for an attacker. Context is key: the summary tells you some data, but not the reasons.

Another example: if I tell you that a request takes, on average, 1.27 seconds to get a response, is that good or bad? There's, again, no way to say anything about it without a point of reference. Are we being asked to approve or deny a transaction and there's a time limit of 1 second to give our answer (or else a default answer is assumed)? We're missing the target and that's a problem. If instead I said the time limit is 10 seconds, then we have plenty of room. Or do we?

Expand Down

0 comments on commit eb57a18

Please sign in to comment.