L30 minor

patricklam · patricklam · commit ce5d39e05382 · 2024-09-17T17:46:47.000+12:00
diff --git a/lectures/L30-slides.tex b/lectures/L30-slides.tex
@@ -223,8 +223,8 @@
 \begin{itemize}
 \item 
   Once upon a time: physical machine; shared hosting.
-\item Virtualization:
-\item Clouds
+\item Virtualization.
+\item Clouds.
 \end{itemize}
 
   Servers typically share persistent storage, also in
@@ -384,8 +384,7 @@
 big data systems. 
 
 Domain: graph processing
-algorithms---
-PageRank and graph connectivity \\
+algorithms---PageRank and graph connectivity \\
 (bottleneck is label propagation). 
 
 Subjects: graphs with billions of edges\\
diff --git a/lectures/L30.tex b/lectures/L30.tex
@@ -13,7 +13,7 @@ \section*{Clusters and Cloud Computing}
 multiple threads or multiple processes, you can do the same with multiple
 computers. We'll survey techniques for programming for
 performance using multiple computers; although there's overlap with
-distributed systems, we're looking more at calculations here.
+distributed systems, we're looking more at calculations here rather than coordination mechanisms.
 
 \paragraph*{Message Passing.} Rust encourages message-passing, but
 a lot of your previous experience when working with C may have centred around
@@ -30,12 +30,12 @@ \section*{Clusters and Cloud Computing}
 Interface}, a de facto standard for programming message-passing multi-
 computer systems. This is, unfortunately, no longer the way. 
 MPI sounds good, but in practice people tend to use other things. 
-Here's a detailed piece about the relevance of MPI today:~\cite{hpcmpi}, if 
+Here's a detailed piece about the relevance of MPI as of 10 years ago:~\cite{hpcmpi}, if 
 you are curious. 
 
 \paragraph{REST}
 We've already seen asynchronous I/O using HTTP (curl) which we could use to
-consume a REST API as one mechanism for multi-computer communication. You
+interact with a REST API as one mechanism for multi-computer communication. You
 may have also learned about sockets and know how to use those, which would
 underlie a lot of the mechanisms we're discussing. The socket approach is too
 low-level for what we want to discuss, while the REST API approach is at a
@@ -54,11 +54,12 @@ \section*{Clusters and Cloud Computing}
 Communication is based around the idea of producers writing a record (some data element, like an invoice) into a topic (categorizing messages) and consumers taking the item from the topic and doing something useful with it. A message remains available for a fixed period of time and can be replayed if needed. I think at this point you have enough familiarity with the concept of the producer-consumer problem and channels/topics/subscriptions that we don't need to spend a lot of time on it. 
 
 
-Kafka's basic strategy is to write things into an immutable log. The log is split into different partitions; you choose how many when creating the topic, where more partitions equals higher parallelism. The producer writes something and it goes into one of the partitions. Consumers read from each one of the partitions and writes down its progress (``commit its offset'') to keep track of how much of the topic it has consumed.  See this image from \url{kafka.apache.org}:
+Kafka's basic strategy is to write things into an immutable log. The log is split into different partitions; you choose how many when creating the topic, where more partitions equals higher parallelism. The producer writes something and it goes into one of the partitions. Consumers read from each one of the partitions and writes down its progress (``commits its offset'') to keep track of how much of the topic it has consumed.  See this image from \url{kafka.apache.org}:
 
 \begin{center}
 	\includegraphics[width=0.4\textwidth]{images/kafka-partition.png}
 \end{center}
+\vspace*{-1.5em}
 
 The nice part about such an architecture is that we can provision the parallelism that we want, and the logic for the broker (the system between the producer and the consumer, that is, Kafka) is simple. Also, consumers can take items and deal with them at their own speed and there's no need for consumers to coordinate; they manage their own offsets. Messages are removed from the topic based on their expiry, so it's not important for consumers to get them out of the queue as quickly as possible.
 
@@ -104,8 +105,7 @@ \subsection*{Cloud Computing}
 instances, that you've started up. Providers offer different instance
 sizes, where the sizes vary according to the number of cores, local
 storage, and memory. Some instances even have GPUs, but it seemed 
-uneconomic to use this for Assignment 3, at least in previous years (I
-have not done the calculation this year). 
+uneconomic to use this for Assignment 3.
 Instead we have the {\tt ecetesla} machines.
 
 \paragraph{Launching Instances.} When you need more compute power,
@@ -152,7 +152,7 @@ \section*{Clusters versus Laptops}
 \paragraph{Results.} 128 cores don't consistently beat a laptop at PageRank: e.g. 249--857s on the twitter\_rv dataset for the big data system vs 300s for the laptop, and they are 2$\times$ slower for label
 propagation, at 251--1784s for the big data system vs 153s on
 twitter\_rv. From the blogpost:
-
+\vspace*{-1.5em}
 \begin{center}
 	\includegraphics[width=0.60\textwidth]{images/pagerank.png}
 \end{center}
@@ -164,7 +164,7 @@ \section*{Clusters versus Laptops}
 $2\times$ speedup for PageRank and $10\times$ speedup for label propagation.
 
 \paragraph{Takeaways.} Some thoughts to keep in mind, from the authors:
-\begin{itemize}
+\begin{itemize}[noitemsep]
 \item    ``If you are going to use a big data system for yourself, see if it is faster than your laptop.''
 \item    ``If you are going to build a big data system for others, see that it is faster than my laptop.''
 \end{itemize}
@@ -173,7 +173,7 @@ \section*{Clusters versus Laptops}
 
 \section*{Movie Hour}
 Let's take a humorous look at cloud computing: James Mickens' session from Monitorama PDX 2014. 
-
+\vspace*{-1.5em}
 \begin{center}
 \url{https://vimeo.com/95066828}
 \end{center}