Skip to content

Commit ce5d39e

Browse files
committed
L30 minor
1 parent 9e73e8e commit ce5d39e

File tree

2 files changed

+12
-13
lines changed

2 files changed

+12
-13
lines changed

lectures/L30-slides.tex

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -223,8 +223,8 @@
223223
\begin{itemize}
224224
\item
225225
Once upon a time: physical machine; shared hosting.
226-
\item Virtualization:
227-
\item Clouds
226+
\item Virtualization.
227+
\item Clouds.
228228
\end{itemize}
229229

230230
Servers typically share persistent storage, also in
@@ -384,8 +384,7 @@
384384
big data systems.
385385

386386
Domain: graph processing
387-
algorithms---
388-
PageRank and graph connectivity \\
387+
algorithms---PageRank and graph connectivity \\
389388
(bottleneck is label propagation).
390389

391390
Subjects: graphs with billions of edges\\

lectures/L30.tex

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ \section*{Clusters and Cloud Computing}
1313
multiple threads or multiple processes, you can do the same with multiple
1414
computers. We'll survey techniques for programming for
1515
performance using multiple computers; although there's overlap with
16-
distributed systems, we're looking more at calculations here.
16+
distributed systems, we're looking more at calculations here rather than coordination mechanisms.
1717

1818
\paragraph*{Message Passing.} Rust encourages message-passing, but
1919
a lot of your previous experience when working with C may have centred around
@@ -30,12 +30,12 @@ \section*{Clusters and Cloud Computing}
3030
Interface}, a de facto standard for programming message-passing multi-
3131
computer systems. This is, unfortunately, no longer the way.
3232
MPI sounds good, but in practice people tend to use other things.
33-
Here's a detailed piece about the relevance of MPI today:~\cite{hpcmpi}, if
33+
Here's a detailed piece about the relevance of MPI as of 10 years ago:~\cite{hpcmpi}, if
3434
you are curious.
3535

3636
\paragraph{REST}
3737
We've already seen asynchronous I/O using HTTP (curl) which we could use to
38-
consume a REST API as one mechanism for multi-computer communication. You
38+
interact with a REST API as one mechanism for multi-computer communication. You
3939
may have also learned about sockets and know how to use those, which would
4040
underlie a lot of the mechanisms we're discussing. The socket approach is too
4141
low-level for what we want to discuss, while the REST API approach is at a
@@ -54,11 +54,12 @@ \section*{Clusters and Cloud Computing}
5454
Communication is based around the idea of producers writing a record (some data element, like an invoice) into a topic (categorizing messages) and consumers taking the item from the topic and doing something useful with it. A message remains available for a fixed period of time and can be replayed if needed. I think at this point you have enough familiarity with the concept of the producer-consumer problem and channels/topics/subscriptions that we don't need to spend a lot of time on it.
5555

5656

57-
Kafka's basic strategy is to write things into an immutable log. The log is split into different partitions; you choose how many when creating the topic, where more partitions equals higher parallelism. The producer writes something and it goes into one of the partitions. Consumers read from each one of the partitions and writes down its progress (``commit its offset'') to keep track of how much of the topic it has consumed. See this image from \url{kafka.apache.org}:
57+
Kafka's basic strategy is to write things into an immutable log. The log is split into different partitions; you choose how many when creating the topic, where more partitions equals higher parallelism. The producer writes something and it goes into one of the partitions. Consumers read from each one of the partitions and writes down its progress (``commits its offset'') to keep track of how much of the topic it has consumed. See this image from \url{kafka.apache.org}:
5858

5959
\begin{center}
6060
\includegraphics[width=0.4\textwidth]{images/kafka-partition.png}
6161
\end{center}
62+
\vspace*{-1.5em}
6263

6364
The nice part about such an architecture is that we can provision the parallelism that we want, and the logic for the broker (the system between the producer and the consumer, that is, Kafka) is simple. Also, consumers can take items and deal with them at their own speed and there's no need for consumers to coordinate; they manage their own offsets. Messages are removed from the topic based on their expiry, so it's not important for consumers to get them out of the queue as quickly as possible.
6465

@@ -104,8 +105,7 @@ \subsection*{Cloud Computing}
104105
instances, that you've started up. Providers offer different instance
105106
sizes, where the sizes vary according to the number of cores, local
106107
storage, and memory. Some instances even have GPUs, but it seemed
107-
uneconomic to use this for Assignment 3, at least in previous years (I
108-
have not done the calculation this year).
108+
uneconomic to use this for Assignment 3.
109109
Instead we have the {\tt ecetesla} machines.
110110

111111
\paragraph{Launching Instances.} When you need more compute power,
@@ -152,7 +152,7 @@ \section*{Clusters versus Laptops}
152152
\paragraph{Results.} 128 cores don't consistently beat a laptop at PageRank: e.g. 249--857s on the twitter\_rv dataset for the big data system vs 300s for the laptop, and they are 2$\times$ slower for label
153153
propagation, at 251--1784s for the big data system vs 153s on
154154
twitter\_rv. From the blogpost:
155-
155+
\vspace*{-1.5em}
156156
\begin{center}
157157
\includegraphics[width=0.60\textwidth]{images/pagerank.png}
158158
\end{center}
@@ -164,7 +164,7 @@ \section*{Clusters versus Laptops}
164164
$2\times$ speedup for PageRank and $10\times$ speedup for label propagation.
165165

166166
\paragraph{Takeaways.} Some thoughts to keep in mind, from the authors:
167-
\begin{itemize}
167+
\begin{itemize}[noitemsep]
168168
\item ``If you are going to use a big data system for yourself, see if it is faster than your laptop.''
169169
\item ``If you are going to build a big data system for others, see that it is faster than my laptop.''
170170
\end{itemize}
@@ -173,7 +173,7 @@ \section*{Clusters versus Laptops}
173173

174174
\section*{Movie Hour}
175175
Let's take a humorous look at cloud computing: James Mickens' session from Monitorama PDX 2014.
176-
176+
\vspace*{-1.5em}
177177
\begin{center}
178178
\url{https://vimeo.com/95066828}
179179
\end{center}

0 commit comments

Comments
 (0)