Skip to content

Commit 0579024

Browse files
committed
docs: refactor design into its own doc
1 parent b6161ec commit 0579024

File tree

2 files changed

+182
-176
lines changed

2 files changed

+182
-176
lines changed

README.md

Lines changed: 3 additions & 176 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# peerd
1+
# Peerd
22

33
[![Build Status]][build-status]
44
[![Kind CI Status]][kind-ci-status]
@@ -226,179 +226,7 @@ The APIs are described in the [swagger.yaml].
226226

227227
## Design and Architecture
228228

229-
![cluster-arch]
230-
231-
The design is inspired from the [Spegel] project, which is a peer to peer proxy for container images that uses libp2p.
232-
In this section, we describe the design and architecture of `peerd`.
233-
234-
### Background
235-
236-
An OCI image is composed of multiple layers, where each layer is stored as a blob in the registry. When a container
237-
application is deployed to a cluster such as an AKS or ACI, the container image must first be downloaded to each node
238-
where it’s scheduled to run. If the image is too large, downloading it often becomes the most time-consuming step of
239-
starting the application. This download step most impacts two scenarios:
240-
241-
a) Where the application needs to scale immediately to handle a burst of requests, such as an e-commerce application
242-
dealing with a sudden increase in shoppers; or
243-
244-
b) Where the application must be deployed on each node of a large cluster (say 1000+ nodes) and the container image
245-
itself is very large (multiple Gbs), such as training a large language model.
246-
247-
ACR Teleport addresses scenario a by allowing a container to quickly start up using the registry as a remote filesystem
248-
and downloading only specific parts of files needed for it to serve requests. However, scenario b will continue to be
249-
impacted by increased latencies due to the requirement of downloading entire layers from the registry to all nodes before
250-
the application can run. Here, the registry can become a bottleneck for the downloads.
251-
252-
To minimize network I/O to the remote registry and improve speed, once an image (or parts of it) has been downloaded by
253-
a node, other nodes in the cluster can leverage this peer and download from it rather than from the remote ACR. This can
254-
reduce network traffic to the registry and improve the average download speed per node. Peers must be able to discover
255-
content already downloaded to the network and share it with others. Such p2p distribution would benefit both scenarios
256-
above, a (Teleport) and b (regular non-Teleport).
257-
258-
### Design
259-
260-
There are four main components to the design that together make up the `peerd` binary:
261-
262-
1. Peer to Peer Router
263-
2. File Cache
264-
3. Containerd Content Store Subscriber
265-
4. P2P Proxy Server
266-
267-
#### Peer to Peer Router
268-
269-
The p2p router is the core component responsible for discovering peers in the local network and maintaining a distributed
270-
hash table (DHT) for content lookup. It provides the ability to advertise local content to the network, as well as
271-
discover peers that have specific content. The DHT protocol is called Kademlia, which provides provable consistency and
272-
performance. Please reference the [white paper] for details.
273-
274-
##### Bootstrap
275-
276-
When a node is created, it must obtain some basic information to join the p2p network, such as the addresses and public
277-
keys of nodes already in the network to initialize its DHT. One way to do this is to connect to an existing node in the
278-
network and ask it for this information. So, which node should it connect to? To make this process completely automatic,
279-
we leverage leader election in k8s, and connect to the leader to bootstrap.
280-
281-
Although this introduces a dependency on the k8s runtime APIs and kubelet credentials for leader election and is the
282-
current approach, an alternative would be to use a statically assigned node as a bootstrapper.
283-
284-
##### Configuration
285-
286-
The router uses the following configuration to connect to peers:
287-
288-
| Name | Value | Description |
289-
| -------------- | ----- | --------------------------------------------- |
290-
| ResolveTimeout | 20ms | The time to wait for a peer to resolve |
291-
| ResolveRetries | 3 | The number of times to retry resolving a peer |
292-
293-
##### Advertisements
294-
295-
Once the node has completed bootstrapping, it is ready to advertise its content to the network. There are two sources
296-
for this content:
297-
298-
1. Containerd Content Store: this is where images pulled to the node are available, see section
299-
[Containerd Content Store Subscriber].
300-
301-
2. File cache: this is where files pulled to the node are available, see section [File Cache].
302-
303-
Advertising means adding the content's key to the node's DHT, and optionally, announcing the available content on the
304-
network. The key used is the sha256 digest of the content.
305-
306-
##### Resolution
307-
308-
A key is resolved to a node based on the closeness metric discussed in the Kademlia paper. With advertisements,
309-
resolution is very fast (overhead of ~1ms in AKS).
310-
311-
#### File Cache
312-
313-
The file cache is a cache of files on the local file system. These files correspond to layers of a teleported image.
314-
315-
##### Prefetching
316-
317-
The first time a request for a file is made, the range of requested bytes is served from the remote source (either peer
318-
or upstream). At the same time, multiple prefetch tasks are kicked off, which download fixed size chunks of the file
319-
parallelly (from peer or upstream) and store them in the cache. The default configuration is as follows:
320-
321-
| Name | Value | Description |
322-
| --------------- | ----- | --------------------------------------------------------------------------------------- |
323-
| ChunkSize | 1 Mib | The size of a single chunk of a file that is downloaded from remote and cached locally. | |
324-
| PrefetchWorkers | 50 | The total number of workers available for downloading file chunks. |
325-
326-
##### File System Layout
327-
328-
Below is an example of what the file cache looks like. Here, five files are cached (the folder name of each is its digest,
329-
shortened in the example below), and for each file, some chunks have been downloaded. For example, for the file
330-
095e6bc048, four chunks are available in the cache. The name of each chunk corresponds to an offset in the file. So,
331-
chunk 0 is the portion of 095e6bc048 starting at offset 0 of size ChunkSize. Chunk 1048576 is the portion of 095e6bc048
332-
starting at offset 1048576 of size ChunkSize. And so on.
333-
334-
![file-system-layout]
335-
336-
#### Containerd Content Store Subscriber
337-
338-
This component is responsible for discovering layers in the local containerd content store and advertising them to the
339-
p2p network using the p2p router component, enabling p2p distribution for regular image pulls.
340-
341-
#### P2P Proxy Server
342-
343-
The p2p proxy server (a.k.a. p2p mirror) serves the node’s content from the file cache or containerd content store.
344-
There are two scenarios for accessing the proxy:
345-
346-
1. Overlaybd TCMU driver: this is the Teleport scenario.
347-
348-
The driver makes requests like the following to the p2p proxy.
349-
350-
```bash
351-
GET http://localhost:5000/blobs/https://westus2.data.mcr.microsoft.com/01031d61e1024861afee5d512651eb9f36fskt2ei//docker/registry/v2/blobs/sha256/1b/1b930d010525941c1d56ec53b97bd057a67ae1865eebf042686d2a2d18271ced/data?se=20230920T01%3A14%3A49Z&sig=m4Cr%2BYTZHZQlN5LznY7nrTQ4LCIx2OqnDDM3Dpedbhs%3D&sp=r&spr=https&sr=b&sv=2018-03-28&regid=01031d61e1024861afee5d512651eb9f
352-
353-
Range: bytes=456-990
354-
```
355-
356-
Here, the p2p proxy is listening at `localhost:5000`, and it is passed in the full SAS URL of the layer. The SAS URL was
357-
previously obtained by the driver from the ACR. The proxy will first attempt to locate this content in the p2p network
358-
using the router. If found, the peer will be used to reverse proxy the request. Otherwise, after the configured resolution
359-
timeout, the request will be proxied to the upstream storage account.
360-
361-
2. Containerd Hosts: this is the non-Teleport scenario.
362-
363-
Here, containerd is configured to use the p2p mirror using its hosts configuration. The p2p mirror will receive registry
364-
requests to the /v2 API, following the OCI distribution API spec. The mirror will support GET and HEAD requests to `/v2/`
365-
routes. When a request is received, the digest is first looked up in the p2p network, and if a peer has the layer, it is
366-
used to serve the request. Otherwise, the mirror returns a 404, and containerd client falls back to the ACR directly (or
367-
any next configured mirror.)
368-
369-
### Performance
370-
371-
The following numbers were gathered from a 3-node AKS cluster.
372-
373-
#### Peer Discovery
374-
375-
In broadcast mode, any locally available content is broadcasted to the k closest peers f the content ID. As seen below,
376-
the performance improves significantly, with the tradeoff that network traffic also increases.
377-
378-
**Broadcast off**
379-
380-
| Operation | Samples | Min (s) | Mean (s) | Max (s) | Std. Deviation |
381-
| --------- | ------- | ------- | -------- | ------- | -------------- |
382-
| Discovery | 30 | 0.006 | 0.021 | 0.039 | 0.009 |
383-
384-
**Broadcast on**
385-
386-
| Operation | Samples | Min (s) | Mean (s) | Max (s) | Std. Deviation |
387-
| --------- | ------- | ------- | -------- | ------- | -------------- |
388-
| Discovery | 416 | 0 | 0.001 | 0.023 | 0.003 |
389-
390-
#### File Scanner Application Container
391-
392-
An Overlaybd image was created for a simple application that reads an entire file. The performance is compared between
393-
running this container in p2p vs non-p2p mode on a 3 node AKS cluster with Artifact Streaming.
394-
395-
| Mode | File Size Read (Mb) | Speed (3 nodes) (Mbps) |
396-
| --------------------------------- | ------------------- | ---------------------- |
397-
| Teleport without p2p | 200 | 3.5, 3.8, 3.9 |
398-
| Teleport with p2p, no prefetching | 600 | 3.8, 3.9, 4.9 |
399-
| Teleport with p2p and prefetching | 200 | 6.5, **11, 13** |
400-
| Teleport with p2p and prefetching | 600 | 5.5, 6.1, 6.8 |
401-
229+
See [design][design-doc].
402230

403231
## Contributing
404232

@@ -438,7 +266,6 @@ A hat tip to:
438266
[Docker Release CI]: https://github.com/azure/peerd/actions/workflows/release.yml/badge.svg
439267
[release-ci]: https://github.com/azure/peerd/actions/workflows/release.yml
440268
[Code Coverage]: https://img.shields.io/badge/coverage-54.9%25-orange
441-
[cluster-arch]: ./assets/images/cluster.png
442269
[node-arch]: ./assets/images/http-flow.png
443270
[Overlaybd]: https://github.com/containerd/overlaybd
444271
[scanner]: ./tests/scanner/scanner.go
@@ -458,4 +285,4 @@ A hat tip to:
458285
[p2p proxy]: https://github.com/containerd/overlaybd/blob/main/src/example_config/overlaybd.json#L27C5-L30C7
459286
[peerd.service]: ./init/systemd/peerd.service
460287
[white paper]: https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia-lncs.pdf
461-
[file-system-layout]: ./assets/images/file-system-layout.png
288+
[design-doc]: ./docs/design.md

0 commit comments

Comments
 (0)