SIGCOMM'18 artifact for "Homa: a receiver-driven low-latency transport protocol using network priorities"
With this artifact, we provide the reviewers with the ability to run workloads W3-W5 using the RAMCloud implementation of Homa transport and reproduce its performance numbers. For the simulation code, please check out the HomaSimulation repository. The CDF files of the workloads can be found here; they are copied and renamed to W1-W5 in the RAMCloud repository.
The files included in this repository are:
$ tree
.
├── getRamcloud.sh
├── localconfigGen.py
├── profile.py
├── README.adoc
├── setup-45XGc-QoS.py
└── startup.sh
-
getRamcloud.sh - Script for building RAMCloud
-
localconfigGen.sh - Generates RAMCloud cluster config file
-
profile.py - CloudLab profile used to instantiate new experiments
-
README.adoc - This file
-
setup-45XGc-QoS.py - Script generating commands used to configure the switch
-
startup.sh - Startup service that installs dependencies required for the experiment
We conduct our experiment using the m510 machines available at CloudLab.
To start a new experiment, follow the instructions at CloudLab’s getting-started page and use the public profile named HomaArtifactEvaluation
. This profile will help you reserve a full chassis of 45 nodes interconnected by a Moonshot 45XGc switch.
It could take 10-15 minutes to instantiate a new experiment and complete our custom startup service. Make sure file /local/startup_service_done
is present on all nodes of the experiment before proceeding to the next step.
Homa transport relies on network priorities to achieve low tail-latency. Therefore, we need to enable the QoS setting of the 45XGc switch to recognize the packet priorities. Note that the current policy of CloudLab is to only grant full switch access to people who have reserved a full chassis. This can be sometimes difficult depending on the resource availability. As of 12/2018, the best way to go about this is to use the reservation mechanism of CloudLab. Once you have successfully reserved a full chassis, you can contact the CloudLab support team (support@cloudlab.us) to request access to the switch. You should receive further instructions shortly. Extend the experiment to avoid expiration if necessary. The commands used to configure the switch can be generated by running setup-45XGc-QoS.py. Once you log in to the switch console, the best way to configure the switch is to use its builtin Python interpreter:
No directory, logging in with HOME=/
Trying 127.0.0.1...
Connected to localhost.
Escape character is 'off'.
<ms-chassis13-sb>python
Python 2.7.3 (default, Apr 10 2014, 16:32:11)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import comware
>>> cmds = '<COPY-THE-COMMANDS-GENERATED-ABOVE-HERE>'
>>> comware.CLI(cmds)
If the commands are working, you should see something like the following:
<ms-chassis13-sb>system-view
System View: return to User View with Ctrl+Z.
[ms-chassis13-sb]qos map-table dot1p-lp
[ms-chassis13-sb-maptbl-dot1p-lp]import 0 export 1
[ms-chassis13-sb-maptbl-dot1p-lp]import 1 export 0
[ms-chassis13-sb-maptbl-dot1p-lp]import 2 export 2
[ms-chassis13-sb-maptbl-dot1p-lp]interface Ten-GigabitEthernet1/0/1
[ms-chassis13-sb-Ten-GigabitEthernet1/0/1]qos trust dot1p
[ms-chassis13-sb-Ten-GigabitEthernet1/0/1]qos sp
[ms-chassis13-sb-Ten-GigabitEthernet1/0/1]quit
[ms-chassis13-sb]interface Ten-GigabitEthernet1/0/2
[ms-chassis13-sb-Ten-GigabitEthernet1/0/2]qos trust dot1p
[ms-chassis13-sb-Ten-GigabitEthernet1/0/2]qos sp
[ms-chassis13-sb-Ten-GigabitEthernet1/0/2]quit
... more output omitted...
[ms-chassis13-sb]interface Ten-GigabitEthernet1/0/45
[ms-chassis13-sb-Ten-GigabitEthernet1/0/45]qos trust dot1p
[ms-chassis13-sb-Ten-GigabitEthernet1/0/45]qos sp
[ms-chassis13-sb-Ten-GigabitEthernet1/0/45]quit
[ms-chassis13-sb]quit
<comware.CLI object at 0x181f1090>
>>>
To fetch the source code of RAMCloud and build the executables, run the following on node rcmaster
:
$ cd /shome
$ /local/repository/getRamcloud.sh
RAMCloud will be available at /shome/RAMCloud
when the script completes.
All commands in this section are assumed to run from the RAMCloud top directory at /shome/RAMCloud
on node rcmaster
.
To make sure RAMCloud and DPDK are built correctly, run a basic performance test as
$ scripts/clusterperf.py --superuser --replicas 0 --transport homa+dpdk --dpdkPort 1 --verbose echo_basic
If everything works as expected, you should see performance numbers similar to the following output (note: make sure CPU governor is set to performance
and idle=poll
is provided as a kernel boot parameter):
echo0 4.4 us send 0B message, receive 0B message median
echo0.min 4.2 us send 0B message, receive 0B message minimum
echo0.9 4.8 us send 0B message, receive 0B message 90%
echo0.99 5.4 us send 0B message, receive 0B message 99%
echo0.999 18.2 us send 0B message, receive 0B message 99.9%
echoBw0 0.0 B/s bandwidth sending 0B messages
echo100 4.9 us send 100B message, receive 100B message median
echo100.min 4.8 us send 100B message, receive 100B message minimum
echo100.9 5.2 us send 100B message, receive 100B message 90%
echo100.99 5.5 us send 100B message, receive 100B message 99%
echo100.999 7.3 us send 100B message, receive 100B message 99.9%
echoBw100 18.7 MB/s bandwidth sending 100B messages
echo1K 8.7 us send 1000B message, receive 1KB message median
echo1K.min 8.5 us send 1000B message, receive 1KB message minimum
echo1K.9 9.0 us send 1000B message, receive 1KB message 90%
echo1K.99 9.3 us send 1000B message, receive 1KB message 99%
echo1K.999 11.5 us send 1000B message, receive 1KB message 99.9%
echoBw1K 107.7 MB/s bandwidth sending 1KB messages
echo10K 25.0 us send 10000B message, receive 10KB message median
echo10K.min 24.9 us send 10000B message, receive 10KB message minimum
echo10K.9 25.1 us send 10000B message, receive 10KB message 90%
echo10K.99 25.5 us send 10000B message, receive 10KB message 99%
echo10K.999 73.9 us send 10000B message, receive 10KB message 99.9%
echoBw10K 376.1 MB/s bandwidth sending 10KB messages
echo100K 178.0 us send 100000B message, receive 100KB message median
echo100K.min 177.7 us send 100000B message, receive 100KB message minimum
echo100K.9 178.5 us send 100000B message, receive 100KB message 90%
echo100K.99 181.8 us send 100000B message, receive 100KB message 99%
echo100K.999 357.7 us send 100000B message, receive 100KB message 99.9%
echoBw100K 532.6 MB/s bandwidth sending 100KB messages
echo1M 1.72 ms send 1000000B message, receive 1MB message median
echo1M.min 1.71 ms send 1000000B message, receive 1MB message minimum
echo1M.9 1.72 ms send 1000000B message, receive 1MB message 90%
echo1M.99 1.89 ms send 1000000B message, receive 1MB message 99%
echo1M.999 2.04 ms send 1000000B message, receive 1MB message 99.9%
echoBw1M 553.8 MB/s bandwidth sending 1MB messages
Before we can run the workloads and generate the slowdown numbers reported in the paper, we need to first obtain the baseline latency numbers (i.e., when the network is empty) for all message sizes in workloads W3-W5. This can be done by running
$ benchmarks/homa/scripts/compute_baseline.sh basic+dpdk W3
$ benchmarks/homa/scripts/compute_baseline.sh basic+dpdk W4
$ benchmarks/homa/scripts/compute_baseline.sh basic+dpdk W5
$ benchmarks/homa/scripts/compute_baseline.sh homa+dpdk W3
$ benchmarks/homa/scripts/compute_baseline.sh homa+dpdk W4
$ benchmarks/homa/scripts/compute_baseline.sh homa+dpdk W5
This step could take a while for workloads with many different message sizes. You can monitor the progress by
$ watch "tail logs/latest/client*.log"
The results will be written to benchmarks/homa/{basic,homa}_{W3,W4,W5}_baseline.txt
.
To run a particular workload with various configurations (e.g. homa vs. basic, load factor, # priorites available, etc.), use the run_workload.sh
script. This script will run the same workload using different configurations and compute the corresponding message slowdown numbers in the end. For example, the following command will run worload W3 with 16 nodes using different configurations with each configuration run taking 100 seconds:
$ benchmarks/homa/scripts/run_workload.sh W3 16 100
Each configuration run must be long enough to collect enough samples to compute 99-percentile tail latency for each message size. For W3 and W5, we recommend allocating at least one hour to each configuration run; for W4, 10 minutes should be enough.
Each invocation of the run_workload.sh
script will create a unique directory that looks something like homa_experiment_YYYYMMDDHHMMSS
. You can find the computed slowdown numbers (in slowdownImpl.txt
), the raw message round-trip latency numbers (in *_experiment.txt
), and some RAMCloud log files inside that directory.