Skip to content
This repository has been archived by the owner on Apr 1, 2023. It is now read-only.

Releases: futurewei-cloud/merak

v0.2-alpha

14 Oct 19:10
4e5d817
Compare
Choose a tag to compare
v0.2-alpha Pre-release
Pre-release

Release Summary:

This release focuses on performance and scalability for the large-scale deployment for EPM (Emulated Physical Machine) and EVM (Emulated Virtual Machine) on a Kubernetes cluster. As well as the integration test with Alcor and ACA. The highlight of the v0.2 release are as follows:

  • EPM deployment:
    • Scenario Manager is able to specify virtual network controller's url (e.g., Alcor API's url) and the image of the virtual network agent (e.g, ACA) which to be installed in each EPM.
    • Merak Topology is able to deploy up to 20K EPMs (Pods) in a Kubernetes cluster with 500 worker nodes (using m5.4xlarge image with 16 vcpus and 64GB memory)
    • The average time of creating an EPM is around 0.27s~0.3s.
  • EVM deployment:
    • Scenario Manager is able to specify VPC, Subnet, number of EVMs per EPM to be deployed on the overlay virtual network.
    • Merak-Compute and Merak Agent are able to deploy up to 30K EVMs per 100 EPMs in a single Kubernetes worker node (using r6i.32xlarge with 128 vcpus and 1TB memory). The average CPU usage time is around 2 sec and an EVM uses about 1.1 MB of memory.
    • Merak can deploy up to 1M EVMs in a Kubernetes environment with 40 worker nodes (using m5.4xlarge instance with 16 vcpus and 64GB memory) and 500 EPMs. The density of EVMs per EPM can reach to 2000.
  • Integration test with Alcor and Alcor-Control-Agent (ACA) for 20K EVMs deployment in 60 minutes.
  • Code improvement & release with 14 PRs.

Performance Test Results

  • EPM deployment

    • This performance test focuses on the large-scale EPM (Pod) deployment and collects the performance metrics (such as elapsed time for each stage of creating an EPM) in different scale of deployments.

    • The following table shows the elapsed time each stage took in a deployment.

      Deployment (# of pods) Total Time (mins) Generate Pod Config (mins) Write Pod config to Meshnet CRD (mins) Creating Pods (mins) Rate (pod/sec) Rate (sec/pod)
      20k 90 2 18 70 3.7 0.27
      10k 45 2 8 35 3.8 0.27
      5k 23 1 5 17 3.6 0.276
      1k 5 0.5 1 3.5 3.3 0.3
    • The following figure shows the stacked column chart of elapse time for EPM deployment.
      image

  • EVM deployment in a single worker node

    • This performance test pushes the limitations of EVM deployment in a single worker node without the virtual network controller (Alcor) and virtual network agent (ACA) involved.
    • We conducted the test in a 2-worker-node Kubernetes cluster. A dedicated worker node runs the vhost (EPM) instances, while another worker node contains the control services (such as Merak services and Alcor services). The worker node dedicated for the EPM uses the AWS r6i.32xlarge instance with 128CPU and 1TB memory.
    • Each test has 100 EPMs and uses Merak Compute workflows and workers with the default 10k concurrent and 100k rps for the EVM creation request to Merak Agents.
    • The following figure shows the memory usage and elapsed time for 10k, 20k, and 30k EVMs deployment.
      image
    • The Max CPU usage (time in seconds) spikes occur when the virtual network devices are brought up. An initial spike always occurs due to each EVM creation workflow synchronizing on bringing all their devices up at the same time. The average CPU usage time throughout each entire test is about 2 seconds.
    • Each EVM (Network Namespace, Tap, Veth-Pair, Bridge) uses about 1.1MB once created. Extra memory is needed during allocation to create the virtual network devices and network namespaces.
  • EVM deployment without Alcor

    • This performance test attempts to deploy as many EVMs as possible in an environment with limited resources and without the overlay virtual network controller (Alcor) and virtual network agent (ACA) involved.
    • The environment setup utilizes AWS m5.4xlarge (16 vcpus and 64GB memory) instances to create a 40-worker-node Kubernetes cluster.
    • Each test has 500 EPMs and uses Merak Compute workflow with 10k concurrent and 100k rps for the EVM creation request to Merak Agents.
    • The following figure shows the memory usage, elapsed time, and density of EVMs per EPM for 100k, 200k, 400k, 500k, 600k, 800k, and 1M EVMs deployments. The density of EVMs per EPM can reach to 2000.
      image
  • EVM deployment with Alcor/ACA integration test

    • This performance test is a real end-to-end EVM test which involves the overlay virtual network control plane (Alcor) and virtual network agent in each EPM (ACA) to receive the goalstate from controller. If the created EVMs belong to the same subnet, they should be able to ping each other.

    • The test is conducted on a 5-worker-node Kubernetes cluster. Each worker node has 16 cpu and 125G memory. The Alcor virtual network controller is deployed on the same cluster with 7 instances for every microservice. Merak services are also deployed on the same cluster. 150 concurrent and 150 rps requests are sent from Merak compute for port creation and port update to Alcor's port manager.

    • The test deploys 1k, 5k, 10k, and 20k EVMs in 10 VPCs and in 50 EPMs. The following figure shows the elapsed time and density of EVMs per EPM for each test.
      image

    • The EVM deployment beyond 20k will hit an error from port update and neighbor state update from Alcor's NCM to ACA in each EPM. The major issues come from the OVS and NCM's neighbor state propagation. Please see the pending item for the future improvement below.

Features Improvement

Pending Item for Future Improvement

  • Emulation-specific issues for Alcor
    • The port update (neighbor state) between Alcor and ACA is still the bottleneck to create EVMs beyond 50K. Three possible major issues need to be fixed in the next release:
      • Issue 1: Too many OVS command line (ovs-vsctl) from many ACAs to a single OVS kernel
      • Issue 2: Large volume of OVS user space process need to access one single OVS DB
      • Issue 3: Alcor and Merak in the same cluster to compete the same resource
    • Low-priority improvement: Modify NCM and Merak agent to have a batch mode for neighbor state sending down from NCM to ACA

v0.1.1-alpha

16 Sep 20:18
82e613d
Compare
Choose a tag to compare
v0.1.1-alpha Pre-release
Pre-release

Release Summary:

This release focuses on the improvement of all fundamental functions and integration test with Alcor and ACA, as well as the performance and scalability test on AWS. Some highlight of the v0.1.1 release:

  • Scalability
    • Deploy 10K pods on AWS EC2 with 200+ VM nodes for < 3% failure rate
    • Deploy 500K VMs on 10K pods in couple of minutes
  • Merak Architecture
    • Remove gateways between Alcor and Meshnet
    • Resolve several issues in Alcor and containerized ACA
  • Merak Topology
    • Create multiple layer vswitches to reduce the load for core and aggregate vswitches
    • Rewrite the link creation algorithm to improve the performance of creating topology
  • Merak Compute and Agent
    • Deploy VMs (namespace) concurrently to improve the performance of VM deployment
    • Return VM status back to Scenario manager immediately after deploy/check/delete compute
  • Merak Network
    • Fix issues for DB and deletion function
    • Add feature to remove registered compute nodes

Features Released

Designed features development

  • Merak Architecture​
    • Separate control plane IP and data path IP in each compute node (PR#754)
    • Fix the issue of ACA didn’t respond to NCM (PR#283)
    • Fix the Neighbor Info update in DPM (PR#754)
    • Fix Alcor DPM ignite db cache lock issue (PR#754)
  • Scenario Manager
  • Merak Compute and Agent
    • Improve the concurrent process for VM deployment and deletion (PR#95, Issue#82)
    • Return VM status (PR#92) and refactor agent (PR#99)
    • Fix delete compute issue (PR#89)
  • Merak Topology​​
    • Remove gateways between container network and meshnet network (PR#67)(Issue#67)
    • Create Multiple layer vswitches for topology deployment (PR#104)
    • Rewrite link generation algorithm (Issue#50)(PR#104)
    • Enable Image information retrieval from user's input (Issue#54)(PR#104)
  • Merak Network
  • Merak Test
    • Draft design for the network test in Subnets/VMs (e.g., pingall) (PR#106)
  • Merak CI/CD
    • Merak-CI with Github Action on multiple AWS VMs is established (PR#60)
  • Common
    • Protobuf refactor (PR#66)

Merak Integration Test & Scalability

  • Scalability​
    • Deploy 10K pods on AWS EC2 with 200 VMs
      • Best case: deploy 10K vhosts + 262 vswitches in 60 mins for 2 vhosts failed => 0.0002% failure rate
      • Normal case: 10K pods with 300 vhosts failed => 3% failure rate
    • Deploy 500k VMs (namespaces) on 10K pods in couple of minutes
      • But size of VM status return message too large, due to grpc message size too small

v0.1-alpha

02 Aug 18:10
369d7a2
Compare
Choose a tag to compare
v0.1-alpha Pre-release
Pre-release

Release Summary

This release focuses on enabling the system architecture design, fundamental functional test, and code development. Some highlight of the v0.1 release:

  • Design overall architecture and deliver five key components:
    • Scenario Manager - collect user's input for the deployment and test scenarios and issue command to other components
    • Merak Topology - deploy/destroy pods with a given topology
    • Merak Network - create/delete network infrastructure resources (e.g., vpc, subnet, security group)
    • Merak Compute - create/delete virtual machines and collect test results from agents
    • Merak Agent - create virtual network devices (bridges, tap devices, veth pairs and namespace) for VMs and collect test results
  • Enable three critical workflows:
    • Flexible underlay network construction (e.g., deploy & destroy a fat-tree topology)
    • Realistic overlay network provisioning by Alcor Cloud CP (e.g., create vpc and port)
    • Scalable emulation of virtual machines in the emulated compute nodes
  • Integration test in the Merak with Alcor and Alcor-Control-Agent (ACA) for large-scale compute node and virtual machine deployment
    • Ability to emulate 10k compute nodes in a k8s cluster
    • Ability to emulate 200k virtual machines creation
  • Code development & release with 46 PRs.

Features Released

Designed features development

  • Component design and development
    • Merak Topology: Deploy fat-tree topology with VMs (vswitches and vhosts) to construct underlay network in K8s cluster (PRs #13, #26, #36, #38)
    • Merak Network: Configure underlay network and overlay network with Alcor (PRs #24, #25, #37, #43)
    • Merak Compute: Create VMs in the emulated CNs with the Merak Agent and Alcor-Control-Agent (ACA) in each CN (PRs #11, #22, #27, #33, #34)
    • Scenario Manager: Issue deploy/destroy/check commands to each component (PRs #15, #16, #19, #32, #39, #40, #42, #45, #46)

Merak Integration Test & Scalability

  • Integration test

    • Perform end-to-end test for Merak with Alcor and ACA for large VPC deployment
      • L2 neighbor state - all virtual machines in the same subnet are able to ping each other
      • L3 neighbor state - virtual machines belong to different subnets can ping each other through a router
  • Scalability design

    • Ability to emulate > 10K Compute Node on AWS K8s cluster (200 AWS VMs)
    • Ability to emulate large scale data center topologies (like fat-tree) with a large volume of switches and routers
    • Ability to emulate > 200K VMs within the emulated compute nodes (20 VMs in a compute node)

Merak Fundamental

Document