Single Root I/O Virtualization (SR-IOV) enables splitting a physical device into virtual functions (VFs). Virtual functions enable direct passthrough to virtual machines or containers. For Kata Containers, we enabled a Container Network Model (CNM) plugin. Additionally, we made the necessary changes in the runtime to detect virtual functions in a container's network namespace to use SR-IOV for network based devices.
To create a network with associated VFs, which can be passed to
Kata Containers, you must install a SR-IOV Docker plugin. The
created network is based on a physical function (PF) device. The network can
create n
containers, where n
is the number of VFs associated with the
Physical Function (PF).
To install the plugin, follow the plugin installation instructions.
In order to setup your host for SR-IOV, the following has to be true:
- The host system must support Intel VT-d.
- Your device (NIC) must support SR-IOV.
- The host kernel must have Input-Output Memory Management Unit (IOMMU) and Virtual Function I/O (VFIO) support.
CONFIG_VFIO_NOIOMMU
must be disabled in the host kernel configuration. You must rebuild your host system's kernel in order to disableCONFIG_VFIO_NOIOMMU
in the kernel configuration.- Optionally, you might need to add a PCI override for your Network Interface Controller (NIC). The section Checking your NIC for SR-IOV describes how to assess if you need to make NIC changes and how to make the necessary changes.
Besides, you need to enable the NIC driver in your guest kernel config (e.g. mlx5 for Mellanox NIC). All the modules need to be complied as built-in instead of loadable.
The following is an example of how to use lspci
to check if your NIC supports
SR-IOV.
$ lspci | fgrep -i ethernet
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 03)
...
$ #sudo required below to read the card capabilities
$ sudo lspci -s 01:00.0 -v | grep SR-IOV
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
If your card does not report this capability, then it does not support SR-IOV.
Run the following command to see how the IOMMU groups are setup on your host system:
$ find /sys/kernel/iommu_groups/ -type l
The command's output details whether or not your NIC is setup appropriately with respect to PCIe Access Control Services (ACS). If the IOMMU groups are setup properly, the PCI for each ACS-enabled NIC port should be in its own IOMMU group. If the PCI bridge is within the same IOMMU group as your NIC, it indicates that either your device does not support ACS or your device does not appropriately share this default capability.
If you do not see any output when running the previous command, then you likely need to update your host's kernel configuration.
For more details, see the blog post, "IOMMU Groups, inside and out"
Depending on your host kernel configuration, you might have to rebuild the kernel. If the following conditions are true, you do not need to rebuild your kernel:
CONFIG_VFIO_IOMMU_TYPE1
,CONFIG_VFIO
, andCONFIG_VFIO_PCI
are set in the kernel configuration. Your kernel is built with VFIO support when configurations are set.CONFIG_VFIO_NOIOMMU
is disabled in the host kernel configuration.
See the following steps one through three if you need to rebuild the kernel.
The following steps, which are based on the Ubuntu 16.04 distribution, update the SR-IOV host system. If you use a different distribution, make appropriate adjustments to the commands.
Before building a new kernel, keep in mind:
- You need to be very clear of the security and maintenance implications of creating a new host kernel.
- Mistakes in installing new kernels and updating the bootloader could make your system unbootable.
- We advise you to ensure you have a recent (and tested) full system backup before proceeding.
-
Grab kernel sources:
$ sudo apt-get install linux-source-<linux-version> $ sudo apt-get install linux-headers-<linux-version> $ cd /usr/src/linux-source-<linux-version>/ $ sudo tar -xvf linux-source-<linux-version>.tar.bz2 $ cd linux-source-<linux-version> $ sudo apt-get install libssl-dev
-
Examine and update the
config
file if necessary:$ sudo cp /boot/config-4.8.0-36-generic .config $ # verify resulting .config does not have NOIOMMU set; ie: `CONFIG_VFIO_NOIOMMU` is not set $ grep -q "^CONFIG_VFIO_NOIOMMU" /boot/config-$(uname -r) || echo ok $ # verify `CONFIG_VFIO_IOMMU_TYPE1`, `CONFIG_VFIO=m` and `CONFIG_VFIO_PCI=m` are set as well. $ for opt in CONFIG_VFIO_IOMMU_TYPE1 CONFIG_VFIO CONFIG_VFIO_PCI do grep "^${opt}=" /boot/config-$(uname -r) done $ sudo make olddefconfig
You might want to modify the kernel
Makefile
to add a unique identifier to theEXTRAVERSION
variable prior to running the make. Including theEXTRAVERSION
variable causes theuname -r
command to indicate that a customized kernel is installed and running. -
Build and install the kernel:
$ make -j <number_of_cpus> $ make modules $ sudo make modules_install $ sudo make install
-
Edit grub to enable
intel-iommu
:edit /etc/default/grub and add intel_iommu=on to cmdline: $ sudo sed -i -e 's/^kernel_params = "\(.*\)"/GRUB_CMDLINE_LINUX="\1 intel_iommu=on"/g' /etc/default/grub $ sudo update-grub
-
Reboot the system and verify:
Host system should be ready now. Reboot the system.
$ sudo reboot
To verify the kernel version and the kernel command line, take a look at
/proc/version
and/proc/cmdline
-
Verify Intel VT-d is initialized:
To check if Intel VT-d initialized correctly, look for the following line in the
dmesg
output:DMAR: Intel(R) Virtualization Technology for Directed I/O
Older kernels use a different prefix (e.g. PCI-DMA):
PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
-
Add the
vfio-pci
module:sudo modprobe vfio-pci
-
Add PCI quirk for SR-IOV NIC if necessary:
$ find /sys/kernel/iommu_groups/ -type l
The previous command verifies that your NIC appears in its own IOMMU group and no other devices appear in the same group. In the rare case where your PCI NIC does not appear in its own group, it is likely that the NIC does not support ACS or you built and ran an old kernel. Depending on your NIC and if it enforces isolation, you might resolve this by adding a
pcie_acs_override=
option to your kernel command line and reboot. See PCIE-ACS-override-option for detailed information about this option.
All the steps in prior sections need to be performed just once to prepare the SR-IOV host systems. The following is needed per system boot in order to facilitate setting up a physical device's virtual functions.
The following procedure sets up your SR-IOV device and needs to be done per system boot. Set up includes loading a device driver, finding out how many virtual functions (VF) you can create, and creating those virtual functions. Once you create VFs you cannot increase or decrease the number of VFs without first setting the number back to zero. Based on this, it is expected that you set the number of VFs for a physical device just once.
-
Add
vfio-pci
device driver:$ sudo modprobe vfio-pci
vfio-pci
is a driver used to reserve a VF PCI device. -
Find the NICs of interest:
$ lspci | grep Ethernet 00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-LM (rev 04) 01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01) 01:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
The previous example finds the PCI details for the NICs in question. In our case, both 01:00.0 and 01:00.1 are the two ports on our x540-AT2 card that we will use. You can use
lshw
command to get further details on the controller and verify it supports SR-IOV. -
Check how many VFs you can create:
$ cat /sys/bus/pci/devices/0000\:01\:00.0/sriov_totalvfs 63 $ cat /sys/bus/pci/devices/0000\:01\:00.1/sriov_totalvfs 63
The previous commands show how many VFs you can create. The
sriov_totalvfs
file undersysfs
for a PCI device specifies the total number of VFs that you can create. -
Create the VFs:
# echo 1 | sudo tee /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs # echo 1 | sudo tee /sys/bus/pci/devices/0000\:01\:00.1/sriov_numvfs
Create virtual functions by editing
sriov_numvfs
. In our example, we create virtual functions by editingsriov_numvfs
. This example creates one VF per physical device. Note, creating one VF eliminates the usefulness of SR-IOV, and is done for simplicity in this example. -
Verify the VFs were added to the host:
$ sudo lspci | grep Ethernet | grep Virtual 02:10.0 Ethernet controller: Intel Corporation X540 Ethernet Controller Virtual Function (rev 01) 02:10.1 Ethernet controller: Intel Corporation X540 Ethernet Controller Virtual Function (rev 01)
-
Assign a MAC address to each VF:
$ sudo ip link set <pf> vf <vfidx> mac <fake MAC address>
Depending on the NIC being used, you might need to explicitly set the MAC address for the VF device. Setting the MAC address guarantees that the address is consistent on the host and when passed to the guest. Verify a MAC address is assigned to the VF using command
ip link show dev <vf>
.
The following example launches a Kata Containers container using SR-IOV:
-
Build and start SR-IOV plugin:
To install the SR-IOV plugin, follow the SR-IOV plugin installation instructions
-
Create the docker network:
$ sudo docker network create -d sriov --internal --opt pf_iface=enp1s0f0 --opt vlanid=100 --subnet=192.168.0.0/24 vfnet E0505 09:35:40.550129 2541 plugin.go:297] Numvfs and Totalvfs are not same on the PF - Initialize numvfs to totalvfs ee2e5a594f9e4d3796eda972f3b46e52342aea04cbae8e5eac9b2dd6ff37b067
The previous commands create the required SR-IOV docker network, subnet,
vlanid
, and physical interface. -
Start containers and test their connectivity:
$ sudo docker run --runtime=kata-runtime --net=vfnet --cap-add SYS_ADMIN --ip=192.168.0.10 -it alpine
The previous example starts a container making use of SR-IOV. If two machines with SR-IOV enabled NICs are connected back-to-back and each has a network with matching
vlanid
created, use the following two commands to test the connectivity:Machine 1:
sriov-1:~$ sudo docker run --runtime=kata-runtime --net=vfnet --cap-add SYS_ADMIN --ip=192.168.0.10 -it mcastelino/iperf bash -c "mount -t ramfs -o size=20M ramfs /tmp; iperf3 -s"
Machine 2:
sriov-2:~$ sudo docker run --runtime=kata-runtime --net=vfnet --cap-add SYS_ADMIN --ip=192.168.0.11 -it mcastelino/iperf iperf3 -c 192.168.0.10 bash -c "mount -t ramfs -o size=20M ramfs /tmp; iperf3 -c 192.168.0.10"