diff --git a/source/access_to_services.rst b/source/access_to_services.rst new file mode 100644 index 0000000..aa66ebc --- /dev/null +++ b/source/access_to_services.rst @@ -0,0 +1,134 @@ +.. include:: vars.rst + +================== +Access to Services +================== + +Openstack Services +================== + +Accessing to Horizon +-------------------- + +The OpenStack web UI is available at: |horizon_url| + +This site is accessible |horizon_access|. + +Accessing the OpenStack CLI +--------------------------- + +A simple way to get started with accessing the OpenStack command-line +interface. + +This can be done from |public_api_access_host| (for example), or any machine +that has access to |public_vip|: + +.. code-block:: console + + openstack# python3 -m venv openstack-venv + openstack# source openstack-venv/bin/activate + openstack# pip install -U pip + openstack# pip install python-openstackclient + openstack# source -openrc.sh + +The `-openrc.sh` file can be downloaded from the OpenStack Dashboard +(Horizon): + +.. image:: _static/openrc.png + :alt: Downloading an openrc file from Horizon + :class: no-scaled-link + :width: 200 + +Now it should be possible to run OpenStack commands: + +.. code-block:: console + + openstack# openstack server list + +Accessing Deployed Instances +---------------------------- + +The external network of OpenStack, called |public_network|, connects to the +subnet |public_subnet|. This network is accessible |floating_ip_access|. + +Any OpenStack instance can make outgoing connections to this network, via a +router that connects the internal network of the project to the +|public_network| network. + +To enable incoming connections (e.g. SSH), a floating IP is required. A +floating IP is allocated and associated via OpenStack. Security groups must be +set to permit the kind of connectivity required (i.e. to define the ports that +must be opened). + +Monitoring Services +=================== + +Access to Opensearch Dashboard +------------------------------ + +OpenStack control plane logs are aggregated from all servers by Fluentd and +stored in OpenSearch. The control plane logs can be accessed from +OpenSearch using Opensearch Dashboard, which is available at the following URL: +|opensearch_dashboard_url| + +To log in, use the ``opensearch`` user. The password is auto-generated by +Kolla-Ansible and can be extracted from the encrypted passwords file +(|kolla_passwords|): + +.. code-block:: console + :substitutions: + + kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^opensearch + +Access to Grafana +----------------- + +Control plane metrics can be visualised in Grafana dashboards. Grafana can be +found at the following address: |grafana_url| + +To log in, use the |grafana_username| user. The password is auto-generated by +Kolla-Ansible and can be extracted from the encrypted passwords file +(|kolla_passwords|): + +.. code-block:: console + :substitutions: + + kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^grafana_admin_password + +Access to Prometheus Alertmanager +--------------------------------- + +Control plane alerts can be visualised and managed in Alertmanager, which can +be found at the following address: |alertmanager_url| + +To log in, use the ``admin`` user. The password is auto-generated by +Kolla-Ansible and can be extracted from the encrypted passwords file +(|kolla_passwords|): + +.. code-block:: console + :substitutions: + + kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^prometheus_alertmanager_password + + +.. ifconfig:: deployment['wazuh'] + + Access to Wazuh Manager + ----------------------- + + To access the Wazuh Manager dashboard, navigate to the ip address + of |wazuh_manager_name| (|wazuh_manager_url|). + + You can login to the dashboard with the username ``admin``. The + password for ``admin`` is defined in the secret + ``opendistro_admin_password`` which can be found within + ``etc/kayobe/inventory/group_vars/wazuh-manager/wazuh-secrets.yml``. + + .. note:: + + Use ``ansible-vault`` to view Wazuh secrets: + + .. code-block:: console + :substitutions: + + kayobe# ansible-vault view --vault-password-file |vault_password_file_path| $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml diff --git a/source/baremetal_management.rst b/source/baremetal_management.rst index 3df9f1b..25a1276 100644 --- a/source/baremetal_management.rst +++ b/source/baremetal_management.rst @@ -1,18 +1,275 @@ -.. include:: vars.rst - ====================================== Bare Metal Compute Hardware Management ====================================== -.. ifconfig:: deployment['ironic'] +Bare metal compute nodes are managed by the Ironic services. +This section describes elements of the configuration of this service. + +.. _ironic-node-lifecycle: + +Ironic node life cycle +---------------------- + +The deployment process is documented in the `Ironic User Guide `__. +OpenStack deployment uses the +`direct deploy method `__. + +The Ironic state machine can be found `here `__. The rest of +this documentation refers to these states and assumes that you have familiarity. + +High level overview of state transitions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following section attempts to describe the state transitions for various Ironic operations at a high level. +It focuses on trying to describe the steps where dynamic switch reconfiguration is triggered. +For a more detailed overview, refer to the :ref:`ironic-node-lifecycle` section. + +Provisioning +~~~~~~~~~~~~ + +Provisioning starts when an instance is created in Nova using a bare metal flavor. + +- Node starts in the available state (available) +- User provisions an instance (deploying) +- Ironic will switch the node onto the provisioning network (deploying) +- Ironic will power on the node and will await a callback (wait-callback) +- Ironic will image the node with an operating system using the image provided at creation (deploying) +- Ironic switches the node onto the tenant network(s) via neutron (deploying) +- Transition node to active state (active) + +.. _baremetal-management-deprovisioning: + +Deprovisioning +~~~~~~~~~~~~~~ + +Deprovisioning starts when an instance created in Nova using a bare metal flavor is destroyed. + +If automated cleaning is enabled, it occurs when nodes are deprovisioned. + +- Node starts in active state (active) +- User deletes instance (deleting) +- Ironic will remove the node from any tenant network(s) (deleting) +- Ironic will switch the node onto the cleaning network (deleting) +- Ironic will power on the node and will await a callback (clean-wait) +- Node boots into Ironic Python Agent and issues callback, Ironic starts cleaning (cleaning) +- Ironic removes node from cleaning network (cleaning) +- Node transitions to available (available) + +If automated cleaning is disabled. + +- Node starts in active state (active) +- User deletes instance (deleting) +- Ironic will remove the node from any tenant network(s) (deleting) +- Node transitions to available (available) + +Cleaning +~~~~~~~~ + +Manual cleaning is not part of the regular state transitions when using Nova, however nodes can be manually cleaned by administrators. + +- Node starts in the manageable state (manageable) +- User triggers cleaning with API (cleaning) +- Ironic will switch the node onto the cleaning network (cleaning) +- Ironic will power on the node and will await a callback (clean-wait) +- Node boots into Ironic Python Agent and issues callback, Ironic starts cleaning (cleaning) +- Ironic removes node from cleaning network (cleaning) +- Node transitions back to the manageable state (manageable) + +Rescuing +~~~~~~~~ + +Feature not used. The required rescue network is not currently configured. + +Baremetal networking +-------------------- + +Baremetal networking with the Neutron Networking Generic Switch ML2 driver requires a combination of static and dynamic switch configuration. + +.. _static-switch-config: + +Static switch configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Static physical network configuration is managed via Kayobe. + +.. TODO: Fill in the switch configuration + +- Some initial switch configuration is required before networking generic switch can take over the management of an interface. + First, LACP must be configured on the switch ports attached to the baremetal node, e.g: + + .. code-block:: shell + + The interface is then partially configured: + + .. code-block:: shell + + For :ref:`ironic-node-discovery` to work, you need to manually switch the port to the provisioning network: + + **NOTE**: You only need to do this if Ironic isn't aware of the node. + +Configuration with kayobe +^^^^^^^^^^^^^^^^^^^^^^^^^ + +Kayobe can be used to apply the :ref:`static-switch-config`. + +- Upstream documentation can be found `here `__. +- Kayobe does all the switch configuration that isn't :ref:`dynamically updated using Ironic `. +- Optionally switches the node onto the provisioning network (when using ``--enable-discovery``) + + + NOTE: This is a dangerous operation as it can wipe out the dynamic VLAN configuration applied by neutron/ironic. + You should only run this when initially enrolling a node, and should always use the ``interface-description-limit`` option. For example: + + .. code-block:: + + kayobe physical network configure --interface-description-limit --group switches --display --enable-discovery + + In this example, ``--display`` is used to preview the switch configuration without applying it. + +.. TODO: Fill in information about how switches are configured in kayobe-config, with links + +- Configuration is done using a combination of ``group_vars`` and ``host_vars`` + +.. _dynamic-switch-configuration: + +Dynamic switch configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Ironic dynamically configures the switches using the Neutron `Networking Generic Switch `_ ML2 driver. + +- Used to toggle the baremetal nodes onto different networks + + + Can use any VLAN network defined in OpenStack, providing that the VLAN has been trunked to the controllers + as this is required for DHCP to function. + + See :ref:`ironic-node-lifecycle`. This attempts to illustrate when any switch reconfigurations happen. + +- Only configures VLAN membership of the switch interfaces or port groups. To prevent conflicts with the static switch configuration, + the convention used is: after the node is in service in Ironic, VLAN membership should not be manually adjusted and + should be left to be controlled by ironic i.e *don't* use ``--enable-discovery`` without an interface limit when configuring the + switches with kayobe. +- Ironic is configured to use the neutron networking driver. + +.. _ngs-commands: + +Commands that NGS will execute +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Networking Generic Switch is mainly concerned with toggling the ports onto different VLANs. It +cannot fully configure the switch. + +.. TODO: Fill in the switch configuration + +- Switching the port onto the provisioning network + + .. code-block:: shell + +- Switching the port onto the tenant network. + + .. code-block:: shell + +- When deleting the instance, the VLANs are removed from the port. Using: + + .. code-block:: shell + +NGS will save the configuration after each reconfiguration (by default). + +Ports managed by NGS +^^^^^^^^^^^^^^^^^^^^ + +The command below extracts a list of port UUID, node UUID and switch port information. + +.. code-block:: bash + + openstack baremetal port list --field uuid --field node_uuid --field local_link_connection --format value + +NGS will manage VLAN membership for ports when the ``local_link_connection`` fields match one of the switches in ``ml2_conf.ini``. +The rest of the switch configuration is static. +The switch configuration that NGS will apply to these ports is detailed in :ref:`dynamic-switch-configuration`. + +.. _ironic-node-discovery: + +Ironic node discovery +--------------------- + +Discovery is a process used to automatically enrol new nodes in Ironic. +It works by PXE booting the nodes into the Ironic Python Agent (IPA) ramdisk. +This ramdisk will collect hardware and networking configuration from the node in a process known as introspection. +This data is used to populate the baremetal node object in Ironic. +The series of steps you need to take to enrol a new node is as follows: + +- Configure credentials on the BMC. These are needed for Ironic to be able to perform power control actions. + +- Controllers should have network connectivity with the target BMC. + +- (If kayobe manages physical network) Add any additional switch configuration to kayobe config. + The minimal switch configuration that kayobe needs to know about is described in :ref:`tor-switch-configuration`. + +- Apply any :ref:`static switch configration `. This performs the initial + setup of the switchports that is needed before Ironic can take over. The static configuration + will not be modified by Ironic, so it should be safe to reapply at any point. See :ref:`ngs-commands` + for details about the switch configuation that Networking Generic Switch will apply. + +- (If kayobe manages physical network) Put the node onto the provisioning network by using the + ``--enable-discovery`` flag and either ``--interface-description-limit`` or ``--interface-limit`` + (do not run this command without one of these limits). See :ref:`static-switch-config`. + + * This is only necessary to initially discover the node. Once the node is in registered in Ironic, + it will take over control of the the VLAN membership. See :ref:`dynamic-switch-configuration`. + + * This provides ethernet connectivity with the controllers over the `workload provisioning` network + +- (If kayobe doesn't manage physical network) Put the node onto the provisioning network. + +.. TODO: link to the relevant file in kayobe config + +- Add node to the kayobe inventory. + +.. TODO: Fill in details about necessary BIOS & RAID config + +- Apply any necesary BIOS & RAID configuration. + +.. TODO: Fill in details about how to trigger a PXE boot + +- PXE boot the node. + +- If the discovery process is successful, the node will appear in Ironic and will get populated with the necessary information from the hardware inspection process. + +.. TODO: Link to the Kayobe inventory in the repo + +- Add node to the Kayobe inventory in the ``baremetal-compute`` group. + +- The node will begin in the ``enroll`` state, and must be moved first to ``manageable``, then ``available`` before it can be used. + + If Ironic automated cleaning is enabled, the node must complete a cleaning process before it can reach the available state. + + * Use Kayobe to attempt to move the node to the ``available`` state. + + .. code-block:: console + + source etc/kolla/public-openrc.sh + kayobe baremetal compute provide --limit + +- Once the node is in the ``available`` state, Nova will make the node available for scheduling. This happens periodically, and typically takes around a minute. + +.. _tor-switch-configuration: + +Top of Rack (ToR) switch configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Networking Generic Switch must be aware of the Top-of-Rack switch connected to the new node. +Switches managed by NGS are configured in ``ml2_conf.ini``. + +.. TODO: Fill in details about how switches are added to NGS config in kayobe-config + +After adding switches to the NGS configuration, Neutron must be redeployed. - The |project_name| cloud includes bare metal compute nodes managed by the - Ironic services. This section describes elements of the configuration of - this service. +Considerations when booting baremetal compared to VMs +------------------------------------------------------ - .. include:: include/baremetal_management.rst +- You can only use networks of type: vlan +- Without using trunk ports, it is only possible to directly attach one network to each port or port group of an instance. -.. ifconfig:: not deployment['ironic'] + * To access other networks you can use routers + * You can still attach floating IPs - The |project_name| cloud does not include bare metal compute nodes managed - by the Ironic services. +- Instances take much longer to provision (expect at least 15 mins) +- When booting an instance use one of the flavors that maps to a baremetal node via the RESOURCE_CLASS configured on the flavor. diff --git a/source/ceph_storage.rst b/source/ceph_storage.rst deleted file mode 100644 index e45e914..0000000 --- a/source/ceph_storage.rst +++ /dev/null @@ -1,38 +0,0 @@ -.. include:: vars.rst - -============ -Ceph Storage -============ - -.. ifconfig:: deployment['ceph'] - - The |project_name| deployment uses Ceph as a storage backend. - -.. ifconfig:: deployment['ceph_managed'] - - The Ceph deployment is managed by StackHPC Ltd. - -.. ifconfig:: not deployment['ceph_managed'] - - The Ceph deployment is not managed by StackHPC Ltd. - -Working with Ceph deployment tool -================================= - -.. ifconfig:: deployment['ceph_ansible'] - - .. include:: include/ceph_ansible.rst - -.. ifconfig:: deployment['cephadm'] - - .. include:: include/cephadm.rst - -Operations -========== - -.. include:: include/ceph_operations.rst - -Troubleshooting -=============== - -.. include:: include/ceph_troubleshooting.rst diff --git a/source/conf.py b/source/conf.py index 08efee8..3bae514 100644 --- a/source/conf.py +++ b/source/conf.py @@ -18,7 +18,7 @@ # -- Project information ----------------------------------------------------- project = 'OpenStack Administration Guide' -copyright = '2020-2023, StackHPC Ltd' +copyright = '2020-2024, StackHPC Ltd' author = 'StackHPC Ltd' diff --git a/source/customising_deployment.rst b/source/customising_deployment.rst deleted file mode 100644 index 2f1dc5c..0000000 --- a/source/customising_deployment.rst +++ /dev/null @@ -1,172 +0,0 @@ -.. include:: vars.rst - -==================================== -Customising the OpenStack Deployment -==================================== - -Horizon customisations -====================== - -Horizon is the most frequent site-specific container customisation required: -other customisations tend to be common across deployments, but personalisation -of Horizon is unique to each institution. - -This describes a simple process for customising the Horizon theme. - -Creating a custom Horizon theme -------------------------------- - -A simple custom theme for Horizon can be implemented as small modifications of -an existing theme, such as the `Default -`__ -one. - -A theme contains at least two files: ``static/_styles.scss``, which can be empty, and -``static/_variables.scss``, which can reference another theme like this: - -.. code-block:: scss - - @import "/themes/default/variables"; - @import "/themes/default/styles"; - -Some resources such as logos can be overridden by dropping SVG image files into -``static/img`` (since the Ocata release, files must be SVG instead of PNG). See -`the Horizon documentation -`__ -for more details. - -Content on some pages such as the splash (login) screen can be updated using -templates. - -See `our example horizon-theme `__ -which inherits from the default theme and includes: - -* a custom splash screen logo -* a custom top-left logo -* a custom message on the splash screen - -Further reading: - -* https://docs.openstack.org/horizon/latest/configuration/customizing.html -* https://docs.openstack.org/horizon/latest/configuration/themes.html -* https://docs.openstack.org/horizon/latest/configuration/branding.html - -Building a custom Horizon container image ------------------------------------------ - -Building a custom container image for Horizon can be done by modifying -``kolla.yml`` to fetch the custom theme and include it in the image: - -.. code-block:: yaml - :substitutions: - - kolla_sources: - horizon-additions-theme-|horizon_theme_name|: - type: "git" - location: |horizon_theme_clone_url| - reference: master - - kolla_build_blocks: - horizon_footer: | - # Binary images cannot use the additions mechanism. - {% raw %} - {% if install_type == 'source' %} - ADD additions-archive / - RUN mkdir -p /etc/openstack-dashboard/themes/|horizon_theme_name| \ - && cp -R /additions/horizon-additions-theme-|horizon_theme_name|-archive-master/* /etc/openstack-dashboard/themes/|horizon_theme_name|/ \ - && chown -R horizon: /etc/openstack-dashboard/themes - {% endif %} - {% endraw %} - -If using a specific container image tag, don't forget to set: - -.. code-block:: yaml - - kolla_tag: mytag - -Build the image with: - -.. code-block:: console - - kayobe overcloud container image build horizon -e kolla_install_type=source --push - -Pull the new Horizon container to the controller: - -.. code-block:: console - - kayobe overcloud container image pull --kolla-tags horizon - -Deploy and use the custom theme -------------------------------- - -Switch to source image type in ``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``: - -.. code-block:: yaml - - horizon_install_type: source - -You may also need to update the container image tag: - -.. code-block:: yaml - - horizon_tag: mytag - -Configure Horizon to include the custom theme and use it by default: - -.. code-block:: console - - mkdir -p ${KAYOBE_CONFIG_PATH}/kolla/config/horizon/ - -Add to ``${KAYOBE_CONFIG_PATH}/kolla/config/horizon/custom_local_settings``: - -.. code-block:: console - :substitutions: - - AVAILABLE_THEMES = [ - ('default', 'Default', 'themes/default'), - ('material', 'Material', 'themes/material'), - ('|horizon_theme_name|', '|project_name|', '/etc/openstack-dashboard/themes/|horizon_theme_name|'), - ] - DEFAULT_THEME = '|horizon_theme_name|' - -You can also set other customisations in this file, such as the HTML title of the page: - -.. code-block:: console - :substitutions: - - SITE_BRANDING = "|project_name| OpenStack" - -Deploy with: - -.. code-block:: console - - kayobe overcloud service reconfigure --kolla-tags horizon - -Troubleshooting ---------------- - -Make sure you build source images, as binary images cannot use the addition -mechanism used here. - -If the theme is selected but the logo doesn’t load, try running these commands -inside the ``horizon`` container: - -.. code-block:: console - - /var/lib/kolla/venv/bin/python /var/lib/kolla/venv/bin/manage.py collectstatic --noinput --clear - /var/lib/kolla/venv/bin/python /var/lib/kolla/venv/bin/manage.py compress --force - settings_bundle | md5sum > /var/lib/kolla/.settings.md5sum.txt - -Alternatively, try changing anything in ``custom_local_settings`` and restarting -the ``horizon`` container. - -If the ``horizon`` container is restarting with the following error: - -.. code-block:: console - - /var/lib/kolla/venv/bin/python /var/lib/kolla/venv/bin/manage.py compress --force - CommandError: An error occurred during rendering /var/lib/kolla/venv/lib/python3.6/site-packages/openstack_dashboard/templates/horizon/_scripts.html: Couldn't find any precompiler in COMPRESS_PRECOMPILERS setting for mimetype '\'text/javascript\''. - -It can be resolved by dropping cached content with ``docker restart -memcached``. Note this will log out users from Horizon, as Django sessions are -stored in Memcached. diff --git a/source/gpus_in_openstack.rst b/source/gpus_in_openstack.rst deleted file mode 100644 index e904706..0000000 --- a/source/gpus_in_openstack.rst +++ /dev/null @@ -1,1127 +0,0 @@ -.. include:: vars.rst - -============================= -Support for GPUs in OpenStack -============================= - -NVIDIA Virtual GPU -################## - -BIOS configuration ------------------- - -Intel -^^^^^ - -* Enable `VT-x` in the BIOS for virtualisation support. -* Enable `VT-d` in the BIOS for IOMMU support. - -Dell -^^^^ - -Enabling SR-IOV with `racadm`: - -.. code:: shell - - /opt/dell/srvadmin/bin/idracadm7 set BIOS.IntegratedDevices.SriovGlobalEnable Enabled - /opt/dell/srvadmin/bin/idracadm7 jobqueue create BIOS.Setup.1-1 - - - -Obtain driver from NVIDIA licensing portal -------------------------------------------- - -Download Nvidia GRID driver from `here `__ -(This requires a login). The file can either be placed on the :ref:`ansible control host` or :ref:`uploaded to pulp`. - -.. _NVIDIA Pulp: - -Uploading the GRID driver to pulp ---------------------------------- - -Uploading the driver to pulp will make it possible to run kayobe from any host. This can be useful when -running in a CI environment. - -.. code:: shell - - pulp artifact upload --file ~/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip - pulp file content create --relative-path "NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" --sha256 c8e12c15b881df35e618bdee1f141cbfcc7e112358f0139ceaa95b48e20761e0 - pulp file repository create --name nvidia - pulp file repository content add --repository nvidia --sha256 c8e12c15b881df35e618bdee1f141cbfcc7e112358f0139ceaa95b48e20761e0 --relative-path "NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" - pulp file publication create --repository nvidia - pulp file distribution create --name nvidia --base-path nvidia --repository nvidia - -The file will then be available at ``/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip``. You -will need to set the ``vgpu_driver_url`` configuration option to this value: - -.. code:: yaml - - # URL of GRID driver in pulp - vgpu_driver_url: "{{ pulp_url }}/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" - -See :ref:`NVIDIA Role Configuration`. - -.. _NVIDIA control host: - -Placing the GRID driver on the ansible control host ---------------------------------------------------- - -Copy the driver bundle to a known location on the ansible control host. Set the ``vgpu_driver_url`` configuration variable to reference this -path using ``file`` as the url scheme e.g: - -.. code:: yaml - - # Location of NVIDIA GRID driver on localhost - vgpu_driver_url: "file://{{ lookup('env', 'HOME') }}/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" - -See :ref:`NVIDIA Role Configuration`. - -.. _NVIDIA OS Configuration: - -OS Configuration ----------------- - -Host OS configuration is done by using roles in the `stackhpc.linux `_ ansible collection. - -Add the following to your ansible ``requirements.yml``: - -.. code-block:: yaml - :caption: $KAYOBE_CONFIG_PATH/ansible/requirements.yml - - #FIXME: Update to known release When VGPU and IOMMU roles have landed - collections: - - name: stackhpc.linux - source: git+https://github.com/stackhpc/ansible-collection-linux.git,preemptive/vgpu-iommu - type: git - -Create a new playbook or update an existing on to apply the roles: - -.. code-block:: yaml - :caption: $KAYOBE_CONFIG_PATH/ansible/host-configure.yml - - --- - - - hosts: iommu - tags: - - iommu - tasks: - - import_role: - name: stackhpc.linux.iommu - handlers: - - name: reboot - set_fact: - kayobe_needs_reboot: true - - - hosts: vgpu - tags: - - vgpu - tasks: - - import_role: - name: stackhpc.linux.vgpu - handlers: - - name: reboot - set_fact: - kayobe_needs_reboot: true - - - name: Reboot when required - hosts: iommu:vgpu - tags: - - reboot - tasks: - - name: Reboot - reboot: - reboot_timeout: 3600 - become: true - when: kayobe_needs_reboot | default(false) | bool - -Ansible Inventory Configuration -------------------------------- - -Add some hosts into the ``vgpu`` group. The example below maps two custom -compute groups, ``compute_multi_instance_gpu`` and ``compute_vgpu``, -into the ``vgpu`` group: - -.. code-block:: yaml - :caption: $KAYOBE_CONFIG_PATH/inventory/custom - - [compute] - [compute_multi_instance_gpu] - [compute_vgpu] - - [vgpu:children] - compute_multi_instance_gpu - compute_vgpu - - [iommu:children] - vgpu - -Having multiple groups is useful if you want to be able to do conditional -templating in ``nova.conf`` (see :ref:`NVIDIA Kolla Ansible -Configuration`). Since the vgpu role requires iommu to be enabled, all of the -hosts in the ``vgpu`` group are also added to the ``iommu`` group. - -If using bifrost and the ``kayobe overcloud inventory discover`` mechanism, -hosts can automatically be mapped to these groups by configuring -``overcloud_group_hosts_map``: - -.. code-block:: yaml - :caption: ``$KAYOBE_CONFIG_PATH/overcloud.yml`` - - overcloud_group_hosts_map: - compute_vgpu: - - "computegpu000" - compute_mutli_instance_gpu: - - "computegpu001" - -.. _NVIDIA Role Configuration: - -Role Configuration -^^^^^^^^^^^^^^^^^^ - -Configure the location of the NVIDIA driver: - -.. code-block:: yaml - :caption: $KAYOBE_CONFIG_PATH/vgpu.yml - - --- - - vgpu_driver_url: "http://{{ pulp_url }}/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" - -Configure the VGPU devices: - -.. code-block:: yaml - :caption: $KAYOBE_CONFIG_PATH/inventory/group_vars/compute_vgpu/vgpu - - #nvidia-692 GRID A100D-4C - #nvidia-693 GRID A100D-8C - #nvidia-694 GRID A100D-10C - #nvidia-695 GRID A100D-16C - #nvidia-696 GRID A100D-20C - #nvidia-697 GRID A100D-40C - #nvidia-698 GRID A100D-80C - #nvidia-699 GRID A100D-1-10C - #nvidia-700 GRID A100D-2-20C - #nvidia-701 GRID A100D-3-40C - #nvidia-702 GRID A100D-4-40C - #nvidia-703 GRID A100D-7-80C - #nvidia-707 GRID A100D-1-10CME - vgpu_definitions: - # Configuring a MIG backed VGPU - - pci_address: "0000:17:00.0" - virtual_functions: - - mdev_type: nvidia-700 - index: 0 - - mdev_type: nvidia-700 - index: 1 - - mdev_type: nvidia-700 - index: 2 - - mdev_type: nvidia-699 - index: 3 - mig_devices: - "1g.10gb": 1 - "2g.20gb": 3 - # Configuring a card in a time-sliced configuration (non-MIG backed) - - pci_address: "0000:65:00.0" - virtual_functions: - - mdev_type: nvidia-697 - index: 0 - - mdev_type: nvidia-697 - index: 1 - -Running the playbook -^^^^^^^^^^^^^^^^^^^^ - -The playbook defined in the :ref:`previous step` -should be run after `kayobe overcloud host configure` has completed. This will -ensure the host has been fully bootstrapped. With default settings, internet -connectivity is required to download `MIG Partition Editor for NVIDIA GPUs`. If -this is not desirable, you can override the one of the following variables -(depending on host OS): - -.. code-block:: yaml - :caption: $KAYOBE_CONFIG_PATH/inventory/group_vars/compute_vgpu/vgpu - - vgpu_nvidia_mig_manager_rpm_url: "https://github.com/NVIDIA/mig-parted/releases/download/v0.5.1/nvidia-mig-manager-0.5.1-1.x86_64.rpm" - vgpu_nvidia_mig_manager_deb_url: "https://github.com/NVIDIA/mig-parted/releases/download/v0.5.1/nvidia-mig-manager_0.5.1-1_amd64.deb" - -For example, you may wish to upload these artifacts to the local pulp. - -Run the playbook that you defined earlier: - -.. code-block:: shell - - kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/host-configure.yml - -Note: This will reboot the hosts on first run. - -The playbook may be added as a hook in ``$KAYOBE_CONFIG_PATH/hooks/overcloud-host-configure/post.d``; this will -ensure you do not forget to run it when hosts are enrolled in the future. - -.. _NVIDIA Kolla Ansible Configuration: - -Kolla-Ansible configuration -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -To use the mdev devices that were created, modify nova.conf to add a list of mdev devices that -can be passed through to guests: - -.. code-block:: - :caption: $KAYOBE_CONFIG_PATH/kolla/config/nova/nova-compute.conf - - {% if inventory_hostname in groups['compute_multi_instance_gpu'] %} - [devices] - enabled_mdev_types = nvidia-700, nvidia-699 - - [mdev_nvidia-700] - device_addresses = 0000:21:00.4,0000:21:00.5,0000:21:00.6,0000:81:00.4,0000:81:00.5,0000:81:00.6 - mdev_class = CUSTOM_NVIDIA_700 - - [mdev_nvidia-699] - device_addresses = 0000:21:00.7,0000:81:00.7 - mdev_class = CUSTOM_NVIDIA_699 - - {% elif inventory_hostname in groups['compute_vgpu'] %} - [devices] - enabled_mdev_types = nvidia-697 - - [mdev_nvidia-697] - device_addresses = 0000:21:00.4,0000:21:00.5,0000:81:00.4,0000:81:00.5 - # Custom resource classes don't work when you only have single resource type. - mdev_class = VGPU - - {% endif %} - -You will need to adjust the PCI addresses to match the virtual function -addresses. These can be obtained by checking the mdevctl configuration after -running the role: - -.. code-block:: shell - - # mdevctl list - - 73269d0f-b2c9-438d-8f28-f9e4bc6c6995 0000:17:00.4 nvidia-700 manual (defined) - dc352ef3-efeb-4a5d-a48e-912eb230bc76 0000:17:00.5 nvidia-700 manual (defined) - a464fbae-1f89-419a-a7bd-3a79c7b2eef4 0000:17:00.6 nvidia-700 manual (defined) - f3b823d3-97c8-4e0a-ae1b-1f102dcb3bce 0000:17:00.7 nvidia-699 manual (defined) - 330be289-ba3f-4416-8c8a-b46ba7e51284 0000:65:00.4 nvidia-700 manual (defined) - 1ba5392c-c61f-4f48-8fb1-4c6b2bbb0673 0000:65:00.5 nvidia-700 manual (defined) - f6868020-eb3a-49c6-9701-6c93e4e3fa9c 0000:65:00.6 nvidia-700 manual (defined) - 00501f37-c468-5ba4-8be2-8d653c4604ed 0000:65:00.7 nvidia-699 manual (defined) - -The mdev_class maps to a resource class that you can set in your flavor definition. -Note that if you only define a single mdev type on a given hypervisor, then the -mdev_class configuration option is silently ignored and it will use the ``VGPU`` -resource class (bug?). - -Map through the kayobe inventory groups into kolla: - -.. code-block:: yaml - :caption: $KAYOBE_CONFIG_PATH/kolla.yml - - kolla_overcloud_inventory_top_level_group_map: - control: - groups: - - controllers - network: - groups: - - network - compute_cpu: - groups: - - compute_cpu - compute_gpu: - groups: - - compute_gpu - compute_multi_instance_gpu: - groups: - - compute_multi_instance_gpu - compute_vgpu: - groups: - - compute_vgpu - compute: - groups: - - compute - monitoring: - groups: - - monitoring - storage: - groups: - "{{ kolla_overcloud_inventory_storage_groups }}" - -Where the ``compute_`` groups have been added to the kayobe defaults. - -You will need to reconfigure nova for this change to be applied: - -.. code-block:: shell - - kayobe overcloud service deploy -kt nova --kolla-limit compute_vgpu - -Openstack flavors -^^^^^^^^^^^^^^^^^ - -Define some flavors that request the resource class that was configured in nova.conf. -An example definition, that can be used with ``openstack.cloud.compute_flavor`` Ansible module, -is shown below: - -.. code-block:: yaml - - vgpu_a100_2g_20gb: - name: "vgpu.a100.2g.20gb" - ram: 65536 - disk: 30 - vcpus: 8 - is_public: false - extra_specs: - hw:cpu_policy: "dedicated" - hw:cpu_thread_policy: "prefer" - hw:mem_page_size: "1GB" - hw:cpu_sockets: 2 - hw:numa_nodes: 8 - hw_rng:allowed: "True" - resources:CUSTOM_NVIDIA_700: "1" - -You now should be able to launch a VM with this flavor. - -NVIDIA License Server -^^^^^^^^^^^^^^^^^^^^^ - -The Nvidia delegated license server is a virtual machine based appliance. You simply need to boot an instance -using the image supplied on the NVIDIA Licensing portal. This can be done on the OpenStack cloud itself. The -requirements are: - -* All tenants wishing to use GPU based instances must have network connectivity to this machine. (network licensing) - - It is possible to configure node locked licensing where tenants do not need access to the license server -* Satisfy minimum requirements detailed `here `__. - -The official documentation for configuring the instance -can be found `here `__. - -Below is a snippet of openstack-config for defining a project, and a security group that can be used for a non-HA deployment: - -.. code-block:: yaml - - secgroup_rules_nvidia_dls: - # Allow ICMP (for ping, etc.). - - ethertype: IPv4 - protocol: icmp - # Allow SSH. - - ethertype: IPv4 - protocol: tcp - port_range_min: 22 - port_range_max: 22 - # https://docs.nvidia.com/license-system/latest/nvidia-license-system-user-guide/index.html - - ethertype: IPv4 - protocol: tcp - port_range_min: 443 - port_range_max: 443 - - ethertype: IPv4 - protocol: tcp - port_range_min: 80 - port_range_max: 80 - - ethertype: IPv4 - protocol: tcp - port_range_min: 7070 - port_range_max: 7070 - - secgroup_nvidia_dls: - name: nvidia-dls - project: "{{ project_cloud_services.name }}" - rules: "{{ secgroup_rules_nvidia_dls }}" - - openstack_security_groups: - - "{{ secgroup_nvidia_dls }}" - - project_cloud_services: - name: "cloud-services" - description: "Internal Cloud services" - project_domain: default - user_domain: default - users: [] - quotas: "{{ quotas_project }}" - -Booting the VM: - -.. code-block:: shell - - # Uploading the image and making it available in the cloud services project - $ openstack image create --file nls-3.0.0-bios.qcow2 nls-3.0.0-bios --disk-format qcow2 - $ openstack image add project nls-3.0.0-bios cloud-services - $ openstack image set --accept nls-3.0.0-bios --project cloud-services - $ openstack image member list nls-3.0.0-bios - - # Booting a server as the admin user in the cloud-services project. We pre-create the port so that - # we can recreate it without changing the MAC address. - $ openstack port create --mac-address fa:16:3e:a3:fd:19 --network external nvidia-dls-1 --project cloud-services - $ openstack role add member --project cloud-services --user admin - $ export OS_PROJECT_NAME=cloud-services - $ openstack server group create nvidia-dls --policy anti-affinity - $ openstack server create --flavor 8cpu-8gbmem-30gbdisk --image nls-3.0.0-bios --port nvidia-dls-1 --hint group=179dfa59-0947-4925-a0ff-b803bc0e58b2 nvidia-dls-cci1-1 --security-group nvidia-dls - $ openstack server add security group nvidia-dls-1 nvidia-dls - - -Manual VM driver and licence configuration -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -vGPU client VMs need to be configured with Nvidia drivers to run GPU workloads. -The host drivers should already be applied to the hypervisor. - -GCP hosts compatible client drivers `here -`__. - -Find the correct version (when in doubt, use the same version as the host) and -download it to the VM. The exact dependencies will depend on the base image you -are using but at a minimum, you will need GCC installed. - -Ubuntu Jammy example: - -.. code-block:: bash - - sudo apt update - sudo apt install -y make gcc wget - wget https://storage.googleapis.com/nvidia-drivers-us-public/GRID/vGPU17.1/NVIDIA-Linux-x86_64-550.54.15-grid.run - sudo sh NVIDIA-Linux-x86_64-550.54.15-grid.run - -Check the ``nvidia-smi`` client is available: - -.. code-block:: bash - - nvidia-smi - -Generate a token from the licence server, and copy the token file to the client -VM. - -On the client, create an Nvidia grid config file from the template: - -.. code-block:: bash - - sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf - -Edit it to set ``FeatureType=1`` and leave the rest of the settings as default. - -Copy the client configuration token into the ``/etc/nvidia/ClientConfigToken`` -directory. - -Ensure the correct permissions are set: - -.. code-block:: bash - - sudo chmod 744 /etc/nvidia/ClientConfigToken/client_configuration_token_.tok - -Restart the ``nvidia-gridd`` service: - -.. code-block:: bash - - sudo systemctl restart nvidia-gridd - -Check that the token has been recognised: - -.. code-block:: bash - - nvidia-smi -q | grep 'License Status' - -If not, an error should appear in the journal: - -.. code-block:: bash - - sudo journalctl -xeu nvidia-gridd - -A successfully licenced VM can be snapshotted to create an image in Glance that -includes the drivers and licencing token. Alternatively, an image can be -created using Diskimage Builder. - -Disk image builder recipe to automatically license VGPU on boot -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -`stackhpc-image-elements `__ provides a ``nvidia-vgpu`` -element to configure the nvidia-gridd service in VGPU mode. This allows you to boot VMs that automatically license themselves. -Snippets of ``openstack-config`` that allow you to do this are shown below: - -.. code-block:: shell - - image_rocky9_nvidia: - name: "Rocky9-NVIDIA" - type: raw - elements: - - "rocky-container" - - "rpm" - - "nvidia-vgpu" - - "cloud-init" - - "epel" - - "cloud-init-growpart" - - "selinux-permissive" - - "dhcp-all-interfaces" - - "vm" - - "extra-repos" - - "grub2" - - "stable-interface-names" - - "openssh-server" - is_public: True - packages: - - "dkms" - - "git" - - "tmux" - - "cuda-minimal-build-12-1" - - "cuda-demo-suite-12-1" - - "cuda-libraries-12-1" - - "cuda-toolkit" - - "vim-enhanced" - env: - DIB_CONTAINERFILE_NETWORK_DRIVER: host - DIB_CONTAINERFILE_RUNTIME: docker - DIB_RPMS: "http://192.168.1.2:80/pulp/content/nvidia/nvidia-linux-grid-525-525.105.17-1.x86_64.rpm" - YUM: dnf - DIB_EXTRA_REPOS: "https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo" - DIB_NVIDIA_VGPU_CLIENT_TOKEN: "{{ lookup('file' , 'secrets/client_configuration_token_05-30-2023-12-41-40.tok') }}" - DIB_CLOUD_INIT_GROWPART_DEVICES: - - "/" - DIB_RELEASE: "9" - properties: - os_type: "linux" - os_distro: "rocky" - os_version: "9" - - openstack_images: - - "{{ image_rocky9_nvidia }}" - - openstack_image_git_elements: - - repo: "https://github.com/stackhpc/stackhpc-image-elements" - local: "{{ playbook_dir }}/stackhpc-image-elements" - version: master - elements_path: elements - -The gridd driver was uploaded pulp using the following procedure: - -.. code-block:: shell - - $ unzip NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip - $ pulp artifact upload --file ~/nvidia-linux-grid-525-525.105.17-1.x86_64.rpm - $ pulp file content create --relative-path "nvidia-linux-grid-525-525.105.17-1.x86_64.rpm" --sha256 58fda68d01f00ea76586c9fd5f161c9fbb907f627b7e4f4059a309d8112ec5f5 - $ pulp file repository add --name nvidia --sha256 58fda68d01f00ea76586c9fd5f161c9fbb907f627b7e4f4059a309d8112ec5f5 --relative-path "nvidia-linux-grid-525-525.105.17-1.x86_64.rpm" - $ pulp file publication create --repository nvidia - $ pulp file distribution update --name nvidia --base-path nvidia --repository nvidia - -This is the file we reference in ``DIB_RPMS``. It is important to keep the driver versions aligned between hypervisor and guest VM. - -The client token can be downloaded from the web interface of the licensing portal. Care should be taken -when copying the contents as it can contain invisible characters. It is best to copy the file directly -into your openstack-config repository and vault encrypt it. The ``file`` lookup plugin can be used to decrypt -the file (as shown in the example above). - -Testing vGPU VMs -^^^^^^^^^^^^^^^^ - -vGPU VMs can be validated using the following test workload. The test should -succeed if the VM is correctly licenced and drivers are correctly installed for -both the host and client VM. - -Install ``cuda-toolkit`` using the instructions `here -`__. - -Ubuntu Jammy example: - -.. code-block:: bash - - wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb - sudo dpkg -i cuda-keyring_1.1-1_all.deb - sudo apt update -y - sudo apt install -y cuda-toolkit make - -The VM may require a reboot at this point. - -Clone the ``cuda-samples`` repo: - -.. code-block:: bash - - git clone https://github.com/NVIDIA/cuda-samples.git - -Build and run a test workload: - -.. code-block:: bash - - cd cuda-samples/Samples/6_Performance/transpose - make - ./transpose - -Example output: - -.. code-block:: - - Transpose Starting... - - GPU Device 0: "Ampere" with compute capability 8.0 - - > Device 0: "GRID A100D-1-10C MIG 1g.10gb" - > SM Capability 8.0 detected: - > [GRID A100D-1-10C MIG 1g.10gb] has 14 MP(s) x 64 (Cores/MP) = 896 (Cores) - > Compute performance scaling factor = 1.00 - - Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16 - - transpose simple copy , Throughput = 159.1779 GB/s, Time = 0.04908 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 - transpose shared memory copy, Throughput = 152.1922 GB/s, Time = 0.05133 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 - transpose naive , Throughput = 117.2670 GB/s, Time = 0.06662 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 - transpose coalesced , Throughput = 135.0813 GB/s, Time = 0.05784 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 - transpose optimized , Throughput = 145.4326 GB/s, Time = 0.05372 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 - transpose coarse-grained , Throughput = 145.2941 GB/s, Time = 0.05377 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 - transpose fine-grained , Throughput = 150.5703 GB/s, Time = 0.05189 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 - transpose diagonal , Throughput = 117.6831 GB/s, Time = 0.06639 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 - Test passed - -Changing VGPU device types -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Converting the second card to an NVIDIA-698 (whole card). The hypervisor -is empty so we can freely delete mdevs. First clean up the mdev -definition: - -.. code:: shell - - [stack@computegpu007 ~]$ sudo mdevctl list - 5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual (defined) - eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual (defined) - 72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-697 manual (defined) - 0a47ffd1-392e-5373-8428-707a4e0ce31a 0000:81:00.5 nvidia-697 manual (defined) - - [stack@computegpu007 ~]$ sudo mdevctl stop --uuid 72291b01-689b-5b7a-9171-6b3480deabf4 - [stack@computegpu007 ~]$ sudo mdevctl stop --uuid 0a47ffd1-392e-5373-8428-707a4e0ce31a - - [stack@computegpu007 ~]$ sudo mdevctl undefine --uuid 0a47ffd1-392e-5373-8428-707a4e0ce31a - - [stack@computegpu007 ~]$ sudo mdevctl list --defined - 5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual (active) - eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual (active) - 72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-697 manual - - # We can re-use the first virtual function - -Secondly remove the systemd unit that starts the mdev device: - -.. code:: shell - - [stack@computegpu007 ~]$ sudo rm /etc/systemd/system/multi-user.target.wants/nvidia-mdev@0a47ffd1-392e-5373-8428-707a4e0ce31a.service - -Example config change: - -.. code:: shell - - diff --git a/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu b/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu - new file mode 100644 - index 0000000..6cea9bf - --- /dev/null - +++ b/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu - @@ -0,0 +1,12 @@ - +--- - +vgpu_definitions: - + - pci_address: "0000:21:00.0" - + virtual_functions: - + - mdev_type: nvidia-697 - + index: 0 - + - mdev_type: nvidia-697 - + index: 1 - + - pci_address: "0000:81:00.0" - + virtual_functions: - + - mdev_type: nvidia-698 - + index: 0 - diff --git a/etc/kayobe/kolla/config/nova/nova-compute.conf b/etc/kayobe/kolla/config/nova/nova-compute.conf - index 6f680cb..e663ec4 100644 - --- a/etc/kayobe/kolla/config/nova/nova-compute.conf - +++ b/etc/kayobe/kolla/config/nova/nova-compute.conf - @@ -39,7 +39,19 @@ cpu_mode = host-model - {% endraw %} - - {% raw %} - -{% if inventory_hostname in groups['compute_multi_instance_gpu'] %} - +{% if inventory_hostname == "computegpu007" %} - +[devices] - +enabled_mdev_types = nvidia-697, nvidia-698 - + - +[mdev_nvidia-697] - +device_addresses = 0000:21:00.4,0000:21:00.5 - +mdev_class = VGPU - + - +[mdev_nvidia-698] - +device_addresses = 0000:81:00.4 - +mdev_class = CUSTOM_NVIDIA_698 - + - +{% elif inventory_hostname in groups['compute_multi_instance_gpu'] %} - [devices] - enabled_mdev_types = nvidia-700, nvidia-699 - - @@ -50,15 +62,14 @@ mdev_class = CUSTOM_NVIDIA_700 - [mdev_nvidia-699] - device_addresses = 0000:21:00.7,0000:81:00.7 - mdev_class = CUSTOM_NVIDIA_699 - -{% endif %} - - -{% if inventory_hostname in groups['compute_vgpu'] %} - +{% elif inventory_hostname in groups['compute_vgpu'] %} - [devices] - enabled_mdev_types = nvidia-697 - - [mdev_nvidia-697] - device_addresses = 0000:21:00.4,0000:21:00.5,0000:81:00.4,0000:81:00.5 - -# Custom resource classes don't seem to work for this card. - +# Custom resource classes don't work when you only have single resource type. - mdev_class = VGPU - - {% endif %} - -Re-run the configure playbook: - -.. code:: shell - - (kayobe) [stack@ansiblenode1 kayobe]$ kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/host-configure.yml --tags vgpu --limit computegpu007 - -Check the result: - -.. code:: shell - - [stack@computegpu007 ~]$ mdevctl list - 5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual - eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual - 72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-698 manual - -Reconfigure nova to match the change: - -.. code:: shell - - kayobe overcloud service reconfigure -kt nova --kolla-limit computegpu007 --skip-prechecks - - -PCI Passthrough -############### - -This guide has been developed for Nvidia GPUs and CentOS 8. - -See `Kayobe Ops `_ for -a playbook implementation of host setup for GPU. - -BIOS Configuration Requirements -------------------------------- - -On an Intel system: - -* Enable `VT-x` in the BIOS for virtualisation support. -* Enable `VT-d` in the BIOS for IOMMU support. - -Hypervisor Configuration Requirements -------------------------------------- - -Find the GPU device IDs -^^^^^^^^^^^^^^^^^^^^^^^ - -From the host OS, use ``lspci -nn`` to find the PCI vendor ID and -device ID for the GPU device and supporting components. These are -4-digit hex numbers. - -For example: - -.. code-block:: text - - 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204M [GeForce GTX 980M] [10de:13d7] (rev a1) (prog-if 00 [VGA controller]) - 01:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1) - -In this case the vendor ID is ``10de``, display ID is ``13d7`` and audio ID is ``0fbb``. - -Alternatively, for an Nvidia Quadro RTX 6000: - -.. code-block:: yaml - - # NVIDIA Quadro RTX 6000/8000 PCI device IDs - vendor_id: "10de" - display_id: "1e30" - audio_id: "10f7" - usba_id: "1ad6" - usba_class: "0c0330" - usbc_id: "1ad7" - usbc_class: "0c8000" - -These parameters will be used for device-specific configuration. - -Kernel Ramdisk Reconfiguration -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The ramdisk loaded during kernel boot can be extended to include the -vfio PCI drivers and ensure they are loaded early in system boot. - -.. code-block:: yaml - - - name: Template dracut config - blockinfile: - path: /etc/dracut.conf.d/gpu-vfio.conf - block: | - add_drivers+="vfio vfio_iommu_type1 vfio_pci vfio_virqfd" - owner: root - group: root - mode: 0660 - create: true - become: true - notify: - - Regenerate initramfs - - reboot - -The handler for regenerating the Dracut initramfs is: - -.. code-block:: yaml - - - name: Regenerate initramfs - shell: |- - #!/bin/bash - set -eux - dracut -v -f /boot/initramfs-$(uname -r).img $(uname -r) - become: true - -Kernel Boot Parameters -^^^^^^^^^^^^^^^^^^^^^^ - -Set the following kernel parameters by adding to -``GRUB_CMDLINE_LINUX_DEFAULT`` or ``GRUB_CMDLINE_LINUX`` in -``/etc/default/grub.conf``. We can use the -`stackhpc.grubcmdline `_ -role from Ansible Galaxy: - -.. code-block:: yaml - - - name: Add vfio-pci.ids kernel args - include_role: - name: stackhpc.grubcmdline - vars: - kernel_cmdline: - - intel_iommu=on - - iommu=pt - - "vfio-pci.ids={{ vendor_id }}:{{ display_id }},{{ vendor_id }}:{{ audio_id }}" - kernel_cmdline_remove: - - iommu - - intel_iommu - - vfio-pci.ids - -Kernel Device Management -^^^^^^^^^^^^^^^^^^^^^^^^ - -In the hypervisor, we must prevent kernel device initialisation of -the GPU and prevent drivers from loading for binding the GPU in the -host OS. We do this using ``udev`` rules: - -.. code-block:: yaml - - - name: Template udev rules to blacklist GPU usb controllers - blockinfile: - # We want this to execute as soon as possible - path: /etc/udev/rules.d/99-gpu.rules - block: | - #Remove NVIDIA USB xHCI Host Controller Devices, if present - ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x{{ vendor_id }}", ATTR{class}=="0x{{ usba_class }}", ATTR{remove}="1" - #Remove NVIDIA USB Type-C UCSI devices, if present - ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x{{ vendor_id }}", ATTR{class}=="0x{{ usbc_class }}", ATTR{remove}="1" - owner: root - group: root - mode: 0644 - create: true - become: true - -Kernel Drivers -^^^^^^^^^^^^^^ - -Prevent the ``nouveau`` kernel driver from loading by -blacklisting the module: - -.. code-block:: yaml - - - name: Blacklist nouveau - blockinfile: - path: /etc/modprobe.d/blacklist-nouveau.conf - block: | - blacklist nouveau - options nouveau modeset=0 - mode: 0664 - owner: root - group: root - create: true - become: true - notify: - - reboot - - Regenerate initramfs - -Ensure that the ``vfio`` drivers are loaded into the kernel on boot: - -.. code-block:: yaml - - - name: Add vfio to modules-load.d - blockinfile: - path: /etc/modules-load.d/vfio.conf - block: | - vfio - vfio_iommu_type1 - vfio_pci - vfio_virqfd - owner: root - group: root - mode: 0664 - create: true - become: true - notify: reboot - -Once this code has taken effect (after a reboot), the VFIO kernel drivers should be loaded on boot: - -.. code-block:: text - - # lsmod | grep vfio - vfio_pci 49152 0 - vfio_virqfd 16384 1 vfio_pci - vfio_iommu_type1 28672 0 - vfio 32768 2 vfio_iommu_type1,vfio_pci - irqbypass 16384 5 vfio_pci,kvm - - # lspci -nnk -s 3d:00.0 - 3d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [Tesla M10] [10de:13bd] (rev a2) - Subsystem: NVIDIA Corporation Tesla M10 [10de:1160] - Kernel driver in use: vfio-pci - Kernel modules: nouveau - -IOMMU should be enabled at kernel level as well - we can verify that on the compute host: - -.. code-block:: text - - # docker exec -it nova_libvirt virt-host-validate | grep IOMMU - QEMU: Checking for device assignment IOMMU support : PASS - QEMU: Checking if IOMMU is enabled by kernel : PASS - -OpenStack Nova configuration ----------------------------- - -Configure nova-scheduler -^^^^^^^^^^^^^^^^^^^^^^^^ - -The nova-scheduler service must be configured to enable the ``PciPassthroughFilter`` -To enable it add it to the list of filters to Kolla-Ansible configuration file: -``etc/kayobe/kolla/config/nova.conf``, for instance: - -.. code-block:: yaml - - [filter_scheduler] - available_filters = nova.scheduler.filters.all_filters - enabled_filters = AvailabilityZoneFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, PciPassthroughFilter - -Configure nova-compute -^^^^^^^^^^^^^^^^^^^^^^ - -Configuration can be applied in flexible ways using Kolla-Ansible's -methods for `inventory-driven customisation of configuration -`_. -The following configuration could be added to -``etc/kayobe/kolla/config/nova/nova-compute.conf`` to enable PCI -passthrough of GPU devices for hosts in a group named ``compute_gpu``. -Again, the 4-digit PCI Vendor ID and Device ID extracted from ``lspci --nn`` can be used here to specify the GPU device(s). - -.. code-block:: jinja - - [pci] - {% raw %} - {% if inventory_hostname in groups['compute_gpu'] %} - # We could support multiple models of GPU. - # This can be done more selectively using different inventory groups. - # GPU models defined here: - # NVidia Tesla V100 16GB - # NVidia Tesla V100 32GB - # NVidia Tesla P100 16GB - passthrough_whitelist = [{ "vendor_id":"10de", "product_id":"1db4" }, - { "vendor_id":"10de", "product_id":"1db5" }, - { "vendor_id":"10de", "product_id":"15f8" }] - alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" } - alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" } - alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" } - {% endif %} - {% endraw %} - -Configure nova-api -^^^^^^^^^^^^^^^^^^ - -pci.alias also needs to be configured on the controller. -This configuration should match the configuration found on the compute nodes. -Add it to Kolla-Ansible configuration file: -``etc/kayobe/kolla/config/nova/nova-api.conf``, for instance: - -.. code-block:: yaml - - [pci] - alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" } - alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" } - alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" } - -Reconfigure nova service -^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code-block:: text - - kayobe overcloud service reconfigure --kolla-tags nova --kolla-skip-tags common --skip-prechecks - -Configure a flavor -^^^^^^^^^^^^^^^^^^ - -For example, to request two of the GPUs with alias gpu-p100 - -.. code-block:: text - - openstack flavor set m1.medium --property "pci_passthrough:alias"="gpu-p100:2" - - -This can be also defined in the |project_config| repository: -|project_config_source_url| - -add extra_specs to flavor in etc/|project_config|/|project_config|.yml: - -.. code-block:: console - :substitutions: - - admin# cd |base_path|/src/|project_config| - admin# vim etc/|project_config|/|project_config|.yml - - name: "m1.medium" - ram: 4096 - disk: 40 - vcpus: 2 - extra_specs: - "pci_passthrough:alias": "gpu-p100:2" - -Invoke configuration playbooks afterwards: - -.. code-block:: console - :substitutions: - - admin# source |base_path|/src/|kayobe_config|/etc/kolla/public-openrc.sh - admin# source |base_path|/venvs/|project_config|/bin/activate - admin# tools/|project_config| --vault-password-file |vault_password_file_path| - -Create instance with GPU passthrough -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code-block:: text - - openstack server create --flavor m1.medium --image ubuntu2004 --wait test-pci - -Testing GPU in a Guest VM -------------------------- - -The Nvidia drivers must be installed first. For example, on an Ubuntu guest: - -.. code-block:: text - - sudo apt install nvidia-headless-440 nvidia-utils-440 nvidia-compute-utils-440 - -The ``nvidia-smi`` command will generate detailed output if the driver has loaded -successfully. - -Further Reference ------------------ - -For PCI Passthrough and GPUs in OpenStack: - -* Consumer-grade GPUs: https://gist.github.com/claudiok/890ab6dfe76fa45b30081e58038a9215 -* https://www.jimmdenton.com/gpu-offloading-openstack/ -* https://docs.openstack.org/nova/latest/admin/pci-passthrough.html -* https://docs.openstack.org/nova/latest/admin/virtual-gpu.html (vGPU only) -* Tesla models in OpenStack: https://egallen.com/openstack-nvidia-tesla-gpu-passthrough/ -* https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF -* https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt -* https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/installation_guide/appe-configuring_a_hypervisor_host_for_pci_passthrough -* https://www.gresearch.co.uk/article/utilising-the-openstack-placement-service-to-schedule-gpu-and-nvme-workloads-alongside-general-purpose-instances/ diff --git a/source/hardware_inventory_management.rst b/source/hardware_inventory_management.rst deleted file mode 100644 index 6e5e5df..0000000 --- a/source/hardware_inventory_management.rst +++ /dev/null @@ -1,302 +0,0 @@ -.. include:: vars.rst - -============================= -Hardware Inventory Management -============================= - -At its lowest level, hardware inventory is managed in the Bifrost service (see :ref:`accessing-the-bifrost-service`). - -Reconfiguring Control Plane Hardware ------------------------------------- - -If a server's hardware or firmware configuration is changed, it should be -re-inspected in Bifrost before it is redeployed into service. A single server -can be reinspected like this (for a host named |hypervisor_hostname|): - -.. code-block:: console - :substitutions: - - kayobe# kayobe overcloud hardware inspect --limit |hypervisor_hostname| - -.. _enrolling-new-hypervisors: - -Enrolling New Hypervisors -------------------------- - -New hypervisors can be added to the Bifrost inventory by using its discovery -capabilities. Assuming that new hypervisors have IPMI enabled and are -configured to network boot on the provisioning network, the following commands -will instruct them to PXE boot. The nodes will boot on the Ironic Python Agent -kernel and ramdisk, which is configured to extract hardware information and -send it to Bifrost. Note that IPMI credentials can be found in the encrypted -file located at ``${KAYOBE_CONFIG_PATH}/secrets.yml``. - -.. code-block:: console - :substitutions: - - bifrost# ipmitool -I lanplus -U |ipmi_username| -H |hypervisor_hostname|-ipmi chassis bootdev pxe - -If node is are off, power them on: - -.. code-block:: console - :substitutions: - - bifrost# ipmitool -I lanplus -U |ipmi_username| -H |hypervisor_hostname|-ipmi power on - -If nodes is on, reset them: - -.. code-block:: console - :substitutions: - - bifrost# ipmitool -I lanplus -U |ipmi_username| -H |hypervisor_hostname|-ipmi power reset - -Once node have booted and have completed introspection, they should be visible -in Bifrost: - -.. code-block:: console - :substitutions: - - bifrost# baremetal node list --provision-state enroll - +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+ - | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | - +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+ - | da0c61af-b411-41b9-8909-df2509f2059b | |hypervisor_hostname| | None | power off | enroll | False | - +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+ - -After editing ``${KAYOBE_CONFIG_PATH}/overcloud.yml`` to add these new hosts to -the correct groups, import them in Kayobe's inventory with: - -.. code-block:: console - - kayobe# kayobe overcloud inventory discover - -We can then provision and configure them: - -.. code-block:: console - :substitutions: - - kayobe# kayobe overcloud provision --limit |hypervisor_hostname| - kayobe# kayobe overcloud host configure --limit |hypervisor_hostname| - kayobe# kayobe overcloud service deploy --limit |hypervisor_hostname| --kolla-limit |hypervisor_hostname| - -Replacing a Failing Hypervisor ------------------------------- - -To replace a failing hypervisor, proceed as follows: - -* :ref:`Disable the hypervisor to avoid scheduling any new instance on it ` -* :ref:`Evacuate all instances ` -* :ref:`Set the node to maintenance mode in Bifrost ` -* Physically fix or replace the node -* It may be necessary to reinspect the node if hardware was changed (this will require deprovisioning and reprovisioning) -* If the node was replaced or reprovisioned, follow :ref:`enrolling-new-hypervisors` - -To deprovision an existing hypervisor, run: - -.. code-block:: console - :substitutions: - - kayobe# kayobe overcloud deprovision --limit |hypervisor_hostname| - -.. warning:: - - Always use ``--limit`` with ``kayobe overcloud deprovision`` on a production - system. Running this command without a limit will deprovision all overcloud - hosts. - -.. _evacuating-all-instances: - -Evacuating all instances ------------------------- - -.. code-block:: console - :substitutions: - - admin# nova host-evacuate-live |hypervisor_hostname| - -You should now check the status of all the instances that were running on that -hypervisor. They should all show the status ACTIVE. This can be verified with: - -.. code-block:: console - - admin# openstack server show - -Troubleshooting -+++++++++++++++ - -Servers that have been shut down -******************************** - -If there are any instances that are SHUTOFF they won’t be migrated, but you can -use ``nova host-servers-migrate`` for them once the live migration is finished. - -Also if a VM does heavy memory access, it may take ages to migrate (Nova tries -to incrementally increase the expected downtime, but is quite conservative). -You can use ``nova live-migration-force-complete -`` to trigger the final move. - -You get the migration ID via ``nova server-migration-list ``. - -For more details see: -http://www.danplanet.com/blog/2016/03/03/evacuate-in-nova-one-command-to-confuse-us-all/ - -Flavors have changed -******************** - -If the size of the flavors has changed, some instances will also fail to -migrate as the process needs manual confirmation. You can do this with: - -.. code-block:: console - - openstack # openstack server resize confirm - -The symptom to look out for is that the server is showing a status of ``VERIFY -RESIZE`` as shown in this snippet of ``openstack server show ``: - -.. code-block:: console - - | status | VERIFY_RESIZE | - -.. _set-bifrost-maintenance-mode: - -Set maintenance mode on a node in Bifrost -+++++++++++++++++++++++++++++++++++++++++ - -For example, to put |hypervisor_hostname| into maintenance: - -.. code-block:: console - :substitutions: - - seed# docker exec -it bifrost_deploy /bin/bash - (bifrost-deploy)[root@seed bifrost-base]# OS_CLOUD=bifrost baremetal node maintenance set |hypervisor_hostname| - -.. _unset-bifrost-maintenance-mode: - -Unset maintenance mode on a node in Bifrost -+++++++++++++++++++++++++++++++++++++++++++ - -For example, to take |hypervisor_hostname| out of maintenance: - -.. code-block:: console - :substitutions: - - seed# docker exec -it bifrost_deploy /bin/bash - (bifrost-deploy)[root@seed bifrost-base]# OS_CLOUD=bifrost baremetal node maintenance unset |hypervisor_hostname| - -Detect hardware differences with cardiff ----------------------------------------- - -Hardware information captured during the Ironic introspection process can be -analysed to detect hardware differences, such as mismatches in firmware -versions or missing storage devices. The cardiff tool can be used for this -purpose. It was developed as part of the `Python hardware package -`__, but was removed from release 0.25. The -`mungetout utility `__ can be used to -convert Ironic introspection data into a format that can be fed to cardiff. - -The following steps are used to install cardiff and mungetout: - -.. code-block:: console - :substitutions: - - kayobe# virtualenv |base_path|/venvs/cardiff - kayobe# source |base_path|/venvs/cardiff/bin/activate - kayobe# pip install -U pip - kayobe# pip install git+https://github.com/stackhpc/mungetout.git@feature/kayobe-introspection-save - kayobe# pip install 'hardware==0.24' - -Extract introspection data from Bifrost with Kayobe. JSON files will be created -into ``${KAYOBE_CONFIG_PATH}/overcloud-introspection-data``: - -.. code-block:: console - :substitutions: - - kayobe# source |base_path|/venvs/kayobe/bin/activate - kayobe# source |base_path|/src/kayobe-config/kayobe-env - kayobe# kayobe overcloud introspection data save - -The cardiff utility can only work if the ``extra-hardware`` collector was used, -which populates a ``data`` key in each node JSON file. Remove any that are -missing this key: - -.. code-block:: console - :substitutions: - - kayobe# for file in |base_path|/src/kayobe-config/overcloud-introspection-data/*; do if [[ $(jq .data $file) == 'null' ]]; then rm $file; fi; done - -Cardiff identifies each unique system by its serial number. However, some -high-density multi-node systems may report the same serial number for multiple -systems (this has been seen on Supermicro hardware). The following script will -replace the serial number used by Cardiff by the node name captured by LLDP on -the first network interface. If this node name is missing, it will append a -short UUID string to the end of the serial number. - -.. code-block:: python - - import json - import sys - import uuid - - with open(sys.argv[1], "r+") as f: - node = json.loads(f.read()) - - serial = node["inventory"]["system_vendor"]["serial_number"] - try: - new_serial = node["all_interfaces"]["eth0"]["lldp_processed"]["switch_port_description"] - except KeyError: - new_serial = serial + "-" + str(uuid.uuid4())[:8] - - new_data = [] - for e in node["data"]: - if e[0] == "system" and e[1] == "product" and e[2] == "serial": - new_data.append(["system", "product", "serial", new_serial]) - else: - new_data.append(e) - node["data"] = new_data - - f.seek(0) - f.write(json.dumps(node)) - f.truncate() - -Apply this Python script on all generated JSON files: - -.. code-block:: console - :substitutions: - - kayobe# for file in ~/src/kayobe-config/overcloud-introspection-data/*; do python update-serial.py $file; done - -Convert files into the format supported by cardiff: - -.. code-block:: console - :substitutions: - - source |base_path|/venvs/cardiff/bin/activate - mkdir -p |base_path|/cardiff-workspace - rm -rf |base_path|/cardiff-workspace/extra* - cd |base_path|/cardiff-workspace/ - m2-extract |base_path|/src/kayobe-config/overcloud-introspection-data/*.json - -.. note:: - - The ``m2-extract`` utility needs to work in an empty folder. Delete the - ``extra-hardware``, ``extra-hardware-filtered`` and ``extra-hardware-json`` - folders before executing it again. - -We are now ready to compare node hardware. The following command will compare -all known nodes, which may include multiple generations of hardware. Replace -``*.eval`` by a stricter globbing expression or by a list of files to compare a -smaller group. - -.. code-block:: console - - hardware-cardiff -I ipmi -p 'extra-hardware/*.eval' - -Since the output can be verbose, it is recommended to pipe it to a terminal -pager or redirect it to a file. Cardiff will display groups of identical nodes -based on various hardware characteristics, such as system model, BIOS version, -CPU or network interface information, or benchmark results gathered by the -``extra-hardware`` collector during the initial introspection process. - -.. ifconfig:: deployment['ceph_managed'] - - .. include:: hardware_inventory_management_ceph.rst diff --git a/source/hardware_inventory_management_ceph.rst b/source/hardware_inventory_management_ceph.rst deleted file mode 100644 index 0e8aa6a..0000000 --- a/source/hardware_inventory_management_ceph.rst +++ /dev/null @@ -1,9 +0,0 @@ -=========================== -Management of Ceph hardware -=========================== - -Extending the Ceph Cluster -========================== - -Replacing Failing Ceph Hardware -=============================== diff --git a/source/include/baremetal_management.rst b/source/include/baremetal_management.rst deleted file mode 100644 index 5447d71..0000000 --- a/source/include/baremetal_management.rst +++ /dev/null @@ -1,289 +0,0 @@ -.. _ironic-node-lifecycle: - -Ironic node life cycle ----------------------- - -The deployment process is documented in the `Ironic User Guide `__. -The |project_name| OpenStack deployment uses the -`direct deploy method `__. - -The Ironic state machine can be found `here `__. The rest of -this documentation refers to these states and assumes that you have familiarity. - -High level overview of state transitions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The following section attempts to describe the state transitions for various Ironic operations at a high level. -It focuses on trying to describe the steps where dynamic switch reconfiguration is triggered. -For a more detailed overview, refer to the :ref:`ironic-node-lifecycle` section. - -Provisioning -~~~~~~~~~~~~ - -Provisioning starts when an instance is created in Nova using a bare metal flavor. - -- Node starts in the available state (available) -- User provisions an instance (deploying) -- Ironic will switch the node onto the provisioning network (deploying) -- Ironic will power on the node and will await a callback (wait-callback) -- Ironic will image the node with an operating system using the image provided at creation (deploying) -- Ironic switches the node onto the tenant network(s) via neutron (deploying) -- Transition node to active state (active) - -.. _baremetal-management-deprovisioning: - -Deprovisioning -~~~~~~~~~~~~~~ - -Deprovisioning starts when an instance created in Nova using a bare metal flavor is destroyed. - -.. ifconfig:: deployment['ironic_automated_cleaning'] - - Automated cleaning is enabled, and occurs when nodes are deprovisioned. - - - Node starts in active state (active) - - User deletes instance (deleting) - - Ironic will remove the node from any tenant network(s) (deleting) - - Ironic will switch the node onto the cleaning network (deleting) - - Ironic will power on the node and will await a callback (clean-wait) - - Node boots into Ironic Python Agent and issues callback, Ironic starts cleaning (cleaning) - - Ironic removes node from cleaning network (cleaning) - - Node transitions to available (available) - -.. ifconfig:: not deployment['ironic_automated_cleaning'] - - Automated cleaning is currently disabled. - - - Node starts in active state (active) - - User deletes instance (deleting) - - Ironic will remove the node from any tenant network(s) (deleting) - - Node transitions to available (available) - -Cleaning -~~~~~~~~ - -Manual cleaning is not part of the regular state transitions when using Nova, however nodes can be manually cleaned by administrators. - -- Node starts in the manageable state (manageable) -- User triggers cleaning with API (cleaning) -- Ironic will switch the node onto the cleaning network (cleaning) -- Ironic will power on the node and will await a callback (clean-wait) -- Node boots into Ironic Python Agent and issues callback, Ironic starts cleaning (cleaning) -- Ironic removes node from cleaning network (cleaning) -- Node transitions back to the manageable state (manageable) - -.. ifconfig:: deployment['ironic_automated_cleaning'] - - See :ref:`baremetal-management-deprovisioning` for information about - automated cleaning. - -Rescuing -~~~~~~~~ - -Feature not used. The required rescue network is not currently configured. - -Baremetal networking --------------------- - -Baremetal networking with the Neutron Networking Generic Switch ML2 driver requires a combination of static and dynamic switch configuration. - -.. _static-switch-config: - -Static switch configuration -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. ifconfig:: deployment['kayobe_manages_physical_network'] - - Static physical network configuration is managed via Kayobe. - - .. TODO: Fill in the switch configuration - - - Some initial switch configuration is required before networking generic switch can take over the management of an interface. - First, LACP must be configured on the switch ports attached to the baremetal node, e.g: - - .. code-block:: shell - - The interface is then partially configured: - - .. code-block:: shell - - For :ref:`ironic-node-discovery` to work, you need to manually switch the port to the provisioning network: - - .. code-block:: shell - - **NOTE**: You only need to do this if Ironic isn't aware of the node. - - Configuration with kayobe - ^^^^^^^^^^^^^^^^^^^^^^^^^ - - Kayobe can be used to apply the :ref:`static-switch-config`. - - - Upstream documentation can be found `here `__. - - Kayobe does all the switch configuration that isn't :ref:`dynamically updated using Ironic `. - - Optionally switches the node onto the provisioning network (when using ``--enable-discovery``) - - + NOTE: This is a dangerous operation as it can wipe out the dynamic VLAN configuration applied by neutron/ironic. - You should only run this when initially enrolling a node, and should always use the ``interface-description-limit`` option. For example: - - .. code-block:: - - kayobe physical network configure --interface-description-limit --group switches --display --enable-discovery - - In this example, ``--display`` is used to preview the switch configuration without applying it. - - .. TODO: Fill in information about how switches are configured in kayobe-config, with links - - - Configuration is done using a combination of ``group_vars`` and ``host_vars`` - -.. ifconfig:: not deployment['kayobe_manages_physical_network'] - - .. TODO: Fill in details about how physical network configuration is managed. - - Static physical network configuration is not managed via Kayobe. - -.. _dynamic-switch-configuration: - -Dynamic switch configuration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Ironic dynamically configures the switches using the Neutron `Networking Generic Switch `_ ML2 driver. - -- Used to toggle the baremetal nodes onto different networks - - + Can use any VLAN network defined in OpenStack, providing that the VLAN has been trunked to the controllers - as this is required for DHCP to function. - + See :ref:`ironic-node-lifecycle`. This attempts to illustrate when any switch reconfigurations happen. - -- Only configures VLAN membership of the switch interfaces or port groups. To prevent conflicts with the static switch configuration, - the convention used is: after the node is in service in Ironic, VLAN membership should not be manually adjusted and - should be left to be controlled by ironic i.e *don't* use ``--enable-discovery`` without an interface limit when configuring the - switches with kayobe. -- Ironic is configured to use the neutron networking driver. - -.. _ngs-commands: - -Commands that NGS will execute -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Networking Generic Switch is mainly concerned with toggling the ports onto different VLANs. It -cannot fully configure the switch. - -.. TODO: Fill in the switch configuration - -- Switching the port onto the provisioning network - - .. code-block:: shell - -- Switching the port onto the tenant network. - - .. code-block:: shell - -- When deleting the instance, the VLANs are removed from the port. Using: - - .. code-block:: shell - -NGS will save the configuration after each reconfiguration (by default). - -Ports managed by NGS -^^^^^^^^^^^^^^^^^^^^ - -The command below extracts a list of port UUID, node UUID and switch port information. - -.. code-block:: bash - - admin# openstack baremetal port list --field uuid --field node_uuid --field local_link_connection --format value - -NGS will manage VLAN membership for ports when the ``local_link_connection`` fields match one of the switches in ``ml2_conf.ini``. -The rest of the switch configuration is static. -The switch configuration that NGS will apply to these ports is detailed in :ref:`dynamic-switch-configuration`. - -.. _ironic-node-discovery: - -Ironic node discovery ---------------------- - -Discovery is a process used to automatically enrol new nodes in Ironic. It works by PXE booting the nodes into the Ironic Python Agent (IPA) ramdisk. This ramdisk will collect hardware and networking configuration from the node in a process known as introspection. This data is used to populate the baremetal node object in Ironic. The series of steps you need to take to enrol a new node is as follows: - -- Configure credentials on the |bmc|. These are needed for Ironic to be able to perform power control actions. - -- Controllers should have network connectivity with the target |bmc|. - -.. ifconfig:: deployment['kayobe_manages_physical_network'] - - - Add any additional switch configuration to kayobe config. - The minimal switch configuration that kayobe needs to know about is described in :ref:`tor-switch-configuration`. - -- Apply any :ref:`static switch configration `. This performs the initial - setup of the switchports that is needed before Ironic can take over. The static configuration - will not be modified by Ironic, so it should be safe to reapply at any point. See :ref:`ngs-commands` - for details about the switch configuation that Networking Generic Switch will apply. - -.. ifconfig:: deployment['kayobe_manages_physical_network'] - - - Put the node onto the provisioning network by using the ``--enable-discovery`` flag and either ``--interface-description-limit`` or ``--interface-limit`` (do not run this command without one of these limits). See :ref:`static-switch-config`. - - * This is only necessary to initially discover the node. Once the node is in registered in Ironic, - it will take over control of the the VLAN membership. See :ref:`dynamic-switch-configuration`. - - * This provides ethernet connectivity with the controllers over the `workload provisioning` network - -.. ifconfig:: not deployment['kayobe_manages_physical_network'] - - - Put the node onto the provisioning network. - -.. TODO: link to the relevant file in kayobe config - -- Add node to the kayobe inventory. - -.. TODO: Fill in details about necessary BIOS & RAID config - -- Apply any necesary BIOS & RAID configuration. - -.. TODO: Fill in details about how to trigger a PXE boot - -- PXE boot the node. - -- If the discovery process is successful, the node will appear in Ironic and will get populated with the necessary information from the hardware inspection process. - -.. TODO: Link to the Kayobe inventory in the repo - -- Add node to the Kayobe inventory in the ``baremetal-compute`` group. - -- The node will begin in the ``enroll`` state, and must be moved first to ``manageable``, then ``available`` before it can be used. - - .. ifconfig:: deployment['ironic_automated_cleaning'] - - The node must complete a cleaning process before it can reach the available state. - - * Use Kayobe to attempt to move the node to the ``available`` state. - - .. code-block:: console - - source etc/kolla/public-openrc.sh - kayobe baremetal compute provide --limit - -- Once the node is in the ``available`` state, Nova will make the node available for scheduling. This happens periodically, and typically takes around a minute. - -.. _tor-switch-configuration: - -Top of Rack (ToR) switch configuration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Networking Generic Switch must be aware of the Top-of-Rack switch connected to the new node. -Switches managed by NGS are configured in ``ml2_conf.ini``. - -.. TODO: Fill in details about how switches are added to NGS config in kayobe-config - -After adding switches to the NGS configuration, Neutron must be redeployed. - -Considerations when booting baremetal compared to VMs ------------------------------------------------------- - -- You can only use networks of type: vlan -- Without using trunk ports, it is only possible to directly attach one network to each port or port group of an instance. - - * To access other networks you can use routers - * You can still attach floating IPs - -- Instances take much longer to provision (expect at least 15 mins) -- When booting an instance use one of the flavors that maps to a baremetal node via the RESOURCE_CLASS configured on the flavor. diff --git a/source/include/ceph_ansible.rst b/source/include/ceph_ansible.rst deleted file mode 100644 index 46e9baf..0000000 --- a/source/include/ceph_ansible.rst +++ /dev/null @@ -1,80 +0,0 @@ -Making a Ceph-Ansible Checkout -============================== - -Invoking Ceph-Ansible -===================== - -Removing a Failed Ceph Drive -============================ - -If a drive is verified dead, stop and eject the osd (eg. `osd.4`) -from the cluster: - -.. code-block:: console - - storage-0# systemctl stop ceph-osd@4.service - storage-0# systemctl disable ceph-osd@4.service - ceph# ceph osd out osd.4 - -.. ifconfig:: deployment['ceph_ansible'] - - Before running Ceph-Ansible, also remove vestigial state directory - from `/var/lib/ceph/osd` for the purged OSD, for example for OSD ID 4: - - .. code-block:: console - - storage-0# rm -rf /var/lib/ceph/osd/ceph-4 - -Remove Ceph OSD state for the old OSD, here OSD ID `4` (we will -backfill all the data when we reintroduce the drive). - -.. code-block:: console - - ceph# ceph osd purge --yes-i-really-mean-it 4 - -Unset noout for osds when hardware maintenance has concluded - eg. -while waiting for the replacement disk: - -.. code-block:: console - - ceph# ceph osd unset noout - -Replacing a Failed Ceph Drive -============================= - -Once an OSD has been identified as having a hardware failure, -the affected drive will need to be replaced. - -.. note:: - - Hot-swapping a failed device will change the device enumeration - and this could confuse the device addressing in Kayobe LVM - configuration. - - In kayobe-config, use ``/dev/disk/by-path`` device references to - avoid this issue. - - Alternatively, always reboot a server when swapping drives. - -If rebooting a Ceph node, first set ``noout`` to prevent excess data -movement: - -.. code-block:: console - - ceph# ceph osd set noout - -Apply LVM configuration using Kayobe for the replaced device (here on ``storage-0``): - -.. code-block:: console - - kayobe$ kayobe overcloud host configure -t lvm -l storage-0 - -Before running Ceph-Ansible, also remove vestigial state directory -from ``/var/lib/ceph/osd`` for the purged OSD - -Reapply Ceph-Asnible in the usual manner. - -.. note:: - - Ceph-Ansible runs can fail to complete if there are background activities - such as backfilling underway when the Ceph-Ansible playbook is invoked. diff --git a/source/include/ceph_operations.rst b/source/include/ceph_operations.rst deleted file mode 100644 index 74bc542..0000000 --- a/source/include/ceph_operations.rst +++ /dev/null @@ -1,26 +0,0 @@ - - -Replacing drive ---------------- - -See upstream documentation: -https://docs.ceph.com/en/quincy/cephadm/services/osd/#replacing-an-osd - -In case where disk holding DB and/or WAL fails, it is necessary to recreate -(using replacement procedure above) all OSDs that are associated with this -disk - usually NVMe drive. The following single command is sufficient to -identify which OSDs are tied to which physical disks: - -.. code-block:: console - - ceph# ceph device ls - -Host maintenance ----------------- - -https://docs.ceph.com/en/quincy/cephadm/host-management/#maintenance-mode - -Upgrading ---------- - -https://docs.ceph.com/en/quincy/cephadm/upgrade/ diff --git a/source/include/ceph_troubleshooting.rst b/source/include/ceph_troubleshooting.rst deleted file mode 100644 index d353390..0000000 --- a/source/include/ceph_troubleshooting.rst +++ /dev/null @@ -1,121 +0,0 @@ -Investigating a Failed Ceph Drive ---------------------------------- - -A failing drive in a Ceph cluster will cause OSD daemon to crash. -In this case Ceph will go into `HEALTH_WARN` state. -Ceph can report details about failed OSDs by running: - -.. code-block:: console - - ceph# ceph health detail - -.. ifconfig:: deployment['cephadm'] - - .. note :: - - Remember to run ceph/rbd commands from within ``cephadm shell`` - (preferred method) or after installing Ceph client. Details in the - official `documentation `__. - It is also required that the host where commands are executed has admin - Ceph keyring present - easiest to achieve by applying - `_admin `__ - label (Ceph MON servers have it by default when using - `StackHPC Cephadm collection `__). - -A failed OSD will also be reported as down by running: - -.. code-block:: console - - ceph# ceph osd tree - -Note the ID of the failed OSD. - -The failed disk is usually logged by the Linux kernel too: - -.. code-block:: console - - storage-0# dmesg -T - -Cross-reference the hardware device and OSD ID to ensure they match. -(Using `pvs` and `lvs` may help make this connection). - -Inspecting a Ceph Block Device for a VM ---------------------------------------- - -To find out what block devices are attached to a VM, go to the hypervisor that -it is running on (an admin-level user can see this from ``openstack server -show``). - -On this hypervisor, enter the libvirt container: - -.. code-block:: console - :substitutions: - - |hypervisor_hostname|# docker exec -it nova_libvirt /bin/bash - -Find the VM name using libvirt: - -.. code-block:: console - :substitutions: - - (nova-libvirt)[root@|hypervisor_hostname| /]# virsh list - Id Name State - ------------------------------------ - 1 instance-00000001 running - -Now inspect the properties of the VM using ``virsh dumpxml``: - -.. code-block:: console - :substitutions: - - (nova-libvirt)[root@|hypervisor_hostname| /]# virsh dumpxml instance-00000001 | grep rbd - - -On a Ceph node, the RBD pool can be inspected and the volume extracted as a RAW -block image: - -.. code-block:: console - :substitutions: - - ceph# rbd ls |nova_rbd_pool| - ceph# rbd export |nova_rbd_pool|/51206278-e797-4153-b720-8255381228da_disk blob.raw - -The raw block device (blob.raw above) can be mounted using the loopback device. - -Inspecting a QCOW Image using LibGuestFS ----------------------------------------- - -The virtual machine's root image can be inspected by installing -libguestfs-tools and using the guestfish command: - -.. code-block:: console - - ceph# export LIBGUESTFS_BACKEND=direct - ceph# guestfish -a blob.qcow - > run - 100% [XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX] 00:00 - > list-filesystems - /dev/sda1: ext4 - > mount /dev/sda1 / - > ls / - bin - boot - dev - etc - home - lib - lib64 - lost+found - media - mnt - opt - proc - root - run - sbin - srv - sys - tmp - usr - var - > quit diff --git a/source/include/cephadm.rst b/source/include/cephadm.rst deleted file mode 100644 index 5130a6b..0000000 --- a/source/include/cephadm.rst +++ /dev/null @@ -1,118 +0,0 @@ -cephadm configuration location -============================== - -In kayobe-config repository, under ``etc/kayobe/cephadm.yml`` (or in a specific -Kayobe environment when using multiple environment, e.g. -``etc/kayobe/environments/production/cephadm.yml``) - -StackHPC's cephadm Ansible collection relies on multiple inventory groups: - -- ``mons`` -- ``mgrs`` -- ``osds`` -- ``rgws`` (optional) - -Those groups are usually defined in ``etc/kayobe/inventory/groups``. - -Running cephadm playbooks -========================= - -In kayobe-config repository, under ``etc/kayobe/ansible`` there is a set of -cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection. - -- ``cephadm.yml`` - runs the end to end process starting with deployment and - defining EC profiles/crush rules/pools and users -- ``cephadm-crush-rules.yml`` - defines Ceph crush rules according -- ``cephadm-deploy.yml`` - runs the bootstrap/deploy playbook without the - additional playbooks -- ``cephadm-ec-profiles.yml`` - defines Ceph EC profiles -- ``cephadm-gather-keys.yml`` - gather Ceph configuration and keys and populate - kayobe-config -- ``cephadm-keys.yml`` - defines Ceph users/keys -- ``cephadm-pools.yml`` - defines Ceph pools\ - -Running Ceph commands -===================== - -Ceph commands are usually run inside a ``cephadm shell`` utility container: - -.. code-block:: console - - ceph# cephadm shell - -Operating a cluster requires a keyring with an admin access to be available for Ceph -commands. Cephadm will copy such keyring to the nodes carrying -`_admin `__ -label - present on MON servers by default when using -`StackHPC Cephadm collection `__. - -Adding a new storage node -========================= - -Add a node to a respective group (e.g. osds) and run ``cephadm-deploy.yml`` -playbook. - -.. note:: - To add other node types than osds (mons, mgrs, etc) you need to specify - ``-e cephadm_bootstrap=True`` on playbook run. - -Removing a storage node -======================= - -First drain the node - -.. code-block:: console - - ceph# cephadm shell - ceph# ceph orch host drain - -Once all daemons are removed - you can remove the host: - -.. code-block:: console - - ceph# cephadm shell - ceph# ceph orch host rm - -And then remove the host from inventory (usually in -``etc/kayobe/inventory/overcloud``) - -Additional options/commands may be found in -`Host management `_ - -Replacing a Failed Ceph Drive -============================= - -Once an OSD has been identified as having a hardware failure, -the affected drive will need to be replaced. - -If rebooting a Ceph node, first set ``noout`` to prevent excess data -movement: - -.. code-block:: console - - ceph# cephadm shell - ceph# ceph osd set noout - -Reboot the node and replace the drive - -Unset noout after the node is back online - -.. code-block:: console - - ceph# cephadm shell - ceph# ceph osd unset noout - -Remove the OSD using Ceph orchestrator command: - -.. code-block:: console - - ceph# cephadm shell - ceph# ceph orch osd rm --replace - -After removing OSDs, if the drives the OSDs were deployed on once again become -available, cephadm may automatically try to deploy more OSDs on these drives if -they match an existing drivegroup spec. -If this is not your desired action plan - it's best to modify the drivegroup -spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``). -Either set ``unmanaged: true`` to stop cephadm from picking up new disks or -modify it in some way that it no longer matches the drives you want to remove. diff --git a/source/include/wazuh_ansible.rst b/source/include/wazuh_ansible.rst deleted file mode 100644 index a71abcc..0000000 --- a/source/include/wazuh_ansible.rst +++ /dev/null @@ -1,94 +0,0 @@ -One method for deploying and maintaining Wazuh is the `official -Ansible playbooks `_. These -can be integrated into |kayobe_config| as a custom playbook. - -Configuring Wazuh Manager -------------------------- - -Wazuh Manager is configured by editing the ``wazuh-manager.yml`` -groups vars file found at -``etc/kayobe/inventory/group_vars/wazuh-manager/``. This file -controls various aspects of Wazuh Manager configuration. -Most notably: - -*domain_name*: - The domain used by Search Guard CE when generating certificates. - -*wazuh_manager_ip*: - The IP address that the Wazuh Manager shall reside on for communicating with the agents. - -*wazuh_manager_connection*: - Used to define port and protocol for the manager to be listening on. - -*wazuh_manager_authd*: - Connection settings for the daemon responsible for registering new agents. - -Running ``kayobe playbook run -$KAYOBE_CONFIG_PATH/ansible/wazuh-manager.yml`` will deploy these -changes. - -Secrets -------- - -Wazuh requires that secrets or passwords are set for itself and the services with which it communiticates. -The playbook ``etc/kayobe/ansible/wazuh-secrets.yml`` automates the creation of these secrets, which should then be encrypted with Ansible Vault. - -To update the secrets you can execute the following two commands - -.. code-block:: shell - :substitutions: - - kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-secrets.yml \ - -e wazuh_user_pass=$(uuidgen) \ - -e wazuh_admin_pass=$(uuidgen) - kayobe# ansible-vault encrypt --vault-password-file |vault_password_file_path| \ - $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml - -Once generated, run ``kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-manager.yml`` which copies the secrets into place. - -.. note:: Use ``ansible-vault`` to view the secrets: - - ``ansible-vault view --vault-password-file ~/vault.password $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml`` - -Adding a New Agent ------------------- -The Wazuh Agent is deployed to all hosts in the ``wazuh-agent`` -inventory group, comprising the ``seed`` group (containing |seed_name|) -plus the ``overcloud`` group (containing all hosts in the -OpenStack control plane). - -.. code-block:: ini - - [wazuh-agent:children] - seed - overcloud - -The following playbook deploys the Wazuh Agent to all hosts in the -``wazuh-agent`` group: - -.. code-block:: shell - - kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-agent.yml - -The hosts running Wazuh Agent should automatically be registered -and visible within the Wazuh Manager dashboard. - -.. note:: It is good practice to use a `Kayobe deploy hook - `_ - to automate deployment and configuration of the Wazuh Agent - following a run of ``kayobe overcloud host configure``. - -Accessing Wazuh Manager ------------------------ - -To access the Wazuh Manager dashboard, navigate to the ip address -of |wazuh_manager_name| (|wazuh_manager_url|). - -You can login to the dashboard with the username ``admin``. The -password for ``admin`` is defined in the secret -``opendistro_admin_password`` which can be found within -``etc/kayobe/inventory/group_vars/wazuh-manager/wazuh-secrets.yml``. - -.. note:: Use ``ansible-vault`` to view Wazuh secrets: - - ``ansible-vault view --vault-password-file ~/vault.password $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml`` diff --git a/source/index.rst b/source/index.rst index f36690c..09ada1d 100644 --- a/source/index.rst +++ b/source/index.rst @@ -18,18 +18,11 @@ Contents :maxdepth: 2 introduction - working_with_openstack + overview_of_system working_with_kayobe - physical_network - hardware_inventory_management - ceph_storage - managing_users_and_projects - operations_and_monitoring - wazuh - customising_deployment - gpus_in_openstack + access_to_services baremetal_management - rally_and_tempest + physical_network Indices and search ================== diff --git a/source/managing_users_and_projects.rst b/source/managing_users_and_projects.rst deleted file mode 100644 index 75f5b4f..0000000 --- a/source/managing_users_and_projects.rst +++ /dev/null @@ -1,52 +0,0 @@ -.. include:: vars.rst - -=========================== -Managing Users and Projects -=========================== - -Projects (in OpenStack) can be defined in the |project_config| repository: -|project_config_source_url| - -To initialise the working environment for |project_config|: - -.. code-block:: console - :substitutions: - - admin# cd |base_path|/src - admin# git clone |project_config_source_url| - admin# cd |project_config| - admin# virtualenv |base_path|/venvs/|project_config| - admin# source |base_path|/venvs/|project_config|/bin/activate - admin# pip install -U pip - admin# pip install -r requirements.txt - admin# ansible-galaxy role install \ - -p ansible/roles \ - -r requirements.yml - admin# ansible-galaxy collection install \ - -p ansible/collections \ - -r requirements.yml - -To define a new project, add a new project to -etc/|project_config|/|project_config|.yml: - -.. code-block:: console - :substitutions: - - admin# cd |base_path|/src/|project_config| - admin# vim etc/|project_config|/|project_config|.yml - -Example invocation: - -.. code-block:: console - :substitutions: - - admin# source |base_path|/src/|kayobe_config|/etc/kolla/public-openrc.sh - admin# source |base_path|/venvs/|project_config|/bin/activate - admin# tools/|project_config| -- --vault-password-file |vault_password_file_path| - -Deleting Users and Projects ---------------------------- - -Ansible is designed for adding configuration that is not present; removing -state is less easy. To remove a project or user, the configuration should be -manually removed. diff --git a/source/operations_and_monitoring.rst b/source/operations_and_monitoring.rst deleted file mode 100644 index 209cf3c..0000000 --- a/source/operations_and_monitoring.rst +++ /dev/null @@ -1,672 +0,0 @@ -.. include:: vars.rst - -========================= -Operations and Monitoring -========================= - -Access to OpenSearch Dashboards -=============================== - -OpenStack control plane logs are aggregated from all servers by Fluentd and -stored in OpenSearch. The control plane logs can be accessed from -OpenSearch using OpenSearch Dashboards, which is available at the following URL: -|opensearch_dashboards_url| - -To log in, use the ``opensearch`` user. The password is auto-generated by -Kolla-Ansible and can be extracted from the encrypted passwords file -(|kolla_passwords|): - -.. code-block:: console - :substitutions: - - kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^opensearch_dashboards - -Access to Grafana -================= - -Control plane metrics can be visualised in Grafana dashboards. Grafana can be -found at the following address: |grafana_url| - -To log in, use the |grafana_username| user. The password is auto-generated by -Kolla-Ansible and can be extracted from the encrypted passwords file -(|kolla_passwords|): - -.. code-block:: console - :substitutions: - - kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^grafana_admin_password - -.. _prometheus-alertmanager: - -Access to Prometheus Alertmanager -================================= - -Control plane alerts can be visualised and managed in Alertmanager, which can -be found at the following address: |alertmanager_url| - -To log in, use the ``admin`` user. The password is auto-generated by -Kolla-Ansible and can be extracted from the encrypted passwords file -(|kolla_passwords|): - -.. code-block:: console - :substitutions: - - kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^prometheus_alertmanager_password - -Migrating virtual machines -========================== - -To see where all virtual machines are running on the hypervisors: - -.. code-block:: console - - admin# openstack server list --all-projects --long - -To move a virtual machine with shared storage or booted from volume from one hypervisor to another, for example to -|hypervisor_hostname|: - -.. code-block:: console - :substitutions: - - admin# openstack --os-compute-api-version 2.30 server migrate --live-migration --host |hypervisor_hostname| 6a35592c-5a7e-4da3-9ab9-6765345641cb - -To move a virtual machine with local disks: - -.. code-block:: console - :substitutions: - - admin# openstack --os-compute-api-version 2.30 server migrate --live-migration --block-migration --host |hypervisor_hostname| 6a35592c-5a7e-4da3-9ab9-6765345641cb - -OpenStack Reconfiguration -========================= - -Disabling a Service -------------------- - -Ansible is oriented towards adding or reconfiguring services, but removing a -service is handled less well, because of Ansible's imperative style. - -To remove a service, it is disabled in Kayobe's Kolla config, which prevents -other services from communicating with it. For example, to disable -``cinder-backup``, edit ``${KAYOBE_CONFIG_PATH}/kolla.yml``: - -.. code-block:: diff - - -enable_cinder_backup: true - +enable_cinder_backup: false - -Then, reconfigure Cinder services with Kayobe: - -.. code-block:: console - - kayobe# kayobe overcloud service reconfigure --kolla-tags cinder - -However, the service itself, no longer in Ansible's manifest of managed state, -must be manually stopped and prevented from restarting. - -On each controller: - -.. code-block:: console - - kayobe# docker rm -f cinder_backup - -Some services may store data in a dedicated Docker volume, which can be removed -with ``docker volume rm``. - -Installing TLS Certificates ---------------------------- - -|tls_setup| - -To configure TLS for the first time, we write the contents of a PEM -file to the ``secrets.yml`` file as ``secrets_kolla_external_tls_cert``. -Use a command of this form: - -.. code-block:: console - :substitutions: - - kayobe# ansible-vault edit ${KAYOBE_CONFIG_PATH}/secrets.yml --vault-password-file=|vault_password_file_path| - -Concatenate the contents of the certificate and key files to create -``secrets_kolla_external_tls_cert``. The certificates should be installed in -this order: - -* TLS certificate for the |project_name| OpenStack endpoint |public_endpoint_fqdn| -* Any intermediate certificates -* The TLS certificate private key - -In ``${KAYOBE_CONFIG_PATH}/kolla.yml``, set the following: - -.. code-block:: yaml - - kolla_enable_tls_external: True - kolla_external_tls_cert: "{{ secrets_kolla_external_tls_cert }}" - -To apply TLS configuration, we need to reconfigure all services, as endpoint URLs need to -be updated in Keystone: - -.. code-block:: console - - kayobe# kayobe overcloud service reconfigure - -Alternative Configuration -+++++++++++++++++++++++++ - -As an alternative to writing the certificates as a variable to -``secrets.yml``, it is also possible to write the same data to a file, -``etc/kayobe/kolla/certificates/haproxy.pem``. The file should be -vault-encrypted in the same manner as secrets.yml. In this instance, -variable ``kolla_external_tls_cert`` does not need to be defined. - -See `Kolla-Ansible TLS guide -`__ for -further details. - -Updating TLS Certificates -------------------------- - -Check the expiry date on an installed TLS certificate from a host that can -reach the |project_name| OpenStack APIs: - -.. code-block:: console - :substitutions: - - openstack# openssl s_client -connect |public_endpoint_fqdn|:443 2> /dev/null | openssl x509 -noout -dates - -*NOTE*: Prometheus Blackbox monitoring can check certificates automatically -and alert when expiry is approaching. - -To update an existing certificate, for example when it has reached expiration, -change the value of ``secrets_kolla_external_tls_cert``, in the same order as -above. Run the following command: - -.. code-block:: console - - kayobe# kayobe overcloud service reconfigure --kolla-tags haproxy - -.. _taking-a-hypervisor-out-of-service: - -Taking a Hypervisor out of Service ----------------------------------- - -To take a hypervisor out of Nova scheduling, for example |hypervisor_hostname|: - -.. code-block:: console - :substitutions: - - admin# openstack compute service set --disable \ - |hypervisor_hostname| nova-compute - -Running instances on the hypervisor will not be affected, but new instances -will not be deployed on it. - -A reason for disabling a hypervisor can be documented with the -``--disable-reason`` flag: - -.. code-block:: console - :substitutions: - - admin# openstack compute service set --disable \ - --disable-reason "Broken drive" |hypervisor_hostname| nova-compute - -Details about all hypervisors and the reasons they are disabled can be -displayed with: - -.. code-block:: console - - admin# openstack compute service list --long - -And then to enable a hypervisor again: - -.. code-block:: console - :substitutions: - - admin# openstack compute service set --enable \ - |hypervisor_hostname| nova-compute - -Managing Space in the Docker Registry -------------------------------------- - -If the Docker registry becomes full, this can prevent container updates and -(depending on the storage configuration of the seed host) could lead to other -problems with services provided by the seed host. - -To remove container images from the Docker Registry, follow this process: - -* Reconfigure the registry container to allow deleting containers. This can be - done in ``docker-registry.yml`` with Kayobe: - -.. code-block:: yaml - - docker_registry_env: - REGISTRY_STORAGE_DELETE_ENABLED: "true" - -* For the change to take effect, run: - -.. code-block:: console - - kayobe seed host configure - -* A helper script is useful, such as https://github.com/byrnedo/docker-reg-tool - (this requires ``jq``). To delete all images with a specific tag, use: - -.. code-block:: console - - for repo in `./docker_reg_tool http://registry-ip:4000 list`; do - ./docker_reg_tool http://registry-ip:4000 delete $repo $tag - done - -* Deleting the tag does not actually release the space. To actually free up - space, run garbage collection: - -.. code-block:: console - - seed# docker exec docker_registry bin/registry garbage-collect /etc/docker/registry/config.yml - -The seed host can also accrue a lot of data from building container images. -The images stored locally in the seed host can be seen using ``docker image ls``. - -Old and redundant images can be identified from their names and tags, and -removed using ``docker image rm``. - -Backup of the OpenStack Control Plane -===================================== - -As the backup procedure is constantly changing, it is normally best to check -the upstream documentation for an up to date procedure. Here is a high level -overview of the key things you need to backup: - -Controllers ------------ - -* `Back up SQL databases `__ -* `Back up configuration in /etc/kolla `__ - -Compute -------- - -The compute nodes can largely be thought of as ephemeral, but you do need to -make sure you have migrated any instances and disabled the hypervisor before -decommissioning or making any disruptive configuration change. - -Monitoring ----------- - -* `Back up InfluxDB `__ -* `Back up OpenSearch `__ -* `Back up Prometheus `__ - -Seed ----- - -* `Back up bifrost `__ - -Ansible control host --------------------- - -* Back up service VMs such as the seed VM - -Control Plane Monitoring -======================== - -The control plane has been configured to collect logs centrally using the FOOD -stack (Fluentd, OpenSearch and OpenSearch Dashboards). - -Telemetry monitoring of the control plane is performed by Prometheus. Metrics -are collected by Prometheus exporters, which are either running on all hosts -(e.g. node exporter), on specific hosts (e.g. controllers for the memcached -exporter or monitoring hosts for the OpenStack exporter). These exporters are -scraped by the Prometheus server. - -Configuring Prometheus Alerts ------------------------------ - -Alerts are defined in code and stored in Kayobe configuration. See ``*.rules`` -files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add -custom rules. - -Silencing Prometheus Alerts ---------------------------- - -Sometimes alerts must be silenced because the root cause cannot be resolved -right away, such as when hardware is faulty. For example, an unreachable -hypervisor will produce several alerts: - -* ``InstanceDown`` from Node Exporter -* ``OpenStackServiceDown`` from the OpenStack exporter, which reports status of - the ``nova-compute`` agent on the host -* ``PrometheusTargetMissing`` from several Prometheus exporters - -Rather than silencing each alert one by one for a specific host, a silence can -apply to multiple alerts using a reduced list of labels. :ref:`Log into -Alertmanager `, click on the ``Silence`` button next -to an alert and adjust the matcher list to keep only ``instance=`` -label. Then, create another silence to match ``hostname=`` (this is -required because, for the OpenStack exporter, the instance is the host running -the monitoring service rather than the host being monitored). - -.. note:: - - After creating the silence, you may get redirected to a 404 page. This is a - `known issue `__ - when running several Alertmanager instances behind HAProxy. - -Generating Alerts from Metrics -++++++++++++++++++++++++++++++ - -Alerts are defined in code and stored in Kayobe configuration. See ``*.rules`` -files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add -custom rules. - -Control Plane Shutdown Procedure -================================ - -Overview --------- - -* Verify integrity of clustered components (RabbitMQ, Galera, Keepalived). They - should all report a healthy status. -* Put node into maintenance mode in bifrost to prevent it from automatically - powering back on -* Shutdown down nodes one at a time gracefully using systemctl poweroff - -Controllers ------------ - -If you are restarting the controllers, it is best to do this one controller at -a time to avoid the clustered components losing quorum. - -Checking Galera state -+++++++++++++++++++++ - -On each controller perform the following: - -.. code-block:: console - :substitutions: - - [stack@|controller0_hostname| ~]$ docker exec -i mariadb mysql -u root -p -e "SHOW STATUS LIKE 'wsrep_local_state_comment'" - Variable_name Value - wsrep_local_state_comment Synced - -The password can be found using: - -.. code-block:: console - :substitutions: - - kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml \ - --vault-password-file |vault_password_file_path| | grep ^database - -Checking RabbitMQ -+++++++++++++++++ - -RabbitMQ health is determined using the command ``rabbitmqctl cluster_status``: - -.. code-block:: console - :substitutions: - - [stack@|controller0_hostname| ~]$ docker exec rabbitmq rabbitmqctl cluster_status - Cluster status of node rabbit@|controller0_hostname| ... - [{nodes,[{disc,['rabbit@|controller0_hostname|','rabbit@|controller1_hostname|', - 'rabbit@|controller2_hostname|']}]}, - {running_nodes,['rabbit@|controller1_hostname|','rabbit@|controller2_hostname|', - 'rabbit@|controller0_hostname|']}, - {cluster_name,<<"rabbit@|controller2_hostname|">>}, - {partitions,[]}, - {alarms,[{'rabbit@|controller1_hostname|',[]}, - {'rabbit@|controller2_hostname|',[]}, - {'rabbit@|controller0_hostname|',[]}]}] - -Checking Keepalived -+++++++++++++++++++ - -On (for example) three controllers: - -.. code-block:: console - :substitutions: - - [stack@|controller0_hostname| ~]$ docker logs keepalived - -Two instances should show: - -.. code-block:: console - - VRRP_Instance(kolla_internal_vip_51) Entering BACKUP STATE - -and the other: - -.. code-block:: console - - VRRP_Instance(kolla_internal_vip_51) Entering MASTER STATE - -Ansible Control Host --------------------- - -The Ansible control host is not enrolled in bifrost. This node may run services -such as the seed virtual machine which will need to be gracefully powered down. - -Compute -------- - -If you are shutting down a single hypervisor, to avoid down time to tenants it -is advisable to migrate all of the instances to another machine. See -:ref:`evacuating-all-instances`. - -.. ifconfig:: deployment['ceph_managed'] - - Ceph - ---- - - The following guide provides a good overview: - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph - -Shutting down the seed VM -------------------------- - -.. code-block:: console - :substitutions: - - kayobe# virsh shutdown |seed_name| - -.. _full-shutdown: - -Full shutdown -------------- - -In case a full shutdown of the system is required, we advise to use the -following order: - -* Perform a graceful shutdown of all virtual machine instances -* Shut down compute nodes -* Shut down monitoring node -* Shut down network nodes (if separate from controllers) -* Shut down controllers -* Shut down Ceph nodes (if applicable) -* Shut down seed VM -* Shut down Ansible control host - -Rebooting a node ----------------- - -Example: Reboot all compute hosts apart from |hypervisor_hostname|: - -.. code-block:: console - :substitutions: - - kayobe# kayobe overcloud host command run --limit 'compute:!|hypervisor_hostname|' -b --command "shutdown -r" - -References ----------- - -* https://galeracluster.com/library/training/tutorials/restarting-cluster.html - -Control Plane Power on Procedure -================================ - -Overview --------- - -* Remove the node from maintenance mode in bifrost -* Bifrost should automatically power on the node via IPMI -* Check that all docker containers are running -* Check OpenSearch Dashboards for any messages with log level ERROR or - equivalent - -Controllers ------------ - -If all of the servers were shut down at the same time, it is necessary to run a -script to recover the database once they have all started up. This can be done -with the following command: - -.. code-block:: console - - kayobe# kayobe overcloud database recover - -Ansible Control Host --------------------- - -The Ansible control host is not enrolled in Bifrost and will have to be powered -on manually. - -Seed VM -------- - -The seed VM (and any other service VM) should start automatically when the seed -hypervisor is powered on. If it does not, it can be started with: - -.. code-block:: console - - kayobe# virsh start seed-0 - -Full power on -------------- - -Follow the order in :ref:`full-shutdown`, but in reverse order. - -Shutting Down / Restarting Monitoring Services ----------------------------------------------- - -Shutting down -+++++++++++++ - -Log into the monitoring host(s): - -.. code-block:: console - :substitutions: - - kayobe# ssh stack@|monitoring_host| - -Stop all Docker containers: - -.. code-block:: console - :substitutions: - - |monitoring_host|# for i in `docker ps -q`; do docker stop $i; done - -Shut down the node: - -.. code-block:: console - :substitutions: - - |monitoring_host|# sudo shutdown -h - -Starting up -+++++++++++ - -The monitoring services containers will automatically start when the monitoring -node is powered back on. - -Software Updates -================ - -Update Packages on Control Plane --------------------------------- - -OS packages can be updated with: - -.. code-block:: console - :substitutions: - - kayobe # kayobe overcloud host package update --limit |hypervisor_hostname| --packages '*' - kayobe # kayobe overcloud seed package update --packages '*' - -See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages - -Minor Upgrades to OpenStack Services ------------------------------------- - -* Pull latest changes from upstream stable branch to your own ``kolla`` fork (if applicable) -* Update ``kolla_openstack_release`` in ``etc/kayobe/kolla.yml`` (unless using default) -* Update tags for the images in ``etc/kayobe/kolla/globals.yml`` to use the new value of ``kolla_openstack_release`` -* Rebuild container images -* Pull container images to overcloud hosts -* Run kayobe overcloud service upgrade - -For more information, see: https://docs.openstack.org/kayobe/latest/upgrading.html - -Troubleshooting -=============== - -Deploying to a Specific Hypervisor ----------------------------------- - -To test creating an instance on a specific hypervisor, *as an admin-level user* -you can specify the hypervisor name as part of an extended availability zone -description. - -To see the list of hypervisor names: - -.. code-block:: console - - admin# openstack hypervisor list - -To boot an instance on a specific hypervisor, for example on -|hypervisor_hostname|: - -.. code-block:: console - :substitutions: - - admin# openstack server create --flavor |flavor_name| --network |network_name| --key-name --image CentOS8.2 --availability-zone nova::|hypervisor_hostname| vm-name - -Cleanup Procedures -================== - -OpenStack services can sometimes fail to remove all resources correctly. This -is the case with Magnum, which fails to clean up users in its domain after -clusters are deleted. `A patch has been submitted to stable branches -`__. -Until this fix becomes available, if Magnum is in use, administrators can -perform the following cleanup procedure regularly: - -.. code-block:: console - - admin# for user in $(openstack user list --domain magnum -f value -c Name | grep -v magnum_trustee_domain_admin); do - if openstack coe cluster list -c uuid -f value | grep -q $(echo $user | sed 's/_[0-9a-f]*$//'); then - echo "$user still in use, not deleting" - else - openstack user delete --domain magnum $user - fi - done - -OpenSearch indexes retention -=============================== - -To alter default rotation values for OpenSearch, edit -``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``: - -.. code-block:: console - - # Duration after which index is closed (default 30) - opensearch_soft_retention_period_days: 90 - - # Duration after which index is deleted (default 60) - opensearch_hard_retention_period_days: 180 - -Reconfigure Opensearch with new values: - -.. code-block:: console - - kayobe overcloud service reconfigure --kolla-tags opensearch - -For more information see the `upstream documentation -`__. diff --git a/source/overview_of_system.rst b/source/overview_of_system.rst new file mode 100644 index 0000000..280fb2e --- /dev/null +++ b/source/overview_of_system.rst @@ -0,0 +1,28 @@ +.. include:: vars.rst + +================== +Overview of System +================== + +.. Overview of the client's System should be included here. + +Ansible Control Host +==================== + +Seed +==== + +Overcloud +========= + +Controllers +----------- + +Hypervisors +----------- + +Storages +-------- + +Monitors +-------- diff --git a/source/rally_and_tempest.rst b/source/rally_and_tempest.rst deleted file mode 100644 index 4d643a8..0000000 --- a/source/rally_and_tempest.rst +++ /dev/null @@ -1,106 +0,0 @@ -.. include:: vars.rst - -========================================== -Verifying the Cloud with Rally and Tempest -========================================== - -`Rally `_ is a test framework, -and `Tempest `_ is the -OpenStack API test suite. In this guide, Rally is used to run Tempest tests. - -Requirements ------------- - -OpenStack tests are run from a host with access to the OpenStack APIs -and external network for instances. - -The following software environment is needed: - -* OpenStack admin credentials, eg public-openrc.sh from kayobe-config/etc/kolla -* A virtualenv setup with python-openstackclient installed - -.. code-block:: shell - :substitutions: - - source venv/bin/activate - pip install python-openstackclient - source |base_path|/src/|kayobe_config|/etc/kolla/public-openrc.sh - -Setup Rally for a new user --------------------------- - -A good directory hierarchy would be `~/rally/shakespeare/tempest-recipes`. -This is assumed in this guide. - -Install Rally into the virtualenv: - -.. code-block:: shell - - pip install rally-openstack \ - --constraint https://releases.openstack.org/constraints/upper/master - -Create the Rally test database and configuration file. For this you will -need the virtualenv and public-openrc as described above: - -.. code-block:: shell - - mkdir -p ~/.rally ~/rally/data - echo "[database]" | tee ~/.rally/rally.conf - echo "connection=sqlite:///${HOME}/rally/data/rally.db" | tee -a ~/.rally/rally.conf - rally db recreate - rally verify create-verifier --name default --type tempest - rally deployment create --fromenv --name production - -Check: - -.. code-block:: shell - - rally deployment show - -Install Shakespeare -------------------- - -Shakespeare is used for writing Tempest test configuration. - -.. code-block:: shell - - cd ~/rally - git clone https://github.com/stackhpc/shakespeare.git - cd shakespeare - pip install -r requirements.txt - -Install Tempest Recipe ----------------------- - -A custom set of Tempest recipes should be maintained for |project_name|, -defining key parameters needed and fine-grained control on the test cases -to run (and which to skip). - -In your `shakespeare` directory, -install the Tempest recipes in your Tempest configuration. - -.. code-block:: shell - :substitutions: - - git clone |tempest_recipes| - ansible-playbook template.yml -e @tempest-recipes/production.yml - mkdir -p ../config/production - rally verify configure-verifier --reconfigure --extend ../config/production/production.conf - -Rally invocation ----------------- - -Invoke the tests (this will take several hours to complete): - -.. code-block:: shell - - rally --debug verify start --concurrency 1 --skip-list tempest-recipes/production-skiplist.yml - -Report generation ------------------ - -Generate an HTML report of the results: - -.. code-block:: shell - - rally verify report --type html --to ~/rally/report-$(date -d "today" +"%Y%m%d%H%M").html diff --git a/source/vars.rst b/source/vars.rst index 96bec49..ba73f75 100644 --- a/source/vars.rst +++ b/source/vars.rst @@ -25,7 +25,7 @@ .. |kayobe_source_url| replace:: https://github.com/acme-openstack/kayobe.git .. |kayobe_source_version| replace:: ``acme/yoga`` .. |keystone_public_url| replace:: https://openstack.acme.example:5000 -.. |opensearch_dashboards_url| replace:: https://openstack.acme.example:5601 +.. |opensearch_dashboard_url| replace:: https://openstack.acme.example:5601 .. |kolla_passwords| replace:: https://github.com/acme-openstack/kayobe-config/blob/acme/yoga/etc/kayobe/kolla/passwords.yml .. |monitoring_host| replace:: ``mon0`` .. |network_name| replace:: admin-vxlan diff --git a/source/wazuh.rst b/source/wazuh.rst deleted file mode 100644 index 316b97b..0000000 --- a/source/wazuh.rst +++ /dev/null @@ -1,28 +0,0 @@ -.. include:: vars.rst - -======================= -Wazuh Security Platform -======================= - -.. ifconfig:: deployment['wazuh'] - - The |project_name| deployment uses `Wazuh `_ as security monitoring platform. Among other things, Wazuh monitors for: - -* Security-related system events. -* Known vulnerabilities (CVEs) in versions of installed software. -* Misconfigurations in system security. - -.. ifconfig:: deployment['wazuh_managed'] - - The Wazuh deployment is managed by StackHPC Ltd. - -.. ifconfig:: not deployment['wazuh_managed'] - - The Wazuh deployment is not managed by StackHPC Ltd. - -.. ifconfig:: deployment ['wazuh_ansible'] - - Wazuh deployment via Ansible - ============================ - - .. include:: include/wazuh_ansible.rst diff --git a/source/working_with_kayobe.rst b/source/working_with_kayobe.rst index 01b2e3e..11b51fb 100644 --- a/source/working_with_kayobe.rst +++ b/source/working_with_kayobe.rst @@ -31,15 +31,15 @@ and control plane hosts through the provisioning network |control_host_access| +.. _Making a Kayobe Checkout: + Making a Kayobe Checkout ------------------------ A Kayobe checkout is made on the Ansible control host. A Kayobe development environment can easily be set up using a script called -``beokay``, for example. This command will need the ``KAYOBE_VAULT_PASSWORD`` -environment variable to be set when secrets are encrypted with Ansible Vault. -See the next section for details. +``beokay``, for example. .. code-block:: console :substitutions: @@ -51,26 +51,22 @@ See the next section for details. --kayobe-repo |kayobe_source_url| \ --kayobe-branch |kayobe_source_version| \ --kayobe-config-repo |kayobe_config_source_url| \ - --kayobe-config-branch |kayobe_config_source_version| - -After making the checkout, source the virtualenv and Kayobe config environment variables: + --kayobe-config-branch |kayobe_config_source_version| \ + --kayobe-config-env-name \ + --vault-password-file |vault_password_file_path| -.. code-block:: console - :substitutions: +If the system does not use Kayobe environment, you can omit ``--kayobe-config-env-name``. +See the section :ref:`Kayobe Environments` for more details. - kayobe# cd |base_path| - kayobe# source venvs/kayobe/bin/activate - kayobe# source src/kayobe-config/kayobe-env - -If you are using a Kayobe environment, you will instead need to specify which -environment to source. See the section :ref:`Kayobe Environments` for more details. +After making the checkout, source ``env-vars.sh``. .. code-block:: console :substitutions: - kayobe# source src/kayobe-config/kayobe-env --environment + cd |base_path| + source env-vars.sh -Set up any dependencies needed on the control host: +Then, set up any dependencies needed on the control host: .. code-block:: console @@ -85,13 +81,13 @@ such as IPMI credentials, Ceph account keys and OpenStack service credentials. The vault of deployment secrets is protected by a password, which conventionally is stored in a (mode 0400) file in the user home directory. -An easy way to manage the vault password is to update ``.bash_profile`` to add -a command such as: +An easy way to manage the vault password is using ``env-vars.sh`` created from ``beokay``. +See the section :ref:`Making a Kayobe Checkout` for details. .. code-block:: console :substitutions: - kayobe# export KAYOBE_VAULT_PASSWORD=$(cat |vault_password_file_path|) + kayobe# source |base_path|/env-vars.sh Verifying Changes Before Applying --------------------------------- @@ -182,7 +178,6 @@ To use a specific environment with Kayobe, make sure to source its environment variables: .. code-block:: console - :substitutions: kayobe# source src/kayobe-config/kayobe-env --environment diff --git a/source/working_with_openstack.rst b/source/working_with_openstack.rst deleted file mode 100644 index 06dbe97..0000000 --- a/source/working_with_openstack.rst +++ /dev/null @@ -1,58 +0,0 @@ -.. include:: vars.rst - -====================== -Working with OpenStack -====================== - -Accessing the Dashboard (Horizon) ---------------------------------- - -The OpenStack web UI is available at: |horizon_url| - -This site is accessible |horizon_access|. - -Accessing the OpenStack CLI ---------------------------- - -A simple way to get started with accessing the OpenStack command-line -interface. - -This can be done from |public_api_access_host| (for example), or any machine -that has access to |public_vip|: - -.. code-block:: console - - openstack# virtualenv openstack-venv - openstack# source openstack-venv/bin/activate - openstack# pip install -U pip - openstack# pip install python-openstackclient - openstack# source -openrc.sh - -The `-openrc.sh` file can be downloaded from the OpenStack Dashboard -(Horizon): - -.. image:: _static/openrc.png - :alt: Downloading an openrc file from Horizon - :class: no-scaled-link - :width: 200 - -Now it should be possible to run OpenStack commands: - -.. code-block:: console - - openstack# openstack server list - -Accessing Deployed Instances ----------------------------- - -The external network of OpenStack, called |public_network|, connects to the -subnet |public_subnet|. This network is accessible |floating_ip_access|. - -Any OpenStack instance can make outgoing connections to this network, via a -router that connects the internal network of the project to the -|public_network| network. - -To enable incoming connections (e.g. SSH), a floating IP is required. A -floating IP is allocated and associated via OpenStack. Security groups must be -set to permit the kind of connectivity required (i.e. to define the ports that -must be opened).