Skip to content

Latest commit

 

History

History
83 lines (63 loc) · 18.1 KB

Supported-Config-Params.md

File metadata and controls

83 lines (63 loc) · 18.1 KB

Supported configuration parameters

The playbook in the provision-cassandra.yml file in this repository pulls in a set of default values for many of the configuration parameters that are needed to deploy Cassandra from the vars/cassandra.yml file and the default configuration file (the config.yml file). The parameters defined in these files define a reasonable set of defaults for a fairly generic Cassandra deployment, either to a single node or a cluster, including defaults for the URL that the Cassandra distribution should be downloaded from, the directory the distribution should be unpacked into, and the packages that must be installed on the node before the cassandra services can be started.

While you may never need to change most of these values from their defaults, there are a fairly large number of these parameters, so a brief summary of what each is and how it is used could be helpful. In this section, we summarize all of these options, breaking them out into:

  • parameters used to control the Ansible playbook run
  • parameters used to configure new nodes that are created in a cloud (AWS or OpenStack) environment
  • parameters used during the deployment process itself, and
  • parameters used to configure our Cassandra nodes once Cassandra has been installed locally.

Each of these sets of parameters are described in their own section, below.

Parameters used to control the playbook run

The following parameters can be used to control the ansible-playbook run itself, defining things like how Ansible should connect to the nodes involved in the playbook run, which nodes should be targeted, where the Cassandra distribution should be downloaded from, which packages must be installed during the deployment process, and where those packages should be obtained from:

  • cloud: this parameter is used to indicate the target cloud for the deployment (either aws or osp); this controls both the role that is used to create new nodes (when a matching set of nodes does not exist in the target environment) and how the build-app-host-groups role retrieves the list of target nodes for the deployment; if unspecified this parameter defaults to the aws value specified in the default configuration file
  • region: this parameter is used to indicate the region that should be searched for matching nodes (and, if no matching nodes are found, the region in which at set of nodes should be created for use as a Cassandra cluster); if unspecified the default value of us-west-2 specified in the vars/zookeeper.yml file is used
  • zone: this parameter is used to indicate the availability zone that should be used when creating new nodes in an OpenStack environment; since this parameter is not needed for AWS deployments, there is no default value for this parameter (and any value provided during an AWS deployment will be silently ignored)
  • tenant: this parameter is used to indicate the tenant name to use, either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; if unspecified this parameter defaults to the datanexus value specified in the default configuration file
  • project: this parameter is used to indicate the project name to use, either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; if unspecified this parameter defaults to the demo value specified in the default configuration file
  • dataflow: this parameter is used to indicate the dataflow name to use, either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; the dataflow tag is used to link together the clusters/ensembles (Cassandra, Zookeeper, Kafka, Solr, etc.) that are involved in a given dataflow; if this value is not specified, it defaults to a value of none during the playbook run
  • domain: this parameter is used to indicate the domain name to use (eg. test, production, preprod), either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; if unspecified this parameter defaults to the production value specified in the default configuration file
  • cluster: this parameter is used to indicate the cluster name to use, either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; this value is used to differentiate clusters of the same type from each other when multiple clusters are deployed for a given application for the same tenant, project, dataflow, and domain; if this value is not specified it defaults to a value of a during the playbook run
  • user: the username that should be used when connecting to the target nodes via SSH; the value for this parameter will likely change from one target environment to the next; if unspecified a value of centos will be used
  • cassandra_url: the URL that the Apache Cassandra distribution should be downloaded from
  • cassandra_version: the version of Cassandra that should be downloaded; used to switch versions when the distribution is downloaded using the default cassandra_url, which is defined in the vars/cassandra.yml file
  • config_file: used to define the location of a configuration file (see the discussion of this topic, below); this file is a YAML file containing definitions for any of the configuration parameters that are described in this section and is more than likely a file that will be created to manage the process of creating a specific ensemble. Storing the settings for a given ensemble in such a file makes it easy to guarantee that all of the nodes in that ensemble are configured consistently. If a value is not specified for this parameter then the default configuration file (the config.yml file) will be used; to override this behavior (and not load a configuration file of any kind), one can simply set the value of this parameter to /dev/null and specify all of the other, non-default parameters that are needed as extra variables during the playbook run
  • private_key_path: used to define the directory where the private keys are maintained when the inventory for the playbook run is being managed dynamically; in these cases, the scripts used to retrieve the dynamic inventory information will return the names of the keys that should be used to access each node, and the playbook will search the directory specified by this parameter to find the corresponding key files. If this value is not specified then the current working directory will be searched for those keys by default

Parameters used to configure nodes created in a cloud environment

When the inventory for the playbook run is being controlled dynamically (i.e. when the deployment is targeting nodes in an AWS or OpenStack environment) and no matching nodes are found, the playbook will actually create a new set of nodes (using the tags that were passed into the playbook run) and configure those nodes as a Cassandra cluster. In that case, there are a number of parameters that must be provided to control the process of node creation:

  • type: the type of node that should be created; if this value is unspecified then a default value of t2.large (suitable for use in the default, AWS deployment) specified in the config.yml file is used
  • image: the image (AMI ID in the case of an AWS deployment or image UUID in the case of an OpenStack deployment) that should be used when creating new nodes; if this parameter is unspecified in an AWS deployment, then the playbook will search for a suitable image to use for the deployment; this parameter must be specified for an OpenStack deployment (and it's value must be the UUID of a pre-existing image that is suitable for use in the playbook run)
  • cidr_block: the CIDR block of the VPC where the nodes should be created in an AWS deployment (or the equivalent in an OpenStack deployment); it is assumed that this VPC (or OpenStack equivalent) already exists; if it is not specified, then the default value of 10.10.0.0/16 from the config.yml file is used
  • node_map: a list of dictionary entries where each entry specifies the number of nodes to create (the count) for a that application (or for each role in a given aapplication deployment if deployment of the cluster involves the deployment of nodes with different roles, like the seed and non-seed nodes in a Cassandra cluster); for the playbook in this repository the default value for this parameter (which appears in the vars/cassandra.yml file) will result in the creation of a three-node Cassandra cluster if no matching nodes were found based on the tags that were passed into the playbook run
  • root_volume: the size (in GB) of the root volume that should be created when building new nodes in an AWS or OpenStack environment; this parameter has a default value that depends on the whether or not there is a corresponding definition for the data_volume parameter (see below):
    • if there is no defined value for the data_volume parameter, then a root volume that is 40GB in size will be created if this parameter is not defined
    • if there is a defined value for the data_volume parameter, then a root volume that is 11GB in size will be created if this parameter is not defined
  • data_volume: the size (in GB) of the data volume that will be created when building new nodes in an AWS or OpenStack environment; if a value is defined for this parameter, a data volume with the corresponding size will be created for each of the instances that are created by the playbook run and those data volumes will then be mounted under the /data directory for each of those instances; if a value is not defined for this parameter then no corresponding data volume will be created (and the nodes that created by the playbook run will only have a single, root volume).
  • application_sg_rules: a list of rules used to configure the firewall associated with the internal and external subnets; for the playbook in this repository the default rules (which should not need to be changed) will result in a few ports being open on the internal subnet to support internode communications between members of the Cassandra cluster and a few more being open on the external subnet to support client connections the cluster

Parameters used during the deployment process

These parameters are used to control the deployment process itself, defining things like where to unpack the distribution into, whether or not Cassandra should be started when the deployment process is complete, and what user/group the should be used when running Cassandra locally.

  • cassandra_dir: the directory that the Cassandra distribution should be unpacked into; defaults to the /opt/apache-cassandra directory. If necessary, this directory will be created as part of the playbook run
  • cassandra_package_list: the list of packages that should be installed on the Cassandra nodes; typically this parameter is left unchanged from the default (which installs the OpenJDK packages needed to run Cassandra), but if it is modified the default, OpenJDK packages must be included as part of this list or an error will result when attempting to start the cassandra service
  • cassandra_group: the name of the user group under which Cassandra should be installed and run; defaults to cassandra
  • cassandra_user: the username under which Cassandra should be installed and run; defaults to cassandra
  • start_cassandra: this parameter is used control whether or not the cassandra service should be started when the playbook is complete; defaults to true

Parameters used to configure the Cassandra nodes

These parameters are used configure the Cassandra nodes themselves during a playbook run, defining things like the interfaces that Cassandra should be listening on for requests, the directory where Cassandra should store its data, and the list of seed nodes for the cluster.

  • internal_subnet: the CIDR block describing the subnet that any nodes being created by the playbook run should attach as a private network (eth0); this network is used for internode communications between the nodes of the cluster being deployed (and between nodes of the clusters/ensembles that make up the dataflow that this cluster is a member of); if it is not specified, then the default value of 10.10.1.0/24 from the config.yml file is used; if the deployment is an OpenStack deployment then a value for the associated internal_uuid parameter must also be provided, and that value must be the UUID for an existing internal network in the targeted OpenStack environment
  • external_subnet: the CIDR block describing the subnet that any nodes being created by the playbook run should attach as a "public" network (eth1); this network is used to support client connections to the various services that make up the cluster being deployed (and between nodes of the clusters/ensembles that make up the dataflow that this cluster is a member of); if it is not specified, then the default value of 10.10.2.0/24 from the config.yml file is used; if the deployment is an OpenStack deployment then a value for the associated external_uuid parameter must also be provided, and that value must be the UUID for an existing external network in the targeted OpenStack environment
  • cassandra_data_dir: the name of the directory that Cassandra should use to store its data; defaults to /data if unspecified. If necessary, this directory will be created as part of the playbook run

In addition to these parameters, Cassandra supports an extremely large set of configuration parameters that may be set by the user if they wish (reasonable defaults are assumed for all of these parameters in the playbook run if they are not specified). Rather than listing all of these parameters here, we will briefly describe the strategy that is used by the playbook when assigning values to these variables here.

Briefly, for any parameter defined in the cassandra.yaml configuration file that is included in the Cassandra distribution (the rpc_port for example), the user can define a value for that parameter using a variable of the same name but with a cassandra_ prefix. So, to change the rpc_port from the default value (9160), you would simply assign your desired value to the cassandra_rpc_port variable (passing that variable into the ansible-playbook run using one of the strategies outline in the discussion in the Controlling the configuration of the README.md file), and that value would override the default value in the cassandra.yaml file for all of the nodes targeted by your playbook run.

In the current (v3.11.0) Cassandra release there are more than 130 of these configuration parameters that users may wish to customize in the cassandra.yaml file, most of which will never be set to anything but their default values. That said, the strategy followed here does give users the ability to set any of these parameters that they wish to set in their own deployments without requiring either an excessive amount of work on our part or an excessive amount of work on the part of users who do not wish to set values for those parameters.

Determining interface names

The playbook in this repository will dynamically determine the names of the interfaces that correspond to the defined internal_subnet and external_subnet CIDR block values and configure the members of the cluster being deployed to listen on those interfaces, either for communication between the nodes that make up the cluster or for client requests. This is accomplished by dynamically constructing an iface_description_array parameter within the playbook, then using that parameter to determine the names of the corresponding interfaces and their IP addresses.

Put quite simply, the iface_description_array lets you specify a description for each of the networks that you are interested in, then retrieve the names of those networks on each machine in a variable that can be used elsewhere in the playbook. To accomplish this, the iface_description_array is defined as an array of hashes (one per interface), each of which include the following fields:

  • type: the type of description being provided, currently only the cidr type is supported
  • val: a value describing the network in question; since only cidr descriptions are currently supported, a CIDR value that looks something like 192.168.34.0/24 should be used for this field
  • as_var: the name of the variable that you would like the interface name returned as

With these values in hand, the playbook will search the available networks on each machine and return a list of the interface names for each network that was described in the iface_description_array as the value of the fact named in the as_var field for that network's entry. For example, given this description:

    iface_description_array: [
        { as_var: 'data_iface', type: 'cidr', val: '192.168.34.0/24' },
        { as_var: 'api_iface', type: 'cidr', val: '192.168.44.0/24' },
    ]

In this example, the playbook will determine the name of the network that matches the CIDR blocks 192.168.34.0/24 and 192.168.44.0/24, returning those interface names as the values of the data_iface and api_iface facts, respectively (eg. eth0 and eth1). These two facts are then used later in the playbook to correctly configure the nodes to talk to each other (over the data_iface network) and listen on the proper interfaces for user requests (on the api_iface network).