Supported configuration parameters

The playbook in the provision-spark.yml file in this repository pulls in a set of default values for many of the configuration parameters that are needed to deploy Spark from the vars/spark.yml file and the default configuration file (the config.yml file). The parameters defined in these files define a reasonable set of defaults for a fairly generic Spark deployment, either to a single node or a cluster, including defaults for the URL that the Spark distribution should be downloaded from, the directory the distribution should be unpacked into, and the packages that must be installed on the node before the spark services can be started.

While you may never need to change most of these values from their defaults, there are a fairly large number of these parameters, so a brief summary of what each is and how it is used could be helpful. In this section, we summarize all of these options, breaking them out into:

parameters used to control the Ansible playbook run
parameters used to configure new nodes that are created in a cloud (AWS or OpenStack) environment
parameters used during the deployment process itself, and
parameters used to configure our Spark nodes once Spark has been installed locally.

Each of these sets of parameters are described in their own section, below.

Parameters used to control the playbook run

The following parameters can be used to control the ansible-playbook run itself, defining things like how Ansible should connect to the nodes involved in the playbook run, which nodes should be targeted, where the Spark distribution should be downloaded from, which packages must be installed during the deployment process, and where those packages should be obtained from:

cloud: this parameter is used to indicate the target cloud for the deployment (either aws or osp); this controls both the role that is used to create new nodes (when a matching set of nodes does not exist in the target environment) and how the build-app-host-groups role retrieves the list of target nodes for the deployment; if unspecified this parameter defaults to the aws value specified in the default configuration file
region: this parameter is used to indicate the region that should be searched for matching nodes (and, if no matching nodes are found, the region in which at set of nodes should be created for use as a Spark cluster); if unspecified the default value of us-west-2 specified in the vars/zookeeper.yml file is used
zone: this parameter is used to indicate the availability zone that should be used when creating new nodes in an OpenStack environment; since this parameter is not needed for AWS deployments, there is no default value for this parameter (and any value provided during an AWS deployment will be silently ignored)
tenant: this parameter is used to indicate the tenant name to use, either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; if unspecified this parameter defaults to the datanexus value specified in the default configuration file
project: this parameter is used to indicate the project name to use, either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; if unspecified this parameter defaults to the demo value specified in the default configuration file
dataflow: this parameter is used to indicate the dataflow name to use, either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; the dataflow tag is used to link together the clusters/ensembles (Cassandra, Zookeeper, Kafka, Solr, etc.) that are involved in a given dataflow; if this value is not specified, it defaults to a value of none during the playbook run
domain: this parameter is used to indicate the domain name to use (eg. test, production, preprod), either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; if unspecified this parameter defaults to the production value specified in the default configuration file
cluster: this parameter is used to indicate the cluster name to use, either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; this value is used to differentiate clusters of the same type from each other when multiple clusters are deployed for a given application for the same tenant, project, dataflow, and domain; if this value is not specified it defaults to a value of a during the playbook run
user: the username that should be used when connecting to the target nodes via SSH; the value for this parameter will likely change from one target environment to the next; if unspecified a value of centos will be used* spark_url: the URL that the Apache Spark distribution should be downloaded from
spark_version: the version of Spark that should be downloaded; used to switch versions when the distribution is downloaded using the default spark_url, which is defined in the vars/spark.yml file
spark_url: the URL that the Spark distribution should be downloaded from
spark_version: the version of Spark that should be downloaded; used to switch versions when the distribution is downloaded using the default spark_url, which is defined in the vars/spark.yml file
local_spark_file: used to pass in the local path (on the Ansible host) to a directory containing the Apache Spark distribution file (a gzipped tarfile); the distribution file will be uploaded from this directory to the target hosts and unpacked into the spark_dir directory
config_file: used to define the location of a configuration file (see the discussion of this topic, below); this file is a YAML file containing definitions for any of the configuration parameters that are described in this section and is more than likely a file that will be created to manage the process of creating a specific ensemble. Storing the settings for a given ensemble in such a file makes it easy to guarantee that all of the nodes in that ensemble are configured consistently. If a value is not specified for this parameter then the default configuration file (the config.yml file) will be used; to override this behavior (and not load a configuration file of any kind), one can simply set the value of this parameter to /dev/null and specify all of the other, non-default parameters that are needed as extra variables during the playbook run
private_key_path: used to define the directory where the private keys are maintained when the inventory for the playbook run is being managed dynamically; in these cases, the scripts used to retrieve the dynamic inventory information will return the names of the keys that should be used to access each node, and the playbook will search the directory specified by this parameter to find the corresponding key files. If this value is not specified then the current working directory will be searched for those keys by default

Parameters used to configure nodes created in a cloud environment

When the inventory for the playbook run is being controlled dynamically (i.e. when the deployment is targeting nodes in an AWS or OpenStack environment) and no matching nodes are found, the playbook will actually create a new set of nodes (using the tags that were passed into the playbook run) and configure those nodes as a Spark cluster. In that case, there are a number of parameters that must be provided to control the process of node creation:

type: the type of node that should be created; if this value is unspecified then a default value of t2.large (suitable for use in the default, AWS deployment) specified in the config.yml file is used
image: the image (AMI ID in the case of an AWS deployment or image UUID in the case of an OpenStack deployment) that should be used when creating new nodes; if this parameter is unspecified in an AWS deployment, then the playbook will search for a suitable image to use for the deployment; this parameter must be specified for an OpenStack deployment (and it's value must be the UUID of a pre-existing image that is suitable for use in the playbook run)
cidr_block: the CIDR block of the VPC where the nodes should be created in an AWS deployment (or the equivalent in an OpenStack deployment); it is assumed that this VPC (or OpenStack equivalent) already exists; if it is not specified, then the default value of 10.10.0.0/16 from the config.yml file is used
node_map: a list of dictionary entries where each entry specifies the number of nodes to create (the count) for a that application (or for each role in a given aapplication deployment if deployment of the cluster involves the deployment of nodes with different roles, like the seed and non-seed nodes in a Cassandra cluster); for the playbook in this repository the default value for this parameter (which appears in the vars/spark.yml file) will result in the creation of a three-node Spark cluster if no matching nodes were found based on the tags that were passed into the playbook run
root_volume: the size (in GB) of the root volume that should be created when building new nodes in an AWS or OpenStack environment; this parameter has a default value that depends on the whether or not there is a corresponding definition for the data_volume parameter (see below):
- if there is no defined value for the data_volume parameter, then a root volume that is 40GB in size will be created if this parameter is not defined
- if there is a defined value for the data_volume parameter, then a root volume that is 11GB in size will be created if this parameter is not defined
data_volume: the size (in GB) of the data volume that will be created when building new nodes in an AWS or OpenStack environment; if a value is defined for this parameter, a data volume with the corresponding size will be created for each of the instances that are created by the playbook run and those data volumes will then be mounted under the /data directory for each of those instances; if a value is not defined for this parameter then no corresponding data volume will be created (and the nodes that created by the playbook run will only have a single, root volume).
application_sg_rules: a list of rules used to configure the firewall associated with the internal and external subnets; for the playbook in this repository the default rules (which should not need to be changed) will result in a number of ports being open on the internal subnet to support internode communications between members of the Spark cluster and client connections to the Web UI interfaces provided by the cluster's master and worker nodes on that same subnet

Parameters used during the deployment process

These parameters are used to control the deployment process itself, defining things like where the distribution should be unpacked, the user/group that should be used when unpacking the distribution and starting the spark service, and which packages to install.

spark_dir: the directory that the Spark distribution should be unpacked into; defaults to the /opt/apache-spark directory. If necessary, this directory will be created as part of the playbook run
spark_package_list: the list of packages that should be installed on the Spark nodes; typically this parameter is left unchanged from the default (which installs the OpenJDK packages needed to run Spark), but if it is modified the default, OpenJDK packages must be included as part of this list or an error will result when attempting to start the spark-master and spark-worker services
spark_group: the name of the user group under which Spark should be installed and run; defaults to spark
spark_user: the username under which Spark should be installed and run; defaults to spark

Parameters used to configure the Spark nodes

These parameters are used configure the Spark nodes themselves during a playbook run, defining things like the interfaces that Spark should be listening on for requests and the directory where Spark should store its data.

internal_subnet: the CIDR block describing the subnet that any nodes being created by the playbook run should attach as a private network (eth0); this network is used for internode communications between the nodes of the cluster being deployed (and between nodes of the clusters/ensembles that make up the dataflow that this cluster is a member of); if it is not specified, then the default value of 10.10.1.0/24 from the config.yml file is used; if the deployment is an OpenStack deployment then a value for the associated internal_uuid parameter must also be provided, and that value must be the UUID for an existing internal network in the targeted OpenStack environment
external_subnet: the CIDR block describing the subnet that any nodes being created by the playbook run should attach as a "public" network (eth1); this network is used to support client connections to the various services that make up the cluster being deployed (and between nodes of the clusters/ensembles that make up the dataflow that this cluster is a member of); if it is not specified, then the default value of 10.10.2.0/24 from the config.yml file is used; if the deployment is an OpenStack deployment then a value for the associated external_uuid parameter must also be provided, and that value must be the UUID for an existing external network in the targeted OpenStack environment
spark_data_dir: the name of the directory that Spark should use to store its data; defaults to /data if unspecified. If necessary, this directory will be created as part of the playbook run
spark_master_port: the port that the Spark master nodes should listen on for internode communications with other members of the cluster; defaults to 7077 if unspecified
spark_master_webui_port: the port that the Spark master node's Web UI should listen on for client connections; defaults to 8080 if unspecified
spark_worker_port: the port that the Spark worker nodes should listen on for internode communications with other members of the cluster; defaults to 7078 if unspecified
spark_worker_webui_port: the port that the Spark worker node's Web UI should listen on for client connections; defaults to 8181 if unspecified

Determining interface names

The playbook in this repository will dynamically determine the names of the interfaces that correspond to the defined internal_subnet and external_subnet CIDR block values and configure the members of the cluster being deployed to listen on those interfaces, either for communication between the nodes that make up the cluster or for client requests. This is accomplished by dynamically constructing an iface_description_array parameter within the playbook, then using that parameter to determine the names of the corresponding interfaces and their IP addresses.

Put quite simply, the iface_description_array lets you specify a description for each of the networks that you are interested in, then retrieve the names of those networks on each machine in a variable that can be used elsewhere in the playbook. To accomplish this, the iface_description_array is defined as an array of hashes (one per interface), each of which include the following fields:

type: the type of description being provided, currently only the cidr type is supported
val: a value describing the network in question; since only cidr descriptions are currently supported, a CIDR value that looks something like 192.168.34.0/24 should be used for this field
as_var: the name of the variable that you would like the interface name returned as

With these values in hand, the playbook will search the available networks on each machine and return a list of the interface names for each network that was described in the iface_description_array as the value of the fact named in the as_var field for that network's entry. For example, given this description:

    iface_description_array: [
        { as_var: 'data_iface', type: 'cidr', val: '192.168.34.0/24' },
        { as_var: 'api_iface', type: 'cidr', val: '192.168.44.0/24' },
    ]

In this example, the playbook will determine the name of the network that matches the CIDR blocks 192.168.34.0/24 and 192.168.44.0/24, returning those interface names as the values of the data_iface and api_iface facts, respectively (eg. eth0 and eth1). These two facts are then used later in the playbook to correctly configure the nodes to talk to each other (over the data_iface network) and listen on the proper interfaces for user requests (on the api_iface network).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supported-Config-Params.md

Supported-Config-Params.md

Supported configuration parameters

Parameters used to control the playbook run

Parameters used to configure nodes created in a cloud environment

Parameters used during the deployment process

Parameters used to configure the Spark nodes

Determining interface names

Files

Supported-Config-Params.md

Latest commit

History

Supported-Config-Params.md

File metadata and controls

Supported configuration parameters

Parameters used to control the playbook run

Parameters used to configure nodes created in a cloud environment

Parameters used during the deployment process

Parameters used to configure the Spark nodes

Determining interface names