The playbook in the provision-spark.yml file in this repository pulls in a set of default values for many of the configuration parameters that are needed to deploy Spark from the vars/spark.yml file and the default configuration file (the config.yml file). The parameters defined in these files define a reasonable set of defaults for a fairly generic Spark deployment, either to a single node or a cluster, including defaults for the URL that the Spark distribution should be downloaded from, the directory the distribution should be unpacked into, and the packages that must be installed on the node before the spark
services can be started.
While you may never need to change most of these values from their defaults, there are a fairly large number of these parameters, so a brief summary of what each is and how it is used could be helpful. In this section, we summarize all of these options, breaking them out into:
- parameters used to control the Ansible playbook run
- parameters used to configure new nodes that are created in a cloud (AWS or OpenStack) environment
- parameters used during the deployment process itself, and
- parameters used to configure our Spark nodes once Spark has been installed locally.
Each of these sets of parameters are described in their own section, below.
The following parameters can be used to control the ansible-playbook
run itself, defining things like how Ansible should connect to the nodes involved in the playbook run, which nodes should be targeted, where the Spark distribution should be downloaded from, which packages must be installed during the deployment process, and where those packages should be obtained from:
cloud
: this parameter is used to indicate the target cloud for the deployment (eitheraws
orosp
); this controls both the role that is used to create new nodes (when a matching set of nodes does not exist in the target environment) and how the build-app-host-groups role retrieves the list of target nodes for the deployment; if unspecified this parameter defaults to theaws
value specified in the default configuration fileregion
: this parameter is used to indicate the region that should be searched for matching nodes (and, if no matching nodes are found, the region in which at set of nodes should be created for use as a Spark cluster); if unspecified the default value ofus-west-2
specified in the vars/zookeeper.yml file is usedzone
: this parameter is used to indicate the availability zone that should be used when creating new nodes in an OpenStack environment; since this parameter is not needed for AWS deployments, there is no default value for this parameter (and any value provided during an AWS deployment will be silently ignored)tenant
: this parameter is used to indicate the tenant name to use, either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; if unspecified this parameter defaults to thedatanexus
value specified in the default configuration fileproject
: this parameter is used to indicate the project name to use, either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; if unspecified this parameter defaults to thedemo
value specified in the default configuration filedataflow
: this parameter is used to indicate the dataflow name to use, either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; the dataflow tag is used to link together the clusters/ensembles (Cassandra, Zookeeper, Kafka, Solr, etc.) that are involved in a given dataflow; if this value is not specified, it defaults to a value ofnone
during the playbook rundomain
: this parameter is used to indicate the domain name to use (eg. test, production, preprod), either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; if unspecified this parameter defaults to theproduction
value specified in the default configuration filecluster
: this parameter is used to indicate the cluster name to use, either when creating new nodes (when a matching set of nodes does not exist in the target environment) or when searching for a matching set of nodes in the build-app-host-groups role; this value is used to differentiate clusters of the same type from each other when multiple clusters are deployed for a given application for the same tenant, project, dataflow, and domain; if this value is not specified it defaults to a value ofa
during the playbook runuser
: the username that should be used when connecting to the target nodes via SSH; the value for this parameter will likely change from one target environment to the next; if unspecified a value ofcentos
will be used*spark_url
: the URL that the Apache Spark distribution should be downloaded fromspark_version
: the version of Spark that should be downloaded; used to switch versions when the distribution is downloaded using the defaultspark_url
, which is defined in the vars/spark.yml filespark_url
: the URL that the Spark distribution should be downloaded fromspark_version
: the version of Spark that should be downloaded; used to switch versions when the distribution is downloaded using the defaultspark_url
, which is defined in the vars/spark.yml filelocal_spark_file
: used to pass in the local path (on the Ansible host) to a directory containing the Apache Spark distribution file (a gzipped tarfile); the distribution file will be uploaded from this directory to the target hosts and unpacked into thespark_dir
directoryconfig_file
: used to define the location of a configuration file (see the discussion of this topic, below); this file is a YAML file containing definitions for any of the configuration parameters that are described in this section and is more than likely a file that will be created to manage the process of creating a specific ensemble. Storing the settings for a given ensemble in such a file makes it easy to guarantee that all of the nodes in that ensemble are configured consistently. If a value is not specified for this parameter then the default configuration file (the config.yml file) will be used; to override this behavior (and not load a configuration file of any kind), one can simply set the value of this parameter to/dev/null
and specify all of the other, non-default parameters that are needed as extra variables during the playbook runprivate_key_path
: used to define the directory where the private keys are maintained when the inventory for the playbook run is being managed dynamically; in these cases, the scripts used to retrieve the dynamic inventory information will return the names of the keys that should be used to access each node, and the playbook will search the directory specified by this parameter to find the corresponding key files. If this value is not specified then the current working directory will be searched for those keys by default
When the inventory for the playbook run is being controlled dynamically (i.e. when the deployment is targeting nodes in an AWS or OpenStack environment) and no matching nodes are found, the playbook will actually create a new set of nodes (using the tags that were passed into the playbook run) and configure those nodes as a Spark cluster. In that case, there are a number of parameters that must be provided to control the process of node creation:
type
: the type of node that should be created; if this value is unspecified then a default value oft2.large
(suitable for use in the default, AWS deployment) specified in the config.yml file is usedimage
: the image (AMI ID in the case of an AWS deployment or image UUID in the case of an OpenStack deployment) that should be used when creating new nodes; if this parameter is unspecified in an AWS deployment, then the playbook will search for a suitable image to use for the deployment; this parameter must be specified for an OpenStack deployment (and it's value must be the UUID of a pre-existing image that is suitable for use in the playbook run)cidr_block
: the CIDR block of the VPC where the nodes should be created in an AWS deployment (or the equivalent in an OpenStack deployment); it is assumed that this VPC (or OpenStack equivalent) already exists; if it is not specified, then the default value of10.10.0.0/16
from the config.yml file is usednode_map
: a list of dictionary entries where each entry specifies the number of nodes to create (thecount
) for a that application (or for each role in a given aapplication deployment if deployment of the cluster involves the deployment of nodes with different roles, like the seed and non-seed nodes in a Cassandra cluster); for the playbook in this repository the default value for this parameter (which appears in the vars/spark.yml file) will result in the creation of a three-node Spark cluster if no matching nodes were found based on the tags that were passed into the playbook runroot_volume
: the size (in GB) of the root volume that should be created when building new nodes in an AWS or OpenStack environment; this parameter has a default value that depends on the whether or not there is a corresponding definition for thedata_volume
parameter (see below):- if there is no defined value for the
data_volume
parameter, then a root volume that is 40GB in size will be created if this parameter is not defined - if there is a defined value for the
data_volume
parameter, then a root volume that is 11GB in size will be created if this parameter is not defined
- if there is no defined value for the
data_volume
: the size (in GB) of the data volume that will be created when building new nodes in an AWS or OpenStack environment; if a value is defined for this parameter, a data volume with the corresponding size will be created for each of the instances that are created by the playbook run and those data volumes will then be mounted under the/data
directory for each of those instances; if a value is not defined for this parameter then no corresponding data volume will be created (and the nodes that created by the playbook run will only have a single, root volume).application_sg_rules
: a list of rules used to configure the firewall associated with the internal and external subnets; for the playbook in this repository the default rules (which should not need to be changed) will result in a number of ports being open on the internal subnet to support internode communications between members of the Spark cluster and client connections to the Web UI interfaces provided by the cluster's master and worker nodes on that same subnet
These parameters are used to control the deployment process itself, defining things like where the distribution should be unpacked, the user/group that should be used when unpacking the distribution and starting the spark
service, and which packages to install.
spark_dir
: the directory that the Spark distribution should be unpacked into; defaults to the/opt/apache-spark
directory. If necessary, this directory will be created as part of the playbook runspark_package_list
: the list of packages that should be installed on the Spark nodes; typically this parameter is left unchanged from the default (which installs the OpenJDK packages needed to run Spark), but if it is modified the default, OpenJDK packages must be included as part of this list or an error will result when attempting to start thespark-master
andspark-worker
servicesspark_group
: the name of the user group under which Spark should be installed and run; defaults tospark
spark_user
: the username under which Spark should be installed and run; defaults tospark
These parameters are used configure the Spark nodes themselves during a playbook run, defining things like the interfaces that Spark should be listening on for requests and the directory where Spark should store its data.
internal_subnet
: the CIDR block describing the subnet that any nodes being created by the playbook run should attach as a private network (eth0
); this network is used for internode communications between the nodes of the cluster being deployed (and between nodes of the clusters/ensembles that make up the dataflow that this cluster is a member of); if it is not specified, then the default value of10.10.1.0/24
from the config.yml file is used; if the deployment is an OpenStack deployment then a value for the associatedinternal_uuid
parameter must also be provided, and that value must be the UUID for an existing internal network in the targeted OpenStack environmentexternal_subnet
: the CIDR block describing the subnet that any nodes being created by the playbook run should attach as a "public" network (eth1
); this network is used to support client connections to the various services that make up the cluster being deployed (and between nodes of the clusters/ensembles that make up the dataflow that this cluster is a member of); if it is not specified, then the default value of10.10.2.0/24
from the config.yml file is used; if the deployment is an OpenStack deployment then a value for the associatedexternal_uuid
parameter must also be provided, and that value must be the UUID for an existing external network in the targeted OpenStack environmentspark_data_dir
: the name of the directory that Spark should use to store its data; defaults to/data
if unspecified. If necessary, this directory will be created as part of the playbook runspark_master_port
: the port that the Spark master nodes should listen on for internode communications with other members of the cluster; defaults to 7077 if unspecifiedspark_master_webui_port
: the port that the Spark master node's Web UI should listen on for client connections; defaults to 8080 if unspecifiedspark_worker_port
: the port that the Spark worker nodes should listen on for internode communications with other members of the cluster; defaults to 7078 if unspecifiedspark_worker_webui_port
: the port that the Spark worker node's Web UI should listen on for client connections; defaults to 8181 if unspecified
The playbook in this repository will dynamically determine the names of the interfaces that correspond to the defined internal_subnet
and external_subnet
CIDR block values and configure the members of the cluster being deployed to listen on those interfaces, either for communication between the nodes that make up the cluster or for client requests. This is accomplished by dynamically constructing an iface_description_array
parameter within the playbook, then using that parameter to determine the names of the corresponding interfaces and their IP addresses.
Put quite simply, the iface_description_array
lets you specify a description for each of the networks that you are interested in, then retrieve the names of those networks on each machine in a variable that can be used elsewhere in the playbook. To accomplish this, the iface_description_array
is defined as an array of hashes (one per interface), each of which include the following fields:
type
: the type of description being provided, currently only thecidr
type is supportedval
: a value describing the network in question; since onlycidr
descriptions are currently supported, a CIDR value that looks something like192.168.34.0/24
should be used for this fieldas_var
: the name of the variable that you would like the interface name returned as
With these values in hand, the playbook will search the available networks on each machine and return a list of the interface names for each network that was described in the iface_description_array
as the value of the fact named in the as_var
field for that network's entry. For example, given this description:
iface_description_array: [
{ as_var: 'data_iface', type: 'cidr', val: '192.168.34.0/24' },
{ as_var: 'api_iface', type: 'cidr', val: '192.168.44.0/24' },
]
In this example, the playbook will determine the name of the network that matches the CIDR blocks 192.168.34.0/24
and 192.168.44.0/24
, returning those interface names as the values of the data_iface
and api_iface
facts, respectively (eg. eth0
and eth1
). These two facts are then used later in the playbook to correctly configure the nodes to talk to each other (over the data_iface
network) and listen on the proper interfaces for user requests (on the api_iface
network).