How to manage a running TDP cluster ? #71

PACordonnier · 2023-03-29T10:03:53Z

PACordonnier
Mar 29, 2023
Collaborator

Questions has been raised about how a running TDP cluster is managed for day to day operations.

TOSIT is currently developing TDP-UI, a web GUI that sits on top of tdp-server. Comparisons have been made with Ambari which operators are used to.

The goal of the discussion is to establish the features needed for such a product and what software could be bundled in TDP to do that.
If this product is NOT TDP-UI, we should make sure to specify what TDP-UI does and does not.

Features required for the operators:

Have an overview of what's currently deployed
See the status of deployed services in the cluster
Restart the failed services
Rolling restart of components (e.g: one hbase region server at a time, every 10 minutes)
See the status of a machine, and what's deployed on this machine
Alerts in case of problems
See the configuration currently deployed
ACLs for the UI
not exhaustive list, add more if needed

Some of this features may be implemented by TDP-UI, some may not.

The main drawback of TDP-UI is that it uses TDP-lib exclusively. Since tdp lib relies on Ansible, it is not lightweight to get a status from a running machine. It is probably not a good idea to rely on tdp-lib to pull status or infos regularly. For comparison Ambari uses Ambari-agent running on every host that does this job of getting the infos locally on sending them to the Ambari server.

Some products already exist that does that kind of job (e.g: Nagios / Shinken). Their job is to map the services in a cluster, try to restart service that crash, send alerts, manages escalation, creates service availability metrics... It may be a good idea to rely on that kind of product.

Feel free to share any thoughts on the subject.

rpignolet · 2023-03-29T10:46:29Z

rpignolet
Mar 29, 2023
Maintainer

TDP Manager (tdp-lib, tdp-server, tdp-ui) should focus on setting up a Hadoop cluster and restarting services/components impacted by changing this configuration.

More concretely, TDP Manager is a tool for editing YAML files stored in local git repositories (tdp_vars). Then this tool launches operations (Ansible playbooks) on the cluster to reconfigure it (tdp deploy/tdp reconfigure).

TDP Manager does not directly manage the hosts, it manages services and components, it is up to the operations (i.e. playbooks) to launch on the correct hosts. TDP Manager also doesn't allow changing the cluster topology because that would be changing Ansible configuration and I don't think coding an Ansible configuration tool is a good idea.

TDP Manager also does not manage the state of the services/components it starts, it initiates an operation (i.e. playbook) and relies on the return code to determine if the operation was successful or not. Concretely the start is done via systemctl start ... so if this command returns an error then the operation will be in error on the other hand if it returns without error and the service/component crashes just after its start it is not possible to know.
For this case, we have planned to make check playbooks that can wait for the service/component to be operational, but it will not be in the first version.

In my opinion, the dashboard of a TDP cluster is a Grafana dashboard. TDP Manager is not the entry point for the cluster operator.

0 replies

PACordonnier · 2023-04-18T10:11:06Z

PACordonnier
Apr 18, 2023
Collaborator Author

Features required in TDP-lib are tracked in TOSIT-IO/tdp-lib#308

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TOSIT-IO

How to manage a running TDP cluster ? #71

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

TOSIT-IO

How to manage a running TDP cluster ? #71

PACordonnier Mar 29, 2023 Collaborator

Replies: 2 comments

rpignolet Mar 29, 2023 Maintainer

PACordonnier Apr 18, 2023 Collaborator Author

PACordonnier
Mar 29, 2023
Collaborator

rpignolet
Mar 29, 2023
Maintainer

PACordonnier
Apr 18, 2023
Collaborator Author