How to manage a running TDP cluster ? #71
Replies: 2 comments
-
TDP Manager ( More concretely, TDP Manager is a tool for editing YAML files stored in local git repositories ( TDP Manager does not directly manage the hosts, it manages services and components, it is up to the operations (i.e. playbooks) to launch on the correct hosts. TDP Manager also doesn't allow changing the cluster topology because that would be changing Ansible configuration and I don't think coding an Ansible configuration tool is a good idea. TDP Manager also does not manage the state of the services/components it starts, it initiates an operation (i.e. playbook) and relies on the return code to determine if the operation was successful or not. Concretely the start is done via In my opinion, the dashboard of a TDP cluster is a Grafana dashboard. TDP Manager is not the entry point for the cluster operator. |
Beta Was this translation helpful? Give feedback.
-
Features required in TDP-lib are tracked in TOSIT-IO/tdp-lib#308 |
Beta Was this translation helpful? Give feedback.
-
Questions has been raised about how a running TDP cluster is managed for day to day operations.
TOSIT is currently developing TDP-UI, a web GUI that sits on top of tdp-server. Comparisons have been made with Ambari which operators are used to.
The goal of the discussion is to establish the features needed for such a product and what software could be bundled in TDP to do that.
If this product is NOT TDP-UI, we should make sure to specify what TDP-UI does and does not.
Features required for the operators:
Some of this features may be implemented by TDP-UI, some may not.
The main drawback of TDP-UI is that it uses TDP-lib exclusively. Since tdp lib relies on Ansible, it is not lightweight to get a status from a running machine. It is probably not a good idea to rely on tdp-lib to pull status or infos regularly. For comparison Ambari uses Ambari-agent running on every host that does this job of getting the infos locally on sending them to the Ambari server.
Some products already exist that does that kind of job (e.g: Nagios / Shinken). Their job is to map the services in a cluster, try to restart service that crash, send alerts, manages escalation, creates service availability metrics... It may be a good idea to rely on that kind of product.
Feel free to share any thoughts on the subject.
Beta Was this translation helpful? Give feedback.
All reactions