Skip to content

A fogØ5-based orchestrator built to efficiently and effectively deploy and maintain massive distributed architectures

License

Notifications You must be signed in to change notification settings

Davide-DD/fog05-orchestrator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

fogØ5 orchestrator

fogØ5 is a platform that aims to provide a decentralized infrastructure for provisioning and managing compute, storage, communication and I/O resources available anywhere across the network. It permits to deploy various entities (like VMs, containers and so on) in any place in the network. It also permits to specify large-grain dependencies between entities, in order to be sure to carry out correctly a complex deployment phase. However, there is still the need to effectively and efficiently manage finer-grain dependencies and large deployment scenarios, where we cannot just write a long and complex Python script employing fogØ5 APIs, as it will result in a code that is unreadable and hard to maintain and extend. This orchestrator aims to solve the aforementioned problems, enriching the functionalities already offered by fogØ5 through an extended management paradigm that permits to:

  • Specify fine-grain dependencies of the entities, in order to provide them all the information they need for a correct execution
  • Define combinations of entities (in our case, called architectures) that offer the same business service and effectively and efficiently employ them (one at a time) to provide the aforementioned service by using a continuous monitoring of the QoS of the active architecture and substituting it with one of the others in case the quality decreases below a predefined threshold

What follows is a brief description of the orchestrator and the files in this repository, but if you are interested in finer details you can check out my master thesis here.

Motivations

In my master thesis, I had to deploy two distributed machine learning architectures to provide a prediction service, therefore we choose to use fogØ5 and to enrich it through the proposed orchestrator. Then, we created two architectures (one, gossip learning based, and the other, federated learning based) and mapped the entities composing these two architectures to entities manageable by fogØ5. Once we had the entities, we created a class for each architecture defining how they needed to be deployed, providing the fine-grain dependencies before instantiating them. What seemed clear to me was the opportunity to offer the aforementioned service in a much more resilient and efficient way by using both these architectures, one at a time: we started by deploying the federated one, and then, in case of a failure, the orchestrator automatically reacted providing the gossip one in order to continue serving the business service needed without the need of a manual intervention, as all the logic for correctly deploy these architecture was inside Python classes (as we are going to see in a moment).

Structure

In order to achieve these purposes, our proposed orchestrator is composed by four main classes: the ArchitectureRepository, the ArchitectureDataRepository, the StrategyRepository and the ActiveStrategyManager.

ArchitectureRepository

This entity simply stores the architectures that the user creates. It uses a dictionary where the keys are the names of the classes implementing the architectures and the values are the instances of those classes. If a user has an architecture that he wants to use, he can call the method add_architecture of the class ArchitectureRepository, passing as argument an instance of an architecture that is an instance of a class extending the Architecture interface (that we will explore in a moment). For architecture, we mean a complex composition of entities or atomic entities (we are referring to fogØ5 jargon, now) in terms of dependencies between them that need to be solved prior to deploy them: these dependencies can be of many types, but, for example, in our case, it referred principally to nodes needing IP address of certain other entities or to the specification of parameters to adjust the entity’s behavior as requested from the user. Now, delving into our concept of architecture, we can analyze our Architecture interface. It defines a lifecycle for any Architecture-extending class, exposing 6 methods (plus other 3 utilities methods) through which the entities composing the architecture are instantiated and terminated on the basis of the current state. These 6 methods are (presented in the same order as the lifecycle that the architecture must follow):

  1. Configure: each architecture has unique configuration parameters it offers to control its deployment. Through this method we can set them, by passing as argument a dictionary containing as keys, the properties we want to set, and, as values, the desired behavior we expect from that property. At the end of the successful execution of this method, the architecture reaches the configured state;
  2. Associate: each architecture can be used in different environments (that, in fogØ5, is equal to say that can be used with different YAKS instances). As such, we need an operation that ties the instance of a needed architecture to a specific environment, in order to be able to use there the considered architecture. This can be done by calling this method and passing as arguments an instance of the FIMAPI class from fogØ5 module (a class that provides APIs to enable interactions with a certain YAKS instance), a dictionary containing other useful information about the environment (that can help mapping the entities to the most appropriate nodes) and another dictionary storing more specific directives about the deployment of the architecture (it differs from the one at the previous point as this dictionary should contain indications imposed by the environment we are going to associate to). If this method executes with success, the architecture arrives to the associated state;
  3. Deploy: it is for sure the main method of the lifecycle of an architecture. It contains the deployment logic, instantiating all the associated entities and providing them with all the needed information and additional data they need to work properly. It only takes one argument, being the directory path that contains the needed files to be loaded on one or more nodes of the architecture. This operation uses the fundamental utility method get_mapping, that exploiting the instance of the class FIMAPI, the environment dictionary, the current state of the architecture and the requirements needed by the current architecture (defined in the “private” utility method __get_requirements), returns for each entity the assigned node; the current state of the architecture is needed for the method __get_requirements, in order to let it known for which entity we need to find a mapping (and therefore a corresponding node). If this method executes with success, the architecture reaches the deployed state. At this point, we can check if the architecture is effectively working or not by calling the utility method check_status, that contains the control logic for the current architecture and returns a boolean indicating if it is online or offline;
  4. Serve: strictly related to the previous point, there is the serve method. This method deploys the entities that are going to exploit the service offered by the architecture (activated through the previous method) always using the method get_mapping as explained in the previous point and providing them with all the settings and additional data needed. This instantiation is separated from the previous one, because the architecture could need some time to prepare itself for providing the service, therefore with this separation we are free to supply the service when we think we are ready. Finally, if this method executes with success, the architecture arrives to the served state;
  5. Undeploy: when we need to stop serving our service or when we have to replace the current architecture with another, we call this method. It terminates all the running entities (both the ones associated with deploy and with serve) and offload all the descriptors loaded by this architecture into the environment. At the end of the successful execution of this method, the architecture reaches the undeployed state; at this point, it can only come back to the deployed state by executing the deploy method. When we are managing an architecture, we have to execute the above methods one after the other. As stated before, only from the undeployed state we can come back to the deployed state, while from any other state we can only proceed forward; if we need a similar architecture but with different configurations and parameters we have to restart from the configure method using a new instance of the same architecture.

ArchitectureDataRepository

This repository keeps in memory associations between an architecture and its needed data for deploy or serve operations. In particular, it offers the add_architecture_data method to which we feed the name of the architecture, the identifier of the data we are loading and the path to files needed at deployment time and at serve time: if we successfully call it, we can memorize data to fetch them later on and from all the different environments, and use them at deploy or serve time of our architectures.

StrategyRepository

This repository keeps in store strategies, that are instances of the class Strategy, that is an abstraction created to memorize a combination of architectures that provides as efficiently and resiliently as possible the same service. In particular, this combination is defined in terms of two elements: one main architecture and zero or more replacement architectures; each architecture indicated provides the same service, but with different performances, as the main architecture is the preferred choice and therefore should offer the best performance, while the replacement architectures are the ones that should be activated when the main one does not work, and therefore should be less expensive (in terms of nodes needed to deploy it) but also provide less quality with respect to the main one. Ultimately, we can say that the purpose of this abstraction is to ensure continuity and efficiency in the provisioning of a business service defining a strategy (from here it comes the name of the class) composed by different architectures ready to be activated when needed. Coming back to the repository, it is important to note that we can store strategies by calling the method define_strategy and passing as argument a dictionary, containing the name of the main architecture and its configuration, and a list of dictionaries, containing for each entry the name of a replacement architecture and its configuration: in fact, it is in the StrategyRepository that the configuration of the architectures takes place.

ActiveStrategyManager

This class manages active strategies, that are strategies whose main architecture (or one of its replacement architectures) has been deployed. In particular, the class ActiveStrategy permits to store the main architecture, the replacement architectures, the arguments needed to execute association operations, the identifier to access additional data for deployment or serve operations, the strategy to which this active strategy refers to and a method (more on this later). The offered operations are four, also defining a lifecycle for an ActiveStrategy:

  1. Start: it permits to activate a strategy, by associating and then deploying that strategy’s main architecture. Thus, we first carry out the association operation as described before, by passing as arguments the instance of the class FIMAPI (created by this class through the address of the YAKS instance provided by the user), the environment dictionary and the additional association arguments. Then, we deploy the architecture by using the method deploy as described above, thus passing the path from which the architecture has to fetch the additional files for its deployment. Once the main architecture is deployed, this method stores the information about this activated strategy in a list (memorizing all the parameters described before) and, contextually, through the class Watcher, activates a thread that, on a certain frequency (as defined by the user), checks the status of the deployed architecture (that can be either the main architecture or one of the replacement architectures) to see if it is still working or if it needs to be replaced (if one of the replacement architectures is active, it is also checked if the main architecture can substitute the replacement one, in order to come back to provide the best performance possible) and triggers the execution of the manage method (described below) when a change in the status is detected;
  2. Serve: if a strategy is activated, we can make it provide its service by executing this operation that is going to call the serve method of the deployed architecture by passing it the path containing the needed data;
  3. Stop: it undeploys the current active architecture of the specified strategy (being it either the main architecture or one of the replacement architectures). As for the current implementation, a strategy cannot be reactivated after the stop, as it is also deleted from the local list of active ones: if we want to reactivate it, we need to call the method start and therefore allocate a new ActiveStrategy;
  4. Manage: as anticipated before, this is the reference method for the Watcher, to be called when a status change in one of the architectures is identified. In particular, there are two kind of changes that can be detected, main-ready and needs-replacement: the former means that a replacement architecture is providing the service but the main one is ready to come back online, while the latter means that the main architecture is having a problem and needs to be substituted. In case the first occurs, the active replacement architecture is undeployed and the main architecture is deployed again and served (there is no chance of delaying the serving phase, but it should be a priority to continue providing the service needed, as a replacement is something that usually happens when the architecture is already fulfilling the service that it needs to stay active). Otherwise, if the second occurs, the main architecture is undeployed and then the algorithm tries to find if one of replacement architectures can be deployed and, if possible, effectively deploy that architecture.

Conclusions

  • It is understandable how, by using our proposed orchestrator that combines the architectures that supply the same specific functionality, a service can be resiliently and efficiently provided. In fact, once we start a strategy, we also begin a continuous monitoring of the status of the active architecture and have the orchestrator automatically react to status changes, by deploying one of the replacement architectures in place of the main one: there is no need for manual intervention. This is not only a gain in terms of reaction rate to the problems the architecture is encountering, but also in terms of automatizing and carrying out faster the operations that a user should do to deploy and undeploy the used architecture. In fact, we cannot forget about the problem of meeting the requirements of an architecture in terms of storage and computation capabilities when deploying it: a user should first check manually for compatibility with the current environment (in terms of available nodes versus required ones) and then deploy the entities, possibly choosing the best node for each entity to deploy, but this orchestrator automatizes also this aspect.
  • Another important aspect is for sure the flexibility that this orchestrator offers: we defined a general Architecture interface, that each architecture that wants to be managed by this tool needs to extend. The only requirement needed, apart from the one just described, is to map the nodes, composing the architecture, to fogØ5 atomic entities, but given the many management possibilities offered by fogØ5, this requirement is not so compelling, as we demonstrated in the previous sections with the architectures we created. Moreover, the division between ArchitectureRepository, containing the “pure” form of the architecture, StrategyRepository, containing the configured state of the architecture, and ActiveStrategyManager, containing the associated state of the architecture, permits the easy reusage of the same architecture in many different strategies and active strategies; together with ArchitectureDataRepository, that permits to associate, for the same architecture, many different additional sets of data to further personalize the behavior of the needed architecture.
  • Finally, one last perk to be noted is for sure the ease-of-use: each one of the orchestrator’s classes is very easy to instantiate and use (apart from ActiveStrategyManager, that presents a little more complexity), making the core logic easy to understand and control: all the complexity is relegated to the actual architectures, where the end user needs to define the logic for the various operations presented before, but this should not be a problem, since the APIs and the schemas for the descriptors offered by fogØ5 are simple enough to permits a smooth mapping of the needed architectures even without having a deep comprehension of fogØ5 itself.

Test use case

In the folder example, you can try out the test execution we used for this master thesis. First, you need to copy data to the architecture_data_repository folder of the src folder. Then, copy what is inside architectures to the architectures folder (src -> architecture_repository); you will need images for the entities of these architectures: in order to get them, you can use this link or just use random LXC images (but you will not be able to see the prediction service in action). Finally, copy run.py to the src folder. Modify run.py to provide the correct paths and then execute it providing the address of the YAKS instance.

License and more

fog05 is published under Apache License 2.0.

Copyright (©) Davide Di Donato 2019