Skip to content

Roadmap

Sebastián Mancilla edited this page Nov 28, 2019 · 15 revisions

(Work in progress) List of features to be completed for CLARA 4.4/5.x

Plugins/services

Plugins should provide info file about their services

The services.yaml file currently requires the user to know the actual classpath/library of the service in order to use it as part of the application.

# services.yaml
io-services:
  reader:
    class: org.jlab.clara.demo.services.ImageReaderService
    name: ImageReaderService
  writer:
    class: org.jlab.clara.demo.services.ImageWriterService
    name: ImageWriterService

services:
  - class: org.jlab.clara.demo.services.FaceDetectorService
    name: FaceDetectorService
  - class: pupil_detector_service
    name: PupilDetectorService
    lang: cpp

The user should not need to be aware of the implementation details of the services nor their actual location on disk.

At the root of the plugin directory there should be an info.yaml file (TODO: decide the right name), which lists all services provided by the plugin:

# info.yaml
plugin:
  services:
    - class: org.jlab.clara.demo.services.ImageReaderService
      name: ImageReaderService
      lang: java
    - class: org.jlab.clara.demo.services.ImageWriterService
      name: ImageWriterService
      lang: java
    - class: org.jlab.clara.demo.services.FaceDetectorService
      name: FaceDetectorService
      lang: java
    - class: pupil_detector_service
      name: PupilDetectorService
      lang: cpp

TODO: Should this list of services be separated into sections?

TODO: If <engine>.yaml specification files are used, then what should this list contain? Paths to the spec files of each service?

TODO: If we enforce a layout like single services per-folder: <plugin>/services/<service>, then the spec files can be obtained from <plugin>/services/<service>/info.yaml and the plugin info file is not necessary?

This plugin info file frees the user to know the details about the services and allow them to use just their registration names to compose an application:

# services.yaml
io-services:
  reader:
    name: ImageReaderService
  writer:
    name: ImageWriterService

services:
  - name: FaceDetectorService

  - name: PupilDetectorService
    lang: cpp

CLARA can then map each service name to the proper class to be loaded by using the information provided by the plugin itself.

TODO: Decide if lang should be a required key or not. The language information should be provided by the info.yaml file, but it is also useful that the language is stated clearly in the services.yaml.

TODO: How to differentiate between services of the same name that are provided by different plugins? By the classpath? By the plugin name?

TODO: Services installed individually (i.e. not part of a plugin) should also provide some kind of info.yaml file, but currently there is not specific layout for single services.

TODO: info.yaml should list plugin-provided orchestrators too?

Services should provide info file with their metadata

Currently, the service metadata is provided programmatically through the Engine interface. But this limits the access to that information to deployed services only. There is no way to show the meta data of installed-but-not-deployed (i.e. available) services.

Services should provide a YAML specification file with their metadata. This file can then be parsed when service is created, in order to get the return data for each of the Engine interface methods.

The EngineSpecification class is available to services developers to use an specification file. The AbstractService base class in the CLARA standard library already provides this functionality. For ServiceA, a ServiceA.yaml resource must exist, which will be then loaded and parsed automatically.

The specification file should at least contain the following fields:

---
name: EvioToEvioReader
engine: org.jlab.clas.std.services.convertors.EvioToEvioReader
type: java

author: Sebastián Mancilla
email: smancill@jlab.org

version: 1.4
description:
  Reads EVIO events from a file.

  Returns a new event on each request or an error if there was some problem.

The clasrec-io services already use an experimental version of this spec file.

TODO: Complete the specification format.

TODO: Supported input and output mime-types should be listed explicitly.

TODO: The EngineSpecification file requires the spec file to be located as a resource. If we enforce a layout of service per-folder, <plugin>/services/<service>, then it should try first $CLARA_PLUGINS/**/services/<service>/info.yaml

TODO: The description may contain extended markup format, like Markdown. It can be then shown as HTML in some graphical interface instead of plain text.

CLARA Shell

Finish API to add custom Java-based commands

The main idea of implementing the shell in Java was to support extensibility. Plugins should be able to add custom Java-base commands and custom variables. Most of the pieces are already present in the code and the feature just needs to be finished.

The plugin should provide a factory class that will register commands and variables within the shell. With support for info.yaml in the root of the plugin, the factory can be defined as:

# info.yaml
plugin:
  shell:
    factories:
      - class: org.jlab.clara.demo.shell.DemoFactory

Multiple factories should be supported. On startup, the shell should load all available factories.

This feature would enable running CLAS12 reconstruction and analysis directly from the shell, if analysis commands are provided.

TODO: Make loading optional? Register the command but actually load the class the first time the command is used?

TODO: Register commands and variables directly from the YAML.

TODO: Test launching GUIs apps.

Provide information about available services

If services specs files are supported, then all the information about services is accessible offline (instead of only for deployed services).

The shell should contain commands to query and show available services. This information can be parsed from the different plugin and services info YAML files.

TODO: Design these commands.

Improve displaying shell variables

Shell variables are defined with a default value, or unset by default. Then the user can change their values, and the current values of all variables can be obtained with show config.

But the current presentation is not optimal, and it should be improved.

  • Group variables by section. Currently they are listed together by the order they were defined, and it's hard to make sense of their organization or purpose.

  • The run local and run farm commands require different default values for same core variables, which is adding a lot of confusion (and it's the reason that threads variable for example is unset by default). We really need to find a solution to this that keeps things as simple as possible. Main issue is overlapping of commands over same variables. Maybe run farm should ignore threads and javaMemory.

  • List unset variables apart. Showing NO VALUE is just confusing. Try to minimize the number of unset variables by default.

  • Add a show variable <variable> (or show config <variable>?) to display a single variable (or maybe a regex matching a group of variables).

  • Show next to the variable if its value has been modified from the default value. It can be a different color (like IntelliJ Idea configuration does) or with some kind of text mark like (*).

  • Add commands to unset a variable or reset a variable to its default value.

(Optionally) Mark shell variables as advanced

There should be a flag to mark a variable as advanced. Then the variable would not appear when the show command is used normally. This would help reducing noise when too many variables are defined.

(Optionally) Allow modifying the environment inside the shell

Provide a new command to set environment variables from the shell:

clara> setenv MY_VAR 1

This environment variable should be available to all commands that run from the shell.

(Optionally) Add echo command to the shell

Useful to print value of environment variables.

clara> echo $PATH

TODO: Should shell variables be also expanded by the shell when used as argument in the form $<variable>?

(Optionally) Support running scripts/binaries from the shell

It should be possible to run binaries directly from the shell. Proposed syntax:

clara> !custom-script

clara> !some-binary --args

TODO: Add each <plugin>/bin directory to the PATH inside the shell?

(Optionally) Support more complete help output for commands

The current help text is too simplistic. Just a "summary" of what the command does. This could be improved with full text help like a "man page", with extended description of what commands do, the arguments, and usage examples. It could be read from Markdown files.

A "workflow" main help page could also be useful as an overview that describes how to use the shell.

(Optionally) Support pipe (|) operator to combine commands

For example:

clara> show dpeLog | grep Reader

No idea if this is feasible.

Standard Orchestrator

Finish error handling

WIP.

VERY IMPORTANT.

Most of errors just make the orchestrator exit instead of continuing working with the next file in the queue

Improve support for declaring multi-DPE applications

The standard orchestrator only supports multi-DPE applications in the form of a DPE of each required language per node, which are automatically detected.

This is, if the data-processing service list contains Java and C++ services, the orchestrator will wait until both the Java and the C++ DPE are alive in some node (the DPEs should have been started with a proper session) and then it will deploy the application in both DPEs.

There should be a key to specify at which DPE a service should be running, instead of deducing it from its language. Something like this, probably:

# services.yaml
services:
  - name: ServiceA
    dpe: 10.0.1.1%8888_java

This could be used for:

  • Mainly, pass an external DPE name for single services that live in their own DPE independently of working node DPEs.

    This could also be used to distribute an application through multiple worker nodes, instead of replicating the same application on every worker node.

    TODO: Should session be considered? Or the full DPE name is enough to use the DPE even if the session does not match?

    TODO: Should the orchestrator deploy this service or is it expected to be deployed already as a long-live external service?

    TODO: Could DPEs automatically deploy long-live services on startup?

  • Use several DPEs of the same language to deploy services between them, instead of all services of a given language in a single DPE.

    Not sure how to declare this. For example, deploy ServiceA in Java DPE 1 and ServiceB in Java DPE 2. How to identify the DPEs? The orchestrator will be listening for their alive messages. Should it just wait until two Java DPEs in a given node send their reports, or match some kind of suffix in the session? The shell should start both DPEs. Should it add some specific suffix to the session or just set random ports?

  • Support multiple worker DPEs but in the same node. This is equivalent to multiple worker DPEs, one per node, where each worker DPE contain an independent application in charge of processing its own file.

    It can be useful for NUMA-based machines, where having a single application across all NUMA nodes in the machine will just slow down everything due to constant switching between non-local nodes. Instead, each NUMA node would contain its own worker DPE, avoiding constant access across nodes.

    Currently the clara-shell starts a pair orchestrator+DPE for each NUMA node when running on JLab farm.

Separate declaration of services and declaration of composition

The standard orchestrator provides no flexibility to define the application composition. It takes the list of declared services from services.yaml and forms the composition as a chain of services, in the order of declaration. No support to set conditions or forking points or complex compositions.

In order to add support to define complex applications, provide a new section to services.yaml where the composition can be defined explicitly:

# services.yaml
application: >
  S1 + S2 + S3;
  if (S3 == "fast") {
    S3 + S4 + S6;
  } else {
    S3 + S5 + S6;
  }

This string should be a valid CLARA composition.

  • If a composition block is defined, then the orchestrator should not generate any kind of automatic composition from the services section.

    For example, the monitoring chain should be now set explicitly in the composition, along with the data-processing chain.

  • The string is not the final composition. Services placeholders should be replaced with full service canonical names.

  • TODO: The service placeholders should be just the literal names, or actual template placeholders like {{ S1 }}?

Finish specification of services.yaml and support version field.

Current format for services.yaml supports multiple variations to declare services, and it lacks support to define complex applications.

The specification should be formally defined and in order to use the final specification a version field should be added to the file, to declare which format is used (just like the Docker Compose YAML file).

#services.yaml
version: 1
...

If no version is declared, then the current in-development format should be used to parse the file and no new features should be allowed unless the version is present.

Multi-Loop reconstruction with DPE-side orchestration

WIP

(Optionally) Support customization/extension of the standard orchestrator

The standard orchestrator is based on CLAS reconstruction: loop over files, loop over events, use local or multiple nodes.

Another application may want to reuse most of the orchestration, but configure specific steps (?).

If so, the places where the orchestration can be customized should be specified, and the API to pass custom handlers that run on those steps should be defined.

This would be a new user-defined orchestrator class, so there must be an easy way to run it instead of the developer figuring out the proper classpath and initialization steps.

(Optionally) Stop requiring a list of data-types in the YAML file

It forces the user to know more details about the services, and adds more complexity when creating the services.yaml file.

This information could be obtained automatically if each service provides its own <engine>.yaml info file with the list of supported data-types.

User-defined orchestrators

The orchestrator API is too low level to quickly design orchestration of services.

  • There must be a new layer of high level patterns to reduce the amount of boilerplate code required to do common tasks, like defining and deploying an application, or retrying failed requests through the network.

  • It's not simple to create the script that starts the orchestrator from the command line. Some extra environment variables must be considered, the proper classpath as expected by CLARA should be set, the right JRE should be selected (when distributed with CLARA), etc.

    A better option is that CLARA provides a wrapper script that will set up everything correctly, and only requires what orchestrator to start.

    Potentially:

    $ run-orchestrator <orchestrator>
    $ run-orchestrator <orchestrator> <args>
    $ run-orchestrator -p <plugin> <orchestrator>

    This command could be run directly from the command line, or the orchestrator developer could add that command to a simple script with a descriptive name.

    TODO: List the orchestrators in the plugin info.yaml file, so the wrapper could receive just the name? Or receive the full classpath (in that case the command should be inside a custom script)?

    TODO: What about non-plugin single orchestrator? Where should they be installed? Should they have their own info.yaml file?

    TODO: Should orchestrators be allowed to install custom scripts into $CLARA_HOME/bin?

Internal

Improve execution scheduler

WIP

VERY IMPORTANT.

Improve service logging

WIP

Use isolated classloaders to load plugin/services

Currently the entire class loading is handled by using the same classpath for CLARA and all plugins/services, which is roughly like this:

<clara>:<plugin1>:<plugin2>:...:<services>

This creates all sort of problems when multiple plugins are involved and different versions of dependencies are required.

A new classloader must be created each time a service is deployed. This classloader should only use paths related to the service/plugin:

  1. $CLARA_HOME/plugins/<plugin>/services/<service>/classes
  2. $CLARA_HOME/plugins/<plugin>/services/<service>/lib
  3. $CLARA_HOME/plugins/<plugin>/services (legacy directory)
  4. $CLARA_HOME/plugins/<plugin>/lib
  5. Plugin-defined custom directories inside $CLARA_HOME/plugins/plugin.
  6. $CLARA_HOME/lib (for Engine and EngineData) (*)

In the case of individual services (installed into $CLARA_HOME/services directory) it could be:

  1. $CLARA_HOME/services/<service>/classes
  2. $CLARA_HOME/services/<service>/lib
  3. $CLARA_HOME/lib (for Engine and EngineData) (*)

Since CLAS loves to have their own directory layout (which requires us to support those paths directly when launching the DPE) there should be a configuration key inside info.yaml that list the directories inside the plugin that must be added to the classpath (and clean j_dpe from CLAS specific paths):

# info.yaml
plugin:
  classpath:
    - core
    - clas

But in general, a common directory layout should be recommended.

TODO (*): CLARA jars and dependencies are all together inside $CLARA_HOME/lib. Maybe separate them between public and private?