-
Notifications
You must be signed in to change notification settings - Fork 2
Roadmap
(Work in progress) List of features to be completed for CLARA 4.4/5.x
The services.yaml
file currently requires the user to know
the actual classpath/library of the service
in order to use it as part of the application.
# services.yaml
io-services:
reader:
class: org.jlab.clara.demo.services.ImageReaderService
name: ImageReaderService
writer:
class: org.jlab.clara.demo.services.ImageWriterService
name: ImageWriterService
services:
- class: org.jlab.clara.demo.services.FaceDetectorService
name: FaceDetectorService
- class: pupil_detector_service
name: PupilDetectorService
lang: cpp
The user should not need to be aware of the implementation details of the services nor their actual location on disk.
At the root of the plugin directory there should be an info.yaml
file
(TODO: decide the right name),
which lists all services provided by the plugin:
# info.yaml
plugin:
services:
- class: org.jlab.clara.demo.services.ImageReaderService
name: ImageReaderService
lang: java
- class: org.jlab.clara.demo.services.ImageWriterService
name: ImageWriterService
lang: java
- class: org.jlab.clara.demo.services.FaceDetectorService
name: FaceDetectorService
lang: java
- class: pupil_detector_service
name: PupilDetectorService
lang: cpp
TODO: Should this list of services be separated into sections?
TODO: If <engine>.yaml
specification files are used,
then what should this list contain? Paths to the spec files of each service?
TODO: If we enforce a layout like single services per-folder:
<plugin>/services/<service>
, then the spec files can be obtained from
<plugin>/services/<service>/info.yaml
and the plugin info file is not necessary?
This plugin info file frees the user to know the details about the services and allow them to use just their registration names to compose an application:
# services.yaml
io-services:
reader:
name: ImageReaderService
writer:
name: ImageWriterService
services:
- name: FaceDetectorService
- name: PupilDetectorService
lang: cpp
CLARA can then map each service name to the proper class to be loaded by using the information provided by the plugin itself.
TODO: Decide if lang
should be a required key or not.
The language information should be provided by the info.yaml
file,
but it is also useful that the language is stated clearly in the services.yaml
.
TODO: How to differentiate between services of the same name that are provided by different plugins? By the classpath? By the plugin name?
TODO: Services installed individually (i.e. not part of a plugin)
should also provide some kind of info.yaml
file,
but currently there is not specific layout for single services.
TODO: info.yaml
should list plugin-provided orchestrators too?
Currently, the service metadata is provided programmatically
through the Engine
interface.
But this limits the access to that information to deployed services only.
There is no way to show the meta data of installed-but-not-deployed
(i.e. available) services.
Services should provide a YAML specification file with their metadata.
This file can then be parsed when service is created, in order to
get the return data for each of the Engine
interface methods.
The EngineSpecification
class is available to services developers to use an specification file.
The AbstractService
base class in the CLARA standard library already provides this functionality.
For ServiceA
, a ServiceA.yaml
resource must exist,
which will be then loaded and parsed automatically.
The specification file should at least contain the following fields:
---
name: EvioToEvioReader
engine: org.jlab.clas.std.services.convertors.EvioToEvioReader
type: java
author: Sebastián Mancilla
email: smancill@jlab.org
version: 1.4
description:
Reads EVIO events from a file.
Returns a new event on each request or an error if there was some problem.
The clasrec-io services already use an experimental version of this spec file.
TODO: Complete the specification format.
TODO: Supported input and output mime-types should be listed explicitly.
TODO: The EngineSpecification
file requires the spec file to be located
as a resource.
If we enforce a layout of service per-folder, <plugin>/services/<service>
,
then it should try first $CLARA_PLUGINS/**/services/<service>/info.yaml
TODO: The description may contain extended markup format, like Markdown. It can be then shown as HTML in some graphical interface instead of plain text.
The main idea of implementing the shell in Java was to support extensibility. Plugins should be able to add custom Java-base commands and custom variables. Most of the pieces are already present in the code and the feature just needs to be finished.
The plugin should provide a factory class
that will register commands and variables within the shell.
With support for info.yaml
in the root of the plugin,
the factory can be defined as:
# info.yaml
plugin:
shell:
factories:
- class: org.jlab.clara.demo.shell.DemoFactory
Multiple factories should be supported. On startup, the shell should load all available factories.
This feature would enable running CLAS12 reconstruction and analysis directly from the shell, if analysis commands are provided.
TODO: Make loading optional? Register the command but actually load the class the first time the command is used?
TODO: Register commands and variables directly from the YAML.
TODO: Test launching GUIs apps.
If services specs files are supported, then all the information about services is accessible offline (instead of only for deployed services).
The shell should contain commands to query and show available services. This information can be parsed from the different plugin and services info YAML files.
TODO: Design these commands.
Shell variables are defined with a default value, or unset by default.
Then the user can change their values,
and the current values of all variables can be obtained with show config
.
But the current presentation is not optimal, and it should be improved.
-
Group variables by section. Currently they are listed together by the order they were defined, and it's hard to make sense of their organization or purpose.
-
The
run local
andrun farm
commands require different default values for same core variables, which is adding a lot of confusion (and it's the reason thatthreads
variable for example is unset by default). We really need to find a solution to this that keeps things as simple as possible. Main issue is overlapping of commands over same variables. Mayberun farm
should ignorethreads
andjavaMemory
. -
List unset variables apart. Showing
NO VALUE
is just confusing. Try to minimize the number of unset variables by default. -
Add a
show variable <variable>
(orshow config <variable>
?) to display a single variable (or maybe a regex matching a group of variables). -
Show next to the variable if its value has been modified from the default value. It can be a different color (like IntelliJ Idea configuration does) or with some kind of text mark like
(*)
. -
Add commands to unset a variable or reset a variable to its default value.
There should be a flag to mark a variable as advanced.
Then the variable would not appear when the show
command is used normally.
This would help reducing noise when too many variables are defined.
Provide a new command to set environment variables from the shell:
clara> setenv MY_VAR 1
This environment variable should be available to all commands that run from the shell.
Useful to print value of environment variables.
clara> echo $PATH
TODO: Should shell variables be also expanded by the shell
when used as argument in the form $<variable>
?
It should be possible to run binaries directly from the shell. Proposed syntax:
clara> !custom-script
clara> !some-binary --args
TODO: Add each <plugin>/bin
directory to the PATH inside the shell?
The current help text is too simplistic. Just a "summary" of what the command does. This could be improved with full text help like a "man page", with extended description of what commands do, the arguments, and usage examples. It could be read from Markdown files.
A "workflow" main help page could also be useful as an overview that describes how to use the shell.
For example:
clara> show dpeLog | grep Reader
No idea if this is feasible.
WIP.
VERY IMPORTANT.
Most of errors just make the orchestrator exit instead of continuing working with the next file in the queue
The standard orchestrator only supports multi-DPE applications in the form of a DPE of each required language per node, which are automatically detected.
This is, if the data-processing
service list contains Java and C++ services,
the orchestrator will wait until both the Java and the C++ DPE are alive in some node
(the DPEs should have been started with a proper session)
and then it will deploy the application in both DPEs.
There should be a key to specify at which DPE a service should be running, instead of deducing it from its language. Something like this, probably:
# services.yaml
services:
- name: ServiceA
dpe: 10.0.1.1%8888_java
This could be used for:
-
Mainly, pass an external DPE name for single services that live in their own DPE independently of working node DPEs.
This could also be used to distribute an application through multiple worker nodes, instead of replicating the same application on every worker node.
TODO: Should session be considered? Or the full DPE name is enough to use the DPE even if the session does not match?
TODO: Should the orchestrator deploy this service or is it expected to be deployed already as a long-live external service?
TODO: Could DPEs automatically deploy long-live services on startup?
-
Use several DPEs of the same language to deploy services between them, instead of all services of a given language in a single DPE.
Not sure how to declare this. For example, deploy ServiceA in Java DPE 1 and ServiceB in Java DPE 2. How to identify the DPEs? The orchestrator will be listening for their alive messages. Should it just wait until two Java DPEs in a given node send their reports, or match some kind of suffix in the session? The shell should start both DPEs. Should it add some specific suffix to the session or just set random ports?
-
Support multiple worker DPEs but in the same node. This is equivalent to multiple worker DPEs, one per node, where each worker DPE contain an independent application in charge of processing its own file.
It can be useful for NUMA-based machines, where having a single application across all NUMA nodes in the machine will just slow down everything due to constant switching between non-local nodes. Instead, each NUMA node would contain its own worker DPE, avoiding constant access across nodes.
Currently the clara-shell starts a pair orchestrator+DPE for each NUMA node when running on JLab farm.
The standard orchestrator provides no flexibility to define the application composition.
It takes the list of declared services from services.yaml
and forms the composition as a chain of services, in the order of declaration.
No support to set conditions or forking points or complex compositions.
In order to add support to define complex applications,
provide a new section to services.yaml
where the composition can be defined explicitly:
# services.yaml
application: >
S1 + S2 + S3;
if (S3 == "fast") {
S3 + S4 + S6;
} else {
S3 + S5 + S6;
}
This string should be a valid CLARA composition.
-
If a composition block is defined, then the orchestrator should not generate any kind of automatic composition from the services section.
For example, the monitoring chain should be now set explicitly in the composition, along with the data-processing chain.
-
The string is not the final composition. Services placeholders should be replaced with full service canonical names.
-
TODO: The service placeholders should be just the literal names, or actual template placeholders like
{{ S1 }}
?
Current format for services.yaml
supports multiple variations to declare
services, and it lacks support to define complex applications.
The specification should be formally defined and in order to use the final
specification a version
field should be added to the file,
to declare which format is used (just like the Docker Compose YAML file).
#services.yaml
version: 1
...
If no version is declared, then the current in-development format should be used to parse the file and no new features should be allowed unless the version is present.
WIP
The standard orchestrator is based on CLAS reconstruction: loop over files, loop over events, use local or multiple nodes.
Another application may want to reuse most of the orchestration, but configure specific steps (?).
If so, the places where the orchestration can be customized should be specified, and the API to pass custom handlers that run on those steps should be defined.
This would be a new user-defined orchestrator class, so there must be an easy way to run it instead of the developer figuring out the proper classpath and initialization steps.
It forces the user to know more details about the services,
and adds more complexity when creating the services.yaml
file.
This information could be obtained automatically if each service provides
its own <engine>.yaml
info file with the list of supported data-types.
The orchestrator API is too low level to quickly design orchestration of services.
-
There must be a new layer of high level patterns to reduce the amount of boilerplate code required to do common tasks, like defining and deploying an application, or retrying failed requests through the network.
-
It's not simple to create the script that starts the orchestrator from the command line. Some extra environment variables must be considered, the proper classpath as expected by CLARA should be set, the right JRE should be selected (when distributed with CLARA), etc.
A better option is that CLARA provides a wrapper script that will set up everything correctly, and only requires what orchestrator to start.
Potentially:
$ run-orchestrator <orchestrator> $ run-orchestrator <orchestrator> <args> $ run-orchestrator -p <plugin> <orchestrator>
This command could be run directly from the command line, or the orchestrator developer could add that command to a simple script with a descriptive name.
TODO: List the orchestrators in the plugin
info.yaml
file, so the wrapper could receive just the name? Or receive the full classpath (in that case the command should be inside a custom script)?TODO: What about non-plugin single orchestrator? Where should they be installed? Should they have their own
info.yaml
file?TODO: Should orchestrators be allowed to install custom scripts into
$CLARA_HOME/bin
?
WIP
VERY IMPORTANT.
WIP
Currently the entire class loading is handled by using the same classpath for CLARA and all plugins/services, which is roughly like this:
<clara>:<plugin1>:<plugin2>:...:<services>
This creates all sort of problems when multiple plugins are involved and different versions of dependencies are required.
A new classloader must be created each time a service is deployed. This classloader should only use paths related to the service/plugin:
$CLARA_HOME/plugins/<plugin>/services/<service>/classes
$CLARA_HOME/plugins/<plugin>/services/<service>/lib
-
$CLARA_HOME/plugins/<plugin>/services
(legacy directory) $CLARA_HOME/plugins/<plugin>/lib
- Plugin-defined custom directories inside
$CLARA_HOME/plugins/plugin
. -
$CLARA_HOME/lib
(forEngine
andEngineData
) (*)
In the case of individual services
(installed into $CLARA_HOME/services
directory)
it could be:
$CLARA_HOME/services/<service>/classes
$CLARA_HOME/services/<service>/lib
-
$CLARA_HOME/lib
(forEngine
andEngineData
) (*)
Since CLAS loves to have their own directory layout
(which requires us to support those paths directly when launching the DPE)
there should be a configuration key inside info.yaml
that list the directories inside the plugin that must be added to the classpath
(and clean j_dpe
from CLAS specific paths):
# info.yaml
plugin:
classpath:
- core
- clas
But in general, a common directory layout should be recommended.
TODO (*): CLARA jars and dependencies are all together inside $CLARA_HOME/lib
.
Maybe separate them between public and private?