Skip to content

Eoulsan : how to add a module

Jaze8 edited this page Jun 7, 2018 · 6 revisions

Introduction

There are two ways of adding a module in Eoulsan. One is to create a new Java module, the other is to wrap your module within a Galaxy tool XML file. As your tool is meant to be included in a workflow you should have a way to place it in the flow of your analysis. This is achieved by creating a DataFormat specific to your module which defines the inputs and outputs of your module.

So the next sections introduce the creation of a DataFormat, and then of a Java module, a Docker image or a Galaxytool.

Add a new DataFormat

DataFormat

DataFormat are format files recognized by Eoulsan. A DataFormat is mainly caracterized by its extension (.txt, .tsv, ...) and a prefix. The prefix is a part of the file name, indicating which step produced this file and written just before the data name. Example: expression_sample1.tsv

Add

There are two ways for adding a DataFormat to Eoulsan. The easier one is to write a Galaxy DataFormat XML file as described in : https://github.com/GenomicParisCentre/eoulsan/wiki/Implementing-a-Galaxy-format

The alternative is to integrate the XML file in your package in the java/META-INF/services/xmldataformats directory and add the file names in a META-INF/services/fr.ens.biologie.genomique.eoulsan.data.XMLDataFormat.

Write a new Java Class

Coding Policy

Your code should respect Java coding policy and Eoulsan coding policy (see https://github.com/GenomicParisCentre/eoulsan/wiki/Coding-Policy)

Required Elements

Here a few required elements are explained :

Doc line preceding module name (@LocalOnly, @HadoopOnly etc...) : Defines the correct usage of the module, if absent the module won't be recognized. **getInputPorts() function **: Defines the input format of data. The easier way to create one is to create a new InputPortsBuilder, then use addPort() function. This function takes as input a name (for the port), a DataFormat and a Boolean value (ie, true or false). The port name can be used in a workflow file to identify which file should be passed to a module. The Boolean value of addPort is false when you only need one port for the given DataFormat. If set to true, Eoulsan import a list containing all files with the corresponding formats. addPort() can be used several times to give input of different formats. Creation is terminated by using create() function.\ Example :

new InputPortsBuilder()
.addPort(InputPortsBuilder.DEFAULT_SINGLE_INPUT_PORT_NAME,
	true, DataFormats.EXPRESSION_RESULTS_TSV)
.create()

getOutputPorts() function : Defines the ouput format of the data. It can be constructed the same way as getInputPorts(), but using a new OutputPortsBuilder.

MODULE_NAME : a String defining the name of the module (the one used when Eoulsan calls the module).

fr.ens.biologie.genomique.eoulsan.core.Module : This file gathers a list of all objects for Eoulsan to use. You need to add this file to your package in java/META-INF/services/ directory, and add to this file your modules.

Input files

When importing a file, you can use either a File object (from Java.IO) or a DataFile object (from eoulsan). The difference is that if the DataFile is compressed, Eoulsan will deal with compression when opening the file.

Interacting with Docker images

Communicate with a Docker image

Communication between your code and the docker is ensured by a DockerExecutorInterpreter. The easiest way is to define an ExecutorInterpreter field in your module. Create a DockerExecutorInterpreter after the reference docker image is set (module configuration). Then in the execute function of your module you can define important directories and files :

// Define execution directory, where files are loaded
File execDir = context.getStepOutputDirectory().toFile();
// Define output directory, where results are saved
File OutputDir = context.getOutputDirectory().toFile();
// Define temporary directory
File tmpDir = context.getLocalTempDirectory();
// Define log directory for log files of your module
File logDirectory = ((TaskContextImpl) context)
	.getTaskOutputDirectory().toFile();
// Define logFiles in log Directory, stdout and errors
File stdoutFile = new File(logDirectory, 
	MODULE_NAME + ".model.out");
File stderrFile = new File(logDirectory, 
	MODULE_NAME + ".model.err");

Build a String containing the command line to use in the docker image. Then pass it to DockerExecutorInterpreter.createCommandLine() to obtain a List of String. Finally launch the command :

try{
	String commandLine = "RScript" + " " + myScript + " " + 
		Param1 + " " + Param2 + " " + Param3 +" " + Param4 ;
	List<String> command =
		this.executor.createCommandLine(commandLine);
	executor.execute(command, execDir, tmpDir, 
		stdoutFile, stderrFile, new File[] {OutputDir} );
} catch(Exception e) {
	return status.createTaskResult(e, 
		"Error while running module : " + e.getMessage());
}

Write a script for the image

This strategy was used for the diffana module. This module use a DockerRExecutor which is basically writing the R script in the image. See module for details.

Create a new Docker Image

Fields

A Docker file contains three field :

  • FROM : place here the image from which yours is constructed
  • MAINTAINER : file author
  • RUN : commands to run in the file

Repository and automated build

All images are gathered in https://github.com/GenomicParisCentre/dockerfiles . The easiest way to contribute is by using git.

First, make a local copy of the repository using the git clone command. Then add your file in the repository, add it in git using git add , and commit change ( git commit -m "Message for the commit" ). Tag your commit using git tag. And finally push ( git push ) to apply your modification to the online repository.

This repository is linked to Dockerhub (https://hub.docker.com/) for automated build. You have to manually set an automated build from the create menu of the interface, linking it to genomicpariscentre/dockerfiles.

Tricks

  • You can use several RUN in your dockerfile, however each RUN adds a new layer to your image and the number of layers is limited.
  • Your commands form generally a bash script, so you can write a one liner using "; " to end your command and pass to next line
  • You can also use conditions in the form :
wget http://cran.r-project.org/src/contrib/scales_0.4.1.tar.gz; \
if [ ! -f scales_0.4.1.tar.gz ]; \
then wget http://cran.r-project.org/src/contrib/Archive/scales/scales_0.4.1.tar.gz; \
fi; \

Add a new GalaxyTool XML file

Galaxy tools are described here : https://github.com/GenomicParisCentre/eoulsan/wiki/Implementing-Galaxy-tool.

They can be added by simply putting your XML file in the galaxytools repository which is itself in the Eoulsan main repository.

This is by far, the easiest way to add a module in Eoulsan.

Choose beetween Galaxy tools or Java module

Choice is mainly guided by the need to access Eoulsan specific files (files that are not produced by a module : workflow, design files, annotation files) or not. If you need to access such files, you will need to code a Java module. Else you can make a Galaxy tool.