MARS is a data brokering initiative for submitting multi-omics life sciences studies to multiple specialized repositories. It is setup to be modular and enables data producers and multiple data repositories to exchange information seamlessly using the same standardized ISA-JSON format. Unlike a centralized platform, MARS operates as a common framework, allowing for decentralized data submissions while ensuring consistent interpretation and validation of ISA-JSON containing metadata across various repositories. The initiative ensures mutual understanding and accurate interpretation of the data, preserving the important links between multi-omics data generated from the same biological source.
MARS is comprised of multiple stakeholders: the end-user, the platform that generates the ISA-JSON, target repositories and the data broker. Each represents key roles in the data submission process. Read more about it in our stakeholders page.
We use ISA-JSON to store and interchange metadata between the end-user and the target repositories because:
- Standardization: ISA-JSON follows the ISA structure (Investigation- Study - Assay), ensuring structured metadata descriptions.
- Versatile: It is not bound to any domain and can represent multi-omics experimental metadata.
- Interoperability: Since ISA-JSON follows a standard format, it facilitates interoperability between different software tools and platforms that support the ISA standard.
- Community Adoption: Widely adopted within the life sciences research community for metadata standardization.
It is produced by the ISA-JSON producing platforms and will be the metadata input of the Data broker platform, see below.
A platform operated by the Data broker should:
- Accept an ISA-JSON as input and submit it to the repositories without any loss of information.
- Extend the ISA-JSON with additional information provided by the target repositories. For example, the accessions assigned to the submitted objects.
- Process reporting errors
- Enable secure credential management and the possibility to set brokering accounts.
- Supports data transfer through various protocols (e.g. FTP). This would include the verification of the checksums associated to the data files.
- Allows the Data broker to set up a brokering account or the end-user a personal account.
- To ensure that the brokering account is not used beyond the purposes defined by the producer. In other words, not to modify or submit in the name of the producer without their consent.
To be discussed: Handling brokering accounts: who creates it? Same for all repositories? Who handles requests to broker data? Can it be done automatically? Are all namespaces for submissions would be shared? Check: https://ena-docs.readthedocs.io/en/latest/faq/data_brokering.html
Examples of Data broker platforms are ARC, Galaxy, ...
This command line tool (CLI) will be used by the Data broker platform and will perform the actual submission of the ISA-JSON to the repositories. Based on receipts repositories give back as response, the ISA-JSON will be updated with accession numbers. The application is build as a Python library which can be integrated in a web application, ARC, Galaxy and others. Source code and documentation can be found in the mars-cli folder in this repo.
The main steps of MARS-CLI are:
-
Ingesting and validating the ISA-JSON: Compared to the vanilla ISA specification, the MARS-CLI has certain fields required (for example
target repository
as comment) in order to function properly. Upon ISA-JSON ingestion the information gets loaded in memory and validated at the same time using Pydantic. -
Identifying the target repositories: The order of submission can be depended on the target repositories specified in the ISA-JSON.
-
Registering samples in BioSamples: Submitting an ISA-JSON to a newly developed API at BioSamples. The BioSamples accession will be reused by the other repositories and thus needs to be done first. After a successful submission, BioSamples sends back a receipt containing BioSamples accession numbers for
Source
andSample
asSource characteristics
andSample characteristics
, respectively.The source code for the ISA-JSON API for BioSamples can be found in the repository-services repo and can be used for testing
-
Filtering the ISA-JSON: The ISA-JSON (updated with BioSamples IDs) has to be filtered for every target repository so it only contains information relevant for that repo. This will be facilitated by the
target repository
attribute present in the ISA-JSON assays. -
Submitting data to target repositories: Since some repositories have the requirement that the actual data is already present in their upload space, the MARS-CLI could optionally take care of the data submission. This would guarantee the presence of the the data upon the metadata (ISA-JSON) submission and a checksum check.
-
Registering ISA-JSON at the target repositories: Sending the filtered ISA-JSON to the endpoints of the repositories who accept ISA-JSON, for example ENA.
The source code for the ISA-JSON API for ENA can be found in the repository-services repo and can be used for testing
-
Processing the receipts and errors from the repositories: After a successful submission, each repository sends back a receipt in a standard format defined for MARS (see repository-api info). The receipt contains the path of the objects in the ISA-JSON for which an accession number has been generated, and the related accession number.
-
Updating the BioSamples External References: Data broker uses the BioSamples accession numbers to download the submitted BioSamples JSON and extend the
External References
schema by adding the accession numbers provided by the other target archives. -
Dumping back an updated ISA-JSON with repositories' information: Based on the information in the receipts, the ISA-JSON is populated with accession numbers linked to the submitted metadata objects and can be given back as output.
MARS-CLI is not responsible for storing and managing credentials, used for submission to the target repositories. Therefor, credentials should be managed by the Data broker platform.
MARS-CLI is not to be used as a platform to host data and will not store the data after submission to the target repository. This should be handled by the Data broker platform. The ISA-JSON provided to the application will be updated and stored in the BioSamples repository as an External Reference, but is otherwise considered as ephemeral.
=> Data submission could be added to MARS-CLI?
ISA-JSON API services are being developed and deployed by the repositories that are part of the MARS initiative. This includes programmatic submission, the ingestion of ISA-JSON in order to register the metadata objects and the creation of a receipt according to the MARS repository-api standard.
Track the status of each repository here:
Repository | Programmatic submission | Development status | Deployed? | Source code |
---|---|---|---|---|
BioSamples | yes | Ready to be tested | no | GitHub |
ENA | yes | Ready to be tested | no | GitHub |
MetaboLights | NA | Not started | no | |
BioStudies/ArrayExpress | yes, in dev | Not started | no | |
e!DAL-PGP | NA | Not started | no | |
Your repository here? Join MARS! |
├── mars-cli
│ ├── mars_lib/
│ ├── mars_cli.py
│ ├── README.md
│ └── ...
├── repository-services
│ ├── isajson-biosamples/
│ │── isajson-ena/
│ ├── repository-api.md
│ └── README.md
│ └── ...
├── test-data
│ └── ...
└── README.md
- mars-cli: Source code of the Python library to submit (meta)data to the repositories. See README to read more on how to use the command line tool.
- repository-services: Code to deploy repository API endpoints that can accept ISAJSON. See README for deployment instructions.
- repository-api.md: Describing the receipt standard for repository APIs to follow.
- test-data: Test data to be used in a submission.
- README.md: This file