Skip to content

Commit cd3e8a9

Browse files
authored
Merge pull request #88 from ELIXIR-Belgium/dev
Adding ISA-JSON support as input file.
2 parents d540934 + abf7b73 commit cd3e8a9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+16576
-66
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@
22
.secret.yml
33
build/
44
ena_upload_cli.egg-info/
5-
ena_upload/__pycache__/
5+
__pycache__/

README.md

Lines changed: 22 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -7,18 +7,18 @@
77

88
# ENA upload tool
99

10-
This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files or one of the excel spreadsheets that can be found on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates). The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a [Galaxy tool](https://toolshed.g2.bx.psu.edu/view/iuc/ena_upload/) and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like [usegalaxy.eu](https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload).
10+
This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files or one of the excel spreadsheets that can be found on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates). The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programmatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a [Galaxy tool](https://toolshed.g2.bx.psu.edu/view/iuc/ena_upload/) and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like [usegalaxy.eu](https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload).
1111

1212
## Overview
1313

14-
The metadata should be provided in separate tables corresponding to the following ENA objects:
14+
The metadata should be provided in separate tables or files carrying similar information corresponding to the following ENA objects:
1515

1616
* STUDY
1717
* SAMPLE
1818
* EXPERIMENT
1919
* RUN
2020

21-
The program to perform the following actions:
21+
You can set the tool to perform the following actions:
2222

2323
* add: add an object to the archive
2424
* modify: modify an object in the archive
@@ -29,11 +29,15 @@ After a successful submission, new tsv tables will be generated with the ENA acc
2929

3030
## Tool dependencies
3131

32-
* python 3.5+ including following packages:
32+
* python 3.7+ including following packages:
3333
* Genshi
3434
* lxml
3535
* pandas
3636
* requests
37+
* pyyaml
38+
* openpyxl
39+
* jsonschema
40+
3741

3842
## Installation
3943

@@ -60,12 +64,14 @@ All supported arguments:
6064
--experiment EXPERIMENT
6165
table of EXPERIMENT object
6266
--run RUN table of RUN object
63-
--data [FILE [FILE ...]]
64-
data for submission
67+
--data [FILE ...] data for submission
6568
--center CENTER_NAME specific to your Webin account
6669
--checklist CHECKLIST
6770
specify the sample checklist with following pattern: ERC0000XX, Default: ERC000011
6871
--xlsx XLSX filled in excel template with metadata
72+
--isa_json ISA_JSON BETA: ISA json describing describing the ENA objects
73+
--isa_assay_stream ISA_ASSAY_STREAM
74+
BETA: specify the assay stream(s) that holds the ENA information, this can be a list of assay streams
6975
--auto_action BETA: detect automatically which action (add or modify) to apply when the action column is not given
7076
--tool TOOL_NAME specify the name of the tool this submission is done with. Default: ena-upload-cli
7177
--tool_version TOOL_VERSION
@@ -88,7 +94,7 @@ To avoid exposing your credentials through the terminal history, it is recommend
8894

8995
### ENA sample checklists
9096

91-
You can specify ENA sample checklist using the `--checklist` parameter. By default the ENA default sample checklist is used supporting the minimum information required for the sample (ERC000011). The supported checklists are listed on the [ENA website](https://www.ebi.ac.uk/ena/browser/checklists). This website will also describe which Field Names you have to use in the header of your sample tsv table. The Field Names will be automatically mapped in the outputted xml if the correct `--checklist` parameter is given.
97+
You can specify ENA sample checklist using the `--checklist` parameter. By default the ENA default sample checklist is used supporting the minimum information required for the sample (ERC000011). The supported checklists are listed on our [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates).
9298

9399
#### Fixed sample columns
94100

@@ -104,55 +110,11 @@ The command line tool will automatically fetch the correct scientific name based
104110

105111
#### Viral submissions
106112

107-
If you want to submit viral samples you can use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist by adding `ERC000033` to the checklist parameter. Check out our [viral example command](#test-the-tool) as demonstration. Please use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist on the website of ENA to know which values are allowed/possible in the `restricted text` and `text choice` fields.
113+
If you want to submit viral samples you can use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist by adding `ERC000033` to the checklist parameter. Check out our [viral example command](#test-the-tool) as demonstration. Please use the [ENA virus pathogen](https://github.com/ELIXIR-Belgium/ENA-metadata-templates/tree/main/templates/ERC000033) checklist in our template repo to know what is allowed/possible in the `Controlled vocabulary`fields.
108114

109115
### ENA study, experiment and run tables
110116

111-
Here we list all the possible columns one can have in its study, experiment or run table along with its cardinality and controlled vocabulary (CV).
112-
Currently we refer to the [ENA Webin](https://wwwdev.ebi.ac.uk/ena/submit/webin/) to discover which values are allowed when a controlled vocabulary is used, but this will change in the future.
113-
114-
#### Study tsv table
115-
116-
| Name of column | Cardinality | Documentation | CV |
117-
|---|---|---|---|
118-
| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. | |
119-
| title | mandatory | Title of the study as would be used in a publication. | |
120-
| study_type | mandatory | The STUDY_TYPE presents a controlled vocabulary for expressing the overall purpose of the study. | yes |
121-
| study_abstract | mandatory | Briefly describes the goals, purpose, and scope of the Study. This need not be listed if it can be inherited from a referenced publication. | |
122-
| center_project_name | optional | Submitter defined project name. This field is intended for backward tracking of the study record to the submitter's LIMS. | |
123-
| study_description | optional | More extensive free-form description of the study. | |
124-
| pubmed_id | optional | Link to publication related to this study. | |
125-
126-
#### Experiment tsv table
127-
128-
| Name of column | Cardinality | Documentation | CV |
129-
|---|---|---|---|
130-
| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. | |
131-
| title | mandatory | Short text that can be used to call out experiment records in searches or in displays. | |
132-
| study_alias | mandatory | Identifies the parent study. | |
133-
| sample_alias | mandatory | Pick a sample to associate this experiment with. The sample may be an individual or a pool, depending on how it is specified. | |
134-
| design_description | mandatory | Goal and setup of the individual library including library was constructed. | |
135-
| spot_descriptor | optional | The SPOT_DESCRIPTOR specifies how to decode the individual reads of interest from the monolithic spot sequence. The spot descriptor contains aspects of the experimental design, platform, and processing information. There will be two methods of specification: one will be an index into a table of typical decodings, the other being an exact specification. This construct is needed for loading data and for interpreting the loaded runs. It can be omitted if the loader can infer read layout (from multiple input files or from one input files). | |
136-
| library_name | optional | The submitter's name for this library. | |
137-
| library_layout | mandatory | LIBRARY_LAYOUT specifies whether to expect single, paired, or other configuration of reads. In the case of paired reads, information about the relative distance and orientation is specified. | yes |
138-
| insert_size | mandatory | Relative distance. | |
139-
| library_strategy | mandatory | Sequencing technique intended for this library | yes |
140-
| library_source | mandatory | The LIBRARY_SOURCE specifies the type of source material that is being sequenced. | yes |
141-
| library_selection | mandatory | Method used to enrich the target in the sequence library preparation | yes |
142-
| platform | mandatory | The PLATFORM record selects which sequencing platform and platform-specific runtime parameters. This will be determined by the Center. | yes |
143-
| instrument_model | mandatory | Model of the sequencing instrument. | yes |
144-
| library_construction_protocol | optional | Free form text describing the protocol by which the sequencing library was constructed. | |
145-
146-
147-
#### Run tsv table
148-
149-
| Name of column | Cardinality | Documentation | CV |
150-
|---|---|---|---|
151-
| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. | |
152-
| experiment_alias | mandatory | Identifies the parent experiment. | |
153-
| file_name | mandatory | The name or relative pathname of a run data file. | |
154-
| file_type | mandatory | The run data file model. | yes |
155-
| file_checksum | optional | Checksum of uncompressed file. If not given, the checksum will be calculated based on the data files specified in the --data option | |
117+
Please check out the [template](https://github.com/ELIXIR-Belgium/ENA-metadata-templates) of your checklist to discover which attributes are mandatory for the study, experiment and run ENA object.
156118

157119

158120
### Dev instance
@@ -176,7 +138,7 @@ There are two ways of submitting only a selection of objects to ENA. This is han
176138
| sample_alias_5 | | sample_title_2 | 2697049 | sample_description_2 |
177139

178140

179-
> IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the `--action` parameter, not rows will be submitted! Either leave out the column or add to every row the corect action.
141+
> IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the `--action` parameter, no rows will be submitted! Either leave out the column or add to every row you want to submit the correct action.
180142
181143

182144
### Using Excel templates
@@ -215,7 +177,7 @@ By default the updated tables after submission will have the action `added` in t
215177
## Tool overview
216178

217179
**inputs**:
218-
* metadata tables/excelsheet
180+
* metadata tables/excelsheet/isa_json
219181
* examples in `example_table` and on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates) for excel sheets
220182
* (optional) define actions in **status** column e.g. `add`, `modify`, `cancel`, `release` (when not given the whole table is submitted)
221183
* to perform bulk submission of all objects, the `aliases ids` in different ENA objects should be in the association where alias ids in experiment object link all objects together
@@ -262,6 +224,11 @@ By default the updated tables after submission will have the action `added` in t
262224
ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --checklist ERC000033 --secret .secret.yml --xlsx example_tables/ENA_excel_example_ERC000033.xlsx
263225
```
264226

227+
* **Using an ISA JSON**
228+
```
229+
ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --secret .secret.yml --isa_json tests/test_data/simple_test_case_v2.json --isa_assay_stream "Ena stream 1"
230+
```
231+
265232
* **Release submission**
266233
```
267234
ena-upload-cli --action release --center'your_center_name' --study example_tables/ENA_template_studies_release.tsv --dev --secret .secret.yml

ena_upload/_version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.6.4"
1+
__version__ = "0.7.0"

ena_upload/ena_upload.py

Lines changed: 45 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
import hashlib
1313
import ftplib
1414
import requests
15+
import json
1516
import uuid
1617
import numpy as np
1718
import re
@@ -21,6 +22,8 @@
2122
import tempfile
2223
from ena_upload._version import __version__
2324
from ena_upload.check_remote import remote_check
25+
from ena_upload.json_parsing.ena_submission import EnaSubmission
26+
2427

2528
SCHEMA_TYPES = ['study', 'experiment', 'run', 'sample']
2629

@@ -371,7 +374,7 @@ def get_taxon_id(scientific_name):
371374
taxon_id = r.json()[0]['taxId']
372375
return taxon_id
373376
except ValueError:
374-
msg = f'Oops, no taxon ID avaible for {scientific_name}. Is it a valid scientific name?'
377+
msg = f'Oops, no taxon ID available for {scientific_name}. Is it a valid scientific name?'
375378
sys.exit(msg)
376379

377380

@@ -390,7 +393,7 @@ def get_scientific_name(taxon_id):
390393
taxon_id = r.json()['scientificName']
391394
return taxon_id
392395
except ValueError:
393-
msg = f'Oops, no scientific name avaible for {taxon_id}. Is it a valid taxon_id?'
396+
msg = f'Oops, no scientific name available for {taxon_id}. Is it a valid taxon_id?'
394397
sys.exit(msg)
395398

396399

@@ -413,16 +416,15 @@ def submit_data(file_paths, password, webin_id):
413416

414417
except IOError as ioe:
415418
print(ioe)
416-
print("ERROR: could not connect to the ftp server.\
419+
sys.exit("ERROR: could not connect to the ftp server.\
417420
Please check your login details.")
418-
sys.exit()
419421
for filename, path in file_paths.items():
420422
print(f'uploading {path}')
421423
try:
422424
print(ftps.storbinary(f'STOR {filename}', open(path, 'rb')))
423425
except BaseException as err:
424426
print(f"ERROR: {err}")
425-
print("ERROR: If your connection times out at this stage, it propably is because of a firewall that is in place. FTP is used in passive mode and connection will be opened to one of the ports: 40000 and 50000.")
427+
print("ERROR: If your connection times out at this stage, it probably is because of a firewall that is in place. FTP is used in passive mode and connection will be opened to one of the ports: 40000 and 50000.")
426428
raise
427429
print(ftps.quit())
428430

@@ -699,7 +701,7 @@ def process_args():
699701

700702
parser.add_argument('--data',
701703
nargs='*',
702-
help='data for submission',
704+
help='data for submission, this can be a list of files',
703705
metavar='FILE')
704706

705707
parser.add_argument('--center',
@@ -712,6 +714,13 @@ def process_args():
712714

713715
parser.add_argument('--xlsx',
714716
help='filled in excel template with metadata')
717+
718+
parser.add_argument('--isa_json',
719+
help='BETA: ISA json describing describing the ENA objects')
720+
721+
parser.add_argument('--isa_assay_stream',
722+
nargs='*',
723+
help='BETA: specify the assay stream(s) that holds the ENA information, this can be a list of assay streams')
715724

716725
parser.add_argument('--auto_action',
717726
action="store_true",
@@ -749,7 +758,7 @@ def process_args():
749758

750759
# check if any table is given
751760
tables = set([args.study, args.sample, args.experiment, args.run])
752-
if tables == {None} and not args.xlsx:
761+
if tables == {None} and not args.xlsx and not args.isa_json:
753762
parser.error('Requires at least one table for submission')
754763

755764
# check if .secret file exists
@@ -764,6 +773,14 @@ def process_args():
764773
msg = f"Oops, the file {args.xlsx} does not exist"
765774
parser.error(msg)
766775

776+
# check if ISA json file exists
777+
if args.isa_json:
778+
if not os.path.isfile(args.isa_json):
779+
msg = f"Oops, the file {args.isa_json} does not exist"
780+
parser.error(msg)
781+
if args.isa_assay_stream is None :
782+
parser.error("--isa_json requires --isa_assay_stream")
783+
767784
# check if data is given when adding a 'run' table
768785
if (not args.no_data_upload and args.run and args.action.upper() not in ['RELEASE', 'CANCEL']) or (not args.no_data_upload and args.xlsx and args.action.upper() not in ['RELEASE', 'CANCEL']):
769786
if args.data is None:
@@ -816,6 +833,8 @@ def main():
816833
secret = args.secret
817834
draft = args.draft
818835
xlsx = args.xlsx
836+
isa_json_file = args.isa_json
837+
isa_assay_stream = args.isa_assay_stream
819838
auto_action = args.auto_action
820839

821840
with open(secret, 'r') as secret_file:
@@ -857,6 +876,25 @@ def main():
857876
schema_dataframe[schema] = xl_sheet
858877
path = os.path.dirname(os.path.abspath(xlsx))
859878
schema_tables[schema] = f"{path}/ENA_template_{schema}.tsv"
879+
elif isa_json_file:
880+
# Read json file
881+
with open(isa_json_file, 'r') as json_file:
882+
isa_json = json.load(json_file)
883+
884+
schema_tables = {}
885+
schema_dataframe = {}
886+
required_assays = []
887+
for stream in isa_assay_stream:
888+
required_assays.append({"assay_stream": stream})
889+
submission = EnaSubmission.from_isa_json(isa_json, required_assays)
890+
submission_dataframes = submission.generate_dataframes()
891+
for schema, df in submission_dataframes.items():
892+
schema_dataframe[schema] = check_columns(
893+
df, schema, action, dev, auto_action)
894+
path = os.path.dirname(os.path.abspath(isa_json_file))
895+
schema_tables[schema] = f"{path}/ENA_template_{schema}.tsv"
896+
897+
860898
else:
861899
# collect the schema with table input from command-line
862900
schema_tables = collect_tables(args)

ena_upload/json_parsing/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)