You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+22-55Lines changed: 22 additions & 55 deletions
Original file line number
Diff line number
Diff line change
@@ -7,18 +7,18 @@
7
7
8
8
# ENA upload tool
9
9
10
-
This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files or one of the excel spreadsheets that can be found on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates). The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a [Galaxy tool](https://toolshed.g2.bx.psu.edu/view/iuc/ena_upload/) and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like [usegalaxy.eu](https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload).
10
+
This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files or one of the excel spreadsheets that can be found on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates). The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programmatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a [Galaxy tool](https://toolshed.g2.bx.psu.edu/view/iuc/ena_upload/) and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like [usegalaxy.eu](https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload).
11
11
12
12
## Overview
13
13
14
-
The metadata should be provided in separate tables corresponding to the following ENA objects:
14
+
The metadata should be provided in separate tables or files carrying similar information corresponding to the following ENA objects:
15
15
16
16
* STUDY
17
17
* SAMPLE
18
18
* EXPERIMENT
19
19
* RUN
20
20
21
-
The program to perform the following actions:
21
+
You can set the tool to perform the following actions:
22
22
23
23
* add: add an object to the archive
24
24
* modify: modify an object in the archive
@@ -29,11 +29,15 @@ After a successful submission, new tsv tables will be generated with the ENA acc
29
29
30
30
## Tool dependencies
31
31
32
-
* python 3.5+ including following packages:
32
+
* python 3.7+ including following packages:
33
33
* Genshi
34
34
* lxml
35
35
* pandas
36
36
* requests
37
+
* pyyaml
38
+
* openpyxl
39
+
* jsonschema
40
+
37
41
38
42
## Installation
39
43
@@ -60,12 +64,14 @@ All supported arguments:
60
64
--experiment EXPERIMENT
61
65
table of EXPERIMENT object
62
66
--run RUN table of RUN object
63
-
--data [FILE [FILE ...]]
64
-
data for submission
67
+
--data [FILE ...] data for submission
65
68
--center CENTER_NAME specific to your Webin account
66
69
--checklist CHECKLIST
67
70
specify the sample checklist with following pattern: ERC0000XX, Default: ERC000011
68
71
--xlsx XLSX filled in excel template with metadata
72
+
--isa_json ISA_JSON BETA: ISA json describing describing the ENA objects
73
+
--isa_assay_stream ISA_ASSAY_STREAM
74
+
BETA: specify the assay stream(s) that holds the ENA information, this can be a list of assay streams
69
75
--auto_action BETA: detect automatically which action (add or modify) to apply when the action column is not given
70
76
--tool TOOL_NAME specify the name of the tool this submission is done with. Default: ena-upload-cli
71
77
--tool_version TOOL_VERSION
@@ -88,7 +94,7 @@ To avoid exposing your credentials through the terminal history, it is recommend
88
94
89
95
### ENA sample checklists
90
96
91
-
You can specify ENA sample checklist using the `--checklist` parameter. By default the ENA default sample checklist is used supporting the minimum information required for the sample (ERC000011). The supported checklists are listed on the [ENA website](https://www.ebi.ac.uk/ena/browser/checklists). This website will also describe which Field Names you have to use in the header of your sample tsv table. The Field Names will be automatically mapped in the outputted xml if the correct `--checklist` parameter is given.
97
+
You can specify ENA sample checklist using the `--checklist` parameter. By default the ENA default sample checklist is used supporting the minimum information required for the sample (ERC000011). The supported checklists are listed on our [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates).
92
98
93
99
#### Fixed sample columns
94
100
@@ -104,55 +110,11 @@ The command line tool will automatically fetch the correct scientific name based
104
110
105
111
#### Viral submissions
106
112
107
-
If you want to submit viral samples you can use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist by adding `ERC000033` to the checklist parameter. Check out our [viral example command](#test-the-tool) as demonstration. Please use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist on the website of ENA to know which values are allowed/possible in the `restricted text` and `text choice`fields.
113
+
If you want to submit viral samples you can use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist by adding `ERC000033` to the checklist parameter. Check out our [viral example command](#test-the-tool) as demonstration. Please use the [ENA virus pathogen](https://github.com/ELIXIR-Belgium/ENA-metadata-templates/tree/main/templates/ERC000033) checklist in our template repo to know what is allowed/possible in the `Controlled vocabulary`fields.
108
114
109
115
### ENA study, experiment and run tables
110
116
111
-
Here we list all the possible columns one can have in its study, experiment or run table along with its cardinality and controlled vocabulary (CV).
112
-
Currently we refer to the [ENA Webin](https://wwwdev.ebi.ac.uk/ena/submit/webin/) to discover which values are allowed when a controlled vocabulary is used, but this will change in the future.
113
-
114
-
#### Study tsv table
115
-
116
-
| Name of column | Cardinality | Documentation | CV |
117
-
|---|---|---|---|
118
-
| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. ||
119
-
| title | mandatory | Title of the study as would be used in a publication. ||
120
-
| study_type | mandatory | The STUDY_TYPE presents a controlled vocabulary for expressing the overall purpose of the study. | yes |
121
-
| study_abstract | mandatory | Briefly describes the goals, purpose, and scope of the Study. This need not be listed if it can be inherited from a referenced publication. ||
122
-
| center_project_name | optional | Submitter defined project name. This field is intended for backward tracking of the study record to the submitter's LIMS. ||
123
-
| study_description | optional | More extensive free-form description of the study. ||
124
-
| pubmed_id | optional | Link to publication related to this study. ||
125
-
126
-
#### Experiment tsv table
127
-
128
-
| Name of column | Cardinality | Documentation | CV |
129
-
|---|---|---|---|
130
-
| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. ||
131
-
| title | mandatory | Short text that can be used to call out experiment records in searches or in displays. ||
132
-
| study_alias | mandatory | Identifies the parent study. ||
133
-
| sample_alias | mandatory | Pick a sample to associate this experiment with. The sample may be an individual or a pool, depending on how it is specified. ||
134
-
| design_description | mandatory | Goal and setup of the individual library including library was constructed. ||
135
-
| spot_descriptor | optional | The SPOT_DESCRIPTOR specifies how to decode the individual reads of interest from the monolithic spot sequence. The spot descriptor contains aspects of the experimental design, platform, and processing information. There will be two methods of specification: one will be an index into a table of typical decodings, the other being an exact specification. This construct is needed for loading data and for interpreting the loaded runs. It can be omitted if the loader can infer read layout (from multiple input files or from one input files). ||
136
-
| library_name | optional | The submitter's name for this library. ||
137
-
| library_layout | mandatory | LIBRARY_LAYOUT specifies whether to expect single, paired, or other configuration of reads. In the case of paired reads, information about the relative distance and orientation is specified. | yes |
138
-
| insert_size | mandatory | Relative distance. ||
139
-
| library_strategy | mandatory | Sequencing technique intended for this library | yes |
140
-
| library_source | mandatory | The LIBRARY_SOURCE specifies the type of source material that is being sequenced. | yes |
141
-
| library_selection | mandatory | Method used to enrich the target in the sequence library preparation | yes |
142
-
| platform | mandatory | The PLATFORM record selects which sequencing platform and platform-specific runtime parameters. This will be determined by the Center. | yes |
143
-
| instrument_model | mandatory | Model of the sequencing instrument. | yes |
144
-
| library_construction_protocol | optional | Free form text describing the protocol by which the sequencing library was constructed. ||
145
-
146
-
147
-
#### Run tsv table
148
-
149
-
| Name of column | Cardinality | Documentation | CV |
150
-
|---|---|---|---|
151
-
| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. ||
152
-
| experiment_alias | mandatory | Identifies the parent experiment. ||
153
-
| file_name | mandatory | The name or relative pathname of a run data file. ||
154
-
| file_type | mandatory | The run data file model. | yes |
155
-
| file_checksum | optional | Checksum of uncompressed file. If not given, the checksum will be calculated based on the data files specified in the --data option ||
117
+
Please check out the [template](https://github.com/ELIXIR-Belgium/ENA-metadata-templates) of your checklist to discover which attributes are mandatory for the study, experiment and run ENA object.
156
118
157
119
158
120
### Dev instance
@@ -176,7 +138,7 @@ There are two ways of submitting only a selection of objects to ENA. This is han
> IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the `--action` parameter, not rows will be submitted! Either leave out the column or add to every row the corect action.
141
+
> IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the `--action` parameter, no rows will be submitted! Either leave out the column or add to every row you want to submit the correct action.
180
142
181
143
182
144
### Using Excel templates
@@ -215,7 +177,7 @@ By default the updated tables after submission will have the action `added` in t
215
177
## Tool overview
216
178
217
179
**inputs**:
218
-
* metadata tables/excelsheet
180
+
* metadata tables/excelsheet/isa_json
219
181
* examples in `example_table` and on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates) for excel sheets
220
182
* (optional) define actions in **status** column e.g. `add`, `modify`, `cancel`, `release` (when not given the whole table is submitted)
221
183
* to perform bulk submission of all objects, the `aliases ids` in different ENA objects should be in the association where alias ids in experiment object link all objects together
@@ -262,6 +224,11 @@ By default the updated tables after submission will have the action `added` in t
print("ERROR: If your connection times out at this stage, it propably is because of a firewall that is in place. FTP is used in passive mode and connection will be opened to one of the ports: 40000 and 50000.")
427
+
print("ERROR: If your connection times out at this stage, it probably is because of a firewall that is in place. FTP is used in passive mode and connection will be opened to one of the ports: 40000 and 50000.")
426
428
raise
427
429
print(ftps.quit())
428
430
@@ -699,7 +701,7 @@ def process_args():
699
701
700
702
parser.add_argument('--data',
701
703
nargs='*',
702
-
help='data for submission',
704
+
help='data for submission, this can be a list of files',
703
705
metavar='FILE')
704
706
705
707
parser.add_argument('--center',
@@ -712,6 +714,13 @@ def process_args():
712
714
713
715
parser.add_argument('--xlsx',
714
716
help='filled in excel template with metadata')
717
+
718
+
parser.add_argument('--isa_json',
719
+
help='BETA: ISA json describing describing the ENA objects')
720
+
721
+
parser.add_argument('--isa_assay_stream',
722
+
nargs='*',
723
+
help='BETA: specify the assay stream(s) that holds the ENA information, this can be a list of assay streams')
# check if data is given when adding a 'run' table
768
785
if (notargs.no_data_uploadandargs.runandargs.action.upper() notin ['RELEASE', 'CANCEL']) or (notargs.no_data_uploadandargs.xlsxandargs.action.upper() notin ['RELEASE', 'CANCEL']):
0 commit comments