Skip to content

Commit 57229fe

Browse files
authored
Merge pull request #56 from usegalaxy-eu/xlsx-support
XLSX support
2 parents a1b1685 + 69ddfa2 commit 57229fe

File tree

5 files changed

+108
-45
lines changed

5 files changed

+108
-45
lines changed

README.md

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
# ENA upload tool
99

10-
This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files. The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a [Galaxy tool](https://toolshed.g2.bx.psu.edu/view/iuc/ena_upload/4aab5ae907b6) and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like [usegalaxy.eu](https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload).
10+
This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files of one of the excel spreadsheet that can be found on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates). The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a [Galaxy tool](https://toolshed.g2.bx.psu.edu/view/iuc/ena_upload/4aab5ae907b6) and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like [usegalaxy.eu](https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload).
1111

1212
## Overview
1313

@@ -60,15 +60,16 @@ All supported arguments:
6060
--experiment EXPERIMENT
6161
table of EXPERIMENT object
6262
--run RUN table of RUN object
63-
--data [FILE [FILE ...]]
64-
data for submission
63+
--data [FILE ...] data for submission
6564
--center CENTER_NAME specific to your Webin account
6665
--checklist CHECKLIST
6766
specify the sample checklist with following pattern: ERC0000XX, Default: ERC000011
67+
--xlsx XLSX Excel table with metadata
6868
--tool TOOL_NAME specify the name of the tool this submission is done with. Default: ena-upload-cli
6969
--tool_version TOOL_VERSION
7070
specify the version of the tool this submission is done with
71-
--no_data_upload indicate if no upload should be performed and you like to submit a RUN object (e.g. if uploaded was done separately).
71+
--no_data_upload indicate if no upload should be performed and you like to submit a RUN object (e.g. if uploaded
72+
was done separately).
7273
--draft indicate if no submission should be performed
7374
--secret SECRET .secret.yml file containing the password and Webin ID of your ENA account
7475
-d, --dev flag to use the dev/sandbox endpoint of ENA
@@ -172,6 +173,11 @@ Optionally you can add a status column to every table that contains the action y
172173

173174
> IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the `--action` parameter, not rows will be submitted! Either leave out the column or add to every row the corect action.
174175
176+
177+
### Using Excel templates
178+
179+
We also support the use of specific excel templates, designed for each sample checklist. Use the `--xlsx` command to add the path to an excel template file filled in from this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates).
180+
175181
### The data files
176182

177183
**Supported data**
@@ -206,8 +212,8 @@ By default the updated tables after submission will have the action `added` in t
206212
## Tool overview
207213

208214
**inputs**:
209-
* metadata tables
210-
* examples in `example_table`
215+
* metadata tables/excelsheet
216+
* examples in `example_table` and on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates) for excel sheets
211217
* (optional) define actions in **status** column e.g. `add`, `modify`, `cancel`, `release` (when not given the whole table is submitted)
212218
* to perform bulk submission of all objects, the `aliases ids` in different ENA objects should be in the association where alias ids in experiment object link all objects together
213219
* experimental data
@@ -248,6 +254,11 @@ By default the updated tables after submission will have the action `added` in t
248254
ena-upload-cli --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --sample example_tables/ENA_template_samples_vir.tsv --experiment example_tables/ENA_template_experiments.tsv --run example_tables/ENA_template_runs.tsv --data example_data/*gz --dev --checklist ERC000033 --secret .secret.yml
249255
```
250256

257+
* **Using an Excel template**
258+
```
259+
ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --checklist ERC000033 --secret .secret.yml --xlsx example_tables/ENA_excel_example_ERC000033.xlsx
260+
```
261+
251262
* **release submission**
252263
```
253264
ena-upload-cli --action release --center'your_center_name' --study example_tables/ENA_template_studies_release.tsv --dev --secret .secret.yml

ena_upload/_version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.4.5"
1+
__version__ = "0.5.0"

ena_upload/ena_upload.py

Lines changed: 87 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121
import tempfile
2222
from ena_upload._version import __version__
2323

24+
SCHEMA_TYPES = ['study', 'experiment', 'run', 'sample']
25+
2426

2527
class MyFTP_TLS(ftplib.FTP_TLS):
2628
"""Explicit FTPS, with shared TLS session"""
@@ -47,32 +49,12 @@ def create_dataframe(schema_tables, action):
4749
dataframe -- pandas dataframe
4850
'''
4951

50-
# ? would it be good to use alias to index rows?
51-
5252
schema_dataframe = {}
5353

5454
for schema, table in schema_tables.items():
5555
df = pd.read_csv(table, sep='\t', comment='#', dtype=str)
5656
df = df.dropna(how='all')
57-
# checking for optional columns and if not present, adding them
58-
if schema == 'sample':
59-
optional_columns = ['accession', 'submission_date',
60-
'status', 'scientific_name', 'taxon_id']
61-
elif schema == 'run':
62-
optional_columns = ['accession',
63-
'submission_date', 'status', 'file_checksum']
64-
else:
65-
optional_columns = ['accession', 'submission_date', 'status']
66-
for header in optional_columns:
67-
if not header in df.columns:
68-
if header == 'status':
69-
df[header] = action.lower()
70-
else:
71-
df[header] = np.nan
72-
# status column contain action keywords
73-
# for xml rendering, keywords require uppercase
74-
# according to scheme definition of submission
75-
df['status'] = df['status'].str.upper()
57+
df = check_columns(df, schema, action)
7658
schema_dataframe[schema] = df
7759

7860
return schema_dataframe
@@ -98,6 +80,29 @@ def extract_targets(action, schema_dataframe):
9880
return schema_targets
9981

10082

83+
def check_columns(df, schema, action):
84+
# checking for optional columns and if not present, adding them
85+
if schema == 'sample':
86+
optional_columns = ['accession', 'submission_date',
87+
'status', 'scientific_name', 'taxon_id']
88+
elif schema == 'run':
89+
optional_columns = ['accession',
90+
'submission_date', 'status', 'file_checksum']
91+
else:
92+
optional_columns = ['accession', 'submission_date', 'status']
93+
94+
for header in optional_columns:
95+
if not header in df.columns:
96+
if header == 'status':
97+
# status column contain action keywords
98+
# for xml rendering, keywords require uppercase
99+
# according to scheme definition of submission
100+
df[header] = str(action).upper()
101+
else:
102+
df[header] = np.nan
103+
104+
return df
105+
101106
def check_filenames(file_paths, run_df):
102107
"""Compare data filenames from command line and from RUN table.
103108
@@ -679,7 +684,10 @@ def process_args():
679684

680685
parser.add_argument('--checklist', help="specify the sample checklist with following pattern: ERC0000XX, Default: ERC000011", dest='checklist',
681686
default='ERC000011')
682-
687+
688+
parser.add_argument('--xlsx',
689+
help='excel table with metadata')
690+
683691
parser.add_argument('--tool',
684692
dest='tool_name',
685693
default='ena-upload-cli',
@@ -711,17 +719,23 @@ def process_args():
711719

712720
# check if any table is given
713721
tables = set([args.study, args.sample, args.experiment, args.run])
714-
if tables == {None}:
722+
if tables == {None} and not args.xlsx:
715723
parser.error('Requires at least one table for submission')
716724

717725
# check if .secret file exists
718726
if args.secret:
719727
if not os.path.isfile(args.secret):
720728
msg = f"Oops, the file {args.secret} does not exist"
721729
parser.error(msg)
730+
731+
# check if xlsx file exists
732+
if args.xlsx:
733+
if not os.path.isfile(args.xlsx):
734+
msg = f"Oops, the file {args.xlsx} does not exist"
735+
parser.error(msg)
722736

723737
# check if data is given when adding a 'run' table
724-
if args.action == 'add' and args.run is not None:
738+
if (not args.no_data_upload and args.run) or (not args.no_data_upload and args.xlsx):
725739
if args.data is None:
726740
parser.error('Oops, requires data for submitting RUN object')
727741

@@ -750,6 +764,16 @@ def collect_tables(args):
750764

751765
return schema_tables
752766

767+
def update_date(date):
768+
if pd.isnull(date) or isinstance(date, str):
769+
return date
770+
try:
771+
return date.strftime('%Y-%m-%d')
772+
except AttributeError:
773+
return date
774+
except Exception:
775+
raise
776+
753777

754778
def main():
755779
args = process_args()
@@ -760,6 +784,7 @@ def main():
760784
checklist = args.checklist
761785
secret = args.secret
762786
draft = args.draft
787+
xlsx = args.xlsx
763788

764789
with open(secret, 'r') as secret_file:
765790
credentials = yaml.load(secret_file, Loader=yaml.FullLoader)
@@ -772,11 +797,32 @@ def main():
772797
f"Oops, file {args.secret} does not contain a password or username")
773798
secret_file.close()
774799

775-
# collect the schema with table input from command-line
776-
schema_tables = collect_tables(args)
800+
if xlsx:
801+
# create dataframe from xlsx table
802+
xl_workbook = pd.ExcelFile(xlsx)
803+
schema_dataframe = {} # load the parsed data in a dict: sheet_name -> pandas_frame
804+
schema_tables = {}
805+
806+
for schema in SCHEMA_TYPES:
807+
xl_sheet = xl_workbook.parse(schema, header=0)
808+
xl_sheet = xl_sheet.drop(0).dropna(how='all')
809+
for column_name in list(xl_sheet.columns.values):
810+
if 'date' in column_name:
811+
xl_sheet[column_name] = xl_sheet[column_name].apply(update_date)
812+
813+
if True in xl_sheet.columns.duplicated():
814+
sys.exit("Duplicated columns found")
815+
816+
xl_sheet = check_columns(xl_sheet, schema, action)
817+
schema_dataframe[schema] = xl_sheet
818+
path = os.path.dirname(os.path.abspath(xlsx))
819+
schema_tables[schema] = f"{path}/ENA_template_{schema}.tsv"
820+
else:
821+
# collect the schema with table input from command-line
822+
schema_tables = collect_tables(args)
777823

778-
# create dataframe from table
779-
schema_dataframe = create_dataframe(schema_tables, action)
824+
# create dataframe from table
825+
schema_dataframe = create_dataframe(schema_tables, action)
780826

781827
# ? add a function to sanitize characters
782828
# ? print 'validate table for specific action'
@@ -797,15 +843,16 @@ def main():
797843
if 'run' in schema_targets:
798844
# a dictionary of filename:file_path
799845
df = schema_targets['run']
800-
801-
file_paths = {os.path.basename(path): os.path.abspath(path)
802-
for path in args.data}
803-
# check if file names identical between command line and table
804-
# if not, system exits
805-
check_filenames(file_paths, df)
806-
846+
file_paths = {}
847+
if args.data:
848+
for path in args.data:
849+
file_paths[os.path.basename(path)] = os.path.abspath(path)
850+
# check if file names identical between command line and table
851+
# if not, system exits
852+
check_filenames(file_paths, df)
853+
807854
# generate MD5 sum if not supplied in table
808-
if not check_file_checksum(df):
855+
if file_paths and not check_file_checksum(df):
809856
print("No valid checksums found, generate now...", end=" ")
810857
file_md5 = {filename: get_md5(path) for filename, path
811858
in file_paths.items()}
@@ -817,6 +864,10 @@ def main():
817864
pd.options.mode.chained_assignment = None
818865
df.loc[:, 'file_checksum'] = md5
819866
print("done.")
867+
elif check_file_checksum(df):
868+
print("Valid checksums found", end=" ")
869+
else:
870+
sys.exit("No valid checksums found and no files given to generate checksum from. Please list the files using the --data option or specify the checksums in the run-table when the data is uploaded separately.")
820871

821872
schema_targets['run'] = df
822873

Binary file not shown.

requirements.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
genshi
22
lxml
3-
pandas
3+
pandas>=1.2
44
pyyaml
5-
requests
5+
requests
6+
openpyxl

0 commit comments

Comments
 (0)