A tool for parsing GFFs and TSV files produced by mettannotator from MGnify. Parses the mettannotator combined GFF file looking for CDS records and lookups up the complementary AMRFinderPlus TSV file for more in-depth results not transferred to the combined GFF file.
python -mvenv .venv
source .venv/bin/activate
pip install -r requirements.txtpython3 parse_amr.py --files path/to/gff.gffRunning with --help will print all commands available.
For more information about this step and subsequent consult </PROCESSING.md>
The program will create four output files
amr_genotype.csva CSV of all records found from parsing the specified GFF filesamr_genotype.parquetparquet representation of the same dataassembly.csva CSV of all assemblies found and processed from parsing the specified GFF filesassembly.parquetparquet representation of the same assembly data
The library creates JSON formatted schemas which when processed by src.schema.load_schema_from_config() will convert into a pyarrow schema. The schema record has the following fields
name: name of the columntype: data type of the column. We supportstring,int8,int16,int32,int64,float16,float32,float64,bool,timestamp[ns](supportss,ms,usandns),duration(same units as before),time32[s](sandms),time64(usandns),uuidandbinarynullable: if the column can be nulleddescription: description of the field. Will be used to create the markdown table
The scripts directory has a series of additional tools which can be used to further process
add_country_from_country_code.py- For a given parquet file we will add country information from an ISO 3 letter codeapply_new_schema_to_parquet.py- Apply a schema to an existing parquet fileconvert_and_merge_csv_to_parquet.py- Take a set of CSV files, convert to parquet and mergegenerate_sbatch.py- Generate a sbatch script which will split a list of files into job arrays of specified length for Slurmgenerate_schema_from_parquet.py- Take a parquet file and generate a schema JSON filejoin_parquet.py- Perform a SQL-like join between parquet data setslookup_quickly.py- Lookup antibiotics quickly using the lookup libraryparquet_to_csv_gz.py- Convert parquet files to CSV. If you use a.gzextension it will compresspost_fixes.py- Run a series of fixes on CABBAGE data. Should always be run until fixes are incorporated upstreamschema_to_markdown_table.py- Take a schema JSON file and create a markdown formatted tablestream_merge_parquet.py- Take multiple parquet files and merge them together
This tool requires an internet connection to contact the following APIs
- ENA: https://www.ebi.ac.uk/ena
- BioSamples: https://www.ebi.ac.uk/biosamples
- OLS (Ontology Lookup Service): https://www.ebi.ac.uk/ols4
There is minimal test suite available for ensuring some of the required parsing is correct.
Example output files can be found in the example_data directory.
The original version of this code is held in the old_code directory. It is no longer supported.