DuckDB Schema Initialization for antiSMASH Database

This guide provides instructions on how to Initialize a local antiSMASH Database schema using DuckDB.

Quick Start

Setup the database and importer requirements following this step:

# Part 1: Build database from schema
git clone git@github.com:NBChub/antismash_db-schema_duckdb.git
cd antismash_db-schema_duckdb
python -m venv antismash_db_duckb
source ./antismash_db_duckb/bin/activate
pip install -r requirements.txt
git clone https://github.com/antismash/db-schema.git
python init_duckdb.py db-schema duckdb-schema
deactivate

# Part 2: Setup importer requirements
mamba env create -f env.yaml
conda run -n antismash_db_env bash env.post-deploy.sh
conda activate antismash_db_env
# 1. Download NCBI taxdump:
wget -P ncbi-taxdump https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz -nc
(cd ncbi-taxdump && tar -xvf new_taxdump.tar.gz)
# 2. Install NCBI taxonomy handler to create the JSON taxdump (requires Rust):
cargo install asdb-taxa
# don't forget to export the .cargo/bin to path
export PATH="$HOME/.cargo/bin:$PATH"
# 3. Clone the JSON importer:
git clone git@github.com:matinnuhamunada/db-import.git
(cd db-import && git checkout -b v4.0.0-duckdb v4.0.0-duckdb)

Setting Up Environment Variables. Get your Entrez API Key here.
```
export ASDBI_ENTREZ_API_KEY=<your_entrez_api_key>
```

Populate database with antiSMASH results

bash full_workflow.sh <your antiSMASH output directory>

Usage

Step 1: Building the Database from Schema

Clone this repository

First, clone this repository to your local machine using the following Git command:
```
git clone git@github.com:NBChub/antismash_db-schema_duckdb.git
cd antismash_db-schema_duckdb
```
Create a Virtual Environment & Install Dependencies

Create a virtual environment using venv. This helps to isolate your project's dependencies from your system-wide Python packages.

Open your terminal and navigate to your project directory, then run:
```
python -m venv antismash_db_duckb
source ./antismash_db_duckb/bin/activate
pip install -r requirements.txt
deactivate
```
Initialize the DuckDB Schema

Use the init_duckdb.py script to initialize the DuckDB schema. You need to specify the directory of the cloned SQL files and the output directory for the DuckDB schema files.
- db-schema: The directory containing the SQL files cloned from the antiSMASH DB schema repository.
- duckdb-schema: The directory where the DuckDB schema files will be stored.
```
source ./antismash_db_duckb/bin/activate
git clone https://github.com/antismash/db-schema.git
python init_duckdb.py db-schema duckdb-schema
deactivate
```

Step 2: Installing prerequisites for importing JSONs

Before you start, make sure you have the following:

Conda/Mamba: You can install it by following the instructions here.

Then follow these steps to install the required packages and repositories:

Create the Conda environment by:

mamba env create -f env.yaml
conda run -n antismash_db_env bash env.post-deploy.sh

Install the necessary components for the importer:

# Activate the Conda environment:
conda activate antismash_db_env

# 1. Download NCBI taxdump:
wget -P ncbi-taxdump https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz -nc
(cd ncbi-taxdump && tar -xvf new_taxdump.tar.gz)

# 2. Install NCBI taxonomy handler to create the JSON taxdump (requires Rust):
cargo install asdb-taxa
# don't forget to export the .cargo/bin to path
export PATH="$HOME/.cargo/bin:$PATH"

# 3. Clone the JSON importer:
git clone git@github.com:matinnuhamunada/db-import.git
(cd db-import && git checkout -b v4.0.0-duckdb v4.0.0-duckdb)

Importing antiSMASH JSONs to the database

The script will process the antiSMASH schema SQL files in the specified input directory (db-schema), convert them to be compatible with DuckDB, and then initialize the schema in the specified output directory (duckdb-schema). In the output folder, you can find the DuckDB database file (duckdb-schema/antismash_db.duckdb) and the converted SQL schema files.

Setting Up Environment Variables

Before you start, you need to generate an Entrez API Key. The Entrez API Key is used to access NCBI's suite of interconnected databases (including PubMed, GenBank, and more) through their E-utilities API. You can find instructions on how to generate your Entrez API Key here.

Once you have your Entrez API Key, you need to set up the following environment variables:

ASDBI_ENTREZ_API_KEY: This should be your Entrez API key.

You can set these variables in your environment by adding them to a .env file in the root directory of your project. The .env file should look like this:

export ASDBI_ENTREZ_API_KEY=<your_entrez_api_key>
echo "export ASDBI_ENTREZ_API_KEY=$ASDBI_ENTREZ_API_KEY" > .env

Replace your_entrez_api_key with your actual Entrez API key.

After you've added these variables to your .env file, you can load them into your environment by running:

source .env

This command reads the .env file and exports the variables so they can be accessed by scripts and applications running in your shell.

Importing JSON to the database

You can run the full_workflow.sh to import your antiSMASH results to the database:

bash full_workflow.sh <your antiSMASH output directory>

For example, you can fetch the S. coelicolor example and add it to the database:

 wget https://antismash-db.secondarymetabolites.org/output/GCF_008931305.1/GCF_008931305.1.json -nc -P input_files/
 bash full_workflow.sh input_files/

Exploring and visualizing the database

There are multiple ways to interact with the DuckDB database. We recommend to start with DBeaver for an easy start. Otherwise, refer to the DuckDB documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
env.post-deploy.sh		env.post-deploy.sh
env.yaml		env.yaml
full_workflow.sh		full_workflow.sh
init.sh		init.sh
init_duckdb.py		init_duckdb.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DuckDB Schema Initialization for antiSMASH Database

Quick Start

Usage

Step 1: Building the Database from Schema

Step 2: Installing prerequisites for importing JSONs

Importing antiSMASH JSONs to the database

Setting Up Environment Variables

Importing JSON to the database

Exploring and visualizing the database

About

Releases

Packages

Languages

License

NBChub/antismash_db-schema_duckdb

Folders and files

Latest commit

History

Repository files navigation

DuckDB Schema Initialization for antiSMASH Database

Quick Start

Usage

Step 1: Building the Database from Schema

Step 2: Installing prerequisites for importing JSONs

Importing antiSMASH JSONs to the database

Setting Up Environment Variables

Importing JSON to the database

Exploring and visualizing the database

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages