Skip to content

Commit

Permalink
add READMEs, fix setup.py and pre-commit-config.yml
Browse files Browse the repository at this point in the history
  • Loading branch information
marcellszi committed May 16, 2024
1 parent a4a2b16 commit b89c86e
Show file tree
Hide file tree
Showing 4 changed files with 83 additions and 19 deletions.
36 changes: 18 additions & 18 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
repos:
- repo: local
hooks:
- id: unittests
name: run unit tests
entry: python -m unittest
language: system
pass_filenames: false
args: ["discover"]
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 24.3.0
hooks:
- id: black
- repo: local
hooks:
- id: unittests
name: run unit tests
entry: python -m unittest
language: system
pass_filenames: false
args: ["discover"]
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 24.3.0
hooks:
- id: black
12 changes: 12 additions & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# RNA3DB scripts
Below are brief descriptions of the scripts in this folder.

- `scripts/slurm` a directory containing useful SLURM scripts.
- `scripts/build_incremental_release_fasta.py` can be used to extract the different chains from two `parse.json` files. Useful for incramental releases.
- `scripts/download_pdb_mmcif.sh` a script for downloading the latest version of the PDB.
- `scripts/fasta_to_json.py` take a [FASTA format](https://en.wikipedia.org/wiki/FASTA_format) file and create a [JSON](https://en.wikipedia.org/wiki/JSON) usable by RNA3DB.
- **Note:** that since FASTA files don't contain this information, the `release_date` is set to 1970-01-01, `structure_method` to "", and `resolution` to 0.0.
- `scripts/generate_modifications_cache.py` used to generate a modifications cache. See [Downloading required data](https://github.com/marcellszi/rna3db/wiki/Building-RNA3DB-from-scratch#downloading-required-data) on the RNA3DB Wiki.
- `scripts/get_nohits.py` looks at a FASTA file and `.tbl` file(s) and identifies entries in the FASTA file that get no hits in any of the `.tbl` file(s). Useful for the second `cmscan`. See [Homology Search](https://github.com/marcellszi/rna3db/wiki/Building-RNA3DB-from-scratch#homology-search) on the RNA3DB Wiki.
- `scripts/json_to_fasta.py` converts an RNA3DB [JSON](https://en.wikipedia.org/wiki/JSON) to a [FASTA file](https://en.wikipedia.org/wiki/FASTA_format).
- `scripts/json_to_mmcif.py` is used to build the single-chain [mmCIFs](https://en.wikipedia.org/wiki/Macromolecular_Crystallographic_Information_File). This script re-reads the chains from a `split.json` and writes them to a hierarchial folder, with each [mmCIF](https://en.wikipedia.org/wiki/Macromolecular_Crystallographic_Information_File) file containing a single chain.
52 changes: 52 additions & 0 deletions scripts/slurm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# RNA3DB SLURM scripts

These [SLURM](https://slurm.schedmd.com/documentation.html) scripts will eventually be used to build releases automatically.

> **Note:** The scripts are experimental as they haven't been rigorously tested.

## Getting started
The first of these script, `build_full_release.slurm`, builds an entire release from the start. This script does a homology search on all chains found in the PDB, so it takes a long time to run.

The second script, `build_incremental_release.slurm` adds new chains (added to the PDB since last release) to an existing release.

Both files start with a number of [sbatch](https://slurm.schedmd.com/sbatch.html) SLURM commands:
```sh
#SBATCH -c 64
#SBATCH -t 0
#SBATCH -p <insert partition here>
#SBATCH --mem=64000
#SBATCH -o logs/rna3db_full_release_%j.out
#SBATCH -e logs/rna3db_full_release_%j.err
#SBATCH --mail-user=<insert email address here>
#SBATCH --mail-type=ALL
```
You will likely need to edit some of these options if you want to use these scripts. Please see the [SLURM documentation for sbatch](https://slurm.schedmd.com/sbatch.html) on what each line means. At least you will need to either enter a partition, or remove the `-p` option. Similarly, you will need to edit the `--mail-user` option.

Next, there are a number of paths you need to set in both files:
```sh
# where you want the release to be output to
OUTPUT_DIR=""
# where the latest release is located
OLD_RELEASE=""

# you set these once and forget
RNA3DB_ROOT_DIR=""
PDB_MMCIF_DIR=""
CMSCAN=""
CMDB=""
```
- `OUTPUT_DIR` specifies the root directory where the release will be placed
- `OLD_RELEASE` is the path to the directory of the release you want to add the new PDB chains to. This is only needed when you are trying to build an incremental release.
- `RNA3DB_ROOT_DIR` path to the rna3db repository. Scripts are called from `$RNA3DB_ROOT_DIR/scripts/`.
- `CMSCAN` is the path to the `cmscan` executable.
- `CMDB` is the path to the covariance models you want to use for the homology search (`cmscan`). Usually this would come from [Rfam](https://rfam.org/) in the form of `Rfam.cm`.

Once you have set the required paths and edited the sbatch commands as required, you can simply run the jobs via:
```sh
$ sbatch build_full_release.slurm
```
Or:
```sh
$ sbatch build_incremental_release.slurm
```
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@
description="A dataset for training and benchmarking deep learning models for RNA structure prediction",
author="Marcell Szikszai",
packages=find_packages(exclude=["tests", "scripts", "data"]),
install_requires=["biopython", "tqdm", "pre-commit"],
install_requires=["biopython", "tqdm", "black", "pre-commit"],
)

0 comments on commit b89c86e

Please sign in to comment.