add READMEs, fix setup.py and pre-commit-config.yml

marcellszi · May 16, 2024 · b89c86e · b89c86e
1 parent a4a2b16
commit b89c86e
Show file tree

Hide file tree

Showing 4 changed files with 83 additions and 19 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -1,19 +1,19 @@
 repos:
-- repo: local
-  hooks:
-    - id: unittests
-    name: run unit tests
-    entry: python -m unittest
-    language: system
-    pass_filenames: false
-    args: ["discover"]
-- repo: https://github.com/pre-commit/pre-commit-hooks
-  rev: v2.3.0
-  hooks:
-    - id: check-yaml
-    - id: end-of-file-fixer
-    - id: trailing-whitespace
-- repo: https://github.com/psf/black
-  rev: 24.3.0
-  hooks:
-    - id: black
+  - repo: local
+    hooks:
+      - id: unittests
+        name: run unit tests
+        entry: python -m unittest
+        language: system
+        pass_filenames: false
+        args: ["discover"]
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v2.3.0
+    hooks:
+      - id: check-yaml
+      - id: end-of-file-fixer
+      - id: trailing-whitespace
+  - repo: https://github.com/psf/black
+    rev: 24.3.0
+    hooks:
+      - id: black
diff --git a/scripts/README.md b/scripts/README.md
@@ -0,0 +1,12 @@
+# RNA3DB scripts
+Below are brief descriptions of the scripts in this folder.
+
+- `scripts/slurm` a directory containing useful SLURM scripts.
+- `scripts/build_incremental_release_fasta.py` can be used to extract the different chains from two `parse.json` files. Useful for incramental releases.
+- `scripts/download_pdb_mmcif.sh` a script for downloading the latest version of the PDB.
+- `scripts/fasta_to_json.py` take a [FASTA format](https://en.wikipedia.org/wiki/FASTA_format) file and create a [JSON](https://en.wikipedia.org/wiki/JSON) usable by RNA3DB.
+    - **Note:** that since FASTA files don't contain this information, the `release_date` is set to 1970-01-01, `structure_method` to "", and `resolution` to 0.0.
+- `scripts/generate_modifications_cache.py` used to generate a modifications cache. See [Downloading required data](https://github.com/marcellszi/rna3db/wiki/Building-RNA3DB-from-scratch#downloading-required-data) on the RNA3DB Wiki.
+- `scripts/get_nohits.py` looks at a FASTA file and `.tbl` file(s) and identifies entries in the FASTA file that get no hits in any of the `.tbl` file(s). Useful for the second `cmscan`. See [Homology Search](https://github.com/marcellszi/rna3db/wiki/Building-RNA3DB-from-scratch#homology-search) on the RNA3DB Wiki.
+- `scripts/json_to_fasta.py` converts an RNA3DB [JSON](https://en.wikipedia.org/wiki/JSON) to a [FASTA file](https://en.wikipedia.org/wiki/FASTA_format).
+- `scripts/json_to_mmcif.py` is used to build the single-chain [mmCIFs](https://en.wikipedia.org/wiki/Macromolecular_Crystallographic_Information_File). This script re-reads the chains from a `split.json` and writes them to a hierarchial folder, with each [mmCIF](https://en.wikipedia.org/wiki/Macromolecular_Crystallographic_Information_File) file containing a single chain.
diff --git a/scripts/slurm/README.md b/scripts/slurm/README.md
@@ -0,0 +1,52 @@
+# RNA3DB SLURM scripts
+
+These [SLURM](https://slurm.schedmd.com/documentation.html) scripts will eventually be used to build releases automatically.
+
+> **Note:** The scripts are experimental as they haven't been rigorously tested.
+
+
+## Getting started
+The first of these script, `build_full_release.slurm`, builds an entire release from the start. This script does a homology search on all chains found in the PDB, so it takes a long time to run.
+
+The second script, `build_incremental_release.slurm` adds new chains (added to the PDB since last release) to an existing release.
+
+Both files start with a number of [sbatch](https://slurm.schedmd.com/sbatch.html) SLURM commands:
+```sh
+#SBATCH -c 64
+#SBATCH -t 0
+#SBATCH -p <insert partition here>
+#SBATCH --mem=64000
+#SBATCH -o logs/rna3db_full_release_%j.out
+#SBATCH -e logs/rna3db_full_release_%j.err
+#SBATCH --mail-user=<insert email address here>
+#SBATCH --mail-type=ALL
+```
+You will likely need to edit some of these options if you want to use these scripts. Please see the [SLURM documentation for sbatch](https://slurm.schedmd.com/sbatch.html) on what each line means. At least you will need to either enter a partition, or remove the `-p` option. Similarly, you will need to edit the `--mail-user` option.
+
+Next, there are a number of paths you need to set in both files:
+```sh
+# where you want the release to be output to
+OUTPUT_DIR=""
+# where the latest release is located
+OLD_RELEASE=""
+
+# you set these once and forget
+RNA3DB_ROOT_DIR=""
+PDB_MMCIF_DIR=""
+CMSCAN=""
+CMDB=""
+```
+- `OUTPUT_DIR` specifies the root directory where the release will be placed
+- `OLD_RELEASE` is the path to the directory of the release you want to add the new PDB chains to. This is only needed when you are trying to build an incremental release.
+- `RNA3DB_ROOT_DIR` path to the rna3db repository. Scripts are called from `$RNA3DB_ROOT_DIR/scripts/`.
+- `CMSCAN` is the path to the `cmscan` executable.
+- `CMDB` is the path to the covariance models you want to use for the homology search (`cmscan`). Usually this would come from [Rfam](https://rfam.org/) in the form of `Rfam.cm`.
+
+Once you have set the required paths and edited the sbatch commands as required, you can simply run the jobs via:
+```sh
+$ sbatch build_full_release.slurm
+```
+Or:
+```sh
+$ sbatch build_incremental_release.slurm
+```
diff --git a/setup.py b/setup.py
@@ -6,5 +6,5 @@
     description="A dataset for training and benchmarking deep learning models for RNA structure prediction",
     author="Marcell Szikszai",
     packages=find_packages(exclude=["tests", "scripts", "data"]),
-    install_requires=["biopython", "tqdm", "pre-commit"],
+    install_requires=["biopython", "tqdm", "black", "pre-commit"],
 )