ChemBAI

This repository contains utilities for training and predicting toxicological activity using the ToxCast v4.1 data set.

Features

Molecular fingerprints: MACCS, Morgan, RDKit, Pattern, Layered
Algorithms: Decision Tree, Logistic Regression, Gradient Boost Tree, XGBoost, Random Forest
Centralised configuration via ToxCast_model/config.py
Single entry script run_pipeline.sh to generate fingerprints, train models and run predictions
Prediction runs save results in timestamped folders and write a metadata.json summary in experiments/<PROJECT_NAME>/results/
Configurable model version via VERSION in config.py
Fingerprint generation skips automatically if files already exist

Project layout

Final_model_save/           # Pretrained models
ToxCast_model/              # Main source code
├─ experiments/             # Input and output of each experiment
├─ prediction/              # Prediction utilities
├─ run/                     # Training scripts (dt.py, rf.py, ...)
└─ toxcast_pkg/             # Helper modules

Quick start

Prepare an Excel file containing the data and assay_list sheets.
Copy Template/template_for_predict(PROJECT_NAME) into ToxCast_model/experiments/ and rename the folder to your project name.
Place the Excel file from step 1 inside this new folder.
Edit ToxCast_model/config.py and set PROJECT_NAME to the folder name from step 2.
Optionally set VERSION = 2 in config.py to use the ToxCast_model_v.2 models (default is 1).
Set OBJECT = 0 in config.py for prediction mode.
Set MODEL_SELECTION = 0 in config.py for best F1 mode
Run the pipeline from the project root:

bash launcher.sh

The launcher and Slurm scripts automatically load the cuda/12.2.1 and python/3.11.2 modules to enable GPU execution. Adjust the module names if your environment differs.

Local usage

If Bash is not available, run the same steps using the Python helper:

python run_local.py download-template --out .
python run_local.py predict

For an interactive option, launch the simple GUI:

python run_local_gui.py

Use the buttons to download the template, select your filled Excel file and run the prediction pipeline on your local machine.

Fingerprints are generated only once and stored under experiments/<PROJECT_NAME>/fingerprints/. Prediction results are saved under experiments/<PROJECT_NAME>/results/<timestamp>/, and a cumulative metadata.json is written to experiments/<PROJECT_NAME>/results/.

Building standalone binaries

Install pyinstaller and run the helper script to create an executable under the

Release directory. The script bundles the Template and ToxCast_model folders so the program can be distributed without the rest of the repository. Build the executable on each platform you want to support:

pip install pyinstaller
python build_release.py

On macOS, running python build_release.py creates ChemBAI_Predictor (or a .app bundle depending on your PyInstaller version) inside Release/. Double click this file to launch the GUI. Use it to download the template, select your input file and run predictions locally.

You must build the binary on each target platform (macOS or Windows) because the executables are platform specific.

Environment setup

Install dependencies with conda using environment.yml or via pip install -r requirements.txt.

Training scripts

The ToxCast_model/run directory contains standalone scripts for each algorithm. They perform cross-validation to select the best model and save it as a joblib file for later prediction.

Slurm-based VERSION=3 training

For large VERSION=3 experiments you can orchestrate GPU training directly on Slurm without exceeding the submission quota:

Configure your experiment in ToxCast_model/config.py and ensure OBJECTS = ["prediction", "training"], OBJECT = 1, and VERSION = 3.
Review slurm/training_config.yaml. The file controls project paths, available seeds, assay/model/fingerprint combinations, resource requests, queue throttling and environment setup (modules and conda).
Submit the jobs from the repository root:
```
python slurm/submit_v3_training.py
```
Use --dry-run to inspect the planned submissions without calling sbatch.

Each job trains a single assay/model/fingerprint combination sequentially over all configured seeds, writes the logs under experiments/<PROJECT>/logs, and saves models under experiments/<PROJECT>/results. Queue length is monitored to avoid overwhelming the scheduler, and a summary job is scheduled automatically after the training jobs finish.

Prediction

ToxCast_model/prediction/Predict_data.py loads trained models specified in config.py and generates predictions for each assay. The script appends the original SMILES strings and writes an Excel file with the predictions.

License

This project is licensed under the terms of the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 442 Commits
.idea		.idea
Final_model_save		Final_model_save
For_Test		For_Test
Template		Template
ToxCast_model		ToxCast_model
gpu_job_logs		gpu_job_logs
slurm		slurm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
launcher.sh		launcher.sh
launcher_local.sh		launcher_local.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ChemBAI

Features

Project layout

Quick start

Local usage

Building standalone binaries

Environment setup

Training scripts

Slurm-based VERSION=3 training

Prediction

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Jiinwon/ChemBAI

Folders and files

Latest commit

History

Repository files navigation

ChemBAI

Features

Project layout

Quick start

Local usage

Building standalone binaries

Environment setup

Training scripts

Slurm-based VERSION=3 training

Prediction

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages