This repository contains utilities for training and predicting toxicological activity using the ToxCast v4.1 data set.
- Molecular fingerprints: MACCS, Morgan, RDKit, Pattern, Layered
- Algorithms: Decision Tree, Logistic Regression, Gradient Boost Tree, XGBoost, Random Forest
- Centralised configuration via
ToxCast_model/config.py - Single entry script
run_pipeline.shto generate fingerprints, train models and run predictions - Prediction runs save results in timestamped folders and write a
metadata.jsonsummary inexperiments/<PROJECT_NAME>/results/ - Configurable model version via
VERSIONinconfig.py - Fingerprint generation skips automatically if files already exist
Final_model_save/ # Pretrained models
ToxCast_model/ # Main source code
├─ experiments/ # Input and output of each experiment
├─ prediction/ # Prediction utilities
├─ run/ # Training scripts (dt.py, rf.py, ...)
└─ toxcast_pkg/ # Helper modules
- Prepare an Excel file containing the
dataandassay_listsheets. - Copy
Template/template_for_predict(PROJECT_NAME)intoToxCast_model/experiments/and rename the folder to your project name. - Place the Excel file from step 1 inside this new folder.
- Edit
ToxCast_model/config.pyand setPROJECT_NAMEto the folder name from step 2. - Optionally set
VERSION = 2inconfig.pyto use theToxCast_model_v.2models (default is 1). - Set
OBJECT = 0inconfig.pyfor prediction mode. - Set
MODEL_SELECTION = 0inconfig.pyfor best F1 mode - Run the pipeline from the project root:
bash launcher.shThe launcher and Slurm scripts automatically load the cuda/12.2.1 and
python/3.11.2 modules to enable GPU execution. Adjust the module names if your
environment differs.
If Bash is not available, run the same steps using the Python helper:
python run_local.py download-template --out .
python run_local.py predictFor an interactive option, launch the simple GUI:
python run_local_gui.pyUse the buttons to download the template, select your filled Excel file and run the prediction pipeline on your local machine.
Fingerprints are generated only once and stored under experiments/<PROJECT_NAME>/fingerprints/. Prediction results are saved under experiments/<PROJECT_NAME>/results/<timestamp>/, and a cumulative metadata.json is written to experiments/<PROJECT_NAME>/results/.
Install pyinstaller and run the helper script to create an executable under the
Release directory. The script bundles the Template and ToxCast_model
folders so the program can be distributed without the rest of the repository.
Build the executable on each platform you want to support:
pip install pyinstaller
python build_release.pyOn macOS, running python build_release.py creates ChemBAI_Predictor (or a
.app bundle depending on your PyInstaller version) inside Release/. Double
click this file to launch the GUI. Use it to download the template, select your
input file and run predictions locally.
You must build the binary on each target platform (macOS or Windows) because the executables are platform specific.
Install dependencies with conda using environment.yml or via pip install -r requirements.txt.
The ToxCast_model/run directory contains standalone scripts for each algorithm. They perform cross-validation to select the best model and save it as a joblib file for later prediction.
For large VERSION=3 experiments you can orchestrate GPU training directly on Slurm without exceeding the submission quota:
-
Configure your experiment in
ToxCast_model/config.pyand ensureOBJECTS = ["prediction", "training"],OBJECT = 1, andVERSION = 3. -
Review
slurm/training_config.yaml. The file controls project paths, available seeds, assay/model/fingerprint combinations, resource requests, queue throttling and environment setup (modules and conda). -
Submit the jobs from the repository root:
python slurm/submit_v3_training.py
Use
--dry-runto inspect the planned submissions without callingsbatch.
Each job trains a single assay/model/fingerprint combination sequentially over
all configured seeds, writes the logs under experiments/<PROJECT>/logs, and
saves models under experiments/<PROJECT>/results. Queue length is monitored to
avoid overwhelming the scheduler, and a summary job is scheduled automatically
after the training jobs finish.
ToxCast_model/prediction/Predict_data.py loads trained models specified in config.py and generates predictions for each assay. The script appends the original SMILES strings and writes an Excel file with the predictions.
This project is licensed under the terms of the MIT License.