comsia.py is a Python implementation of Comparative Molecular Similarity Indices Analysis (CoMSIA), a 3D Quantitative Structure-Activity Relationship (QSAR) method. This tool allows you to analyze molecular fields and predict biological activities based on molecular structures.
If you use Py-CoMSIA in your work, please cite our publication:
Haga, C. L., Le, C. N., Yang, X. D., & Phinney, D. G. (2025). Py-CoMSIA: An Open-Source Implementation of Comparative Molecular Similarity Indices Analysis in Python. Pharmaceuticals, 18(3), 440. https://doi.org/10.3390/ph18030440
https://www.mdpi.com/1424-8247/18/3/440
- CoMSIA Field Calculation: Calculates steric, electrostatic, hydrophobic, hydrogen bond donor, and hydrogen bond acceptor fields.
- PLS Regression: Utilizes Partial Least Squares (PLS) regression for building QSAR models.
- Flexible Input: Supports both CSV files (SMILES and activity data) and pre-aligned SDF files.
- Grid-Based Analysis: Configurable grid resolution and padding for field calculations.
- Field Selection: Allows you to select specific fields for analysis.
- Visualization: (Optional) Visualization of the CoMSIA fields and PLS results.
- Prediction: Predict activities for new compounds based on the trained model.
- Column filtering: option to filter out columns with low variance.
-
Clone the repository:
git clone https://github.com/clhaga/pycomsia cd pycomsia -
Install dependencies:
pip install -r requirements.txt
python comsia.py --train_file <train_file> [options]--train_file (required): Path to the training data. Can be a CSV file with SMILES and activity data or an SDF file containing pre-aligned molecules and activity data.
--predict_file: Path to the input CSV or SDF file for prediction.
--sdf_activity: Activity to use for SDF file. Required if using an SDF file.
--grid_resolution: Resolution of the grid used for field calculation. (default: 1.0)
--grid_padding: Padding of the grid used for field calculation. (default: 3.0)
--fields: Fields to use for analysis. Options: steric, electrostatic, hydrophobic, donor, acceptor, all. (default: all)
--num_components: Number of components for PLS analysis. (default: 12)
--column_filter: Column filtering. (default: 0.0)
--disable_visualization: Disable visualization. (default: False)
CSV: One molecule per row. A column for SMILES strings. A column for the activity data.
SDF: Molecules should be pre-aligned. The SDF file must contain a property field corresponding to the activity data. Use --sdf_activity to specify the property name.
To run the examples from the publication, simply execute the following:
python comsiatest.pyCreates a png file of molecules in an SDF file with IUPAC names (if available).
python moleculeimager.py SDF_file_name.sdf