This repository contains molecular structures and descriptors for the Tox24 challenge prepared by me (team name: filipsPL). The goal of the challenge was to predict the in vitro activity of compounds' activity against Transthyretin (TTR) using chemical structure data.
This repository includes:
- The chemical structures in SMILES format, provided by the organizers and curated by me using my RDKit pipeline
data/smiles_org+fixed.csv
. - Training set - a diversified set of 1000 compounds, used for training models
data/train.csv.xz
- Validation set: a diversified set of 100 compounds, used for final validation of models
data/validation.csv.xz
- Test set: 500 compounds used to make predictions. It contains a leaderboard set (200 compounds) and a blind set (300 compounds)
data/test.csv.xz
. 💡 This set contains compounds with known and unknown activity. Compounds with known activity are also the members of the Training/Validation set.
The csv files contain 2D descriptors of molecules, including:
- DRKitDescriptors (2D)
- molecular fingerprints:
- CDK:
- CDKECFP4
- CDKEState
- CDKFCFP4
- CDKmolprop
- CDKpubchem
- CDKstandard
- Indigo fingerprints:
- IndigoResonanceSubstructure
- IndigoSimilarity
- RDKit fingerprints:
- RDkitFP-AtomPair
- RDkitFP-Avalon
- RDkitFP-FeatMorgan4
- RDkitFP-Layered
- RDkitFP-MACCS
- RDkitFP-Morgan2
- RDkitFP-Morgan3
- RDkitFP-Morgan4
- RDkitFP-Pattern
- RDkitFP-RDKit
- RDkitFP-Torsion
- CDK:
Feature importances according to the final catboost model
Bar plot showing RMSE of submitted predictions (by me, based on the official results). Congratulations to the winning team Amidoff 🎉!