If you find yourselves doing ML in this group or that you are just interested in ML topics that align with the boss's research, you may want to check this folder out. It is cool stuffs anyway.
To gain a comprehensive overview of Quantum Machine Learning (QML) in alignment with Prof. Panida's research interests, I recommend reading this paper MUST READ by Prof. Kjell at ETHZ.
-
Data science
- statistics
- data visualization
- machine learning
- deep learning
-
Chemoinformatics
- text representation of molecules: SMILES, SELFIES, InChI
- molecular descriptors
- molecular fingerprints
- molecular similarity
-
Quantum machine learning
- representing molecules
- properties for molecular representation: unique, invariant, computationally efficient, and differentiable
- text-based: SMILES, SELFIES, InChI
- molecular graph
- molecular descriptors
- molecular fingerprints
- electronic-structure-based (using 3D information)
- machine learning
- deep learning
- representing molecules
- Coding skills (refer to above)
- Basis ML
- Supervised learning/ Unsupervised learning/ Reinforcement learning
- Regression/ Classification
- Overfitting/ Underfitting
- Feature engineering (feature selection, feature extraction): see Molecular representation below
- Data preprocessing (data cleaning, data normalization, data augmentation)
- Data visualization (PCA, t-SNE, UMAP)
- Algorithms
- Linear regression
- Logistic regression
- Ensemble models: Random forest, XGBoost
- Kernel methods: SVM, Gaussian process, Kernel ridge regression (KRR)
- Neural network: MLP, CNN, RNN, LSTM, Transformer
- Clustering: K-means, DBSCAN
- Support vector machine
- Dimensionality reduction: PCA, t-SNE, UMAP
- Model evaluation:
- Cross-validation or Leave-one-out cross-validation
- Metrics: R2, RMSE, MAE, accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, etc.
- Hyperparameter tuning (model selection)
(P.S. Most of the time, if you ever get to do QML, you will find yourself using molecular representation as below with Ensemble models and KRR)
Core idea is basically to translate raw chemical information into a vector of numbers that computer can understand and can then be used as input for machine learning models.
- molecular descriptor: (e.g., molecular weight, buried volume, Sterimol parameters, etc.)
- Structure-based: SMILES, SELFIES, one-hot encoding, 1D/2D fingerprints
-
Electronic structure-based (in QML/DScribe):
- Coulomb matrix (CM)
- Bag of Bonds (BoB)
- Smooth Overlap of Atomic Positions (SOAP)
- Spectrum of London and Axillrod-Teller-Muto potential (SLATM) and the local (atomic) version of which (aSLATM)\
- FCHL
-
Molecular Representation Learning
To read more about molecular representation
- Molecular representations for machine learning applications in chemistry
- A review of molecular representation in the age of machine learning
- Physics-Inspired Structural Representations for Molecules and Materials
- Quantum machine learning using atom-in-molecule-based fragments selected on the fly (SLATM)
- SPAHM: the spectrum of approximated Hamiltonian matrices representations
You can reduce the dimensionality of your molecular representations (above) to 2D or 3D for visualizing your chemical space using various algorithms. Some common methods include:
- Principal Component Analysis (PCA)
- t-Distributed Stochastic Neighbor Embedding (t-SNE)
- Uniform Manifold Approximation and Projection (UMAP)
2D t-SNE map depicting chemical diversity of IFLP catalysts [Taken From 10.26434/chemrxiv-2023-09md (will probably change later)]
Check these work out:
- Accelerated dinuclear palladium catalyst identification through unsupervised machine learning
- OSCAR: an extensive repository of chemically and functionally diverse organocatalysts
- Exploring Chemical Reaction Space with Reaction Difference Fingerprints and Parametric t-SNE
- Check out Coursera (particularly, Andrew Ng), Youtube, and so on yourselves.
- Very comprehensive ML (no need to learn all, just pick what you need in your project) here
- By P'Rangsiman (or Dr.Rangsiman already), in Thai, very detailed also here
- Deep learning here
LCMD at EPFL has a good tutorial to QML for beginner here
You can also learn from QML tutorial in which there are both practical (code) and theory stuffs here
- Learning the Exciton Properties of Azo-dyes
- Data-Driven Advancement of Homogeneous Nickel Catalyst Activity for Aryl Ether Cleavage
- Selected machine learning of HOMO–LUMO gaps with improved data-efficiency
- Reaction-based machine learning representations for predicting the enantioselectivity of organocatalysts
- Electronic spectra from TDDFT and machine learning in chemical space
- SPAHM: the spectrum of approximated Hamiltonian matrices representations
- Chemoinfomatics tutorial
- Al4Chem, a course by Prof. Philippe Schwaller, in which you can find cool AI stuffs in chemistry.
- RDKit: for chemoinformatics and machine learning on molecules, a powerhouse
- OpenBabel: for dealing with chemical formats
- morfeus: molecular features from 3D structures with a focus on steric descriptors.
- kallistro: molecular featurizer and modeller
- QML: toolkit for representation learning of properties of molecules and solids.
- QStack: Stack of codes for dedicated pre- and post-processing tasks for Quantum Machine Learning (QML)
- DScribe: library for various descriptors for machine learning of materials and molecules
- UMAP
- Chemiscope: interactive structure-property relationship explorer
- DeepChem: for machine learning and deep learning on molecular and quantum datasets.
- torchdrug: machine learning platform designed for drug discovery
- DGL-LifeSci: for applying graph neural networks to various tasks in chemistry and biology
Benchmark dataset
- QM9/QM7/QM7b
- GDB
- ChEMBL
- PubChem
- ZINC