Welcome to Machine Learning in the Molecular Sciences

Aims

The NYU-ECNU Center for Computational Chemistry at New York University Shanghai (a.k.a, NYU Shanghai) announced a summer school dedicated to machine learning and its applications in the molecular sciences to be held June, 2017 at the NYU Shanghai Pudong Campus. Using a combination of technical lectures and hands-on exercises, the school aimed to instruct attendees in both the fundamentals of modern machine learning techniques and to demonstrate how these approaches can be applied to solve complex computational problems in chemistry, biology, and materials science. In order to promote the idea of free to code, this project is built to help you understand some basic machine learning models mentioned below.

Panel-Topics

Fundamental topics to be covered include basic machine learning models such as kernel methods and neural networks optimization schemes, parameter learning and delta learning paradigms, clustering, and decision trees. Application areas will feature machine learning models for representing and predicting properties of individual molecules and condensed phases, learning algorithms for bypassing explicit quantum chemical and statistical mechanical calculations, and techniques applicable to biomolecular structure prediction, bioinformatics, protein-ligand binding, materials and molecular design and various others.

Course-Schedule

Monday, June 12

8:45 - 9:00: Welcome and Introduction

9:00 - 10:00: Introduction to Machine Learning (presented by Matthias Rupp)

10:00 - 10:20: Coffee Break

10:20 - 11:20: Kernel-based Regression (presented by Matthias Rupp)

11:20 - 12:30: Dimensional Reduction, Feature Selection, and Clustering techniques (presented by Alex Rodriguez)

12:30 - 14:00: Lunch Break

14:00 - 15:00: Introduction to Neural Networks (presented by Mark Tuckerman)

15:00 - 15:30: Coffee Break

15:30 - 17:30: Practical Session: Clustering with Feature Selection and Validation (presented by Alex Rodriguez)
Tuesday, June 13

9:00 - 10:00: Random Forests(presented by Yingkai Zhang)

10:00 - 10:30: Coffee Break

10:30 - 11:30: Learning Curves, Representations, and Training Sets I (presented by Anatole von Lilienfeld)

11:30 - 12:30: Learning Curves, Representations, and Training Sets II (presented by Anatole von Lilienfeld)

12:30 - 14:00: Lunch Break

14:00 - 15:00: Review of Electronic Structure, Atomic, Molecular, and Crystal Representations (presented by Mark Tuckerman)

15:00 - 15:30: Coffee Break

15:30 - 17:30: Practical Session: Learning Curves (presented by Anatole von Lilienfeld)
Wednesday, June 14

9:00 - 10:00: Predictiong Properties of Molecules and Materials (presented by Matthias Rupp)

10:00 - 10:30: Coffee Break

10:30 - 11:30: Parameter Learning and Delta Learning (presented by Anatole von Lilienfeld)

11:30 - 12:30: Learning Electronic Densities (presented by Mark Tuckerman),ML Models of Crystal Properties (presented by Anatole von Lilienfeld)

12:30 - 14:00: Lunch Break

14:00 - 15:30: Practical Session: Machine Learning and Property Prediction I (presented by Matthias Rupp)

15:30 - 16:00: Coffee Break

16:00 - 17:30: Practical Session: Machine Learning and Property Prediction II (presented by Matthias Rupp)
Thursday, June 15

9:00 - 10:00: Machine Learning of Potential Enenery Surfaces (Ming Chen, California Institute Technology)

10:00 - 10:30: Coffee Break

10:30 - 11:30: Machine Learning Based Enhanced Sampling (Ming Chen)

11:30 - 12:30: Machine Learning of Free Energy Surfaces (presented by Mark Tuckerman)

12:30 - 14:00: Lunch Break

14:00 - 15:00: Cluster-based Analysis of Molecular Simulations (presented by Alex Rodriguez)

15:00 - 15:30: Coffee Break

15:30 - 17:30: Practical Session: Neural Network Learning of Free Energy Surface (presented by Mark Tuckerman)
Friday, June 16

9:00 - 10:00: Development of Protein-ligand Scoring Functions (presented by Yingkai Zhang)

10:00 - 10:30: Coffee Break

10:30 - 11:30: Machine Learning in Structural Biology I (presented by Yang Zhang)

11:30 - 12:30：Machine Learning in Structural Biology II (presented by Yang Zhang)

12:30 - 14:00: Lunch Break

14:00 - 15:30: Practical Session: Random Forests and Scoring Functions (presented by Yingkai Zhang)

15:30 - 16:00: Coffee Break

16:00 - 17:30: Practical Session: Machine Learning for Structural Bioinformatics (presented by Yang Zhang)

Codes

Tuesday-June-13

For Practical Session: Learning Curves, please run these commands on Jupyterlab via huawei cloud:

!pip install qml
!git clone https://github.com/qmlcode/tutorial.git
ls
cd tutorial
ls
%load exercise_2_1.py
%run exercise_2_1.py
%load exercise_2_2.py
%run exercise_2_2.py
%load exercise_2_3.py
%run exercise_2_3.py
%load exercise_2_4.py
%run exercise_2_4.py

Wednesday-June-14

For Practical Session: Machine Learning and Property Prediction, please run these commands on Wolfram Cloud:

(*Please adjust the following path to where you unpacked the reference implementation code from the supplementary material.*)

AppendTo[$Path,FileNameJoin[{"Path","to","library"}]]; (* Parent directory containing QMMLPack directory *)

Thursday-June-15

For Practical Session: Machine Learning of Free Energy Surfaces, please run these commands on Linux system (In order to compile the code, a C++ compiler and the mkl library will be needed):

1. Unpack the tar file:
```
                                  tar -xzvf Neural_network_practical_software.tar.gz
```
2. Change Command-Line-Interface to the directory created by unpacking and compile the source code. At the beginning, edit 'Makefile' and change the C and C++ compliers to the corresponding ones you have available on your sytem, e.g., 'gcc' and 'g++' or 'icc' if necessary. The complie the code by typing
```
                                                       make
```
3. Create a training data set from the full dataset. One of two commands is avaiable for use:
```
                                   head -n ala-dip-data_all.txt > ala-dip-data.txt
                                   tail -nl ala-dip-data_all.txt > ala-dip-data.txt
```
Here n is the number of training points you wish to extract from the full dataset.

4. Edit the 2nd, 3rd, 4th, and 5th lines in the file "INPUT.txt" if you want to change the calculation type, number of conjugate gradient steps, checkpointing frequency of weights, and number of conjugate gradient line-minmization steps. As to the calculation type, '1' indicates caculating neural network parameters starting from scratch, and '-1' calculating neural network parameters starting from an old set contained in file "weight.txt", and '0' means validation calculation of the neural network.

Deployment

Machine Learning in Molecular Sciences

Internal-Links

Annual Conference on Neural Information Processing Systems (NIPS)
International Conference on Machine Learning (ICML)
Conference on Learning Theory (COLT)

External-Links

One of the exciting aspects of Machine-Learning (ML) techniques is their possible to democratize molecular and materials modelling with relatively economical computational calculations and low level entry for common folks. (Pople's Gassian software makes quantum chemistry calculations really approachable).

The success of machine-learning technology relies on three contributing factors: open data, open software and open education.

Open data:

Publicly accessible structure and property databases for molecules and solid materials.

Computed structures and properties:

AFLOWLIB (Structure and property repository from high-throughput ab initio calculations of inorganic materials)

Computational Materials Repository (Infrastructure to enable collection, storage, retrieval and analysis of data from electronic-structure codes)

GDB (Databases of hypothetical small organic molecules)

Harvard Clean Energy Project (Computed properties of candidate organic solar absorber materials)

Materials Project (Computed properties of known and hypothetical materials carried out using a standard calculation scheme)

NOMAD (Input and output files from calculations using a wide variety of electronicstructure codes)

Open Quantum Materials Database (Computed properties of mostly hypothetical structures carried out using a standard calculation scheme)

NREL Materials Database (Computed properties of materials for renewable-energy applications)

TEDesignLab (Experimental and computed properties to aid the design of new thermoelectric materials)

ZINC (Commercially available organic molecules in 2D and 3D formats)

Experimental structures and properties:

ChEMBL (Bioactive molecules with drug-like properties)

ChemSpider (Royal Society of Chemistry’s structure database, featuring calculated and experimental properties from a range of sources)

Citrination (Computed and experimental properties of materials)

Crystallography Open Database (Structures of organic, inorganic, metal–organic compounds and minerals )

CSD (Repository for small-molecule organic and metal–organic crystal structures)

ICSD (Inorganic Crystal Structure Database)

MatNavi (Multiple databases targeting properties such as superconductivity and thermal conductance)

MatWeb (Datasheets for various engineering materials, including thermoplastics, semiconductors and fibres)

NIST Chemistry WebBook (High-accuracy gas-phase thermochemistry and spectroscopic data)

NIST Materials Data Repository (Repository to upload materials data associated with specifc publications)

PubChem (Biological activities of small molecules)

Open Software:

Publicly accessible learning resources and tools related to machine learning

General-purpose machine-learning frameworks:

Caret (Package for machine learning in R)

Deeplearning4j (Distributed deep learning for Java)

H2O.ai (Machine-learning platform written in Java that can be imported as a Python or R library)

Keras (High-level neural-network API written in Python)

Mlpack (Scalable machine-learning library written in C++)

Scikit-learn (Machine-learning and data-mining member of the scikit family of toolboxes built around the SciPy Python library)

Weka (Collection of machine-learning algorithms and tasks written in Java)

Machine-learning tools for molecules and materials:

Amp (Package to facilitate machine learning for atomistic calculations)

ANI (Neural-network potentials for organic molecules with Python interface)

COMBO (Python library with emphasis on scalability and eficiency)

DeepChem (Python library for deep learning of chemical systems)

GAP (Gaussian approximation potentials)

MatMiner (Python library for assisting machine learning in materials science)

NOMAD (Collection of tools to explore correlations in materials datasets)

PROPhet (Code to integrate machine-learning techniques with quantum-chemistry approaches)

TensorMol (Neural-network chemistry package)

Open education:

fast.ai is a course that is “making neural nets uncool again” by making them accessible to a wider community of researchers. One of the advantages of this course is that users start to build working machine-learning models almost immediately. However, it is not for absolute beginners, requiring a working knowledge of computer programming and high-school-level mathematics.
DataCamp ofers an excellent introduction to coding for data-driven science and covers many practical analysis tools relevant to chemical datasets. This course features interactive environments for developing and testing code and is suitable for non-coders because it teaches Python at the same time as machine learning.
Academic MOOCs are useful courses for those wishing to get more involved with the theory and principles of artifcial intelligence and machine learning, as well as the practice. The Stanford MOOC is popular, with excellent alternatives available from sources such as edx (see, for example, ‘Learning from data (introductory machine learning)’) and udemy (search for ‘Machine learning A–Z’). The underlying mathematics is the topic of a course from Imperial College London coursera.
Many machine-learning professionals run informative blogs and podcasts that deal with specifc aspects of machine-learning practice. These are useful resources for general interest as well as for broadening and deepening knowledge. There are too many to provide an exhaustive list here, but we recommend machinelearningmastery and lineardigression as a starting point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Welcome to Machine Learning in the Molecular Sciences

Aims

Panel-Topics

Course-Schedule

Codes

Deployment

Internal-Links

External-Links

Open data:

Publicly accessible structure and property databases for molecules and solid materials.

Computed structures and properties:

Experimental structures and properties:

Open Software:

Publicly accessible learning resources and tools related to machine learning

General-purpose machine-learning frameworks:

Machine-learning tools for molecules and materials:

Open education:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Welcome to Machine Learning in the Molecular Sciences

Aims

Panel-Topics

Course-Schedule

Codes

Deployment

Internal-Links

External-Links

Open data:

Publicly accessible structure and property databases for molecules and solid materials.

Computed structures and properties:

Experimental structures and properties:

Open Software:

Publicly accessible learning resources and tools related to machine learning

General-purpose machine-learning frameworks:

Machine-learning tools for molecules and materials:

Open education: