Examples of using the Novel Materials Discovery (NOMAD) database, especially downloading all chemical formulas.
Clone or download the repository. To clone:
git clone https://github.com/sparks-baird/nomad-examples.git
cd nomad-examples
Install the dependencies, e.g. via:
pip install -r requirements.txt
Use all_formula_basic_metadata.py to download the data from NOMAD and to do some basic processing. This might take somewhere around an hour.
python -m all_formula_basic_metadata.py
Use remove_duplicate_compositions.py to process the chemical formulas down to a list of unique chemical compositions (represented as reduced formulas). This also might take around an hour.
python -m remove_duplicate_compositions.py
The data is available via figshare DOI: 10.6084/m9.figshare.19319783.v3 and was downloaded on 2022-03-07. There are four files available: all-formula.csv
, unique-formula.csv
, unique-reduced-formula.csv
, and bad-formula.csv
. There are 11680557
, 764431
, 695612
, and 15
rows for each of these files, respectively. Descriptions are given below.
all-formula.csv
contains two columns: calc_id
(Calculation ID) and formula
(Chemical Formula). These were restricted to VASP DFT calculations, and do not include noble gases nor radioactive elements. Some calculation IDs have missing chemical formulas.
The list has also been filtered down to unique (non-reduced) chemical formulas in unique-formula.csv
along with the calc_id
for each unique formula. No structural information is included directly in this data.
REALLY, what you're probably most interested in is unique-reduced-formula.csv
because it is the most curated and is directly usable with e.g. pymatgen
. This contains three columns: calc_id, reduced_formula, and factor which correspond to the Calculation ID, the reduced formula (e.g. Si2O4 --> SiO2), and the factor (e.g. for Si2O4 --> SiO2 the factor is 2). The formulas were first parsed via the pymatgen.core.Composition
class.
Finally, bad-formula.csv
contains the formulas that were skipped during processing (i.e. not successfully processed with pymatgen.core.Composition
for various reasons comprising 15 in total).
Downloading all of the crystal structures and reducing this to a list of unique phases each with a CIF file.
See something missing? Please don't hesitate to drop me a note in issues.