Skip to content

Commit

Permalink
Merge pull request #5 from Olivierj23/version2
Browse files Browse the repository at this point in the history
Update to Version 2 of the dataset
  • Loading branch information
jonwzheng committed May 20, 2024
2 parents 6a8c356 + db05381 commit 94c871c
Show file tree
Hide file tree
Showing 13 changed files with 25,046 additions and 21,832 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -156,4 +156,5 @@ fabric.properties
# Android studio 3.1+ serialized cache file
.idea/caches/build_file_checksums.ser

sjc/
sjc/
.DS_Store
8 changes: 8 additions & 0 deletions .idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions .idea/Dissociation-Constants.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 13 additions & 0 deletions .idea/inspectionProfiles/Project_Default.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/inspectionProfiles/profiles_settings.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions .idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/vcs.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Binary file modified IUPAC_pK_DataDigitizationReport.pdf
Binary file not shown.
53 changes: 32 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

**Important note: This is the first release of this dataset. The detailed process of digitization and curation of this dataset can be found in the *IUPAC pKa Data Digitization Report* attached in this repository. Our validation process is ongoing and will continue. Please be advised that a few errors and inconsistencies may still exist.**

### Dataset version: v1.0
### Dataset version: v2.0

## Description

Expand All @@ -13,7 +13,7 @@ This repository includes "high-confidence" pKa data digitized from three referen
- **Perrin**: International Union of Pure and Applied Chemistry, DD Perrin. *Dissociation Constants of Organic Bases in Aqueous Solution*; Butterworths, 1965
- **Perrin Supplement**: International Union of Pure and Applied Chemistry, DD Perrin. *Dissociation Constants of Organic Bases in Aqueous Solution, Supplement*; Butterworths, 1972

With permission from the copyright holder, the International Union of Pure and Applied Chemistry (IUPAC), the reference books were scanned, converted to digital data, checked for accuracy, and curated for accessibility, interoperability, and reusability between the dates of Friday, Sept. 10, 2021 and Thursday, Sept. 15, 2022 by Jonathan Zheng (jonzheng@mit.edu).
With permission from the copyright holder, the International Union of Pure and Applied Chemistry (IUPAC), the reference books were scanned, converted to digital data, checked for accuracy, and curated for accessibility, interoperability, and reusability between the dates of Friday, Sept. 10, 2021 and Thursday, Sept. 15, 2022 by Jonathan Zheng (jonzheng@mit.edu). Further processing and curation was performed between Jan. 8, 2024 to Feb. 22, 2024 by Jonathan Zheng and Olivier Lafontant-Joseph (olivj23@mit.edu).

## Contributors:
Jonathan Zheng
Expand All @@ -26,21 +26,30 @@ Massachusetts Institute of Technology

jonzheng@mit.edu


Olivier Lafontant-Joseph

Green Group, Department of Chemical Engineering

Massachusetts Institute of Technology

olivj23@mit.edu

## License
Copyright © 2022 International Union of Pure and Applied Chemistry.
Copyright © 2024 International Union of Pure and Applied Chemistry.

This material is available for reuse under a [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) license with the following attribution: Reproduced by permission of International Union of Pure and Applied Chemistry.

## Recommended Citation
Zheng, Jonathan W. (2022) IUPAC Digitized pKa Dataset, v1.0. Copyright © 2022 International Union of Pure and Applied Chemistry (IUPAC), The dataset is reproduced by permission of IUPAC and is licensed under a [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). Access at https://doi.org/10.5281/zenodo.7236453.
Zheng, Jonathan W. and Lafontant-Joseph, Olivier. (2024) IUPAC Digitized pKa Dataset, v2.0. Copyright © 2024 International Union of Pure and Applied Chemistry (IUPAC), The dataset is reproduced by permission of IUPAC and is licensed under a [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). Access at https://doi.org/10.5281/zenodo.7236453.

This GitHub repository serves as a working copy for the dataset. Please refer to the Zenodo DOI badge linked at the top of this README.


## Data & File Overview

**File List**
* `iupac_high-confidence_v1_0.csv` : pKa dataset.
* `iupac_high-confidence_v2_0.csv` : pKa dataset.
* `IUPAC_pK_DataDigitizationReport.pdf` : Report describing methods of creating this dataset.
* `reference_code_translation.csv`: Spreadsheet containing reference codes.
* `method_translation.csv`: Spreadsheet containing method codes.
Expand All @@ -51,19 +60,18 @@ This GitHub repository serves as a working copy for the dataset. Please refer to
* `jupyter_notebooks/data_sample/data.csv`: Result of inputting the image scan pdf into Amazon Textract; used for Jupyter notebook demos.
* `jupyter_notebooks/names/sample_names_OUT.csv`: Sample of IUPAC name translations following typical workflow; used for Jupyter notebook demos.

Further data was collected that is not included in this dataset. "Low-confidence" data (i.e. only one source translating IUPAC name to SMILES; or different name translators gave different results) were excluded, as well as entries whose names could not be programatically translated. Raw scans of the reference books are also excluded from this data set.
Further data was collected that is not included in this dataset. "Low-confidence" data (i.e. only one source translating IUPAC name to SMILES; or different name translators gave different results) were excluded, as well as entries whose names could not be programatically translated, and dissociation types that could not be unambiguously assigned (i.e. pKaH versus pKa). Raw scans of the reference books are also excluded from this data set.

This is the first release of the dataset. Please email jonzheng@mit.edu if any errors are discovered.
Please email jonzheng@mit.edu if any errors are discovered.

## Methodological Information

This dataset was obtained by scanning the reference books listed above, and then using OCR (Amazon Textract) to extract the images into text. After some light processsing of the intermediate data files, scripts were then written to process the text into an organized tabular format that resembles the layout of the textbook, to aid members of the Green Group at MIT manually check the digital data for fidelity to the original reference data. The digital tables were then processed again to extract the information to a convenient form for computational use, and to attempt to translate the IUPAC names to SMILES strings. This final format was then manually and programatically checked. *More information can be found in the report attached in the repository.*

Proprietary software Amazon Textract was used to perform the OCR, and Chemaxon Molconvert was used to aid in translating IUPAC names to structures.

Open-source software used to aid in the OCR process: `unpaper`, `OCRmyPDF`, `camelot`, `tabula`. IUPAC to SMILES translation was aided by `OPSIN`, PubChem, and the Chemical Identifier Resolver.

Sample Jupyter notebooks showing the data extraction process are included in this repository to demonstrate the processing methods. Packages `rdkit` and `pandas` are required to run the demos at minimal capacity. For extended usability, install `pubchempy` and `cirpy` as well.
Sample Jupyter notebooks showing the data extraction process are included in this repository to demonstrate the processing methods. Packages `rdkit` and `pandas` are required to run the demos at minimal capacity. For extended usability, install `pubchempy`, `py2opsin`, and `cirpy` as well.

**To run the Jupyter notebooks:** Install the required prerequisite packages. Run the Jupyter notebooks in order from 1 to 3. You should not need to alter any of the scripts. If running the minimal example without `pubchempy` and `cirpy`, either comment out or skip the cells that involve those packages.

Expand All @@ -78,14 +86,16 @@ Before publication, several programmed checks were performed on the dataset.
* Checking range and distribution of pKa values.
* Checking for common typos in Remarks, pKa types, temperatures, and chemical names.
* Manual review of all entries that failed to produce a SMILES (in case the translation failure was caused by a name typo).
* Standardizing formats (e.g. formatting "pk" as "pK", "pKa" as "pK1", standardizing "V. Uncert." versus "Very Uncert.", etc.).
* Manual review of amphoteric molecules to check the validity of the dissociation type (example: pKaH vs. pKa vs. pKb)
* Standardizing formats (e.g. formatting "pk" as "pKa", "pKa" as "pKa1", standardizing "V. Uncert." versus "Very Uncert.", etc.).

## Data-specific information

### **Columns**:
* `entry_#`: Order of the chemical species in the original reference work.
* `unique_ID#`: A unique code for each distinct molecule in the dataset, composed of the reference work code plus the entry number of the chemical species in the original reference work.
* `SMILES`: SMILES string translated from the IUPAC names provided in the original reference work.
* `pka_type`: Type of dissociation constant. Examples: pKAH1 = conjugate acid's first dissociation; pKb = basic dissociation constant; The vast majority of entries will be of the form pKAH, pKA, or pKB, but there are some exceptions in the reference works that include parentheses to identify an unusual protonation site or structure, e.g. pK(indole-ring) is pK for protonation on indole ring; pK(cis) to refer to protonation for the cis structure. (This convention may be changed in a later version).
* `InChI`: InChI string derived from the SMILES strings
* `pka_type`: Type of dissociation constant. Examples: pKaH1 = conjugate acid's first dissociation; pKb = basic dissociation constant; The vast majority of entries will be of the form pKaH, pKa, or pKB, but there are some exceptions in the reference works that include parentheses to identify an unusual protonation site or structure, e.g. pK(indole-ring) is pK for protonation on indole ring. (This convention may be changed in a later version). Many amphoteric molecules species are also present in this dataset, for which several pK values are reported. The original reference works do not often distinguish which values are acidic and which are basic. This distinction can be automatically made if only two acidities are reported, in which case the lower value is assumed to be basic and the higher as acidic. For all entries with more than 2 pK values that are potentially amphoteric, we manually examined the chemical structures to determine the labels. Out of the original data corpus, 1,275 entries had ambiguous labels we could not manually assign, so we excluded them from the high-confidence dataset. Future work may include resolving the labels for that missing set.
* `pka_value`: the pK value
* `T`: temperature (deg. C)
* `remarks`: Comments for this specific datapoint.
Expand All @@ -99,19 +109,20 @@ Before publication, several programmed checks were performed on the dataset.
* `num_name_contributors`: The number of methods that yielded a successful SMILES translation.
* `original_IUPAC_nicknames`: Secondary names identified from the IUPAC identifier for the chemical species originally presented in the reference works
* `source`: Name of the reference book: perrin, perrin_supp, or serjeant .
* `unique_ID`: Unique identifier combining the `source` and the `entry_#` to create a code that matches to the chemical species.
* `pressure`: Pressure (a handful of entries include very high pressure entries which might yield unexpected results if not filtered, so this column is added to help filter these out).
* `acidity_label`: Descriptor indicating whether the pK is an acidic (A), conjugate acid (AH), or basic (B) dissociation constant.
* `acidity_label`: Descriptor indicating whether the pK is an acidic (A), conjugate acid (AH), basic (B), or "other" dissociation constant.
* `original_T`: Displays the original temperature if it was corrected for purposes of standardization. (In the T column, room temperature was converted to 25 degrees Celsius, and any approximate temperatures were reported without their approximation sign.)
* `solvent`: Solvent information, if parsable from the remarks column.

### **Rows**:
There are 21,147 rows corresponding to 8,843 unique molecules in the dataset.
There are 24,222 rows corresponding to 10,624 unique molecules in the dataset.

Specialized abbreviations used:
* `pK`: dissociation constant of any type, e.g. a pKa, pKAH, pKB, etc.
* `pK`: dissociation constant of any type, e.g. a pKa, pKaH, pKb, etc.
* `pKa`: acid dissociation constant
* `pk1, pk2, pk3, ...`: first, second, third (etc) acid dissociation constants
* `pKAH`: acid dissociation constant of a base's conjugate acid
* `pKB`: base dissociation constant, or 14-pKAH
* `pKa1, pKa2, pKa3, ...`: first, second, third (etc) acid dissociation constants
* `pKaH`: acid dissociation constant of a base's conjugate acid
* `pKb`: base dissociation constant, or 14-pKAH
* `I`: ionic strength, equal to 1/2 * Sum(ci * zi^2)
* `m`: concentration in mole/1000g of water
* `c`: concentration in mole/L
Expand Down Expand Up @@ -182,7 +193,7 @@ The following table is necessary for this dataset to be indexed by
- International Union of Pure and Applied Chemistry', E. P Serjeant and Boyd Dempsey. 'Ionisation Constants of Organic Acids in Aqueous Solution'; Oxford/Pergamon, 1979 (Oxford IUPAC chemical data series)<br/>
- International Union of Pure and Applied Chemistry, DD Perrin. *Dissociation Constants of Organic Bases in Aqueous Solution*; Butterworths, 1965<br/>
- International Union of Pure and Applied Chemistry, DD Perrin. *Dissociation Constants of Organic Bases in Aqueous Solution, Supplement*; Butterworths, 1972<br/>
With permission from the copyright holder, the International Union of Pure and Applied Chemistry (IUPAC), the reference books were scanned, converted to digital data, checked for accuracy, and curated for accessibility, interoperability, and reusability between the dates of Friday, Sept. 10, 2021 and Thursday, Sept. 15, 2022 by Jonathan Zheng (jonzheng@mit.edu).
With permission from the copyright holder, the International Union of Pure and Applied Chemistry (IUPAC), the reference books were scanned, converted to digital data, checked for accuracy, and curated for accessibility, interoperability, and reusability between the dates of Friday, Sept. 10, 2021 and Thursday, Sept. 15, 2022 by Jonathan Zheng (jonzheng@mit.edu). Further processing and curation was performed between Jan. 8, 2024 to Feb. 22, 2024 by Jonathan Zheng and Olivier Lafontant-Joseph (olivj23@mit.edu).
</code></td>
</tr>
<tr>
Expand Down Expand Up @@ -231,4 +242,4 @@ With permission from the copyright holder, the International Union of Pure and A
<td><code itemprop="license">https://creativecommons.org/licenses/by-nc/4.0/</code></td>
</tr>
</table>
</div>
</div>
Loading

0 comments on commit 94c871c

Please sign in to comment.