This Python tool is designed to automate the extraction of quantum chemical descriptors from Gaussian calculation output files (.log) and perform subsequent multivariate statistical analysis.
It is specifically tailored for analyzing DFT calculations of small organic molecules, extracting physical organic descriptors, and performing stepwise linear regression to model experimental properties.
It includes a fully functional Stepwise Backward Regression module that automatically builds predictive models based on the extracted descriptors.
- Batch Extraction: Automatically processes multiple molecules based on numerical indices found in filenames.
- Robust Parsing: Extracts SCF energies, orbital energies (HOMO/LUMO), Dipole moments, and Frequency data using regex.
- Descriptor Calculation: Computes DFT-based reactivity indices (Hardness, Softness, Electrophilicity, etc.).
-
Steric Analysis: Calculates Sterimol parameters (
$L, B_1, B_5$ ) usingmorfeusbased on molecule geometries. - Data Preprocessing: Automatically handles missing values, removes constant descriptor columns, and applies Z-score standardization to all features before regression to ensure scale invariance.
- Statistical Modeling: Performs automatic Backward Elimination OLS Regression. It iteratively removes the least significant descriptors (p-value > 0.05) until all remaining variables are statistically significant. It also performs an initial Spearman correlation screening if the number of descriptors exceeds the sample size.
Ensure you have Python installed along with the following libraries:
pip install pandas statsmodels morfeus-py openpyxl matplotlibNote: The script also uses standard libraries: os, re, csv.
The script relies on a strict file naming convention to associate files with a specific sample ID (integer n).
Place all files in a single directory. Replace n with the sample number (e.g., 1, 2, 10...):
| File Type | Naming Pattern | Description |
|---|---|---|
| Gaussian Output | n-**.log |
Output log for the isolated cation. |
You also need to provide an Excel file named data.xlsx in the same directory. This file serves two purposes: providing the experimental target variable (for regression) and configuration for Sterimol calculations.
The Excel file must contain the following columns (headers are case-sensitive):
| Column Name | Description | Example |
|---|---|---|
number |
The Sample ID corresponding to n in filenames. |
1 |
dependent variable |
The experimental value (target Y) to predict. | 5.4 |
sterimol axis atoms |
Atom indices to define an axis for Sterimol calculations, separated by a comma. | 1,6 |
Example data.xlsx content:
| number | dependent variable | sterimol axis atoms |
|---|---|---|
| 1 | 8.23 | 1,6 |
| 2 | 7.45 | 1,5 |
| 3 | 9.10 | 2,7 |
The script extracts and calculates the following descriptors:
- Energies: HOMO, LUMO, HOMO-LUMO Gap.
-
DFT Indices:
- Chemical Hardness (
$\eta$ ) - Chemical Softness (
$\sigma$ ) - Chemical Potential (
$\mu$ ) - Electronegativity (
$\chi$ ) - Electrophilicity Index (
$\omega$ )
- Chemical Hardness (
- Dipole Moment: Field-independent basis (Debye).
- Energies: Total SCF Energy, Kinetic Energy (KE), Nuclear Repulsion (N-N), Electron-Nuclear (E-N).
- Corrections: ZPE, Thermal Corrections to Energy, Enthalpy (H), and Gibbs Free Energy (G).
-
Thermochemistry: Entropy (
$S$ ), Heat Capacity ($C_v$ ). -
Binding Energies:
$\Delta E$ for Salt formation and Solvent-Cation interaction.
-
Sterimol Parameters:
$L$ (Length),$B_1$ (Min width),$B_5$ (Max width). - Frequencies: Lowest vibrational frequency.
- Mass: Molecular mass.
Follow these steps to run the analysis:
Create a folder (e.g., D:\Research\GaussianData) and ensure it contains:
- All your
.logfiles named correctly (see Naming Convention). - The
data.xlsxfile containing your experimental data and sterimol configs.
Open extract_gaussian_data.py in a text editor or IDE. Locate the main() function and update the data_folder variable to point to your directory:
def main():
# ...
data_folder = r"D:\Research\GaussianData" # <--- Update this path
output_file = 'results.csv'
# ...Open your terminal or command prompt, navigate to the folder containing the python script, and run:
python extract_gaussian_data.pyThe script will provide real-time feedback in the console:
- Loading: It will confirm that
data.xlsxwas loaded and how many Sterimol configs were found. - Processing: It will iterate through every group number found:
Processing group 1... - Fitting: Once extraction is done, it begins the Multivariate Linear Regression (Backward Elimination):
Starting Multivariate Linear Fitting (Mode: backward)...--- Round 1 Fitting ---Descriptor Contribution...Decision: Removing descriptor 'SM-LUMO' (Low contribution, P=0.85...)
After execution, two new files will be generated in your working directory:
-
results.csv: A comprehensive dataset containing every extracted descriptor for every molecule. This is your raw data for further analysis. -
fitting_report.txt: The final statistical summary of the best regression model found, including R-squared, F-statistic, and coefficients. -
prediction_vs_actual.png: A scatter plot visualizing the correlation between Experimental (X-axis) and Predicted (Y-axis) values, displaying the final$R^2$ of the model.
Warning: File not found ...: The script cannot find a specific log file. Double-check that your files are named exactlyn-cation.log,n-salt.log, etc.Warning: Config file ... missing columns: Yourdata.xlsxheaders are likely incorrect. They must exactly matchnumber,dependent variable, andsterimol axis atoms.Error extracting HOMO/LUMO: The script failed to parse the orbital energies. Ensure your Gaussian jobs included orbital printing (standard in optimization jobs) and finished successfully (SCF Done).- Empty
fitting_report.txt: If the regression fails, check ifresults.csvcontainsNaNvalues (blank cells). The regression tool removes columns containing any missing data.