This repository provides the evaluation code for the GPT-OSS model in the Tox21 Leaderboard.
Here a GPT-OSS model is evaluated on the Tox21 dataset.
full_evaluation_run.py- descriptioneval_results.ipynb- description
To run the GPT-OSS classifier, clone the repository and install dependencies:
git clone https://github.com/ml-jku/tox21_gpt-oss_classifier.git
cd tox21_gpt-oss_classifier
pip install -r requirements.txtThe pipeline consists of two components:
- full_evaluation_run.py — generates predictions from the model.
- eval_result.ipynb — inspects and analyzes the generated prediction file.
The evaluator loads the Tox21 test set, builds prompts for every target, runs the model via vLLM, and writes all predictions to a CSV file.
Example usage:
python full_evaluation_run.py \
--test_csv path/to/tox21_test.csv \
--model_name openai/gpt-oss-120b \
--train_csv path/to/tox21_train.csv \
--n_shots 0 \
--output_dir results \
--n_rollouts 1What the script does
- Load data
- Loads the test set (
test_csv). - Loads the training set if few-shot examples are enabled.
- Loads the test set (
- Construct prompts
For each molecule–target pair, a chat prompt is created containing:
- a fixed system instruction,
- the SMILES string,
- a description of the target,
- optionally few-shot examples (sampled from the training data).
Prompts are formatted through the model’s chat template (
tokenizer.apply_chat_template).
- Run vLLM inference
- Generates model outputs for every prompt.
- Supports multiple rollouts per sample (
--n_rollouts). - Configurable temperature, reasoning effort, and max tokens.
- Extract answer and type conversion
- direct float parsing when possible,
- otherwise the last valid number between 0 and 1 is used,
- percentage formats are handled,
- invalid outputs default to 0.
- Save predictions
The final prediction file is written to:
results/<model_name>/<shot_setting>_<reasoning_effort>/predictions.csv
The notebook eval_result.ipynb is intended for:
- loading and inspecting the generated predictions.csv,
- computing performance metrics.