Tox21 GPT-OSS Classifier

This repository provides the evaluation code for the GPT-OSS model in the Tox21 Leaderboard.

Here a GPT-OSS model is evaluated on the Tox21 dataset.

Repository Structure

full_evaluation_run.py - description
eval_results.ipynb - description

Installation

To run the GPT-OSS classifier, clone the repository and install dependencies:

git clone https://github.com/ml-jku/tox21_gpt-oss_classifier.git
cd tox21_gpt-oss_classifier
pip install -r requirements.txt

Evaluation procedure

The pipeline consists of two components:

full_evaluation_run.py — generates predictions from the model.
eval_result.ipynb — inspects and analyzes the generated prediction file.

Running the evaluator (`full_evaluation_run.py`)

The evaluator loads the Tox21 test set, builds prompts for every target, runs the model via vLLM, and writes all predictions to a CSV file.

Example usage:

python full_evaluation_run.py \
    --test_csv path/to/tox21_test.csv \
    --model_name openai/gpt-oss-120b \
    --train_csv path/to/tox21_train.csv \
    --n_shots 0 \
    --output_dir results \
    --n_rollouts 1

What the script does

Load data
- Loads the test set (test_csv).
- Loads the training set if few-shot examples are enabled.
Construct prompts For each molecule–target pair, a chat prompt is created containing:
- a fixed system instruction,
- the SMILES string,
- a description of the target,
- optionally few-shot examples (sampled from the training data). Prompts are formatted through the model’s chat template (tokenizer.apply_chat_template).
Run vLLM inference
- Generates model outputs for every prompt.
- Supports multiple rollouts per sample (--n_rollouts).
- Configurable temperature, reasoning effort, and max tokens.
Extract answer and type conversion
- direct float parsing when possible,
- otherwise the last valid number between 0 and 1 is used,
- percentage formats are handled,
- invalid outputs default to 0.
Save predictions

The final prediction file is written to:

results/<model_name>/<shot_setting>_<reasoning_effort>/predictions.csv

Inspecting results (`eval_result.ipynb`)

The notebook eval_result.ipynb is intended for:

loading and inspecting the generated predictions.csv,
computing performance metrics.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
eval_result.ipynb		eval_result.ipynb
full_evaluation_run.py		full_evaluation_run.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tox21 GPT-OSS Classifier

Repository Structure

Installation

Evaluation procedure

Running the evaluator (`full_evaluation_run.py`)

Inspecting results (`eval_result.ipynb`)

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ml-jku/tox21_gpt-oss_classifier

Folders and files

Latest commit

History

Repository files navigation

Tox21 GPT-OSS Classifier

Repository Structure

Installation

Evaluation procedure

Running the evaluator (full_evaluation_run.py)

Inspecting results (eval_result.ipynb)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Running the evaluator (`full_evaluation_run.py`)

Inspecting results (`eval_result.ipynb`)

Packages