Skip to content

aristizabal95/unexploitable-search

Repository files navigation

See here a link to the paper.

Running experiments

Training

Model organisms

Train a single organism:

uv run scripts/train/train_model_organism.py setting=apps side_quest=tmp_variable it_model=gemma3_12b train.train_set_wandb_path=<train-path> train.validation_set_wandb_path=<validation-path>

Train a sweep:

uv run scripts/train/train_model_organism.py --multirun setting=apps side_quest=tmp_variable,logic_bomb it_model=gemma3_12b,llama3_8b train.train_set_wandb_path=<train-path> train.validation_set_wandb_path=<validation-path>

RL (with mitigation strat)

Single run example:

uv run scripts/train/train_with_mitigation_strat.py setting=apps side_quest=tmp_variable model_organism=prompted it_model=llama3_8b strat=kl_from_base_penalty

Multirun example:

uv run scripts/train/train_with_mitigation_strat.py --multirun setting=apps side_quest=tmp_variable model_organism=prompted it_model=llama3_8b strat=entropy_bonus,length_penalty,kl_from_base_penalty

Evaluation

Example usage:

uv run scripts/eval/run_benign_eval.py setting=apps
uv run scripts/eval/run_organism_eval.py setting=apps model_organism=prompted it_model=gemma3_12b side_quest=tmp_variable
uv run scripts/eval/run_organism_eval.py --multirun setting=apps model_organism=prompted it_model=gemma3_12b side_quest=tmp_variable,logic_bomb
uv run scripts/eval/run_organism_eval.py setting=apps model_organism=trained it_model=gemma3_12b  model_organism.wandb_artifact_path=<path> side_quest=tmp_variable
uv run scripts/eval/run_mitigation_eval.py setting=apps side_quest=logic_bomb wandb_artifact_path=<path>
uv run scripts/eval/run_mitigation_eval.py --multirun setting=apps side_quest=logic_bomb wandb_artifact_path=<path1>,<path2>

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6

Languages