See here a link to the paper.
Train a single organism:
uv run scripts/train/train_model_organism.py setting=apps side_quest=tmp_variable it_model=gemma3_12b train.train_set_wandb_path=<train-path> train.validation_set_wandb_path=<validation-path>Train a sweep:
uv run scripts/train/train_model_organism.py --multirun setting=apps side_quest=tmp_variable,logic_bomb it_model=gemma3_12b,llama3_8b train.train_set_wandb_path=<train-path> train.validation_set_wandb_path=<validation-path>Single run example:
uv run scripts/train/train_with_mitigation_strat.py setting=apps side_quest=tmp_variable model_organism=prompted it_model=llama3_8b strat=kl_from_base_penaltyMultirun example:
uv run scripts/train/train_with_mitigation_strat.py --multirun setting=apps side_quest=tmp_variable model_organism=prompted it_model=llama3_8b strat=entropy_bonus,length_penalty,kl_from_base_penaltyExample usage:
uv run scripts/eval/run_benign_eval.py setting=appsuv run scripts/eval/run_organism_eval.py setting=apps model_organism=prompted it_model=gemma3_12b side_quest=tmp_variableuv run scripts/eval/run_organism_eval.py --multirun setting=apps model_organism=prompted it_model=gemma3_12b side_quest=tmp_variable,logic_bombuv run scripts/eval/run_organism_eval.py setting=apps model_organism=trained it_model=gemma3_12b model_organism.wandb_artifact_path=<path> side_quest=tmp_variableuv run scripts/eval/run_mitigation_eval.py setting=apps side_quest=logic_bomb wandb_artifact_path=<path>uv run scripts/eval/run_mitigation_eval.py --multirun setting=apps side_quest=logic_bomb wandb_artifact_path=<path1>,<path2>