Should we ensure relative_hidden_states to involve T_{f}^{+} and T_{f}^{-}?
#60
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reading the paper, I got the feeling that
relative_hidden_states[layer]should be the difference between positive and negative function/concept templates.RepReadingPipeline.get_directionsseemingly relies ontrain_inputs[::2]being the positive andtrain_inputs[1::2]the negative ones, but it does not ensure that. This may affect some downstream usages ofRepReadingPipeline.get_directions.For example, in
honesty.ipynb, the training data gets prepared byhonesty_function_dataset, which shuffles the labels of prompts:representation-engineering/examples/honesty/utils.py
Lines 53 to 57 in 5455d8a
When such prepared data is passed to
RepReadingPipeline.get_directions,train_inputs[::2]are not necessarily the positive function/concept templates.Am I reading this right?