Should we ensure `relative_hidden_states` to involve T_{f}^{+} and T_{f}^{-}? #60

novakboskov · 2025-09-09T09:48:40Z

Reading the paper, I got the feeling that relative_hidden_states[layer] should be the difference between positive and negative function/concept templates. RepReadingPipeline.get_directions seemingly relies on train_inputs[::2] being the positive and train_inputs[1::2] the negative ones, but it does not ensure that. This may affect some downstream usages of RepReadingPipeline.get_directions.

For example, in honesty.ipynb, the training data gets prepared by honesty_function_dataset, which shuffles the labels of prompts:

representation-engineering/examples/honesty/utils.py

Lines 53 to 57 in 5455d8a

    
           train_labels = [] 
        
           for d in train_data: 
        
               true_s = d[0] 
        
               random.shuffle(d) 
        
               train_labels.append([s == true_s for s in d])

When such prepared data is passed to RepReadingPipeline.get_directions, train_inputs[::2] are not necessarily the positive function/concept templates.

Am I reading this right?

…volve T_f^+ and T_f^-

relative_hidden_states in RepReadingPipeline.get_directions should in…

406ddce

…volve T_f^+ and T_f^-

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should we ensure `relative_hidden_states` to involve T_{f}^{+} and T_{f}^{-}? #60

Should we ensure `relative_hidden_states` to involve T_{f}^{+} and T_{f}^{-}? #60

Uh oh!

novakboskov commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	train_labels = []
	for d in train_data:
	true_s = d[0]
	random.shuffle(d)
	train_labels.append([s == true_s for s in d])

Should we ensure relative_hidden_states to involve T_{f}^{+} and T_{f}^{-}? #60

Are you sure you want to change the base?

Should we ensure relative_hidden_states to involve T_{f}^{+} and T_{f}^{-}? #60

Uh oh!

Conversation

novakboskov commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Should we ensure `relative_hidden_states` to involve T_{f}^{+} and T_{f}^{-}? #60

Should we ensure `relative_hidden_states` to involve T_{f}^{+} and T_{f}^{-}? #60