Hi Conner,
I think I found a possible bug in the buffer code. It's possible for self.token_pointer to be less than min(self.token_pointer + self.cfg["model_batch_size"], num_batches) in the buffer refresh. Then an empty token gets passed to the two models. However, this doesn't immediately raise an error.
If you run train.py in this branch https://github.com/tim-hua-01/crosscoder_fun/tree/issues_demo, you can see that the the loss appears to down even while empty tokens are added to the buffer.
I rewrote the buffer code here: https://github.com/tim-hua-01/crosscoder_fun/blob/main/buffer.py, although I'm not 100% sure if that's correctly done either.
Thanks!
Tim