Unfortunately the validation loss used to train the model is currently being calculated on the test set, which means that the test set perplexity performance metric is not a reliable indicator of out-of-sample generalisation (cf. main.py lines 256-282).
The original intention to calculate the validation loss on the validation set is clear from main.py lines 244-251, however the variables defined there are not used subsequently in the "evaluate" function.