What do you think about combining your architecture with existing pre-trained encoders? Can BERT as an **prior_encoder** help achieve the better results?