Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why the max_seq_length = 512 for XLNet? #263

Open
vr25 opened this issue Apr 23, 2020 · 4 comments
Open

Why the max_seq_length = 512 for XLNet? #263

vr25 opened this issue Apr 23, 2020 · 4 comments

Comments

@vr25
Copy link

vr25 commented Apr 23, 2020

Hi,

Just a conceptual question:
In the paper, it is mentioned that XLNet derives some parts from Transformer-XL which isn't limited to a fixed context but the hyperparameters section says that the max length is 512.

Can you please help me better understand it?

Thanks!

@mihaidobri
Copy link

I was having the same question. @zihangdai could you please help us with this?

@mihaidobri
Copy link

or maybe @kimiyoung ?

@zihangdai
Copy link
Owner

Assuming you are familiar with Transformer-XL, max_seq_length means the length of each training segment where you can back-prop (as the gradient does not pass to the memory in Transformer-XL).

Then, why the value 512?
(1) Longer sequence requires more pretraining time.
(2) Most of the tasks considered at that time do not really require handling long sequences: GLUE -> 128, SQuAD -> 512. RACE performance can be improved slightly if you also increase max_seq_length during finetuning. Technically, you can increase the sequence length if you want during finetuning. But if it's too long, the generalization may not be good as longer sequences are not seen during pretraining.

@mihaidobri
Copy link

@zihangdai thank you for your fast reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants