You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello. Thanks you for your great work!
I had a question about the full-resolution Swin-T baseline given in the FastVQA paper. It is mentioned that fixed recognition features were regressed to get the baseline. Does this mean all frames of the video (no temporal sampling) and no fragmentation or resizing was done? Or was the temporally sampled video the input to the Swin-T model for generating the fixed features?
The text was updated successfully, but these errors were encountered:
Hello. Thanks you for your great work!
I had a question about the full-resolution Swin-T baseline given in the FastVQA paper. It is mentioned that fixed recognition features were regressed to get the baseline. Does this mean all frames of the video (no temporal sampling) and no fragmentation or resizing was done? Or was the temporally sampled video the input to the Swin-T model for generating the fixed features?
The text was updated successfully, but these errors were encountered: