-
Notifications
You must be signed in to change notification settings - Fork 65
Open
Description
Love the work, I am just having difficulty understanding the architecture for the SI + DI model.
From what I see in the architecture of the resnext.mat model, the model uses a temporal max pooling layer just before the softmax layer. It says the input to the temporal max pooling layer are the merged conv7 features and Video2. I am assuming the merged conv7 features come from running the dynamic image through the ResNext model. Where does the Video2 come from?
Are we supposed to pass the whole video or just a single frame from the video clip?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels