You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In attention.py demo, get_attention_by_gradcam method's inputs have image_input and text_input, I want to know why choosing text_input to deal. The demo is showed below.
def get_attention_by_gradcam(self, model, tokenizer, image_path, image_input, text_input, attr_name, target_layer):
encoder_name = getattr(model, attr_name, None)
encoder_name.encoder.layer[target_layer].crossattention.self.save_attention = True
output = model(image_input, text_input)
loss = output[:, 1].sum()
model.zero_grad()
loss.backward()
image_size = 256
temp = int(np.sqrt(image_size))
# the effect of mask is let those padding tokens multiply with 0 so that they won't be calculated in cams and
# grads , because of the text preprocess of ALBEF and TCL, mask is unuseful here
mask = **text_input**.attention_mask.view(text_input.attention_mask.size(0), 1, -1, 1, 1)
grads = **encoder_name**.encoder.layer[target_layer].crossattention.self.get_attn_gradients()
cams = encoder_name.encoder.layer[target_layer].crossattention.self.get_attention_map()
Another same question is in 'albef' attention, demo shows atter_name is 'text_encoder', The demo is showed below.
In attention.py demo, get_attention_by_gradcam method's inputs have image_input and text_input, I want to know why choosing text_input to deal. The demo is showed below.
Another same question is in 'albef' attention, demo shows atter_name is 'text_encoder', The demo is showed below.
The text was updated successfully, but these errors were encountered: