Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why attention demo chooses language model layer to catch model attention? #14

Open
SunJiaheng66 opened this issue Oct 23, 2023 · 0 comments

Comments

@SunJiaheng66
Copy link

SunJiaheng66 commented Oct 23, 2023

In attention.py demo, get_attention_by_gradcam method's inputs have image_input and text_input, I want to know why choosing text_input to deal. The demo is showed below.

def get_attention_by_gradcam(self, model, tokenizer, image_path, image_input, text_input, attr_name, target_layer):
    encoder_name = getattr(model, attr_name, None)
    encoder_name.encoder.layer[target_layer].crossattention.self.save_attention = True
    output = model(image_input, text_input)
    loss = output[:, 1].sum()
    model.zero_grad()
    loss.backward()
    image_size = 256
    temp = int(np.sqrt(image_size))
    # the effect of mask is let those padding tokens multiply with 0 so that they won't be calculated in cams and
    # grads , because of the text preprocess of ALBEF and TCL, mask is unuseful here
    mask = **text_input**.attention_mask.view(text_input.attention_mask.size(0), 1, -1, 1, 1)
    grads = **encoder_name**.encoder.layer[target_layer].crossattention.self.get_attn_gradients()
    cams = encoder_name.encoder.layer[target_layer].crossattention.self.get_attention_map()

Another same question is in 'albef' attention, demo shows atter_name is 'text_encoder', The demo is showed below.

def getAttMap(self, image_path, text):
    if self.model_name.lower() == 'albef':
        engine = ALBEF('ALBEF_4M.pth')
        model, tokenizer = engine.load_model(engine.model_id)
        image_input = engine.load_data(src_type='local', data=[image_path])[0]
        text_input = tokenizer(engine.pre_caption(text), return_tensors="pt")
        self.get_attention_by_gradcam(model, tokenizer, image_path, image_input, text_input,
                                          attr_name='text_encoder', target_layer=8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant