-
Notifications
You must be signed in to change notification settings - Fork 938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to visualize attention map for model with multiple heads attention (e.g., vilbert) #917
Comments
Hi @CCYChongyanChen, thanks for using mmf, Most of what you said seems correct to me. In Vilbert, the image representation we used are from pre-extracted features on bboxes detected from object detection. The attention map on the image will be more on the bounding box level, ie., which bounding box was more/less attended to as opposed to pixel level. Alternatively, you can also use Vilbert with grid feature extractor which will give you more details in terms of the attention coverage. Let me know if this helps. |
Thanks! Do you have the pretrained VilBERT model with grid feature extractor? @ytsheng |
@CCYChongyanChen if possible can you share the code for future users? |
@apsdehal Step 1: Allow visualization(1) set (2) set Step 2: Edit the prediction_loop function in the evaluation_loop.pymmf/mmf/trainers/core/evaluation_loop.py Line 83 in f024b7b
Step 3: Visualization!Visualize the attention map following #145 using the def attention_bbox_interpolation(im, bboxes, att) and def visualize_pred(im_path, boxes, att_weights) function Notice that there is a small adjustment for the attention_bbox_interpolation function since the output attention of VilBERT ranges from (0,1). We should resize it to image size. |
FYI for some other models #1052 looks useful for exporting attention weights. Checked with mmbt and visual bert. |
❓ Questions and Help
Overall goal: I am trying to extract visual attention map from vilbert to explore where the vilbert is looking at the image.
My question
Question 1:
I know vilbert has three kinds of attention: image attention, text attention, and co-attention. I don't know if I should go with image attention or co-attention. Currently, I go with image attention.
Question 2:
I know for image attention, it outputs 6 vectors, each of the vector has a size (1,8,100,100). I would like to know (1) what does the 8, 100, 100 represent. (2) which vector should I select (3) and how can I visualize attention map with the image attention weights.
My understanding for Question 2:
According to https://github.com/facebookresearch/mmf/blob/3947693aafcc9cc2a16d7c1c5e1479bf0f88ed4b/mmf/configs/models/vilbert/defaults.yaml, it seems that 8 represents the number of attention heads. My guessing is 1 represents the batch size (I changed the batch size to 1), 100 is the image width and height.
If that is correct, then my question 2 becomes "how to deal with multiple attention heads?"
Possible solution for Question2:
I know how to visualize attention map if the attention weights are 1d array or 2d array....For 4d, I am not sure if it makes sense to directly use squeeze() to transform 4d into 2d for visualization. Or I should average multi-heads attention to get 2D attention weights?
Other questions
(1) I am worried about the way they represent the image in transformers makes it impossible to visualize the image attention map for vilbert:
(2) I got two image attention weights from Pythia, which one should I use for visualization?
Thank you in advance!
The text was updated successfully, but these errors were encountered: