Skip to content

Latest commit

 

History

History
73 lines (56 loc) · 9.17 KB

README.md

File metadata and controls

73 lines (56 loc) · 9.17 KB

AVVP-Learning-List

Introduction

This repository introduces Weakly-supervised Audio-Visual Video Parsing (AVVP) and Audio-Visual Event Localization (AVEL) task, and collects some related works.

Content

Definition

  • Weakly-supervised AVVP: Weakly-supervised Audio-Visual Video Parsing is a task which aims to parse a video into temporal event segements and label them as either audible, visible, or both.

  • AVE: Audio-Visual Event Localization (AVEL) defines an audio-visual event as an event that is both audible and visible in a video segment. It contains a fully and weakly-supervised audio-visual event localization task, and a cross-modality localization task. The former aims to predict the event label for each video segment, while the latter aims to find the position of one modality (visual/auditory), given a segment of synchronized content in the other modality (auditory/visual). Cross-modality localization task includes visual localization from audio (A2V) and audio localization from visual content (V2A).

Datsets

  • LLP: Look, Listen, and Parse (LLP) dataset is the only available dataset for the AVVP task. LLP contains 11,849 YouTube video clips spanning over 25 categories for a total of 32.9 hours collected from AudioSet. A wide range of video events (e.g., human speaking, singing, baby crying, dog barking, violin playing, and car running, and vacuum cleaning etc.) from diverse domains (e.g., human activities, animal activities, music performances, vehicle sounds, and domestic environments) are included in the dataset. Each video is 10s long and has at least 1s audio or visual events. There are 7,202 videos that contain events from more than one event categories and per video has averaged 1.64 different event categories. Individual audio and visual events are annotated with second-wise temporal boundaries for randomly selected 1,849 videos from the LLP dataset. Finally, there totally 6,626 event annotations, including 4,131 audio events and 2,495 visual events for the 1,849 videos, leading to 2,488 audio-visual event annotations. The validation set and testing set have 649 and 1,200 videos with fully annotated labels, respectively.The training set consists of 10,000 videos with weak labels.
  • AVE: Audio-Visual Event (AVE) dataset is the only available dataset for the AVE task. It is a subset of AudioSet, which contains 4143 videos covering 28 event categories. Videos in AVE are temporally labeled with audio-visual event boundaries. Each video contains at least one 2s long audio-visual event. The dataset covers a wide range of audio-visual events (e.g., man speaking, woman speaking, dog barking, playing guitar, and frying food etc.) from different domains, e.g., human activities, animal activities, music performances, and vehicle sounds. Each event category contains a minimum of 60 videos and a maximum of 188 videos, and 66.4% videos in the AVE contain audio-visual events that span over the full 10 seconds.

Evaluation Metric

  • Weakly-supervised Audio-Visual Video Parsing
    • F-scores are applied as metrics at both the segment-level and event-level on individual audio, visual, and audio-visual events. Segment-level F-scores evaluate snippet-wise event labeling performance. Event-level F-scores evaluate the ability to extract events with concatenating positive consecutive snippets in the same event categories, based on mIoU=0.5 as the threshold.
    • Type@AV and Event@AV are used to evaluate the overall perforamce. Type@AV computes averaged audio, visual, and audio-visual event evaluation results. Event@AV computes the F-scores considering all considers all audio and visual events for each sample.
  • Audio-Visual Event Localization
    • Supervised audio-visual event localization (SEL): The overall accuracy of the category prediction of each one-second segment is used as an evaluation metric. “Background” is also a category in this classification task.
    • Cross-modality localization (CML): The percentage of correct matchings is used as the evaluation metric. A correct matching defined in this task is that a matched audio/visual segment is exactly the same as its ground truth; otherwise, it will be an incorrect matching.

Related Works

Weakly-supervised Audio-Visual Video Parsing

Audio-Visual Event Localization

Reference