Joint Visual and Audio Learning for Video Highlight Detection

Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, Li Cheng

August 2021

PDF

Abstract

In video highlight detection, the goal is to identify the interesting moments within an unedited video. Although the audio component of the video provides important cues for highlight detection, the majority of existing efforts focus almost exclusively on the visual component. In this paper, we argue that both audio and visual components of a video should be modeled jointly to retrieve its best moments. To this end, we propose an audio-visual network for video highlight detection. At the core of our approach lies a bimodal attention mechanism, which captures the interaction between the audio and visual components of a video, and produces fused representations to facilitate highlight detection. Furthermore, we introduce a noise sentinel technique to adaptively discount a noisy visual or audio modality. Empirical evaluations on two benchmark datasets demonstrate the superior performance of our approach over the state-of-the-art methods.

Type

Conference paper

Publication

Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

ICCV Highlight Detection

Joint Visual and Audio Learning for Video Highlight Detection

Abstract

Taivanbat Badamdorj

Master Graduate

Li Cheng

Professor