Home /Research /Exploring CNN-Based Architectures for Multimodal Salient Event Detection in Videos
PERCEPTION

Exploring CNN-Based Architectures for Multimodal Salient Event Detection in Videos

Petros Koutras, Athanasia Zlatinsi, Petros Maragos

Year
2018
Citations
7

Abstract

Nowadays, multimodal attention plays a significant role in many machine-based understanding applications, computer vision and robotic applications, such as action recognition or summarization. In this paper, we present our approach to the problem of audio-visual salient event detection based on visual and audio modalities by employing modern Convolutional Neural Network (CNN) based architectures. In this way, we extend our previous work, where a hand-crafted frontend was examined, an energy based synergistic approach, where a nonparametric classification technique was used for the classification of salient vs. non-salient events. Our comparative evaluations over the COGNIMUSE database [1], consisting of movies and travel documentaries, as well as ground-truth data denoting the perceptually mono- and multimodal salient events, provided strong evidence that the CNN-based approach for all modalities (i.e., audio, visual and audiovisual), even in this task, manages to outperform the hand-crafted frontend in almost all cases, accomplishing really good average results.

Keywords

Automatic summarizationComputer scienceSalientConvolutional neural networkModalitiesArtificial intelligenceEvent (particle physics)Audio visualTask (project management)Visualization

Related papers

Browse all PERCEPTION papers