Home /Research /Exploring CNN-Based Architectures for Multimodal Salient Event Detection in Videos

PERCEPTION

Exploring CNN-Based Architectures for Multimodal Salient Event Detection in Videos

Petros Koutras, Athanasia Zlatinsi, Petros Maragos

Year: 2018
Citations: 7

Abstract

Nowadays, multimodal attention plays a significant role in many machine-based understanding applications, computer vision and robotic applications, such as action recognition or summarization. In this paper, we present our approach to the problem of audio-visual salient event detection based on visual and audio modalities by employing modern Convolutional Neural Network (CNN) based architectures. In this way, we extend our previous work, where a hand-crafted frontend was examined, an energy based synergistic approach, where a nonparametric classification technique was used for the classification of salient vs. non-salient events. Our comparative evaluations over the COGNIMUSE database [1], consisting of movies and travel documentaries, as well as ground-truth data denoting the perceptually mono- and multimodal salient events, provided strong evidence that the CNN-based approach for all modalities (i.e., audio, visual and audiovisual), even in this task, manages to outperform the hand-crafted frontend in almost all cases, accomplishing really good average results.

Keywords

Automatic summarizationComputer scienceSalientConvolutional neural networkModalitiesArtificial intelligenceEvent (particle physics)Audio visualTask (project management)Visualization

Exploring CNN-Based Architectures for Multimodal Salient Event Detection in Videos

Abstract

Keywords

Related papers

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory