Exploring CNN-Based Architectures for Multimodal Salient Event Detection in Videos
Petros Koutras, Athanasia Zlatinsi, Petros Maragos
- 发表年份
- 2018
- 引用次数
- 7
摘要
Nowadays, multimodal attention plays a significant role in many machine-based understanding applications, computer vision and robotic applications, such as action recognition or summarization. In this paper, we present our approach to the problem of audio-visual salient event detection based on visual and audio modalities by employing modern Convolutional Neural Network (CNN) based architectures. In this way, we extend our previous work, where a hand-crafted frontend was examined, an energy based synergistic approach, where a nonparametric classification technique was used for the classification of salient vs. non-salient events. Our comparative evaluations over the COGNIMUSE database [1], consisting of movies and travel documentaries, as well as ground-truth data denoting the perceptually mono- and multimodal salient events, provided strong evidence that the CNN-based approach for all modalities (i.e., audio, visual and audiovisual), even in this task, manages to outperform the hand-crafted frontend in almost all cases, accomplishing really good average results.
关键词
相关论文
Statistical Learning Theory
Yuhai Wu, Vladimir Vapnik
1999
Artificial intelligence: a modern approach
1995
Applied Nonlinear Control
Jean-Jacques Slotine, Weiping Li
1991
A new optimizer using particle swarm theory
R.C. Eberhart, James Kennedy
2002