AVOT: Audio-Visual Object Tracking of Multiple Objects for Robotics
Justin Wilson, Ming C. Lin
- 发表年份
- 2020
- 引用次数
- 22
摘要
Existing state-of-the-art object tracking can run into challenges when objects collide, occlude, or come close to one another. These visually based trackers may also fail to differentiate between objects with the same appearance but different materials. Existing methods may stop tracking or incorrectly start tracking another object. These failures are uneasy for trackers to recover from since they often use results from previous frames. By using audio of the impact sounds from object collisions, rolling, etc., our audio-visual object tracking (AVOT) neural network can reduce tracking error and drift. We train AVOT end to end and use audio-visual inputs over all frames. Our audio-based technique may be used in conjunction with other neural networks to augment visually based object detection and tracking methods. We evaluate its runtime frames-per-second (FPS) performance and intersection over union (IoU) performance against OpenCV object tracking implementations and a deep learning method. Our experiments, using the synthetic Sound-20K audio-visual dataset, demonstrate that AVOT outperforms single-modality deep learning methods, when there is audio from object collisions. A proposed scheduler network to switch between AVOT and other methods based on audio onset maximizes accuracy and performance over all frames in multimodal object tracking.
关键词
相关论文
Statistical Learning Theory
Yuhai Wu, Vladimir Vapnik
1999
Artificial intelligence: a modern approach
1995
Applied Nonlinear Control
Jean-Jacques Slotine, Weiping Li
1991
A new optimizer using particle swarm theory
R.C. Eberhart, James Kennedy
2002