This paper presents a framework to automate the labelling process for gestures in musical performance videos with a 3D Convolutional Neural Network (CNN). While this idea was proposed in a previous study, this paper introduces several novelties: (i) Presents a novel method to overcome the class imbalance challenge and make learning possible for co-existent gestures by batch balancing approach and spatial-temporal representations of gestures. (ii) Performs a detailed study on 7 and 18 categories of gestures generated during the performance (guitar play) of musical pieces that have been video-recorded. (iii) Investigates the possibility to use audio features. (iv) Extends the analysis to multiple videos. The novel methods significantly improve the performance of gesture identification by 12 %, when compared to the previous work (51 % in this study over 39 % in previous work). We successfully validate the proposed methods on 7 super classes (72 %), an ensemble of the 18 gestures/classes, and additional videos (75 %).
翻译:本文提供了一个框架,将音乐表演录像中手势的标签过程与3D进化神经网络自动化。 虽然这一想法是在前一项研究中提出的,但本文提出了几个新颖之处:(一) 提出了克服阶级不平衡挑战的新颖方法,通过批量平衡方法和时空手势的展示,使学习共同存在的手势成为可能。 (二) 对音乐作品表演(吉塔尔播放)期间产生的7类和18类手势进行了详细研究,这些手势已被录制。 (三) 调查使用音频特征的可能性。 (四) 将分析扩大到多个视频。新颖方法与以往的工作相比(本研究的51%以上),大大改进了手势识别的绩效,比以前的工作高出了12%。 我们成功验证了7个超级类的拟议方法(72%)、18个手势/级的组合以及额外的视频(75%)。