Automatic height and age estimation of speakers using acoustic features is widely used for the purpose of human-computer interaction, forensics, etc. In this work, we propose a novel approach of using attention mechanism to build an end-to-end architecture for height and age estimation. The attention mechanism is combined with Long Short-Term Memory(LSTM) encoder which is able to capture long-term dependencies in the input acoustic features. We modify the conventionally used Attention -- which calculates context vectors the sum of attention only across timeframes -- by introducing a modified context vector which takes into account total attention across encoder units as well, giving us a new cross-attention mechanism. Apart from this, we also investigate a multi-task learning approach for jointly estimating speaker height and age. We train and test our model on the TIMIT corpus. Our model outperforms several approaches in the literature. We achieve a root mean square error (RMSE) of 6.92cm and6.34cm for male and female heights respectively and RMSE of 7.85years and 8.75years for male and females ages respectively. By tracking the attention weights allocated to different phones, we find that Vowel phones are most important whistlestop phones are least important for the estimation task.
 翻译:使用声学特征的发言者的自动高度和年龄估计被广泛用于人体-计算机互动、法证等目的。在这项工作中,我们提出一种新颖的注意机制,即利用关注机制来建立一个用于估计身高和年龄的端到端结构。注意机制与长期短期内存编码器相结合,能够捕捉输入声学特征的长期依赖性。我们修改传统使用的注意方法 -- -- 计算背景矢量,仅在时间跨时间跨时间跨度时段注意的总和 -- -- 引入一个经过修改的上下文矢量,该矢量也考虑到各编码单位的完全注意,给我们一个新的交叉注意机制。除此之外,我们还调查共同估计发言者身高和年龄的多任务学习方法。我们在TIMIT文集上培训和测试我们的模型。我们的模型超越了文献中的若干方法。我们分别对男女高度的根平均值(RMSE)为6.92cm6.34厘米,对男女高度的根平均值为6.85年,而RME为8.75年,给我们提供了一个新的交叉注意机制。除此之外,我们还调查了一种多任务学习方法,以联合估计发言者的重量分别用于不同的移动电话。