Despite the impressive performance of vision-based pose estimators, they generally fail to perform well under adverse vision conditions and often don't satisfy the privacy demands of customers. As a result, researchers have begun to study tactile sensing systems as an alternative. However, these systems suffer from noisy and ambiguous recordings. To tackle this problem, we propose a novel solution for pose estimation from ambiguous pressure data. Our method comprises a spatio-temporal vision transformer with an encoder-decoder architecture. Detailed experiments on two popular public datasets reveal that our model outperforms existing solutions in the area. Moreover, we observe that increasing the number of temporal crops in the early stages of the network positively impacts the performance while pre-training the network in a self-supervised setting using a masked auto-encoder approach also further improves the results.
翻译:尽管基于视觉的表面估计仪表现令人印象深刻,但它们一般在不利的视觉条件下表现不佳,往往不能满足客户的隐私需求。 结果,研究人员开始研究触摸感应系统,作为替代方法。 然而,这些系统受到噪音和模糊的录音的影响。为了解决这一问题,我们提出了一个新颖的解决办法,用模糊的压力数据作出估计。我们的方法包括一个时空视觉变异器,并有一个编码解码器-解码器结构。关于两个受欢迎的公共数据集的详细实验显示,我们的模型优于该地区现有的解决方案。此外,我们注意到,在网络早期阶段增加临时作物的数量对业绩产生了积极影响,同时利用蒙面自动解码器方法对网络进行自我监督的预先培训也进一步改进了结果。</s>