In this paper, we present a method to detect the hand-object interaction from an egocentric perspective. In contrast to massive data-driven discriminator based method like \cite{Shan20}, we propose a novel workflow that utilises the cues of hand and object. Specifically, we train networks predicting hand pose, hand mask and in-hand object mask to jointly predict the hand-object interaction status. We compare our method with the most recent work from Shan et al. \cite{Shan20} on selected images from EPIC-KITCHENS \cite{damen2018scaling} dataset and achieve $89\%$ accuracy on HOI (hand-object interaction) detection which is comparative to Shan's ($92\%$). However, for real-time performance, with the same machine, our method can run over $\textbf{30}$ FPS which is much efficient than Shan's ($\textbf{1}\sim\textbf{2}$ FPS). Furthermore, with our approach, we are able to segment script-less activities from where we extract the frames with the HOI status detection. We achieve $\textbf{68.2\%}$ and $\textbf{82.8\%}$ F1 score on GTEA \cite{fathi2011learning} and the UTGrasp \cite{cai2015scalable} dataset respectively which are all comparative to the SOTA methods.
翻译:在本文中, 我们从自我中心角度展示了一种检测手动物体互动的方法。 与像\ cite{ shan20} 这样的大规模数据驱动的基于歧视者的方法相比, 我们提出了一种使用手和物体提示的新型工作流程。 具体地说, 我们训练网络, 预测手姿势、 手蒙面和手持物体遮罩, 以共同预测手动物体互动状态。 我们比较了我们的方法和Shan et al.\ cite{ shan20} 在EPIC- Kitchennes\ cite{damen2018scating} 数据集中的最新工作, 并实现了 HOI( 手动弹点互动) 检测的89 $ 。 然而, 对于实时性能, 我们的方法可以超过$\ textb{ 30} FPS, 这比 Shan ($\ textbff{ 1\\\\\\ cal\ textb} FPPS $。 此外, 我们可以用SO2\\\ recreal Stateal a rodustrateal rodustrate) 活动段段段, 我们用SOI2\\\\\\\\ dex stateal ex ex stateal a grogres) axxxxxxxxx 。