Hand reconstruction has achieved great success in real-time applications such as visual reality and augmented reality while interacting with two-hand reconstruction through efficient transformers is left unexplored. In this paper, we propose a method called lightweight attention hand (LWA-HAND) to reconstruct hands in low flops from a single RGB image. To solve the occlusion and interaction challenges in efficient attention architectures, we introduce three mobile attention modules. The first module is a lightweight feature attention module that extracts both local occlusion representation and global image patch representation in a coarse-to-fine manner. The second module is a cross image and graph bridge module which fuses image context and hand vertex. The third module is a lightweight cross-attention mechanism that uses element-wise operation for cross attention of two hands in linear complexity. The resulting model achieves comparable performance on the InterHand2.6M benchmark in comparison with the state-of-the-art models. Simultaneously, it reduces the flops to $0.47GFlops$ while the state-of-the-art models have heavy computations between $10GFlops$ and $20GFlops$.
翻译:手工重建在实时应用方面取得了巨大成功,例如视觉现实和扩大了现实,而通过高效变压器与双手重建互动则没有被探索。在本文中,我们提议了一种叫做轻量关注手(LWA-HAND)的方法,从一个RGB图像中以低浮点重建手。为了解决高效关注结构中的隔离和互动挑战,我们引入了三个移动关注模块。第一个模块是一个轻量特效关注模块,它以粗略到平面的方式将本地封闭代表和全球图像覆盖代表制提取为0.47GFlops。第二个模块是一个交叉图像和图形连接模块,将图像背景和手头头连接在一起。第三个模块是一个轻量度交叉关注机制,使用元素操作,在线性复杂的两手交叉关注中进行交叉关注。由此形成的模型在InterHand2.6M基准上取得了与最新模型的类似性能。同时,它将Flops 降低到0.47GFlops$,而最先进的模型在10GGFlops美元和20GGFlops之间进行重量的计算。