Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.
翻译:将自我中心视频与可穿戴传感器对齐已在人类动作识别领域展现出潜力,但面临用户不适、隐私担忧和可扩展性等实际限制。我们探索将离轴视频与环境传感器结合,作为一种非侵入式、可扩展的替代方案。虽然先前自我中心-可穿戴设备的研究主要采用全局对齐方法,将整个序列编码为统一表示,但该方法在离轴-环境设置中因两个问题而失效:(P1) 无法捕捉局部细节(如细微动作),(P2) 过度依赖模态不变的时间模式,导致具有相似时间模式但不同空间语义背景的动作之间出现错位。为解决这些问题,我们提出DETACH——一个分解的时空框架。这种显式分解保留了局部细节,而我们通过在线聚类发现的新型传感器-空间特征为上下文感知对齐提供了语义基础。为对齐分解后的特征,我们的两阶段方法通过相互监督建立空间对应关系,随后通过空间-时间加权对比损失执行时间对齐,该损失能自适应处理简单负样本、困难负样本和假负样本。在Opportunity++和HWU-USP数据集上进行的下游任务综合实验表明,相较于适配的自我中心-可穿戴基线方法,本方法取得了显著改进。