We present Long Short-term TRansformer (LSTR), a temporal modeling algorithm for online action detection, which employs a long- and short-term memory mechanism to model prolonged sequence data. It consists of an LSTR encoder that dynamically leverages coarse-scale historical information from an extended temporal window (e.g., 2048 frames spanning of up to 8 minutes), together with an LSTR decoder that focuses on a short time window (e.g., 32 frames spanning 8 seconds) to model the fine-scale characteristics of the data. Compared to prior work, LSTR provides an effective and efficient method to model long videos with fewer heuristics, which is validated by extensive empirical analysis. LSTR achieves state-of-the-art performance on three standard online action detection benchmarks, THUMOS'14, TVSeries, and HACS Segment.
翻译:我们提出了长短期TRANEXER(LSTR),这是一个用于在线行动探测的时间模型算法,它使用长期和短期记忆机制来模拟长期序列数据,包括动态地利用一个延长时间窗口(例如,2048年框架,长达8分钟)的粗缩历史资料的LSTR编码器,以及侧重于短时间窗口(例如,32个框架,横跨8秒)的LSTR解码器,以模拟数据的细微特征。与以前的工作相比,LSTR提供了一种有效和高效的方法,用较少脂重的长视频制作模型,经过广泛的实证分析加以验证。LSTRA在三种标准的在线行动探测基准THUMOOS'14、TVSeries和HACS段上取得了最先进的业绩。