This paper presents U-LanD, a framework for joint detection of key frames and landmarks in videos. We tackle a specifically challenging problem, where training labels are noisy and highly sparse. U-LanD builds upon a pivotal observation: a deep Bayesian landmark detector solely trained on key video frames, has significantly lower predictive uncertainty on those frames vs. other frames in videos. We use this observation as an unsupervised signal to automatically recognize key frames on which we detect landmarks. As a test-bed for our framework, we use ultrasound imaging videos of the heart, where sparse and noisy clinical labels are only available for a single frame in each video. Using data from 4,493 patients, we demonstrate that U-LanD can exceedingly outperform the state-of-the-art non-Bayesian counterpart by a noticeable absolute margin of 42% in R2 score, with almost no overhead imposed on the model size. Our approach is generic and can be potentially applied to other challenging data with noisy and sparse training labels.
翻译:本文展示了U- LanD, 联合探测视频中关键框架和里程碑的框架。 我们处理了一个特别具有挑战性的问题, 培训标签是吵闹和高度稀少的。 U- LanD基于一个关键观察: 一个完全在关键视频框上受过训练的Bayesian深度地标探测器, 这些框架与其他视频框相比的预测不确定性大大降低。 我们用这一观察作为不受监督的信号, 自动识别我们探测里程碑的钥匙框。 作为我们框架的测试台, 我们使用心脏超声波成像视频, 每个视频框中只有稀少和吵闹的临床标签。 我们使用来自4 493名病人的数据, 我们证明U- LanD在R2评分上可以大大超过最先进的非Bayesian对等数据, 明显绝对比42%的R2分差, 模型大小几乎没有设置任何间接费用。 我们的方法是通用的, 并有可能用于其它具有挑战性的数据, 噪音和稀少的培训标签。