Metric learning has received conflicting assessments concerning its suitability for solving instance segmentation tasks. It has been dismissed as theoretically flawed due to the shift equivariance of the employed CNNs and their respective inability to distinguish same-looking objects. Yet it has been shown to yield state of the art results for a variety of tasks, and practical issues have mainly been reported in the context of tile-and-stitch approaches, where discontinuities at tile boundaries have been observed. To date, neither of the reported issues have undergone thorough formal analysis. In our work, we contribute a comprehensive formal analysis of the shift equivariance properties of encoder-decoder-style CNNs, which yields a clear picture of what can and cannot be achieved with metric learning in the face of same-looking objects. In particular, we prove that a standard encoder-decoder network that takes $d$-dimensional images as input, with $l$ pooling layers and pooling factor $f$, has the capacity to distinguish at most $f^{dl}$ same-looking objects, and we show that this upper limit can be reached. Furthermore, we show that to avoid discontinuities in a tile-and-stitch approach, assuming standard batch size 1, it is necessary to employ valid convolutions in combination with a training output window size strictly greater than $f^l$, while at test-time it is necessary to crop tiles to size $n\cdot f^l$ before stitching, with $n\geq 1$. We complement these theoretical findings by discussing a number of insightful special cases for which we show empirical results on synthetic data.
翻译:计量学习在是否适合解决实例分割任务方面得到了相互矛盾的评估,但由于雇用的有线电视新闻网的变换变化以及它们各自无法区分相同对象,因此在理论上存在缺陷,因此被驳斥为理论缺陷。然而,事实证明,在各种任务中产生了最新的最新结果,而实际问题主要是在瓷砖和螺丝方法的背景下报告的,在瓷砖边界上出现了不连续现象。迄今为止,报告的问题都没有经过彻底的正式分析。在我们的工作中,我们对编码-脱coder-风格的有线电视新闻网的变换性进行了全面的正式分析,从而可以清楚地了解在面对相同对象时,光学能够和无法取得什么成就。特别是,我们证明一个标准的编码-脱coder网络以美元作为输入,用美元集中层和集中因子美元,能够将大部分美元作为我们所报告的问题进行彻底分析。同样看起来的物体,我们展示了这种特殊的上限可以达到。此外,我们假设,在以美元标准-成本模型的模型中,要用比标准-xxxxx 来避免数字的图像的缩组合。