In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.
翻译:在这项工作中,与其直接预测像素分解面罩,不如直接预测像素分解面面罩,不如将图像分解问题表述为相继多边形生成,而预测的多边形随后可转换为分解面面罩。这得益于一个新的序列到序列框架“多边形变换器(PollyFormer) ”, 将一系列图像补丁和文字查询符号作为输入,并产生一系列多边形脊椎自动递增。 对于更精确的几何定位,我们提议一个基于回归的解码器,直接预测精确的浮点坐标,而没有任何坐标的量化错误。在实验中, PolyFormer 以清晰的边距(例如5.40%和4.52%) 超越了先前的艺术。它还表明,在不作微调,例如,在Ref-DAVIS17数据集上实现61.5%的J&F具有竞争力。