We introduce a novel privacy-preserving methodology for performing Visual Question Answering on the edge. Our method constructs a symbolic representation of the visual scene, using a low-complexity computer vision model that jointly predicts classes, attributes and predicates. This symbolic representation is non-differentiable, which means it cannot be used to recover the original image, thereby keeping the original image private. Our proposed hybrid solution uses a vision model which is more than 25 times smaller than the current state-of-the-art (SOTA) vision models, and 100 times smaller than end-to-end SOTA VQA models. We report detailed error analysis and discuss the trade-offs of using a distilled vision model and a symbolic representation of the visual scene.
翻译:我们引入了一种在边缘进行视觉问题解答的新隐私保护方法。 我们的方法构建了视觉场景的象征性代表, 使用一种低复杂性的计算机视觉模型, 共同预测等级、 属性和前提。 这种象征性的表示是不可区分的, 这意味着它不能用来恢复原始图像, 从而保持原始图像的隐私。 我们提议的混合解决方案使用一种比目前最先进的视觉模型小25倍以上和比终端到终端SOTA VQA模型小100倍的视觉模型。 我们报告详细的错误分析,并讨论使用蒸馏的视觉模型和视觉场景的象征性表示的权衡。