With ever-increasing dataset sizes, subset selection techniques are becoming increasingly important for a plethora of tasks. It is often necessary to guide the subset selection to achieve certain desiderata, which includes focusing or targeting certain data points, while avoiding others. Examples of such problems include: i)targeted learning, where the goal is to find subsets with rare classes or rare attributes on which the model is underperforming, and ii)guided summarization, where data (e.g., image collection, text, document or video) is summarized for quicker human consumption with specific additional user intent. Motivated by such applications, we present PRISM, a rich class of PaRameterIzed Submodular information Measures. Through novel functions and their parameterizations, PRISM offers a variety of modeling capabilities that enable a trade-off between desired qualities of a subset like diversity or representation and similarity/dissimilarity with a set of data points. We demonstrate how PRISM can be applied to the two real-world problems mentioned above, which require guided subset selection. In doing so, we show that PRISM interestingly generalizes some past work, therein reinforcing its broad utility. Through extensive experiments on diverse datasets, we demonstrate the superiority of PRISM over the state-of-the-art in targeted learning and in guided image-collection summarization
翻译:随着数据集规模的不断增加,子集选择技术对于许多任务变得日益重要。通常有必要指导子集选择实现某些偏差,包括重点或针对某些数据点,同时避免其他数据点。这类问题的例子包括:一)目标学习,目标是找到具有罕见类别或模式表现不佳的稀有特性的子集,以及二)指导总结,数据(例如图像收集、文本、文件或视频)如何归纳,以加快人类消费,并具有具体的额外用户意向。受这些应用的驱动,我们介绍一个丰富的PRISM,即一个丰富的PaRamerized子模子信息措施类别。通过新颖的功能及其参数化,PRISM提供多种模型能力,使想要的子集多样性或代表性的品质与一套数据点相似/差异之间能够进行交易。我们展示了如何将PRISM应用于上述两个现实世界问题,需要指导子集选择。我们通过这些应用,展示了PRISM将一些具有不同性的形象的模型化,从而在我们以往工作中加强了其广泛实用性。