Embeddings are one of the fundamental building blocks for data analysis tasks. Embeddings are already essential tools for large language models and image analysis, and their use is being extended to many other research domains. The generation of these distributed representations is often a data- and computation-expensive process; yet the holistic analysis and adjustment of them after they have been created is still a developing area. In this paper, we first propose a very general quantitatively measure for the presence of features in the embedding data based on if it can be learned. We then devise a method to remove or alleviate undesired features in the embedding while retaining the essential structure of the data. We use a Domain Adversarial Network (DAN) to generate a non-affine transformation, but we add constraints to ensure the essential structure of the embedding is preserved. Our empirical results demonstrate that the proposed algorithm significantly outperforms the state-of-art unsupervised algorithm on several data sets, including novel applications from the industry.
翻译:嵌入是数据分析任务的基本基石之一。 嵌入已经是大型语言模型和图像分析的基本工具,并且正在将其推广到许多其他研究领域。 这些分布式表述的生成往往是一个数据和计算昂贵的过程; 然而,在它们创建后对其进行的整体分析和调整仍然是一个发展中的领域。 在本文件中,我们首先建议对嵌入数据中存在的特点进行非常笼统的量化衡量,如果数据可以学习的话。 然后我们设计一种方法,在保留数据基本结构的同时消除或缓解嵌入中不理想的特征。 我们使用 Domain Adversarial 网络(DAN) 来产生非硬盘转换,但我们增加了一些限制,以确保嵌入的基本结构得以保留。 我们的经验结果显示,拟议的算法大大超越了几个数据集上最先进的、不受监督的算法,包括该行业的新应用。