通用排泄层:跨语文文本分类的集合学习和异同文件嵌入 (Generalized Funnelling: Ensemble Learning and Heterogeneous Document Embeddings for Cross-Lingual Text Classification)

\emph{Funnelling} (Fun) is a recently proposed method for cross-lingual text classification (CLTC) based on a two-tier learning ensemble for heterogeneous transfer learning (HTL). In this ensemble method, 1st-tier classifiers, each working on a different and language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a metaclassifier that uses this vector as its input. The metaclassifier can thus exploit class-class correlations, and this (among other things) gives Fun an edge over CLTC systems in which these correlations cannot be brought to bear. In this paper we describe \emph{Generalized Funnelling} (gFun), a generalization of Fun consisting of an HTL architecture in which 1st-tier components can be arbitrary \emph{view-generating functions}, i.e., language-dependent functions that each produce a language-independent representation ("view") of the (monolingual) document. We describe an instance of gFun in which the metaclassifier receives as input a vector of calibrated posterior probabilities (as in Fun) aggregated to other embedded representations that embody other types of correlations, such as word-class correlations (as encoded by \emph{Word-Class Embeddings}), word-word correlations (as encoded by \emph{Multilingual Unsupervised or Supervised Embeddings}), and word-context correlations (as encoded by \emph{multilingual BERT}). We show that this instance of \textsc{gFun} substantially improves over Fun and over state-of-the-art baselines, by reporting experimental results obtained on two large, standard datasets for multilingual multilabel text classification. Our code that implements gFun is publicly available.

翻译：\ emph{ Funnelling} (Fun) 是最近提出的一种跨语言文本分类方法, 它基于两个层次的学习集合, 用于兼容性传输学习( HTL) 。在这种组合方法中, 一层的分类者, 每个在不同的和依赖语言的特性空间工作, 返回每个文档的校准后端概率的矢量( 每个类都有一个维度), 最后的分类决定由使用此矢量输入的元分类器做出。元分类者可以因此利用类的多语系( CLTC) 相关性, 而这( 和其他事物一样) 会给 CLTC 系统带来一个边际的边际。在这个文件中, 我们描述\ emph{ 一般的调情由 HTL 结构组成, 其中, 1 级的组件可以任意 emph{view- 生成功能。 iversion, 也就是, 语言独立的函数可以生成一个多语言的多层次的表达式( ) comlideal deal develys 。