不要猜测什么是真实的: 选择什么是最佳的。机器学习分类器的概率转换器 (Don't guess what's true: choose what's optimal. A probability transducer for machine-learning classifiers)

In fields such as medicine and drug discovery, the ultimate goal of a classification is not to guess a class, but to choose the optimal course of action among a set of possible ones, usually not in one-one correspondence with the set of classes. This decision-theoretic problem requires sensible probabilities for the classes. Probabilities conditional on the features are computationally almost impossible to find in many important cases. The main idea of the present work is to calculate probabilities conditional not on the features, but on the trained classifier's output. This calculation is cheap, needs to be made only once, and provides an output-to-probability "transducer" that can be applied to all future outputs of the classifier. In conjunction with problem-dependent utilities, the probabilities of the transducer allow us to find the optimal choice among the classes or among a set of more general decisions, by means of expected-utility maximization. This idea is demonstrated in a simplified drug-discovery problem with a highly imbalanced dataset. The transducer and utility maximization together always lead to improved results, sometimes close to theoretical maximum, for all sets of problem-dependent utilities. The one-time-only calculation of the transducer also provides, automatically: (i) a quantification of the uncertainty about the transducer itself; (ii) the expected utility of the augmented algorithm (including its uncertainty), which can be used for algorithm selection; (iii) the possibility of using the algorithm in a "generative mode", useful if the training dataset is biased.

翻译：在医学和药物发现等领域,分类的最终目的不是猜测一个类别,而是在一组可能的类别中选择最佳行动路线,通常不是在与一组分类的一对一对应中。这个决定理论问题要求各类别有合理的概率。根据特征的概率在计算上几乎不可能在许多重要案例中找到。目前工作的主要想法是计算概率,不以特性为条件,而是以经过训练的分类员的产出为条件。这一计算是廉价的,只需要一次,并且提供一个产出到概率的“递解器”,可以适用于分类者今后的所有产出。这个决定理论问题要求各类别有合理的概率。根据特征的概率,在许多重要案例中几乎不可能找到最佳的选择。目前工作的主要想法是计算概率,不是以特性为条件,而是以经过训练的分类员的输出为条件。这个概念体现在一个简化的药物发现问题中,它只需要做一次计算,并且提供输出到一个输出到概率的“递增到概率的“递增”“递解工具,有时是理论性的最大计算方法,它让我们找到一个更高的递增工具。