Technology-assisted review (TAR) refers to human-in-the-loop machine learning workflows for document review in legal discovery and other high recall review tasks. Attorneys and legal technologists have debated whether review should be a single iterative process (one-phase TAR workflows) or whether model training and review should be separate (two-phase TAR workflows), with implications for the choice of active learning algorithm. The relative cost of manual labeling for different purposes (training vs. review) and of different documents (positive vs. negative examples) is a key and neglected factor in this debate. Using a novel cost dynamics analysis, we show analytically and empirically that these relative costs strongly impact whether a one-phase or two-phase workflow minimizes cost. We also show how category prevalence, classification task difficulty, and collection size impact the optimal choice not only of workflow type, but of active learning method and stopping point.
翻译:技术和辅助性审查(TAR)是指在法律发现和其他高回顾性审查任务中用于文件审查的 " 流动中的人 " 机学习工作流程; 律师和法律技术专家已经辩论过,审查是否应当是一个单一的迭代过程(一个阶段的TAR工作流程),还是模式培训和审查应当分开(两个阶段的TAR工作流程),对选择主动学习算法有影响; 不同目的的人工标签(培训与审查)和不同文件(积极与消极实例)的相对成本是本次辩论中一个关键和被忽视的因素。 我们利用新的成本动态分析,从分析和经验上表明,这些相对成本对一个阶段或两个阶段的工作流程产生极大影响,无论一个阶段还是两个阶段的工作流程是最大限度地降低成本。 我们还表明,分类的普及程度、分类任务困难和收集规模如何影响不仅对工作流程类型的最佳选择,而且对积极学习方法和停止点的最佳选择。