Textual redundancy is one of the main challenges to ensuring that legal texts remain comprehensible and maintainable. Drawing inspiration from the refactoring literature in software engineering, which has developed methods to expose and eliminate duplicated code, we introduce the duplicated phrase detection problem for legal texts and propose the Dupex algorithm to solve it. Leveraging the Minimum Description Length principle from information theory, Dupex identifies a set of duplicated phrases, called patterns, that together best compress a given input text. Through an extensive set of experiments on the Titles of the United States Code, we confirm that our algorithm works well in practice: Dupex will help you simplify your law.
翻译:文字冗余是确保法律文本保持可理解和可维持的主要挑战之一。从软件工程中重新构思的文献中得到的启发,这些文献已经开发出揭露和消除重复代码的方法,我们为法律文本引入了重复的短语探测问题,并提出了Dupex算法来解决它。Dupex从信息理论中利用最低描述长度原则,从信息理论中找出了一套重复的短语,称为模式,它们一起将一个输入文本压缩得最佳。我们通过对《美国法典》标题进行一系列广泛的实验,确认我们的算法在实践中运作良好:Dupex将帮助您简化法律。