Digital Humanities and Computational Literary Studies apply text mining methods to investigate literature. Such automated approaches enable quantitative studies on large corpora which would not be feasible by manual inspection alone. However, due to copyright restrictions, the availability of relevant digitized literary works is limited. Derived Text Formats (DTFs) have been proposed as a solution. Here, textual materials are transformed in such a way that copyright-critical features are removed, but that the use of certain analytical methods remains possible. Contextualized word embeddings produced by transformer-encoders (like BERT) are promising candidates for DTFs because they allow for state-of-the-art performance on various analytical tasks and, at first sight, do not disclose the original text. However, in this paper we demonstrate that under certain conditions the reconstruction of the original copyrighted text becomes feasible and its publication in the form of contextualized token representations is not safe. Our attempts to invert BERT suggest, that publishing the encoder as a black box together with the contextualized embeddings is critical, since it allows to generate data to train a decoder with a reconstruction accuracy sufficient to violate copyright laws.
翻译:数字人文和计算文学研究采用文字挖掘方法调查文献。这种自动化方法使得对大型公司进行量性研究成为仅靠人工检查是行不通的。然而,由于版权限制,相关的数字化文学作品的可用性有限。提出了衍生文本格式(DTFs)作为解决办法。在这里,文本材料的转换方式可以消除版权关键特征,但使用某些分析方法仍然是可能的。变压器-编码器(如BERT)产生的背景化词嵌入是DTF的有希望的候选对象,因为它们允许在各种分析任务上进行最先进的表现,而且首先不会披露原始文本。然而,在本文中,我们表明在某些条件下,重塑原始版权文本是可行的,以背景化象征性表述的形式出版这种文本并不安全。我们试图倒转BERT认为,将编码作为黑盒与背景化嵌入器一起出版是至关重要的,因为它允许生成数据,对解码器进行培训,使其具有足够准确性地重建版权法律。