Inscriptis provides a library, command line client and Web service for converting HTML to plain text. Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium. In contrast to related software packages, Inscriptis (i) provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers; and (ii) supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes. These unique features ensure that downstream knowledge extraction components can operate on accurate text representations, and may even use information on the semantics and structure of the original HTML document.
翻译:指令提供将 HTML 转换为纯文本的图书馆、命令行客户和网络服务。它的开发是因为需要获得保存文本空间一致性而无需借助重力和浏览器解决方案(如Selenium)的知识提取任务准确的文字表述。与相关的软件包相比,Inrmitis (一) 提供了更加接近标准网络浏览器的图像的 HTML 版图转换;以及 (二) 支持说明规则,即用户提供的绘图,以便根据以 HTML 标记和属性编码的结构和语义信息对抽取的文字进行说明。这些独特的特征确保了下游知识提取组成部分能够以准确的文本表述方式运作,甚至可以使用关于原 HTML 文档的语义和结构的信息。