We introduce HTLM, a hyper-text language model trained on a large-scale web crawl. Modeling hyper-text has a number of advantages: (1) it is easily gathered at scale, (2) it provides rich document-level and end-task-adjacent supervision (e.g. class and id attributes often encode document category information), and (3) it allows for new structured prompting that follows the established semantics of HTML (e.g. to do zero-shot summarization by infilling title tags for a webpage that contains the input text). We show that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels. HTLM matches or exceeds the performance of comparably sized text-only LMs for zero-shot prompting and fine-tuning for classification benchmarks, while also setting new state-of-the-art performance levels for zero-shot summarization. We also find that hyper-text prompts provide more value to HTLM, in terms of data efficiency, than plain text prompts do for existing LMs, and that HTLM is highly effective at auto-prompting itself, by simply generating the most likely hyper-text formatting for any available training data. We will release all code and models to support future HTLM research.
翻译:我们引入了超文本语言模型HTLM, 这是一种在大规模网络爬行方面受过培训的超文本语言模型。 模拟超文本模式具有若干优点:(1) 它在规模上很容易收集, (2) 它提供了丰富的文档级别和终端任务匹配监督( 例如, 类属性和 id 属性通常编码文件类别信息), (3) 它允许按照HTML的既定语义进行新的结构化提示( 例如, 填写含有输入文本的网页的标题标记, 从而完成零发音总和。 我们发现, 高文本提示为HTLM提供更大的价值, 直接在简化的 HTM 上进行BART式的分解损失直接培训, 为一系列广泛的最终任务和监督级别提供非常有效的传输。 HTLM 匹配或超过可比较的文本大小的LMM 性能, 用于零发即时的提示和精确调整分类基准, 同时为零发音总和加音的网页。 我们还发现, 超文本提示为HTLM提供数据效率方面的更高价值, 而不是普通文本能够使现有的HLM 以高文本格式生成数据。