A profile hidden Markov model, a popular model in biological sequence analysis, can be used to model related sequences of characters transcribed from books, magazines, and other printed materials. This paper documents one application of a profile HMM: automatically producing an ebook edition from distinct print editions. The resulting ebook has virtually all the desired properties found in a publisher-prepared ebook, including accurate transcription and an absence of print artifacts such as end-of-line hyphenation and running headers. The technique, which has particular benefits for readers and libraries that require books in an accessible format, is demonstrated using seven copies of a nineteenth-century novel.
翻译:在生物序列分析中,一个隐藏的Markov模型是一种流行的模式,可以用来模拟从书籍、杂志和其他印刷材料中转录的字符的相关序列。本文记载了一个简介HMM的应用程序:从不同的印刷版中自动制作一本电子手册。由此产生的电子手册几乎具有出版商编写的电子书中所有想要的属性,包括准确的抄录和没有印刷品,如线尾连字符和挂头。该技术对读者和图书馆具有特别好处,需要以无障碍格式出版书籍,它使用19世纪的一本小说的七本进行演示。