We often represent text using Unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating systems such as Windows. Software systems frequently have to convert text from one Unicode format to the other. While recent disks have bandwidths of 5 GiB/s or more, conventional approaches transcode non-ASCII text at a fraction of a gigabyte per second. We show that we can validate and transcode Unicode text at gigabytes per second on current systems (x64 and ARM) without sacrificing safety. Our open-source library can be ten times faster than the popular ICU library on non-ASCII strings and even faster on ASCII strings.
翻译:我们常常代表使用Unicode格式(UTF-8和UTF-16)的文本。UTF-8格式越来越受欢迎,特别是在网络上(XML、HTML、JSON、Rust、Go、Swift、Ruby)。UTF-16格式在爪哇、.NET和Windows等操作系统内部最为常见。软件系统往往不得不将文本从一个Unicode格式转换到另一个系统。虽然最近的磁盘带带宽为5 GB/s或更多,但常规做法是将非ASCII的文本转换到每秒的几兆字节。我们显示,我们可以在不牺牲安全的情况下,在目前系统(x64和ARM)每秒的千字节上验证和转换统一编码文本。我们的开放源图书馆可以比非ACII字符的流行的ICU图书馆快10倍,甚至更快。