We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. https://github.com/Bin-2/FineFreq
翻译:我们提出了FineFreq,这是一个从FineWeb和FineWeb2语料库衍生的大规模多语言字符频率数据集,涵盖超过1900种语言,时间跨度为2013年至2025年。该数据集包含从57 TB压缩文本中处理得到的96万亿字符的频率计数。对于每种语言,FineFreq提供每个字符的统计信息,包括聚合频率和年度频率,从而支持细粒度的时间分析。该数据集保留了自然出现的多语言特征,如跨文字借用、表情符号和首字母缩略词,未应用人工过滤。每个字符条目都包含Unicode元数据(类别、文字、区块),便于进行特定领域或其他下游的过滤与分析。完整数据集以CSV和Parquet格式发布,并附有相关元数据,可在GitHub和HuggingFace上获取。https://github.com/Bin-2/FineFreq