We argue that progress toward general intelligence requires complementary foundation models grounded in language, the physical world, and structured data. This report presents LimiX-16M and LimiX-2M, two instantiations of our large structured-data models (LDMs). Both models treat structured data as a joint distribution over variables and missingness, thus capable of addressing a wide range of tabular tasks through query-based conditional prediction via a single model. They are pretrained using masked joint-distribution modeling with an episodic, context-conditional objective, supporting rapid, training-free adaptation at inference. We evaluate LimiX models across 11 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios. LimiX-16M consistently surpasses strong baselines, as shown in Figure 1 and Figure 2. The superiority holds across a wide range of tasks, such as classification, regression, missing value imputation, and data generation, often by substantial margins, while avoiding task-specific architectures or bespoke training per task. Notably, LimiX-2M delivers strong results under tight compute and memory budgets. We also present the first scaling law study for LDMs, revealing how data and model scaling jointly influence downstream performance and offering quantitative guidance for tabular foundation modeling. All LimiX models are publicly accessible under Apache 2.0.
翻译:我们认为,实现通用智能的进展需要基于语言、物理世界和结构化数据的互补性基础模型。本报告介绍了LimiX-16M和LimiX-2M,这是我们大型结构化数据模型(LDMs)的两个具体实例。两个模型均将结构化数据视为变量与缺失值的联合分布,从而能够通过基于查询的条件预测,以单一模型应对广泛的表格任务。它们采用掩码联合分布建模与情景化、上下文条件目标进行预训练,支持在推理时实现快速、无需训练的自适应。我们在11个大型结构化数据基准上评估LimiX模型,这些基准覆盖了样本量、特征维度、类别数量、分类特征与数值特征比例、缺失值比例以及样本-特征比等多种广泛场景。如图1和图2所示,LimiX-16M始终超越强基线模型。其优势体现在分类、回归、缺失值插补和数据生成等多种任务中,且通常以显著优势领先,同时避免了任务特定架构或针对每项任务的定制化训练。值得注意的是,LimiX-2M在严格的计算和内存预算下仍能提供强劲结果。我们还首次提出了LDMs的缩放定律研究,揭示了数据和模型规模如何共同影响下游性能,并为表格基础建模提供了量化指导。所有LimiX模型均在Apache 2.0许可下公开可用。