Processing and analyzing tabular data in a productive and efficient way is essential for building successful applications of machine learning in fields such as healthcare. However, the lack of a unified framework for representing and standardizing tabular information poses a significant challenge to researchers and professionals alike. In this work, we present TabText, a methodology that leverages the unstructured data format of language to encode tabular data from different table structures and time periods efficiently and accurately. We show using two healthcare datasets and four prediction tasks that features extracted via TabText outperform those extracted with traditional processing methods by 2-5%. Furthermore, we analyze the sensitivity of our framework against different choices for sentence representations of missing values, meta information and language descriptiveness, and provide insights into winning strategies that improve performance.
翻译:以有效和高效的方式处理和分析表格数据,对于在保健等领域成功应用机器学习至关重要。然而,缺乏代表和标准化表格信息的统一框架,对研究人员和专业人员都构成重大挑战。在这项工作中,我们提供了TabText。TabText是利用语言的无结构数据格式将不同表格结构和时间段的表格数据编码的方法。我们用TabText提取的两套保健数据集和四项预测任务,将传统处理方法提取的数据比传统处理方法高出2-5%。此外,我们分析了我们框架对缺少的数值、元信息和语言描述的不同判决表达选择的敏感性,并为改进业绩的获胜战略提供了见解。