Bankruptcy prediction is an important research area that heavily relies on data science. It aims to help investors, managers, and regulators better understand the operational status of corporations and predict potential financial risks in advance. To improve prediction, researchers and practitioners have begun to utilize a variety of different types of data, ranging from traditional financial indicators to unstructured data, to aid in the construction and optimization of bankruptcy forecasting models. Over time, not only instrumentalized data improved, but also instrumentalized methodology for data structuring, cleaning, and analysis. With the aid of advanced analytical techniques that deploy machine learning and deep learning algorithms, bankruptcy assessment became more accurate over time. However, due to the sensitivity of financial data, the scarcity of valid public datasets remains a key bottleneck for the rapid modeling and evaluation of machine learning algorithms for targeted tasks. This study therefore introduces a taxonomy of datasets for bankruptcy research, and summarizes their characteristics. This paper also proposes a set of metrics to measure the quality and the informativeness of public datasets The taxonomy, coupled with the informativeness measure, thus aims at providing valuable insights to better assist researchers and practitioners in developing potential applications for various aspects of credit assessment and decision making by pointing at appropriate datasets for their studies.
翻译:暂无翻译