Scarcity of data and technological limitations for resource-poor languages in developing countries like India poses a threat to the development of sophisticated NLU systems for healthcare. To assess the current status of various state-of-the-art language models in healthcare, this paper studies the problem by initially proposing two different Healthcare datasets, Indian Healthcare Query Intent-WebMD and 1mg (IHQID-WebMD and IHQID-1mg) and one real world Indian hospital query data in English and multiple Indic languages (Hindi, Bengali, Tamil, Telugu, Marathi and Gujarati) which are annotated with the query intents as well as entities. Our aim is to detect query intents and extract corresponding entities. We perform extensive experiments on a set of models in various realistic settings and explore two scenarios based on the access to English data only (less costly) and access to target language data (more expensive). We analyze context specific practical relevancy through empirical analysis. The results, expressed in terms of overall F1 score show that our approach is practically useful to identify intents and entities.
翻译:为评估保健领域各种最先进的语言模式的现状,本文件首先提出两个不同的保健数据集,即印度保健调查Intent-WebMD和1mg(IHQID-WebMD和IHQID-1mg),以及用英语和多种印度语(Hindi、孟加拉语、泰米尔语、Telugu语、Marathi语和古吉拉特语)提供的真正世界印度医院查询数据(Hindi语、泰米尔语、Telugu语、Marathi语和古吉拉特语),并附有查询意向和实体的说明。我们的目的是检测查询意向和提取相应实体。我们在不同现实环境中对一套模型进行广泛实验,并探索基于仅获得英语数据(费用较低)和获取目标语言数据(费用更高)的两种假设。我们通过经验分析分析了具体背景和实用的相关性。从总体F1分中得出的结果表明,我们的方法实际上有助于确定意向和实体。