We describe CARTOLABE, a web-based multi-scale system for visualizing and exploring large textual corpora based on topics, introducing a novel mechanism for the progressive visualization of filtering queries. Initially designed to represent and navigate through scientific publications in different disciplines, CARTOLABE has evolved to become a generic framework and accommodate various corpora, ranging from Wikipedia (4.5M entries) to the French National Debate (4.3M entries). CARTOLABE is made of two modules: the first relies on Natural Language Processing methods, converting a corpus and its entities (documents, authors, concepts) into high-dimensional vectors, computing their projection on the 2D plane, and extracting meaningful labels for regions of the plane. The second module is a web-based visualization, displaying tiles computed from the multidimensional projection of the corpus using the U MAP projection method. This visualization module aims at enabling users with no expertise in visualization and data analysis to get an overview of their corpus, and to interact with it: exploring, querying, filtering, panning and zooming on regions of semantic interest. Three use cases are discussed to illustrate CARTOLABE's versatility and ability to bring large scale textual corpus visualization and exploration to a wide audience.
翻译:我们描述CARTOLABE,这是一个基于主题的网络多尺度系统,用于视觉化和探索大型文本体,引入了过滤查询的渐进直观化新机制;最初,CARTOOLABE旨在通过不同学科的科学出版物表达和浏览,后来演变成一个通用框架,容纳了从维基百科(4.5M项)到法国国家辩论(4.3M项)等各种公司;CARTOLABE由两个模块组成:第一个模块依赖自然语言处理方法,将一个实体及其实体(文件、作者、概念)转换为高维矢量矢量,在2D平面上计算其投影,并为平面区域提取有意义的标签;第二个模块是基于网络的直观化,显示用UMAP投影法从对材料的多层面投影中计算出来的砖块;这一可视化模块旨在帮助没有视觉化和数据分析专门知识的用户了解其材料,并与它进行互动:探索、查询、筛选、横跨和缩放、在2D平面的矢量图象区域进行有意义的投影。