Telegram is one of the most popular instant messaging apps in today's digital age. In addition to providing a private messaging service, Telegram, with its channels, represents a valid medium for rapidly broadcasting content to a large audience (COVID-19 announcements), but, unfortunately, also for disseminating radical ideologies and coordinating attacks (Capitol Hill riot). This paper presents the TGDataset, a new dataset that includes 120,979 Telegram channels and over 400 million messages, making it the largest collection of Telegram channels to the best of our knowledge. After a brief introduction to the data collection process, we analyze the languages spoken within our dataset and the topic covered by English channels. Finally, we discuss some use cases in which our dataset can be extremely useful to understand better the Telegram ecosystem, as well as to study the diffusion of questionable news. In addition to the raw dataset, we released the scripts we used to analyze the dataset and the list of channels belonging to the network of a new conspiracy theory called Sabmyk.
翻译:电讯是当今数字时代最受欢迎的即时电文应用之一。 电讯是当今数字时代最受欢迎的即时电文应用软件之一。 除了提供私人电讯服务外, Telegram(其频道)是向广大观众快速广播内容的有效媒介( COVID-19 公告),但不幸的是,它也是传播激进意识形态和协调攻击( Capitol Hill 暴动)的有效媒介。 本文展示了TGDataset( TGDataset), 包括120, 979 Telegram 频道和超过4亿条信息的新数据集, 使得它成为我们所知最多的Teleggram 频道收藏。 在对数据收集过程进行简要介绍后, 我们分析了我们数据集中所使用的语言以及英语频道所涵盖的主题。 最后, 我们讨论了一些使用我们数据集可以非常有用的案例, 来更好地了解Telegram生态系统, 以及研究可疑消息的传播。 除了原始数据集外, 我们还发布了我们用来分析数据集的剧本以及属于名为Sabmyk的新阴谋论网络的频道清单。</s>