In the era of big data, we continuously - and at times unknowingly - leave behind digital traces, by browsing, sharing, posting, liking, searching, watching, and listening to online content. When aggregated, these digital traces can provide powerful insights into the behavior, preferences, activities, and traits of people. While many have raised privacy concerns around the use of aggregated digital traces, it has undisputedly brought us many advances, from the search engines that learn from their users and enable our access to unforeseen amounts of data, knowledge, and information, to, e.g., the discovery of previously unknown adverse drug reactions from search engine logs. Whether in online services, journalism, digital forensics, law, or research, we increasingly set out to exploring large amounts of digital traces to discover new information. Consider for instance, the Enron scandal, Hillary Clinton's email controversy, or the Panama papers: cases that revolve around analyzing, searching, investigating, exploring, and turning upside down large amounts of digital traces to gain new insights, knowledge, and information. This discovery task is at its core about "finding evidence of activity in the real world." This dissertation revolves around discovery in digital traces, and sits at the intersection of Information Retrieval, Natural Language Processing, and applied Machine Learning. We propose computational methods that aim to support the exploration and sense-making process of large collections of digital traces. We focus on textual traces, e.g., emails and social media streams, and address two aspects that are central to discovery in digital traces.
翻译:在大数据时代,我们不断 — — 有时在不知不觉中 — — 留下数字痕迹,通过浏览、共享、张贴、上传、欣赏、搜索、观察和监听在线内容。当汇总时,这些数字痕迹可以使人们的行为、偏好、活动和特征有强大的洞察力。虽然许多人在使用综合数字痕迹时提出了隐私问题,但毫无疑问,它给我们带来了许多进步,从向用户学习并使我们能够获取大量数据、知识和信息的搜索引擎,到例如从搜索引擎日志中发现先前未知的不良药物反应。无论是在在线服务、新闻、数字法医学、法律或研究领域,我们越来越多地开始探索大量的数字痕迹以发现新信息。举例来说,Enron丑闻、希拉里·克林顿的电子邮件争议或巴拿马论文:围绕分析、搜索、调查、探索和翻转大量数字痕迹以获得新的洞察力、知识和信息。这一发现任务的核心内容是“在现实世界中查找活动的证据 ”, 也就是在纸质的解读和纸质的解读过程中,我们用到大量数据检索的线索和检索方法。