Click-stream data, which comes with a massive volume generated by the human activities on the websites, has become a prominent feature to identify readers' characteristics by the newsrooms after the digitization of the news outlets. It is essential to have elastic architectures to process the streaming data, particularly for unprecedented traffic, enabling conducting more comprehensive analyses such as recommending mostly related articles to the readers. Although the nature of click-stream data has a similar logic within the websites, it has inherent limitations to recognize human behaviors when looking from a broad perspective, which brings the need of limiting the problem in niche areas. This study investigates the anonymized readers' click activities in the organizations' websites to identify news consumption patterns following referrals from Twitter, who incidentally reach but propensity is mainly the routed news content. The investigation is widened to a broad perspective by linking the log data with news content to enrich the insights rather than sticking into the web journey. The methodologies on ensemble cluster analysis with mixed-type embedding strategies are applied and compared to find similar reader groups and interests independent from time. Our results demonstrate that the quality of clustering mixed-type data set approaches to optimal internal validation scores when embedded by Uniform Manifold Approximation and Projection (UMAP) and using consensus function as a key to access the most applicable hyper parameter configurations in the given ensemble rather than using consensus function results directly. Evaluation of the resulting clusters highlights specific clusters repeatedly present in the samples, which provide insights to the news organizations and overcome the degradation of the modeling behaviors due to the change in the interest over time.
翻译:虽然点击流数据的性质与网站内部的逻辑相似,但在从广义角度看待人类行为时,它具有内在的局限性,这就使得有必要限制特定区域的问题。这项研究调查了各组织网站上匿名读者点击用户网站的活动,以查明从Twitter查询后发现的新消费模式,其中偶然触及但易读性主要是路由式的新闻内容。通过将日志数据与新闻内容挂钩,丰富洞察力,而不是停留在网络旅程中,调查范围更加全面,从而能够进行更全面的分析。虽然点击流数据的性质在网站内部具有类似的逻辑性,但在从广义角度审视人类行为时,它具有内在的局限性,这就使得有必要限制特定区域的问题。本研究报告调查了各组织网站上匿名读者点击活动,以确定在从Twitter查询后,即偶然触及但主要易读性是路由路由式新闻内容,处理流数据消费模式结构后的新闻消费模式模式模式模式模式模式模式模式模式模式模式,在采用最佳内部数据组合后,在采用最佳内部数据排序后,即采用最佳内部共识,即采用固定的系统,在内部数据组合中,在使用特定的系统定义中,通过自动调整,将数据组合中,将数据组合,将数据分组,将数据分组与最佳数据组合,即采用最佳的升级,即自动,将数据转换为可直接确定。