以树为基础的有重点的网络与加强学习 (Tree-based Focused Web Crawling with Reinforcement Learning) - 专知论文

会员服务 ·

0

WEB · Learning · Agent · Markov · 可约的 ·

2022 年 7 月 1 日

Tree-based Focused Web Crawling with Reinforcement Learning

翻译：以树为基础的有重点的网络与加强学习

Andreas Kontogiannis,Dimitrios Kelesis,Vasilis Pollatos,Georgios Paliouras,George Giannakopoulos

A focused crawler aims at discovering as many web pages relevant to a target topic as possible, while avoiding irrelevant ones. Reinforcement Learning (RL) has been utilized to optimize focused crawling. In this paper, we propose TRES, an RL-empowered framework for focused crawling. We model the crawling environment as a Markov Decision Process, which the RL agent aims at solving by determining a good crawling strategy. Starting from a few human provided keywords and a small text corpus, that are expected to be relevant to the target topic, TRES follows a keyword set expansion procedure, which guides crawling, and trains a classifier that constitutes the reward function. To avoid a computationally infeasible brute force method for selecting a best action, we propose Tree-Frontier, a decision-tree-based algorithm that adaptively discretizes the large state and action spaces and finds only a few representative actions. Tree-Frontier allows the agent to be likely to select near-optimal actions by being greedy over selecting the best representative action. Experimentally, we show that TRES significantly outperforms state-of-the-art methods in terms of harvest rate (ratio of relevant pages crawled), while Tree-Frontier reduces by orders of magnitude the number of actions needed to be evaluated at each timestep.

翻译：集中的爬行器旨在尽可能多地发现与目标主题相关的网页,同时避免不相干的内容。强化学习(RL)已被用于优化重点爬行。在本文中,我们提议TRES,这是一个有重点爬行的RL动力框架。我们把爬行环境模型成一个Markov 决策程序,RL代理商旨在通过确定一个良好的爬行战略来解决这个问题。从几个与目标主题相关的人类提供的关键字和一个小文本体开始,TRE遵循一个关键字集扩展程序,该程序引导爬行,并训练一个构成奖赏功能的分类师。为了避免一种计算上不可行的粗力方法选择最佳行动,我们提议了树-Frontier,一种基于决定的算法,它适应性地将大型状态和行动空间分散,只找到少数具有代表性的行动。树-Frontier允许该代理商通过贪婪地选择最佳的代表性行动来选择接近最优化的行动。实验,我们显示TRES明显地超越了选择最先进的标准状态,同时按不同程度的收成品级排序,同时评估每一程度的顺序。

0

相关内容

WEB

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

翘嘴鳜高密度遗传连锁图谱构建及生长性状QTL多家系定位

国家自然科学基金

0+阅读 · 2015年12月31日

KCuxSy热电材料声子和载流子传输特性的协同调控

国家自然科学基金

0+阅读 · 2015年12月31日

四倍体马铃薯SNP遗传图谱的构建及重要农艺性状的关联分析

国家自然科学基金

0+阅读 · 2014年12月31日

基于三角形坐标系的铝石榴石型荧光粉的光谱调控

国家自然科学基金

0+阅读 · 2013年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

异构动态移动通信网络的延时优化

国家自然科学基金

2+阅读 · 2013年12月31日

有限长区域中的空间耦合多元Rateless码研究

国家自然科学基金

0+阅读 · 2012年12月31日

CuInGaSe2太阳能电池界面结构、界面态及其钝化

国家自然科学基金

0+阅读 · 2012年12月31日

聚集诱导发光化合物的分子设计与性能研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

FocusFormer: Focusing on What We Need via Architecture Sampler

Arxiv

0+阅读 · 2022年8月23日

Entropy Enhanced Multi-Agent Coordination Based on Hierarchical Graph Learning for Continuous Action Space

Arxiv

0+阅读 · 2022年8月23日

Prioritizing Samples in Reinforcement Learning with Reducible Loss

Prioritizing Samples in Reinforcement Learning with Reducible Loss

Arxiv

0+阅读 · 2022年8月22日

Reinforcement Learning on Graph: A Survey

Arxiv

67+阅读 · 2022年4月13日

Reinforcement Learning based Air Combat Maneuver Generation

Reinforcement Learning based Air Combat Maneuver Generation

Arxiv

91+阅读 · 2022年1月14日

Recent Advances in Reinforcement Learning in Finance

Arxiv

11+阅读 · 2021年12月8日

A Survey on Reinforcement Learning for Recommender Systems

Arxiv

22+阅读 · 2021年9月22日

A Survey on Multi-Task Learning

Arxiv

31+阅读 · 2021年3月29日

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Arxiv

20+阅读 · 2020年3月10日

A Multi-Objective Deep Reinforcement Learning Framework

A Multi-Objective Deep Reinforcement Learning Framework

Arxiv

16+阅读 · 2018年6月27日

VIP会员

文章信息

相关主题

相关VIP内容

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《美国海军陆战队软件定义网络应用案例：分布式防火墙自动化系统》148页

《多体环境下定位导航授时（PNT）系统研究》228页

软件定义无线电（SDR）：商业与军事领域的技术、应用及未来趋势

《攻势防空作战中无人追击者/规避者最优轨迹研究（含动态交战区建模）》95页

相关资讯

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

相关论文

FocusFormer: Focusing on What We Need via Architecture Sampler

Arxiv

0+阅读 · 2022年8月23日

Entropy Enhanced Multi-Agent Coordination Based on Hierarchical Graph Learning for Continuous Action Space

Arxiv

0+阅读 · 2022年8月23日

Prioritizing Samples in Reinforcement Learning with Reducible Loss

Prioritizing Samples in Reinforcement Learning with Reducible Loss

Arxiv

0+阅读 · 2022年8月22日

Reinforcement Learning on Graph: A Survey

Arxiv

67+阅读 · 2022年4月13日

Reinforcement Learning based Air Combat Maneuver Generation

Reinforcement Learning based Air Combat Maneuver Generation

Arxiv

91+阅读 · 2022年1月14日

Recent Advances in Reinforcement Learning in Finance

Arxiv

11+阅读 · 2021年12月8日

A Survey on Reinforcement Learning for Recommender Systems

Arxiv

22+阅读 · 2021年9月22日

A Survey on Multi-Task Learning

Arxiv

31+阅读 · 2021年3月29日

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Arxiv

20+阅读 · 2020年3月10日

A Multi-Objective Deep Reinforcement Learning Framework

A Multi-Objective Deep Reinforcement Learning Framework

Arxiv

16+阅读 · 2018年6月27日

相关基金

翘嘴鳜高密度遗传连锁图谱构建及生长性状QTL多家系定位

国家自然科学基金

0+阅读 · 2015年12月31日

KCuxSy热电材料声子和载流子传输特性的协同调控

国家自然科学基金

0+阅读 · 2015年12月31日

四倍体马铃薯SNP遗传图谱的构建及重要农艺性状的关联分析

国家自然科学基金

0+阅读 · 2014年12月31日

基于三角形坐标系的铝石榴石型荧光粉的光谱调控

国家自然科学基金

0+阅读 · 2013年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

异构动态移动通信网络的延时优化

国家自然科学基金

2+阅读 · 2013年12月31日

有限长区域中的空间耦合多元Rateless码研究

国家自然科学基金

0+阅读 · 2012年12月31日

CuInGaSe2太阳能电池界面结构、界面态及其钝化

国家自然科学基金

0+阅读 · 2012年12月31日

聚集诱导发光化合物的分子设计与性能研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员