作者:Daulet Nurmanbetov
编译:ronghuaiyang
语义搜索是NLP中很值得去解决的,但又很困难的问题。
我们通常会花很多时间在大量的文档中寻找特定的信息。我们通常会使用CTRL + F。还有众所周知的Google-fu,在21世纪的职场中,有效使用Google搜索信息的是一项宝贵的技能。人类的所有知识对我们来说都是可用的,问题在于提出正确的问题,以及知道如何浏览结果找到相关的答案。
我们的大脑会执行语义搜索,我们会查看结果并找到与我们的搜索查询相似的句子。在金融和法律行业尤其如此,因为文件越来越长,我们不得不搜索很多关键字来找到正确的句子或段落。时至今日,人类在探索上所付出的累积努力是惊人的。
自NLP出现以来,机器学习一直试图解决语义搜索的这个问题。一个完整的研究领域 —— 语义搜索已经出现。最近,由于深度学习技术的进步,计算机能够以最小的人力投入精确地向我们提供相关信息。
自然语言处理(NLP)领域对此有一个术语,当一个词被提及时,我们称之为“surface form”,举个例子,“president”这个词本身意味着国家元首。但根据上下文和时间,这可能意味着特朗普或奥巴马。
NLP的进步使我们能够有效地映射这些surface form,并将这些单词中的上下文捕获到称为“embeddings”的东西中。具有相似含义的两个单词将具有相似的向量,从而允许我们计算向量的相似性。
扩展这个想法,在向量空间中,我们应该能够计算任意两个句子之间的相似性。这就是句子嵌入模型所能达到的效果。这些模型将任何给定的句子转换成一个向量,从而能够快速计算任意一对句子的相似度或不同度。
这个想法并不新鲜,最早的一篇论文——word2vec早在2013年就提出了用向量表示单个单词。然而,从那时起,BERT和其他基于Transformer的模型让我们走了很长的路,它们允许我们更有效地捕捉这些词的上下文。
在这里,我们如何将最近的嵌入模型与word2vec或过去的GloVe进行比较。
这些经过修改和微调的BERT NLP模型在识别相似的句子方面非常好,比以前的模型好得多。让我们看看这在实际意义上意味着什么。
我有几篇2020年4月的文章标题,我希望找到与一组搜索词最相似的句子。
这里是我的搜索词 ——
1. The economy is more resilient and improving.
2. The economy is in a lot of trouble.
3. Trump is hurting his own reelection chances.
我的文章标题如下 ——
Coronavirus:
White House organizing program to slash development time for coronavirus vaccine by as much as eight months (Bloomberg)
Trump says he is pushing FDA to approve emergency-use authorization for Gilead's remdesivir (WSJ)
AstraZeneca to make an experimental coronavirus vaccine developed by Oxford University (Bloomberg)
Trump contradicts US intel, says Covid-19 started in Wuhan lab. (The Hill)
Reopening:
Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico)
White House risks backlash with coronavirus optimism if cases flare up again (The Hill)
Florida plans to start reopening on Monday with restaurants and retail in most areas allowed to resume business in most areas (Bloomberg)
California Governor Newsom plans to order closure of all state beaches and parks starting Friday due to concerns about overcrowding (CNN)
Japan preparing to extend coronavirus state of emergency, which is scheduled to end 6-May, by about another month (Reuters)
Policy/Stimulus:
Economists from a broad range of ideological backgrounds encouraging Congress to keep spending to combat the coronavirus fallout and don't believe now is time to worry about deficit (Politico)
Global economy:
China's official PMIs mixed with beat from services and miss from manufacturing (Bloomberg)
China's Beige Book shows employment situation in Chinese factories worsened in April from end of March, suggesting economy on less solid ground than government data (Bloomberg)
Japan's March factory output fell at the fastest pace in five months, while retail sales also dropped (Reuters)
Eurozone economy contracts by 3.8% in Q1, the fastest decline on record (FT)
US-China:
Trump says China wants to him to lose his bid for re-election and notes he is looking at different options in terms of consequences for Beijing over the virus (Reuters)
Senior White House official confident China will meet obligations under trad deal despite fallout from coronavirus pandemic (WSJ)
Oil:
Trump administration may announce plans as soon as today to offer loans to oil companies, possibly in exchange for a financial stake (Bloomberg)
Munchin says Trump administration could allow oil companies to store another several hundred million barrels (NY Times)
Norway, Europe's biggest oil producer, joins international efforts to cut supply for first time in almost two decades (Bloomberg)
IEA says coronavirus could drive 6% decline in global energy demand in 2020 (FT)
Corporate:
Microsoft reports strong results as shift to more activities online drives growth in areas from cloud-computing to video gams (WSJ)
Facebook revenue beats expectations and while ad revenue fell sharply in March there have been recent signs of stability (Bloomberg)
Tesla posts third straight quarterly profit while Musk rants on call about need for lockdowns to be lifted (Bloomberg)
eBay helped by online shopping surge though classifieds business hurt by closure of car dealerships and lower traffic (WSJ)
Royal Dutch Shell cuts dividend for first time since World War II and also suspends next tranche of buyback program (Reuters)
Chesapeake Energy preparing bankruptcy filing and has held discussions with lenders about a ~$1B loan (Reuters)
Amazon accused by Trump administration of tolerating counterfeit sales, but company says hit politically motivated (WSJ)
在计算了每个查询和每个嵌入的相似性后,这里是我的每个搜索词的前5个相似的句子:
======================
Query: The economy is more resilient and improving.Top 5 most similar sentences in corpus:
Microsoft reports strong results as shift to more activities online drives growth in areas from cloud-computing to video gams (WSJ) (Score: 0.5362)
Facebook revenue beats expectations and while ad revenue fell sharply in March there have been recent signs of stability (Bloomberg) (Score: 0.4632)
Senior White House official confident China will meet obligations under trad deal despite fallout from coronavirus pandemic (WSJ) (Score: 0.3558)
Economists from a broad range of ideological backgrounds encouraging Congress to keep spending to combat the coronavirus fallout and don't believe now is time to worry about deficit (Politico) (Score: 0.3052)
White House risks backlash with coronavirus optimism if cases flare up again (The Hill) (Score: 0.2885)
======================
Query: The economy is in a lot of trouble.Top 5 most similar sentences in corpus:
Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico) (Score: 0.4667)
eBay helped by online shopping surge though classifieds business hurt by closure of car dealerships and lower traffic (WSJ) (Score: 0.4338)
China's Beige Book shows employment situation in Chinese factories worsened in April from end of March, suggesting economy on less solid ground than government data (Bloomberg) (Score: 0.4283)
Eurozone economy contracts by 3.8% in Q1, the fastest decline on record (FT) (Score: 0.4252)
China's official PMIs mixed with beat from services and miss from manufacturing (Bloomberg) (Score: 0.4052)
======================
Query: Trump is hurting his own reelection chances.Top 5 most similar sentences in corpus:
Trump contradicts US intel, says Covid-19 started in Wuhan lab. (The Hill) (Score: 0.7472)
Amazon accused by Trump administration of tolerating counterfeit sales, but company says hit politically motivated (WSJ) (Score: 0.7408)
Trump says China wants to him to lose his bid for re-election and notes he is looking at different options in terms of consequences for Beijing over the virus (Reuters) (Score: 0.7111)
Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico) (Score: 0.6213)
White House risks backlash with coronavirus optimism if cases flare up again (The Hill) (Score: 0.6181)
你可以看到,这个模型挑选出最相似的句子是多么地准确。
我使用的代码可以在下面找到 ——
安装transformer包:
!git clone git@github.com:huggingface/transformers.git
!cd transformers
!pip install .
import scipy
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
语料如下:
# Get a sample corpus to search over
_c="""
Coronavirus:
White House organizing program to slash development time for coronavirus vaccine by as much as eight months (Bloomberg)
Trump says he is pushing FDA to approve emergency-use authorization for Gilead's remdesivir (WSJ)
AstraZeneca to make an experimental coronavirus vaccine developed by Oxford University (Bloomberg)
Reopening:
Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico)
White House risks backlash with coronavirus optimism if cases flare up again (The Hill)
Florida plans to start reopening on Monday with restaurants and retail in most areas allowed to resume business in most areas (Bloomberg)
California Governor Newsom plans to order closure of all state beaches and parks starting Friday due to concerns about overcrowding (CNN)
Japan preparing to extend coronavirus state of emergency, which is scheduled to end 6-May, by about another month (Reuters)
Policy/Stimulus:
Economists from a broad range of ideological backgrounds encouraging Congress to keep spending to combat the coronavirus fallout and don't believe now is time to worry about deficit (Politico)
Global economy:
China's official PMIs mixed with beat from services and miss from manufacturing (Bloomberg)
China's Beige Book shows employment situation in Chinese factories worsened in April from end of March, suggesting economy on less solid ground than government data (Bloomberg)
Japan's March factory output fell at the fastest pace in five months, while retail sales also dropped (Reuters)
Eurozone economy contracts by 3.8% in Q1, the fastest decline on record (FT)
US-China:
Trump says China wants to him to lose his bid for re-election and notes he is looking at different options in terms of consequences for Beijing over the virus (Reuters)
Senior White House official confident China will meet obligations under trad deal despite fallout from coronavirus pandemic (WSJ)
Oil:
Trump administration may announce plans as soon as today to offer loans to oil companies, possibly in exchange for a financial stake (Bloomberg)
Munchin says Trump administration could allow oil companies to store another several hundred million barrels (NY Times)
Norway, Europe's biggest oil producer, joins international efforts to cut supply for first time in almost two decades (Bloomberg)
IEA says coronavirus could drive 6% decline in global energy demand in 2020 (FT)
Corporate:
Microsoft reports strong results as shift to more activities online drives growth in areas from cloud-computing to video gams (WSJ)
Facebook revenue beats expectations and while ad revenue fell sharply in March there have been recent signs of stability (Bloomberg)
Tesla posts third straight quarterly profit while Musk rants on call about need for lockdowns to be lifted (Bloomberg)
eBay helped by online shopping surge though classifieds business hurt by closure of car dealerships and lower traffic (WSJ)
Royal Dutch Shell cuts dividend for first time since World War II and also suspends next tranche of buyback program (Reuters)
Chesapeake Energy preparing bankruptcy filing and has held discussions with lenders about a ~$1B loan (Reuters)
Amazon accused by Trump administration of tolerating counterfeit sales, but company says hit politically motivated (WSJ)
Trump contradicts US intel, says Covid-19 started in Wuhan lab.
# Convert the corpus into a list of headlines
corpus=[i for i in _c.split('\n')if i != ''and len(i.split(' '))>=4]
# Get a vector for each headline (sentence) in the corpus
corpus_embeddings = model.encode(corpus)
# Define search queries and embed them to vectors as well
queries = [
'The economy is more resilient and improving.', 'The economy is in a lot of trouble.', 'Trump is hurting his own reelection chances.']
query_embeddings = model.encode(queries)
# For each search term return 5 closest sentences
closest_n = 5
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])
print("\n\n======================\n\n")
print("Query:", query)
print("\nTop 5 most similar sentences in corpus:")
for idx, distance in results[0:closest_n]:
print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))
结果如下:
======================
Query: The economy is more resilient and improving.
Top 5 most similar sentences in corpus:
Microsoft reports strong results as shift to more activities online drives growth in areas from cloud-computing to video gams (WSJ) (Score: 0.5362)
Facebook revenue beats expectations and while ad revenue fell sharply in March there have been recent signs of stability (Bloomberg) (Score: 0.4632)
Senior White House official confident China will meet obligations under trad deal despite fallout from coronavirus pandemic (WSJ) (Score: 0.3558)
Economists from a broad range of ideological backgrounds encouraging Congress to keep spending to combat the coronavirus fallout and don't believe now is time to worry about deficit (Politico) (Score: 0.3052)
White House risks backlash with coronavirus optimism if cases flare up again (The Hill) (Score: 0.2885)
======================
Query: The economy is in a lot of trouble.
Top 5 most similar sentences in corpus:
Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico) (Score: 0.4667)
eBay helped by online shopping surge though classifieds business hurt by closure of car dealerships and lower traffic (WSJ) (Score: 0.4338)
China's Beige Book shows employment situation in Chinese factories worsened in April from end of March, suggesting economy on less solid ground than government data (Bloomberg) (Score: 0.4283)
Eurozone economy contracts by 3.8% in Q1, the fastest decline on record (FT) (Score: 0.4252)
China's official PMIs mixed with beat from services and miss from manufacturing (Bloomberg) (Score: 0.4052)
======================
Query: Trump is hurting his own reelection chances.
Top 5 most similar sentences in corpus:
Trump contradicts US intel, says Covid-19 started in Wuhan lab. (Score: 0.7472)
Amazon accused by Trump administration of tolerating counterfeit sales, but company says hit politically motivated (WSJ) (Score: 0.7408)
Trump says China wants to him to lose his bid for re-election and notes he is looking at different options in terms of consequences for Beijing over the virus (Reuters) (Score: 0.7111)
Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico) (Score: 0.6213)
White House risks backlash with coronavirus optimism if cases flare up again (The Hill) (Score: 0.6181)
上面的例子很简单,但是说明了语义搜索的一个重要方面。人类需要几分钟才能找到最相似的句子。它使我们能够在不需要人工参与的情况下在文本中查找特定信息,这意味着我们可以以计算机速度在成千上万个文档中搜索我们关心的短语。
这项技术已经被用来在两个文档中找到相似的句子。或者季度收益报告中的关键信息。例如,通过这种语义搜索,我们可以很容易地找到Twitter、Facebook、Snapchat等所有社交公司的日常活跃用户。尽管他们定义和叫法的是不同的——日活跃用户(DAU)或月活跃用户(MAU)或可盈利活跃用户(mMAU)。由BERT支持的语义搜索可以发现所有这些表面形式在语义上意味着相同的东西 —— 一种性能的衡量,它能够从报告中提取我们感兴趣的句子。
对冲基金利用语义搜索来解析和展示季度报告(10-Q/10-K)中的指标,并在它们发布后立即将其作为量化交易信号,这不是一个遥远的想法。
上面的实验显示了语义搜索在过去的一年里取得了怎样的效果。
使用这些句子向量嵌入的另一种主要方式是用于聚类。我们可以快速地将单个文档中的句子或多个文档中的句子聚成相似的组。
通过使用上面的代码,我们可以利用sklearn中的一个简单的k-means方法。
from sklearn.cluster import KMeans
import numpy as npnum_clusters = 10
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
for i in range(10):
print()
print(f'Cluster {i + 1} contains:')
clust_sent = np.where(cluster_assignment == i)
for k in clust_sent[0]:
print(f'- {corpus[k]}')
同样,对于一台机器来说,结果是准确的。这里有几个聚类 ——
Cluster 2 contains:
- AstraZeneca to make an experimental coronavirus vaccine developed by Oxford University (Bloomberg)
- Trump says he is pushing FDA to approve emergency-use authorization for Gilead's remdesivir (WSJ)
Cluster 3 contains:
- Chesapeake Energy preparing bankruptcy filing and has held discussions with lenders about a ~$1B loan (Reuters)
- Trump administration may announce plans as soon as today to offer loans to oil companies, possibly in exchange for a financial stake (Bloomberg)
- Munchin says Trump administration could allow oil companies to store another several hundred million barrels (NY Times)
Cluster 4 contains:
- Trump says China wants to him to lose his bid for re-election and notes he is looking at different options in terms of consequences for Beijing over the virus (Reuters)
- Amazon accused by Trump administration of tolerating counterfeit sales, but company says hit politically motivated (WSJ)
- Trump contradicts US intel, says Covid-19 started in Wuhan lab. (The Hill)
有趣的是,ElasticSeach现在有了dense向量的用法:https://www.elastic.co/blog/text- similar-search with-vectors-in-elasticsearch,可以和其他的工业界的快速比较两个向量的工具相比,如Facebook的faiss。这个技术是很尖端的,但具有很强的操作性,会在几周内推出。先进的人工智能触手可及,任何人都知道该寻找什么。
英文原文:https://towardsdatascience.com/cutting-edge-semantic-search-and-sentence-similarity-53380328c655
推荐阅读
文本自动摘要任务的“不完全”心得总结番外篇——submodular函数优化
斯坦福大学NLP组Python深度学习自然语言处理工具Stanza试用
太赞了!Springer面向公众开放电子书籍,附65本数学、编程、机器学习、深度学习、数据挖掘、数据科学等书籍链接及打包下载
数学之美中盛赞的 Michael Collins 教授,他的NLP课程要不要收藏?
模型压缩实践系列之——bert-of-theseus,一个非常亲民的bert压缩方法
关于AINLP
AINLP 是一个有趣有AI的自然语言处理社区,专注于 AI、NLP、机器学习、深度学习、推荐算法等相关技术的分享,主题包括文本摘要、智能问答、聊天机器人、机器翻译、自动生成、知识图谱、预训练模型、推荐系统、计算广告、招聘信息、求职经验分享等,欢迎关注!加技术交流群请添加AINLPer(id:ainlper),备注工作/研究方向+加群目的。
阅读至此了,点个在看吧👇