当前位置: 首页 > news >正文

Elasticsearch:创建一个简单的 “你的意思是?” 推荐搜索

你的意思是” 是搜索引擎中一个非常重要的功能,因为它们通过显示建议的术语来帮助用户,以便他可以进行更准确的搜索。比如,在百度中,我们进行搜索时,它通常会显示一些更为常用推荐的搜索选项来供我们选择:

 

为了创建 “你的意思是”,我们将使用 phrase suggester,因为通过它我们将能够建议句子更正,而不仅仅是术语。在我之前的文章 “Elasticsearch:如何实现短语建议 - phrase suggester”,我有涉及到这个问题。

首先,我们将使用一个 shingle 过滤器,因为它将提供一个分词,短语建议器将使用该标记来进行匹配并返回更正。有关 shingle 过滤器的描述,请阅读之前的文章 “Elasticsearch: Ngrams, edge ngrams, and shingles”。

准备数据

我们首先来定义映射:

PUT movies
{"settings": {"analysis": {"analyzer": {"en_analyzer": {"tokenizer": "standard","filter": ["lowercase","stop"]},"shingle_analyzer": {"type": "custom","tokenizer": "standard","filter": ["lowercase","shingle_filter"]}},"filter": {"shingle_filter": {"type": "shingle","min_shingle_size": 2,"max_shingle_size": 3}}}},"mappings": {"properties": {"title": {"type": "text","analyzer": "en_analyzer","fields": {"suggest": {"type": "text","analyzer": "shingle_analyzer"}}},"actors": {"type": "text","analyzer": "en_analyzer","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"description": {"type": "text","analyzer": "en_analyzer","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"director": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"genre": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"metascore": {"type": "long"},"rating": {"type": "float"},"revenue": {"type": "float"},"runtime": {"type": "long"},"votes": {"type": "long"},"year": {"type": "long"},"title_suggest": {"type": "completion","analyzer": "simple","preserve_separators": true,"preserve_position_increments": true,"max_input_length": 50}}}
}

我们接下来使用 _bulk 命令来写入一些文档到这个索引中去。我们使用这个链接中的内容。我们使用如下的方法:

POST movies/_bulk
{"index": {}}
{"title": "Guardians of the Galaxy", "genre": "Action,Adventure,Sci-Fi", "director": "James Gunn", "actors": "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana", "description": "A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control of the universe.", "year": 2014, "runtime": 121, "rating": 8.1, "votes": 757074, "revenue": 333.13, "metascore": 76}
{"index": {}}
{"title": "Prometheus", "genre": "Adventure,Mystery,Sci-Fi", "director": "Ridley Scott", "actors": "Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron", "description": "Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize they are not alone.", "year": 2012, "runtime": 124, "rating": 7, "votes": 485820, "revenue": 126.46, "metascore": 65}....

 在上面,为了说明的方便,我省去了其它的文档。你需要把整个 movies.txt 的文件拷贝过来,并全部写入到 Elasticsearch 中。它共有1000 个文档。

搜索数据

现在让我们运行一个基本查询来查看 suggest 的结果:

GET movies/_search?filter_path=suggest
{"suggest": {"text": "transformers revenge of the falen","did_you_mean": {"phrase": {"field": "title.suggest","size": 5}}}
}

上面命令显示的结果为:

{"suggest": {"did_you_mean": [{"text": "transformers revenge of the falen","offset": 0,"length": 33,"options": [{"text": "transformers revenge of the fallen","score": 0.004467494},{"text": "transformers revenge of the fall","score": 0.00020402104},{"text": "transformers revenge of the face","score": 0.00006419608}]}]}
}

请注意,在几行中你已经获得了一些有希望的结果。

现在让我们通过使用更多短语建议功能来增加我们的查询。让我们使用 max_errors = 2,这样我们希望句子中最多有两个术语。 添加了 highlight 显示以突出​​显示建议的术语。

GET movies/_search?filter_path=suggest
{"suggest": {"text": "transformer revenge of the falen","did_you_mean": {"phrase": {"field": "title.suggest","size": 5,"confidence": 1,"max_errors":2,"highlight": {"pre_tag": "<strong>","post_tag": "</strong>"}}}}
}

上面命令返回的结果为:

{"suggest": {"did_you_mean": [{"text": "transformer revenge of the falen","offset": 0,"length": 32,"options": [{"text": "transformers revenge of the fallen","highlighted": "<strong>transformers</strong> revenge of the <strong>fallen</strong>","score": 0.004382903},{"text": "transformers revenge of the fall","highlighted": "<strong>transformers</strong> revenge of the <strong>fall</strong>","score": 0.00020015794},{"text": "transformers revenge of the face","highlighted": "<strong>transformers</strong> revenge of the <strong>face</strong>","score": 0.00006298054},{"text": "transformers revenge of the falen","highlighted": "<strong>transformers</strong> revenge of the falen","score": 0.00006159308},{"text": "transformer revenge of the fallen","highlighted": "transformer revenge of the <strong>fallen</strong>","score": 0.000048000533}]}]}
}

我们再改进一点好吗? 我们添加了 “collate”,我们可以对每个结果执行查询,改进建议的结果。 我使用了带有 “and” 运算符的匹配项,以便在同一个句子中匹配所有术语。 如果我仍然想要不符合查询条件的结果,我使用 prune = true。

GET movies/_search?filter_path=suggest
{"suggest": {"text": "transformer revenge of the falen","did_you_mean": {"phrase": {"field": "title.suggest","size": 5,"confidence": 1,"max_errors":2,"collate": {"query": { "source" : {"match": {"{{field_name}}": {"query": "{{suggestion}}","operator": "and"}}}},"params": {"field_name" : "title"}, "prune" :true},"highlight": {"pre_tag": "<strong>","post_tag": "</strong>"}}}}
}

现在的结果是:

请注意,答案已更改,我有一个新字段 “collat​​e_match”,它指示结果中是否匹配整理规则(这是因为 prune = true)。

让我们设置 prune 为 false:

GET movies/_search?filter_path=suggest
{"suggest": {"text": "transformer revenge of the falen","did_you_mean": {"phrase": {"field": "title.suggest","size": 5,"confidence": 1,"max_errors":2,"collate": {"query": { "source" : {"match": {"{{field_name}}": {"query": "{{suggestion}}","operator": "and"}}}},"params": {"field_name" : "title"}, "prune" :false},"highlight": {"pre_tag": "<strong>","post_tag": "</strong>"}}}}
}

这次我们得到的结果是:

我们可以看到只有一个结果是最相关的建议。 

http://www.lryc.cn/news/23576.html

相关文章:

  • urllib之ProxyHandler代理以及CookieJar的cookie内存传递和本地保存与读取的使用详解
  • 华为造车锚定智选模式, 起点赢家赛力斯驶入新能源主航道
  • [oeasy]python0096_游戏娱乐行业_雅达利_米洛华_四人赛马_影视结合游戏
  • 使用python测试框架完成自动化测试并生成报告-实例练习
  • JavaWeb 实战 01 - 计算机是如何工作的
  • 线性代数学习-1
  • 人工智能写的十段代码,九个通过测试了
  • 巴塞尔问题数值逼近方法
  • 【深度学习环境】Docker
  • 基于vscode开发vue项目的详细步骤教程 2 第三方图标库FontAwesome
  • 今天面了个腾讯拿25K出来的软件测试工程师,让我见识到了真正的天花板...
  • OSG三维渲染引擎编程学习之六十九:“第六章:OSG场景工作机制” 之 “6.9 OSG数据变量”
  • Tektronix泰克TDP3500差分探头3.5GHz
  • 轻松实现内网穿透:实现远程访问你的私人网络
  • MySQL长字符截断
  • python计算量比指标
  • 下拉框推荐-Suggest-SUG
  • Nmap的几种扫描方式以及相应的命令
  • Qt::QOpenGLWidget 渲染天空壳
  • 谷歌搜索技巧大全 | 谷歌高级搜索语法指令
  • JAVA开发(JAVA垃圾回收的几种常见算法)
  • 你还不会用CAD一键布置停车位?赶紧学起来!
  • 【MySQL之MySQL底层分析篇】系统学习MySQL,从应用SQL语法到底层知识讲解,这将是你见过最完成的知识体系
  • 单核CPU是否有线程可见性问题?
  • MyBatis 架构介绍
  • 加密算法---RSA 非对称加密原理及使用
  • MySQL-查询语句
  • 【算法】【数组与矩阵模块】求数组中需要排序的最短子数组长度
  • centos安装Anaconda3
  • 【微信小程序】-- WXML 模板语法 - 列表渲染 -- wx:for wx:key(十二)