当前位置：首页 > news >正文

【Elasticsearch】检索高亮

news 2025/7/6 8:41:32

检索高亮

1.什么是片段
2.案例实战
- 2.1 测试数据准备
- 2.2 基础高亮语法
- 2.3 自定义高亮标签
- 2.4 多字段高亮
- 2.5 返回完整字段内容
- 2.6 长文本多匹配点场景
- 2.7 限制片段数量和大小
3.为什么有时片段长度会超过 fragment_size
- 3.1 示例分析
- - 3.1.1 查询1：严格限制片段大小
  - 3.1.2 查询2：观察片段扩展
- 3.2 关键结论

高亮功能可以在搜索结果中标记出匹配的文本片段。

1.什么是片段

Elasticsearch 的高亮片段（fragment）是指从原始文本中提取的、包含搜索关键词的一小段文字。它的目的是让用户快速看到匹配内容在原文中的位置，而不是返回整个字段内容。

检索返回的片段由 fragment_size 和 number_of_fragments 共同控制。

特性	说明
`fragment_size`	控制返回的高亮文本片段的长度每个片段的目标字符数（实际可能略多，因需保持单词完整）默认值为 $100$ 当字段内容很长时，不会返回整个字段内容，而是返回包含匹配项的片段有助于减少网络传输数据量和提高可读性
`number_of_fragments`	每个字段返回的最大片段数（针对单个文档的单个字段）默认值为 $5$ 。如果设为 $0$ ，则返回整个字段内容（不进行分段）对于很长的文本，可能有多个地方匹配查询条件，此参数控制展示多少个这样的匹配点设置为 $0$ 时适用于短文本或需要展示完整内容的情况
片段内容	一定包含至少一个匹配的关键词，并会添加高亮标签（如 `<em>`）

2.案例实战

2.1 测试数据准备

首先，我们创建一个名为 blog_posts 的索引，并插入一些测试数据：

PUT /blog_posts
{"mappings": {"properties": {"title": { "type": "text" },"content": { "type": "text" },"author": { "type": "keyword" },"views": { "type": "integer" },"publish_date": { "type": "date" },"tags": { "type": "keyword" }}}
}

POST /blog_posts/_bulk
{"index":{}}
{"title":"Elasticsearch Basics","content":"Learn the basics of Elasticsearch and how to perform simple queries.","author":"John Doe","views":1500,"publish_date":"2023-01-15","tags":["search","database"]}
{"index":{}}
{"title":"Advanced Search Techniques","content":"Explore advanced search techniques in Elasticsearch including aggregations and filters.","author":"Jane Smith","views":3200,"publish_date":"2023-02-20","tags":["search","advanced"]}
{"index":{}}
{"title":"Data Analytics with ELK","content":"How to use the ELK stack for data analytics and visualization.","author":"John Doe","views":2800,"publish_date":"2023-03-10","tags":["analytics","elk"]}
{"index":{}}
{"title":"Elasticsearch Performance Tuning","content":"Tips and tricks for optimizing Elasticsearch performance in production environments.","author":"Mike Johnson","views":4200,"publish_date":"2023-04-05","tags":["performance","optimization"]}
{"index":{}}
{"title":"Kibana Dashboard Guide","content":"Creating effective dashboards in Kibana for monitoring and analysis.","author":"Jane Smith","views":1900,"publish_date":"2023-05-12","tags":["kibana","visualization"]}

在这里插入图片描述

2.2 基础高亮语法

默认情况下，不指定参数。但其实，此时 fragment_size 默认为 $100$ ，number_of_fragments 默认为 $5$ 。

GET /blog_posts/_search
{"query": {"match": {"content": "Elasticsearch"}},"highlight": {"fields": {"content": {}}}
}

在这里插入图片描述

2.3 自定义高亮标签

GET /blog_posts/_search
{"query": {"match": {"content": "techniques"}},"highlight": {"pre_tags": ["<strong>"],"post_tags": ["</strong>"],"fields": {"content": {"fragment_size": 150,"number_of_fragments": 3}}}
}

在这里插入图片描述

2.4 多字段高亮

GET /blog_posts/_search
{"query": {"multi_match": {"query": "search","fields": ["title", "content"]}},"highlight": {"fields": {"title": {},"content": {"fragment_size": 100,"number_of_fragments": 2}}}
}

在这里插入图片描述

2.5 返回完整字段内容

设置 number_of_fragments 为 $0$ ，适用于短文本，如标题。

GET /blog_posts/_search
{"query": { "match": { "title": "Kibana" } },"highlight": {"fields": {"title": { "number_of_fragments": 0 }}}
}

在这里插入图片描述

2.6 长文本多匹配点场景

模拟长文本字段：

POST /blog_posts/_update/Nmgc2ZcB9mA5oeTvZT0A
{"doc": {"content": "Elasticsearch is a tool. Elasticsearch is fast. Elasticsearch scales well. Repeat: Elasticsearch is a tool."}
}

GET /blog_posts/_search
{"query": { "match": { "content": "Elasticsearch" } },"highlight": {"fields": {"content": {"fragment_size": 30,"number_of_fragments": 3}}}
}

展示前 $3$ 个匹配点（忽略第 $4$ 个重复匹配）。

在这里插入图片描述

2.7 限制片段数量和大小

每个片段限制在 $30$ 字符以内，只返回 $2$ 个片段（即使实际匹配更多）。

GET /blog_posts/_search
{"query": { "match": { "content": "Elasticsearch" } },"highlight": {"fields": {"content": {"fragment_size": 30,"number_of_fragments": 2}}}
}

在这里插入图片描述

如果不设置 number_of_fragments 参数，默认值为 $5$ ，此处会全部返回。

GET /blog_posts/_search
{"query": { "match": { "content": "Elasticsearch" } },"highlight": {"fields": {"content": {"fragment_size": 30}}}
}

在这里插入图片描述

3.为什么有时片段长度会超过 fragment_size

即使设置了 fragment_size，实际返回的片段长度可能会略大，原因包括：

单词完整性保护：Elasticsearch 不会在单词中间截断，因此会扩展到下一个空格或标点符号。

// 设 fragment_size=20，匹配词为"Elasticsearch"
"This is a test with Elasticsearch and other words"
→ 可能返回："test with <em>Elasticsearch</em> and other"（实际27字符）

高亮标签占用长度：HTML 高亮标签（如 <em>）会增加额外字符，但这些 不计入 fragment_size。
边界扩展策略：为保证上下文可读性，Elasticsearch 可能会稍微扩展片段范围。

3.1 示例分析

插入测试数据。

PUT /test/_doc/1
{"text": "Elasticsearch is a distributed search engine. It is built on top of Lucene. Elasticsearch provides powerful full-text search capabilities. Many companies use Elasticsearch for log analytics."
}

3.1.1 查询1：严格限制片段大小

GET /test/_search
{"query": { "match": { "text": "Elasticsearch" } },"highlight": {"fields": {"text": { "fragment_size": 20,"number_of_fragments": 2}}}
}

可能返回：

"highlight": {"text": ["<em>Elasticsearch</em> is a",  // 实际21字符（含空格）"use <em>Elasticsearch</em> for" // 实际22字符]
}

说明：虽然 fragment_size=20，但为保证单词完整，实际略超。

3.1.2 查询2：观察片段扩展

GET /test/_search
{"query": { "match": { "text": "search" } },"highlight": {"fields": {"text": { "fragment_size": 10,"number_of_fragments": 1}}}
}

返回示例：

"highlight": {"text": ["distributed <em>search</em> engine"  // 实际25字符（远超10）]
}

原因：匹配词 "search" 前后需要保留最小上下文，无法严格截断。

3.2 关键结论

片段是围绕匹配词的文本片段，用于展示关键词上下文。
fragment_size 是目标值，实际可能因以下原因超出：
- 保持单词完整
- 包含高亮标签
- 最小上下文保留

需要严格限制时，可用 "type": "plain" + "boundary_scanner": "chars"（可能破坏单词完整性）。

"highlight": {"fields": {"text": {"fragment_size": 20,"number_of_fragments": 1,"type": "plain",  // 禁用智能处理"boundary_scanner": "chars" // 按字符而非单词截断}}
}