当前位置: 首页 > news >正文

学技术学英语:elasticsearch查询的两阶段queryingfetching

To understand Elasticsearch’s distributed search, let’s take a moment to understand how querying and fetching work. Unlike simple CRUD tasks, distributed search is like navigating through a maze of shards spread across the cluster.

In Elasticsearch, CRUD operations handle individual documents identified by their unique indextype, and routing-value (usually the document’s _id). However, search queries are more complex. They don’t have a fixed destination and must search through every shard in the index or indices to locate potential matches.

However, discovering matching documents marks just the beginning. The search API needs to combine results from various shards into a unified, organized list before displaying them to the user. This initiates the two-step process of querying and fetching.

By default, Elasticsearch utilizes a search method known as “Query Then Fetch.” This approach progresses through the following steps:

  1. Client sent a query to Elasticsearch
  2. Broadcast the query to each shard
  3. Find all matching documents and calculate scores using local Term/Document Frequencies
  4. Build a priority queue of results (sort, pagination with from/to, etc)
  5. Return metadata about the results to requesting node. Note, the actual document is not sent yet, just the scores
  6. Scores from all the shards are merged and sorted on the requesting node, docs are selected according to query criteria
  7. Finally, the actual docs are retrieved from individual shards where they reside.
  8. Results are returned to the client

Note: Coordinator node responsible for the steps 1,2, and 8.

Query Phase (3,4,5,6): the search query is sent to every shard, initiating local execution and the creation of a priority queue containing matching documents.

Fetch Phase (7): while the query phase identifies relevant documents, the fetch phase is responsible for fetching the actual documents from their respective shards.

This divided method guarantees effective and scalable search operations in a distributed setting. In the query phase, the search query navigates through each shard copy (primary or replica shards) to initiate local searches and compile a prioritized list of matching documents. This phase marks the initial step in refining the search results.

The fetch phase, resulting in the delivery of desired search outcomes. This phase acts as a bridge between query execution and result retrieval, ensuring the thoroughness of the search process.

Additional information:

Enabling Elasticsearch’s slow logs separately for query and fetch phases enables precise monitoring and optimization of search performance. Administrators can pinpoint potential bottlenecks and adjust system parameters by establishing thresholds for query and fetch durations separately.

For instance, configuring slow logs with specific thresholds for query and fetch phases can be done as follows:

PUT *,-.*/_settings
{"index.search.slowlog.threshold.query.warn": "1s","index.search.slowlog.threshold.fetch.warn": "100ms"
}#or with curlcurl -XPUT "http://localhost:9200/*,-.*/_settings" -H "Content-Type: application/json" -d'
{"index.search.slowlog.threshold.query.warn": "1s","index.search.slowlog.threshold.fetch.warn": "100ms"
}'

Elasticsearch query vs fetch times

It’s expected to see way more less fetch time compared to query time. Here is a topic that created in elastic discuss about the speed.

中文总结:

  1. 分布式搜索的两阶段过程

    • Elasticsearch 的分布式搜索分为 查询阶段(Query Phase) 和 获取阶段(Fetch Phase)

    • 查询阶段:搜索请求广播到每个分片,分片本地执行查询并返回匹配文档的元数据(如评分)。

    • 获取阶段:根据查询阶段的结果,从各个分片获取实际的文档内容。

  2. 查询阶段的工作流程

    • 客户端发送查询请求到协调节点(Coordinator Node)。

    • 协调节点将查询广播到索引的每个分片(主分片或副本分片)。

    • 每个分片本地执行查询,计算文档评分,并构建一个优先级队列。

    • 分片返回元数据(如文档 ID 和评分)到协调节点,协调节点合并和排序所有分片的结果。

  3. 获取阶段的工作流程

    • 协调节点根据查询阶段的结果,向相关分片请求实际的文档内容。

    • 分片返回文档内容,协调节点将最终结果返回给客户端。

  4. 慢日志监控

    • 可以为查询阶段和获取阶段分别启用慢日志,以监控和优化搜索性能。

    • 示例配置:

      json

      复制

      PUT *,-.*/_settings
      {"index.search.slowlog.threshold.query.warn": "1s","index.search.slowlog.threshold.fetch.warn": "100ms"
      }
  5. 查询时间与获取时间的对比

    • 通常情况下,获取时间(Fetch Time)远低于 查询时间(Query Time),因为查询阶段涉及更多的计算和排序操作。

http://www.lryc.cn/news/528927.html

相关文章:

  • Linux_线程互斥
  • 基于 NodeJs 一个后端接口的创建过程及其规范 -- 【elpis全栈项目】
  • 企业知识库提升企业核心竞争力促进团队协作和知识分享
  • C++ unordered_map和unordered_set的使用,哈希表的实现
  • games101-作业3
  • 【Block总结】高效多尺度注意力EMA,超越SE、CBAM、SA、CA等注意力|即插即用
  • Pwn 入门核心工具和命令大全
  • 探索AI(chatgpt、文心一言、kimi等)提示词的奥秘
  • 利用飞书机器人进行 - ArXiv自动化检索推荐
  • 小白爬虫冒险之反“反爬”:无限debugger、禁用开发者工具、干扰控制台...(持续更新)
  • Ubuntu中MySQL安装-02
  • 大数据相关职位介绍之一(数据分析,数据开发,数据产品经理,数据运营)
  • 使用DeepSeek API生成Markdown文件
  • java多线程学习笔记
  • Manticore Search,新一代搜索引擎之王
  • 【MySQL】数据类型与表约束
  • CAG技术:提升LLM响应速度与质量
  • 上位机知识篇---Linux源码编译安装链接命令
  • 科研绘图系列:R语言绘制线性回归连线图(line chart)
  • 将ollama迁移到其他盘(eg:F盘)
  • Oracle 创建用户和表空间
  • cursor ide配置远程ssh qt c++开发环境过程记录
  • yolov5错误更改与相关参数详解(train.py)
  • Python设计模式 - 组合模式
  • css粘性定位超出指定宽度失效问题
  • Windows 程序设计6:错误码的查看
  • doris: CSV导入数据
  • FastStone Image Viewer图像处理软件安装步骤(百度网盘链接)
  • Kafka 深入服务端 — 时间轮
  • 网络爬虫学习:应用selenium获取Edge浏览器版本号,自动下载对应版本msedgedriver,确保Edge浏览器顺利打开。