当前位置: 首页 > news >正文

Elasticsearch-ES查询单字段去重

ES 语句

整体数据

GET wkl_test/_search
{"query": {"match_all": {}}
}

结果:

{"took" : 123,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 5,"relation" : "eq"},"max_score" : 1.0,"hits" : [{"_index" : "wkl_test","_type" : "_doc","_id" : "aK0tFpABTkLj5j4c34pE","_score" : 1.0,"_source" : {"name" : "zhangsan","aa" : 1}},{"_index" : "wkl_test","_type" : "_doc","_id" : "aa0uFpABTkLj5j4cFYrJ","_score" : 1.0,"_source" : {"name" : "lisi","aa" : 2}},{"_index" : "wkl_test","_type" : "_doc","_id" : "aq0uFpABTkLj5j4cKYqF","_score" : 1.0,"_source" : {"name" : "wangwu","aa" : 2}},{"_index" : "wkl_test","_type" : "_doc","_id" : "a60uFpABTkLj5j4c2IoF","_score" : 1.0,"_source" : {"name" : "maliu","aa" : 2}},{"_index" : "wkl_test","_type" : "_doc","_id" : "bK1IFpABTkLj5j4cqYop","_score" : 1.0,"_source" : {"name" : "gouqi","aa" : 3}}]}
}

1:collapse折叠功能- 查询去重后的数据列表(ES5.3之后支持)

  • 推荐原因:性能高,占内存小
  • 注意:使用此方式去重时,不会去除掉不存在去重字段的数据。
  • 去重字段只能是数字long类型或keyword。
  • Field Collapsing(字段折叠)不能与scroll、rescore以及search after 结合使用。
GET wkl_test/_search
{"query": {"match_all": {}},"collapse": {"field": "aa"}
}

结果:hits 中total虽然=5,但是只返回了去重后的 3 条数据

{"took" : 2,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 5,"relation" : "eq"},"max_score" : null,"hits" : [{"_index" : "wkl_test","_type" : "_doc","_id" : "aK0tFpABTkLj5j4c34pE","_score" : 1.0,"_source" : {"name" : "zhangsan","aa" : 1},"fields" : {"aa" : [1]}},{"_index" : "wkl_test","_type" : "_doc","_id" : "aa0uFpABTkLj5j4cFYrJ","_score" : 1.0,"_source" : {"name" : "lisi","aa" : 2},"fields" : {"aa" : [2]}},{"_index" : "wkl_test","_type" : "_doc","_id" : "bK1IFpABTkLj5j4cqYop","_score" : 1.0,"_source" : {"name" : "gouqi","aa" : 3},"fields" : {"aa" : [3]}}]}
}

2:cardinality - 查询去重后的数据总数

  • 聚合+cardinality:即去重计算,类似sql中 count(distinct),先去重再求和
  • 注意:使用此方式统计去重后的数量时,会去除掉不存在去重字段的数据。
GET wkl_test/_search
{"query": {"match_all": {}},"size": 0, "aggs": {"distinct_count": {"cardinality": {"field": "aa"}}}
}

结果:distinct_count = 3,说明去重后有3个,既aggregations聚合下,返回了按名字查询去重后的结果数,但是只有去重后的条数,没有具体的数据。

{"took" : 2,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 5,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"aggregations" : {"distinct_count" : {"value" : 3}}
}

3:整体语句

  • 使用collapse 折叠查询后,虽然返回了去重后的数据,但是total 还是所有的数据量
  • 使用 cardinality 聚合 ,虽然在aggs 聚合结果中返回了正确的数据量,但是hits中还是全部的数据
  • 所以我们需要 两个综合使用,如下:
GET wkl_test/_search
{"query": {"match_all": {}},"collapse": {"field": "aa"}, "aggs": {"distinct_count": {"cardinality": {"field": "aa"}}}
}

结果:

{"took" : 3,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 5,"relation" : "eq"},"max_score" : null,"hits" : [{"_index" : "wkl_test","_type" : "_doc","_id" : "aK0tFpABTkLj5j4c34pE","_score" : 1.0,"_source" : {"name" : "zhangsan","aa" : 1},"fields" : {"aa" : [1]}},{"_index" : "wkl_test","_type" : "_doc","_id" : "aa0uFpABTkLj5j4cFYrJ","_score" : 1.0,"_source" : {"name" : "lisi","aa" : 2},"fields" : {"aa" : [2]}},{"_index" : "wkl_test","_type" : "_doc","_id" : "bK1IFpABTkLj5j4cqYop","_score" : 1.0,"_source" : {"name" : "gouqi","aa" : 3},"fields" : {"aa" : [3]}}]},"aggregations" : {"distinct_count" : {"value" : 3}}
}

注:我们使用cardinality聚合后的distinct_count 作为去重后的总数,用 collapse 折叠后的列表作为数据结果集

分页使用解释说明:

  • 1.hits中total的总条数实际上是去重前的总条数,原数据条数,这里我们知道就行,分页中我们并不使用它。hits中数组的大小刚好等于courseAgg聚合的值,数组中的数据就是去重后的数据。

  • 2.aggregations中的courseAgg条数,这个才是去重后的实际条数,也是分页用的总条数。

  • 3.from 查询的偏移量,也就是从哪里开始查。

  • 4.size 查询条数,一次查几条。

  • 接下来,你就可以把它当做一个简单分页查询来用了,传入from和size就ok啦~

JAVA API使用

1:collapse 查询去重的结果集

// 使用collapse来指定去重的字段,例如"your_distinct_field"CollapseBuilder collapseBuilder = new CollapseBuilder("your_distinct_field");searchSourceBuilder.collapse(collapseBuilder);

2:cardinality - 查询去重后的数据总数

		// 添加一个cardinality聚合来计算去重字段的唯一值数量CardinalityAggregationBuilder aggregation = AggregationBuilders.cardinality("distinct_count")//这里是聚合结果的字段名.field("your_distinct_field")//这里是需要聚合的字段.precisionThreshold(40000); // 根据需要调整精度阈值searchSourceBuilder.aggregation(aggregation);

3:整体使用

package com.wenge.system.utils;import org.apache.http.HttpHost;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.SearchHits;
import org.elasticsearch.search.aggregations.AggregationBuilders;
import org.elasticsearch.search.aggregations.metrics.CardinalityAggregationBuilder;
import org.elasticsearch.search.aggregations.metrics.ParsedCardinality;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.collapse.CollapseBuilder;import java.io.IOException;
import java.util.Map;/*** @author wangkanglu* @version 1.0* @description* @date 2024-06-17 16:48*/
public class TestES {public static void main(String[] args) throws IOException {//创建ES客户端RestHighLevelClient esClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost",9200,"http")));try {// 创建一个搜索请求并设置索引名SearchRequest searchRequest = new SearchRequest("your_index");// 构建搜索源构建器SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();// 设置查询条件,例如匹配所有文档,这里根据业务自己修改searchSourceBuilder.query(QueryBuilders.matchAllQuery());// 使用collapse来指定去重的字段,例如"your_distinct_field"CollapseBuilder collapseBuilder = new CollapseBuilder("your_distinct_field");searchSourceBuilder.collapse(collapseBuilder);// 添加一个cardinality聚合来计算去重字段的唯一值数量CardinalityAggregationBuilder aggregation = AggregationBuilders.cardinality("distinct_count")//这里是聚合结果的字段名.field("your_distinct_field")//这里是需要聚合的字段.precisionThreshold(40000); // 根据需要调整精度阈值searchSourceBuilder.aggregation(aggregation);// 设置搜索源searchRequest.source(searchSourceBuilder);// 执行搜索SearchResponse searchResponse = esClient.search(searchRequest, RequestOptions.DEFAULT);SearchHit[] hits = searchResponse.getHits().getHits();for (SearchHit hit : hits) {Map<String, Object> sourceAsMap = hit.getSourceAsMap();System.out.println("去重结果: " + sourceAsMap);}// 处理搜索结果,获取去重数量ParsedCardinality parsedCardinality = searchResponse.getAggregations().get("distinct_count");long distinctCount = parsedCardinality.getValue();System.out.println("去重结果数量:" + distinctCount);} finally {// 关闭clientesClient.close();}}
}
http://www.lryc.cn/news/379391.html

相关文章:

  • 【Apache Doris】周FAQ集锦:第 7 期
  • EE trade:炒伦敦金的注意事项及交易指南
  • JAVA医院绩效考核系统源码 功能特点:大型医院绩效考核系统源码
  • Python神经影像数据的处理和分析库之nipy使用详解
  • 非关系型数据库NoSQL数据层解决方案 之 Mongodb 简介 下载安装 springboot整合与读写操作
  • 使用Redis优化Java应用的性能
  • 基于Python的数据可视化大屏的设计与实现
  • 什么是N卡和A卡?有什么区别?
  • 四边形不等式优化
  • 这家民营银行起诉担保公司?暴露担保增信兜底隐患
  • vscode禅模式怎么退出
  • Java23种设计模式(四)
  • HTML静态网页成品作业(HTML+CSS)——故宫介绍网页(4个页面)
  • Zookeeper:客户端命令行操作
  • 区块链技术介绍和用法
  • Upload-Labs-Linux1 使用 一句话木马
  • 从 Hadoop 迁移,无需淘汰和替换
  • 深度学习:从理论到应用的全面解析
  • 【02】区块链技术应用
  • 一篇文章搞懂残差网络算法
  • 网络安全:Web 安全 面试题.(SQL注入)
  • XSS学习(绕过)
  • 深信服2024笔试
  • IOS Swift 从入门到精通:闭包 第一部分
  • 解两道四年级奥数题(等差数列)玩玩
  • 深入理解Python中的并发与异步的结合使用
  • 如何将 ChatGPT 集成到你的应用中
  • 在 Swift 中,UILabel添加点击事件的方法
  • indexedDB---掌握浏览器内建数据库的基本用法
  • 【css】如何修改input选中历史选项后,自动填充的蓝色背景色