Elasticsearch使用篇 - 指标聚合
指标聚合
指标聚合从聚合文档中提取出指标进行计算。可以从文档的字段或者使用脚本方式进行提取。
聚合统计可以同时返回明细数据,可以分页查询,可以返回总数量。
可以结合查询条件,限制数据范围,结合倒排索引+列式存储。
指标聚合的资料可以参考 Elasticsearch Metrics aggregation。
语法格式:
GET <index>/_search
{"aggs": {"<aggs_name>": {"<aggs_type>": {"field": "<field_name>"}}}
}
min、max、sum、avg
分别是最小值、最大值、求和、求平均值。它们四个都是单值指标聚合。可以从聚合文档中指定数值字段或者脚本中提取数值来计算统计信息。
-
field:对指定字段进行聚合。
-
missing:当指定字段的值不存在时,指定一个缺省值。默认会忽略。
GET kibana_sample_data_ecommerce/_search
{"size": 0, "aggs": {"sum_taxful_total_price": {"sum": {"field": "taxful_total_price"}},"avg_taxful_total_price": {"avg": {"field": "taxful_total_price"}},"max_taxful_total_price": {"max": {"field": "taxful_total_price"}},"min_taxful_total_price": {"min": {"field": "taxful_total_price"}}}
}
结果输出如下:
{"took" : 16,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 4675,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"aggregations" : {"max_taxful_total_price" : {"value" : 2250.0},"sum_taxful_total_price" : {"value" : 350884.12890625},"avg_taxful_total_price" : {"value" : 75.05542864304813},"min_taxful_total_price" : {"value" : 6.98828125}}
}
以上四种聚合可以从脚本中提取数值来统计相关信息。
GET kibana_sample_data_ecommerce/_search
{"size": 0,"runtime_mappings": {"unit_price": {"type": "double","script": """emit(doc['taxful_total_price'].value / doc['total_quantity'].value)"""}}, "aggs": {"avg_unit_price": {"avg": {"field": "unit_price"}}}
}
stats
统计聚合。一种多值指标聚合,可以从聚合文档中指定数值字段或者脚本中提取数值来计算统计信息,统计信息包括 count、min、max、sum、avg。
-
field:对指定字段进行聚合。
-
missing:当指定字段的值不存在时,指定一个缺省值。默认会忽略。
GET kibana_sample_data_ecommerce/_search
{"size": 0, "aggs": {"stats_taxful_total_price": {"stats": {"field": "taxful_total_price"}}}
}
输出结果如下:
{"took" : 22,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 4675,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"aggregations" : {"stats_taxful_total_price" : {"count" : 4675,"min" : 6.98828125,"max" : 2250.0,"avg" : 75.05542864304813,"sum" : 350884.12890625}}
}
stat 聚合可以从脚本中提取数值来统计相关信息。
GET kibana_sample_data_ecommerce/_search
{"size": 0,"runtime_mappings": {"unit_price": {"type": "double","script": """emit(doc['taxful_total_price'].value / doc['total_quantity'].value)"""}}, "aggs": {"stat_price": {"stats": {"field": "unit_price"}}}
}
extended_stats
拓展统计聚合。一种多值指标聚合,可以从聚合文档中指定数值字段或者脚本中提取数值来计算统计信息。
- field:对指定字段进行聚合。
- sigma:控制应该显示离均值的标准偏差的数量。
- missing:当指定字段的值不存在时,指定一个缺省值。默认会忽略。
GET kibana_sample_data_ecommerce/_search
{"size": 0, "aggs": {"extend_stats_total_price": {"extended_stats": {"field": "taxful_total_price"}}}
}
输出结果如下:
{"took" : 4,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 4675,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"aggregations" : {"extend_stats_total_price" : {"count" : 4675,"min" : 6.98828125,"max" : 2250.0,"avg" : 75.05542864304813,"sum" : 350884.12890625,"sum_of_squares" : 3.9367749294174194E7, // 平方和"variance" : 2787.59157113862, // 方差"variance_population" : 2787.59157113862,"variance_sampling" : 2788.187974983536,"std_deviation" : 52.79764740155209, // 标准差"std_deviation_population" : 52.79764740155209,"std_deviation_sampling" : 52.80329511482722,"std_deviation_bounds" : {"upper" : 180.6507234461523,"lower" : -30.53986616005605,"upper_population" : 180.6507234461523,"lower_population" : -30.53986616005605,"upper_sampling" : 180.66201887270256,"lower_sampling" : -30.551161586606312}}}
}
extended_stats 聚合同样支持从脚本中提取数值来统计相关信息。
GET kibana_sample_data_ecommerce/_search
{"size": 0,"runtime_mappings": {"unit_price": {"type": "double","script": """emit(doc['taxful_total_price'].value / doc['total_quantity'].value)"""}}, "aggs": {"extended_stat_unit_price": {"extended_stats": {"field": "unit_price"}}}
}
percentiles
[pə’sentaɪlz] ,百分位数聚合。
它属于多值指标聚合,从聚合文档中的数值字段、直方图字段或者脚本中提取出一个或者多个百分位数。
百分位表示观测值的某个百分比出现的点。例如,第 95 个百分位数是大于观测值 95% 的值。
百分位数通常用于寻找异常值。在正态分布中,第 0.13 和 第 99.87 个百分位代表与平均值的三个标准差。任何超出三个标准差的数据通常都被认为是异常。
当检索到一个百分比范围时,可以使用它们来估计数据分布,并确定数据是否倾斜、双峰等。
-
field:对指定字段进行聚合。
-
keyed:默认 true,即使用键值对格式返回数据;如果设置为 false,则使用数组格式返回数据。
-
percents:指定百分位等级。
-
tdigest:百分位计算选择的算法。TDigest 算法用来平衡内存使用率和估算精度。该算法使用一些节点来估算百分位数 - 可用的节点越多,数据的精度越高,但是内存使用率也越高。节点个数限制为 compression * 20。一个节点大约占用 32 字节的内存,按照默认配置的最差情况将产生一个大约 64 KB 大小的 TDigest。
-
compression:压缩参数。默认100。
-
hdr:使用 HDR 直方图(High Dynamic Range Histogram,即高动态范围直方图)计算百分位数。它比 TDigest 算法更快,但是占用更大的内存。内部维护一个固定的最坏情况百分比错误(指定为有效数字的数量)。这意味着,如果在直方图中记录从 1 微秒到 1 小时(3,600,000,000 微秒)的值,并将其设置为 3 个有效数字,则对于 1 毫秒以内的值将保持 1 微秒的值分辨率,对于最大跟踪值(1 小时)将保持 3.6 秒(或更好)的值分辨率。
-
number_of_significant_value_digits:有效数字的数量。不能为负数。
-
missing:当指定字段的值不存在时,指定一个缺省值。默认会忽略。
-
script:使用脚本方式。
统计商品价格的百分位数(使用 TDigest 算法)
GET kibana_sample_data_ecommerce/_search
{"size": 0, "aggs": {"percentiles_taxful_total_price": {"percentiles": {"field": "taxful_total_price","percents": [1,5,25,50,75,95,99],"tdigest": {"compression": 200}}}}
}
输出如下:
{"took" : 58,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 4675,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"aggregations" : {"percentiles_taxful_total_price" : {"values" : {"1.0" : 21.984375,"5.0" : 27.984375,"25.0" : 44.96875,"50.0" : 63.96875,"75.0" : 93.0,"95.0" : 156.0,"99.0" : 222.0}}}
}
使用 HDR 直方图的方式。
GET kibana_sample_data_ecommerce/_search
{"size": 0, "aggs": {"percentiles_taxful_total_price": {"percentiles": {"field": "taxful_total_price","percents": [1,5,25,50,75,95,99],"hdr": {"number_of_significant_value_digits": 3}}}}
}
百分位数聚合支持脚本方式。
GET kibana_sample_data_ecommerce/_search
{"size": 0,"aggs": {"percentiles_taxful_total_price": {"percentiles": {"script": {"lang": "painless","source": "doc['taxful_total_price'].value / params.timeUnit","params": {"timeUnit": 1000}}}}}
}
percentile_ranks
[pərˈsentaɪl],百分位数排名聚合。
多值指标聚合,从聚合文档中的指定数值字段、直方图字段或者脚本中提取出一个或者多个百分位排名。
百分位数排名表示观测值低于某一数值的百分比。例如,如果一个值大于或等于观测值的 95%,则它位于第 95 百分位。
-
field:对指定字段进行聚合。
-
values:指定观测值。
-
keyed:默认 true,即使用键值对格式返回数据;如果设置为 false,则使用数组格式返回数据。
-
hdr:使用 HDR 直方图(High Dynamic Range Histogram,即高动态范围直方图)计算百分位数。它比 TDigest 算法更快,但是占用更大的内存。内部维护一个固定的最坏情况百分比错误(指定为有效数字的数量)。这意味着,如果在直方图中记录从 1 微秒到 1 小时(3,600,000,000 微秒)的值,并将其设置为 3 位有效数字,则对于 1 毫秒以内的值将保持 1 微秒的值分辨率,对于最大跟踪值(1 小时)将保持 3.6 秒(或更好)的值分辨率。
-
number_of_significant_value_digits:有效数字的数量。不能为负数。
-
missing:当指定字段的值不存在时,指定一个缺省值。默认会忽略。
GET kibana_sample_data_ecommerce/_search
{"size": 0, "aggs": {"percentile_ranks_total_price": {"percentile_ranks": {"field": "taxful_total_price","values": [100,200]}}}
}
结果输出如下:
{"took" : 22,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 4675,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"aggregations" : {"percentile_ranks_total_price" : {"values" : {"100.0" : 79.31550802139039,"200.0" : 98.43850267379679}}}
}
percentile_ranks 聚合同样支持脚本方式。
GET kibana_sample_data_ecommerce/_search
{"size": 0,"aggs": {"percentile_ranks_taxful_total_price": {"percentile_ranks": {"values": [90, 100],"script": {"lang": "painless","source": "doc['taxful_total_price'].value / params.timeUnit","params": {"timeUnit": 10}}}}}
}
cardinality
[kɑːdɪ’nælɪtɪ],基数聚合。
一种单值指标聚合,统计不同值的近似计数。底层使用 Hyperloglog++ 算法。
-
field:对指定字段进行聚合。
-
precision_threshold:精度控制参数,默认 3000, 最大值 40000,在这个范围内,统计出来的数据去重是准确的,超过之后存在一定的误差。
-
missing:当指定字段的值不存在时,指定一个缺省值。默认会忽略。
统计客户数量
GET kibana_sample_data_ecommerce/_search
{"size": 0, "aggs": {"cardinality_customer_id": {"cardinality": {"field": "customer_id","precision_threshold": 3000}}}
}
结果输出如下:
{"took" : 9,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 4675,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"aggregations" : {"cardinality_customer_id" : {"value" : 46}}
}
value_count
值计数聚合。用于计算从聚合文档中提取的值的数量。它是一种单值指标聚合。
- field:指定聚合的字段
统计客户购买了多少商品
GET kibana_sample_data_ecommerce/_search
{"size": 0, "aggs": {"value_count_products_id": {"value_count": {"field": "products._id.keyword"}}}
}
对于 histogram 类型的字段,value_count 聚合会统计 counts 数组元素之和。
PUT metrics_index
{"mappings": {"properties": {"network.name": {"type": "keyword"},"latency_histo": {"type": "histogram"}}}
}
PUT metrics_index/_doc/1
{"network.name" : "net-1","latency_histo" : {"values" : [0.1, 0.2, 0.3, 0.4, 0.5],"counts" : [3, 7, 23, 12, 6] }
}PUT metrics_index/_doc/2
{"network.name" : "net-2","latency_histo" : {"values" : [0.1, 0.2, 0.3, 0.4, 0.5],"counts" : [8, 17, 8, 7, 6] }
}
GET /metrics_index/_search?size=0
{"aggs": {"total_requests": {"value_count": { "field": "latency_histo" }}}
}
输出结果如下:
{"took" : 6,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"aggregations" : {"total_requests" : {"value" : 97}}
}
string_stats
字符串统计聚合,仅用于 keyword 类型的数据。它是一种多值指标聚合。
统计客户的名字
GET kibana_sample_data_ecommerce/_search
{"size": 0, "aggs": {"string_stats_customer_name": {"string_stats": {"field": "customer_full_name.keyword"}}}
}
结果输出如下:
{"took" : 81,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 4675,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"aggregations" : {"string_stats_customer_name" : {"count" : 4675,"min_length" : 7,"max_length" : 25,"avg_length" : 13.309304812834224,"entropy" : 4.773147238719484}}
}
count:非空值的数量。
min_length:最小长度。
max_length:最大长度。
avg_length:平均长度。
entropy:信息熵。对于测量数据集的广泛属性,如多样性、相似性、随机性等,这是一个非常有用的度量标准。
top_hits
即热门聚合。热门聚合关注关联性最强的文档。通常作为子聚合使用,聚合每个桶中匹配程度最高的文档。简言之,分桶聚合时,热门聚合作为子聚合,用来返回每组头部明细数据。
- size:限制明细数据返回数量。
- sort:指定明细数据的排序字段以及排序方式。
- _source:指定明细数据返回的字段。
GET kibana_sample_data_ecommerce/_search
{"size": 0,"aggs": {"aggs_customer_id": {"terms": {"field": "customer_id","size": 10},"aggs": {"top_hits": {"top_hits": {"size": 2, "_source": {"includes": ["customer_id", "order_date", "products"]},"sort": [{"order_date": {"order": "desc"}} ]}}}}}
}
PUT sales
{"mappings": {"properties": {"tags": { "type": "keyword" },"comments": { "type": "nested","properties": {"username": { "type": "keyword" },"comment": { "type": "text" }}}}}
}
PUT sales/_doc/1?refresh
{"tags": ["car","auto"],"comments": [{"username": "baddriver007","comment": "This car could have better brakes"},{"username": "dr_who","comment": "Where's the autopilot? Can't find it"},{"username": "ilovemotorbikes","comment": "This car has two extra wheels"}]
}
GET sales/_search
{"query": {"term": {"tags": "car"}},"aggs": {"by_sale": {"nested": {"path": "comments"},"aggs": {"by_user": {"terms": {"field": "comments.username","size": 1},"aggs": {"by_nested": {"top_hits": {}}}}}}}
}
top_metrics
即头部指标聚合。可以指定自定义字段与排序规则,按照排序字段的头部数据统计。
- metrics:获取头部指标字段的数值。
- sort:指定头部指标字段的排序规则
- size:限制头部指标返沪的数据条数。默认1,索引限制最多为10,可以修改 index.top_metrics_max_size
e.g. 按照客户分桶统计,在每个桶中按照客户下单日期顺序排序,返回订单中第一条购买的总金额。
GET kibana_sample_data_ecommerce/_search
{"size": 0, "aggs": {"aggs_customer_id": {"terms": {"field": "customer_id","size": 2},"aggs": {"top_metrics_total_price": {"top_metrics": {"metrics": {"field": "taxful_total_price"},"sort": {"order_date": "desc"},"size": 1}}}}}
}
结果输出如下:
{"took" : 14,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 4675,"relation" : "eq"},"max_score" : null,"hits" : [ ]},"aggregations" : {"aggs_customer_id" : {"doc_count_error_upper_bound" : 0,"sum_other_doc_count" : 4139,"buckets" : [{"key" : "27","doc_count" : 348,"top_metrics_total_price" : {"top" : [{"sort" : ["2022-08-13T18:38:53.000Z"],"metrics" : {"taxful_total_price" : 79.0}}]}},{"key" : "52","doc_count" : 188,"top_metrics_total_price" : {"top" : [{"sort" : ["2022-08-13T21:41:46.000Z"],"metrics" : {"taxful_total_price" : 75.0}}]}}]}}
}
修改 top metrics 限制
PUT kibana_sample_data_ecommerce/_settings
{"top_metrics_max_size": 100
}