当前位置: 首页 > news >正文

ElasticSearch映射分词

目录

弃用Type

why

映射

查询 mapping of index

创建 index with mapping

添加 field with mapping

数据迁移

1.新建 一个 index with correct mapping 

2.数据迁移 reindex data into that index

分词

POST _analyze

自定义词库

ik分词器

circuit_breaking_exception


弃用Type

ES 6.x 之前,Type 开始弃用

ES 7.x ,被弱化,仍支持

ES 8.x ,完全移除

弃用后,每个索引只包含一种文档类型

如果需要区分不同类型的文档,俩种方式:

  • 创建不同的索引
  • 在文档中添加自定义字段来实现。

why

Elasticsearch 的底层存储(Lucene)是基于索引的,而不是基于 Type 的。

在同一个索引中,不同 Type 的文档可能具有相同名称但不同类型的字段,这种字段类型冲突会导致数据不一致和查询错误。

GET /bank/_search
{"query": {"match": {"address": "mill lane"}},"_source": ["account_number","address"]
}

从查询语句可以看出,查询是基于index的,不会去指定type。如果有不同type的address,就会引起查询冲突。


映射

Mapping 定义 doc和field 如何被存储和被检索

Mapping(映射) 是 Elasticsearch 中用于定义文档结构和字段类型的机制。它类似于关系型数据库中的表结构(Schema),用于描述文档中包含哪些字段、字段的数据类型(如文本、数值、日期等),以及字段的其他属性(如是否分词、是否索引等)。

Mapping 是 Elasticsearch 的核心概念之一,它决定了数据如何被存储、索引和查询。

查询 mapping of index

 _mapping

GET /bank/_mapping
{"bank" : {"mappings" : {"properties" : {"account_number" : {"type" : "long"},"address" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}},"age" : {"type" : "long"},"balance" : {"type" : "long"},"city" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}},"email" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}},"employer" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}},"firstname" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}},"gender" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}},"lastname" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}},"state" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}}}}}
}
  • text 可以添加子field ---keyword,类型是 keyword。keyword存储精确值

创建 index with mapping

Put /{indexName}

Put /my_index
{"mappings": {"properties": {"account_number": {"type": "long"},"address": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"city": {"type": "keyword"}}}
}

添加 field with mapping

  •  PUT /{indexName}/_mapping + mapping.properties请求体
PUT /my_index/_mapping
{"properties": {"state": {"type": "keyword","index": false}}
}
  •  "index": false  该字段无法被索引,不会参与检索   默认true

数据迁移

 ES不支持修改已存在的mapping。若想更新已存在的mapping,就要进行数据迁移。

1.新建 一个 index with correct mapping 

PUT /my_bank
{"mappings": {"properties": {"account_number": {"type": "long"},"address": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"age": {"type": "integer"},"balance": {"type": "long"},"city": {"type": "keyword"},"email": {"type": "keyword"},"employer": {"type": "keyword"},"firstname": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"gender": {"type": "keyword"},"lastname": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"state": {"type": "keyword"}}}
}

2.数据迁移 reindex data into that index

POST _reindex
{"source": {"index": "bank","type": "account"},"dest": {"index": "my_bank"}
}
  • ES 8.0  弃用type参数 


分词

        将文本拆分为单个词项(tokens)

POST _analyze

标准分词器

POST _analyze
{"analyzer": "standard","text": ["it's test data","hello world"]
}

 Response

{"tokens" : [{"token" : "it's","start_offset" : 0,"end_offset" : 4,"type" : "<ALPHANUM>","position" : 0},{"token" : "test","start_offset" : 5,"end_offset" : 9,"type" : "<ALPHANUM>","position" : 1},{"token" : "data","start_offset" : 10,"end_offset" : 14,"type" : "<ALPHANUM>","position" : 2},{"token" : "hello","start_offset" : 15,"end_offset" : 20,"type" : "<ALPHANUM>","position" : 3},{"token" : "world","start_offset" : 21,"end_offset" : 26,"type" : "<ALPHANUM>","position" : 4}]
}

自定义词库

nginx/html目录下 创建es/term.text,添加词条

配置ik远程词库,/elasticsearch/config/analysis-ik/IKAnalyzer.cfg.xml

 测试

POST _analyze
{"analyzer": "ik_smart","text": "尚硅谷项目谷粒商城"
}

 [尚硅谷,谷粒商城]为term.text词库中的词条

 Response

{"tokens" : [{"token" : "尚硅谷","start_offset" : 0,"end_offset" : 3,"type" : "CN_WORD","position" : 0},{"token" : "项目","start_offset" : 3,"end_offset" : 5,"type" : "CN_WORD","position" : 1},{"token" : "谷粒商城","start_offset" : 5,"end_offset" : 9,"type" : "CN_WORD","position" : 2}]
}

ik分词器

        中文分词

github地址

https://github.com/infinilabs/analysis-ik

    下载地址

    bin/elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-ik/7.4.2

    进入docker容器ES 下载 ik 插件

    卸载插件

    elasticsearch-plugin remove analysis-ik

    测试

    POST _analyze
    {"analyzer": "ik_smart","text": "我要成为java高手"
    }

    Response 

    {"tokens" : [{"token" : "我","start_offset" : 0,"end_offset" : 1,"type" : "CN_CHAR","position" : 0},{"token" : "要","start_offset" : 1,"end_offset" : 2,"type" : "CN_CHAR","position" : 1},{"token" : "成为","start_offset" : 2,"end_offset" : 4,"type" : "CN_WORD","position" : 2},{"token" : "java","start_offset" : 4,"end_offset" : 8,"type" : "ENGLISH","position" : 3},{"token" : "高手","start_offset" : 8,"end_offset" : 10,"type" : "CN_WORD","position" : 4}]
    }

    circuit_breaking_exception

    熔断器机制被触发

    {"error": {"root_cause": [{"type": "circuit_breaking_exception","reason": "[parent] Data too large, data for [<http_request>] would be [124604192/118.8mb], which is larger than the limit of [123273216/117.5mb], real usage: [124604192/118.8mb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1788/1.7kb, in_flight_requests=0/0b, accounting=225547/220.2kb]","bytes_wanted": 124604192,"bytes_limit": 123273216,"durability": "PERMANENT"}],"type": "circuit_breaking_exception","reason": "[parent] Data too large, data for [<http_request>] would be [124604192/118.8mb], which is larger than the limit of [123273216/117.5mb], real usage: [124604192/118.8mb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1788/1.7kb, in_flight_requests=0/0b, accounting=225547/220.2kb]","bytes_wanted": 124604192,"bytes_limit": 123273216,"durability": "PERMANENT"},"status": 429
    }

    查看ES日志

    docker logs elasticsearch

    检查 Elasticsearch 的内存使用情况

    GET /_cat/nodes?v&h=name,heap.percent,ram.percent
    • 如果 heap.percent 或 ram.percent 接近 100%,说明内存不足。

     增加 Elasticsearch 堆内存

    删除并重新创建容器 调整 -Xms 和 -Xmx 参数 256m

    docker run --name elasticsearch -p 9200:9200 -p 9300:9300 \
    > -e "discovery.type=single-node" \
    > -e ES_JAVA_OPTS="-Xms64m -Xmx256m" \
    > -v /mydata/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml \
    > -v  /mydata/elasticsearch/data:/usr/share/elasticsearch/data \
    > -v /mydata/elasticsearch/plugins:/usr/share/elasticsearch/plugins \
    > -d elasticsearch:7.4.2

    http://www.lryc.cn/news/538641.html

    相关文章:

  • JVM——堆的回收:引用计数发和可达性分析法、五种对象引用
  • PosgreSQL比MySQL更优秀吗?
  • 冒险岛079 V8 整合版源码搭建教程+IDEA启动
  • 基于Python的Flask微博话题舆情分析可视化系统
  • ms-swift3 序列分类训练
  • VSCode 实用快捷键
  • MVC模式和MVVM模式
  • CSS伪类选择器全解析:让你的样式更加灵活和智能
  • 【GESP】2024年12月图形化一级 -- 飞行的小猫
  • 30填学习自制操作系统第二天
  • MapReduce的工作原理及其在大数据处理中的应用
  • vue3.x 的provide 与 inject详细解读
  • c#中“事件-event”的经典示例与理解
  • 《第三代大语言模型Grok 3:闪亮登场》
  • rem、em、vw区别
  • 最新Apache Hudi 1.0.1源码编译详细教程以及常见问题处理
  • C语言简单练习题
  • C++ ——static关键字
  • Jasper AI技术浅析(二):语言模型
  • QML 部件获得焦点触发的全局槽函数 onActiveFocusItemChanged
  • 【git】工作场景下的 工作区 <-> 暂存区<-> 本地仓库 命令实战 具体案例
  • Python 中从零开始的随机梯度下降
  • 期权隐含波动率是什么意思?
  • python中使用数据库sqlite3
  • JavaScript数组-数组的概念
  • 英语---基础词汇库
  • ASCII 与 Unicode:两种字符编码的定义和不同
  • Linux相关概念和易错知识点(28)(线程控制、Linux下线程的底层)
  • lighten() 函数被弃用:替代方案color.scale()或者color.adjust()
  • 【leetcode】双指针:有效三角形的个数 and 和为s的两个数