当前位置：首页 > news >正文

5.安装IK分词器

news 2025/9/16 9:39:34

es创建倒排索引的时候，需要对文档进行分词。

搜索时，需要对用户输入的内容分词。但是默认的分词规则对中文处理并不友好。

英语分词器，一个汉子分成一个词，对于java英文单词会分成一个词。

POST /_analyze
{
"text":"我住在北京这个大城市学习java",
"analyzer": "english"
}

中文分词器也是一样，一个汉子分成一个词
POST /_analyze
{
"text":"我住在北京这个大城市学习java",
"analyzer": "chinese"
}

中文分词器也是一样，一个汉子分成一个词

standard标准分词器也是一样，一个汉子分成一个词

POST /_analyze
{
"text":"我住在北京这个大城市学习java",
"analyzer": "standard"
}

都有一个问题，不能对中文很好的分词，按照词语。

处理中文分词，采用ik分词器

有两种：ik_smart 和 ik_max_word

ik_smart 最少切分，分词分的比较少。

ik_max_word 最细切分，分词分的比较多。

下载地址，直接用迅雷下载：

https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.12.1/elasticsearch-analysis-ik-7.12.1.zip

下载后解压文件，文件夹的名字命名为ik即可。(这里命名一定要为ik，否则重启es的docker容器会报错)

然后将ik文件夹上传到centos7服务器，放到docker容器的插件目录即可。

/home/xiankejin/es-plugins/

重启es的docker容器

测试效果：

POST /_analyze
{
"text":"我住在北京这个大城市学习java",
"analyzer": "ik_smart"
}

{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "住在",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "北京",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "这个",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "大城市",
"start_offset" : 7,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "学习",
"start_offset" : 10,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "java",
"start_offset" : 12,
"end_offset" : 16,
"type" : "ENGLISH",
"position" : 6
}
]
}

POST /_analyze
{
"text":"我住在北京这个大城市学习java",
"analyzer": "ik_max_word"
}

{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "住在",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "北京",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "这个",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "个大",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "大城市",
"start_offset" : 7,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "大城",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "城市学",
"start_offset" : 8,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "城市",
"start_offset" : 8,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 8
},
{
"token" : "学习",
"start_offset" : 10,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 9
},
{
"token" : "java",
"start_offset" : 12,
"end_offset" : 16,
"type" : "ENGLISH",
"position" : 10
}
]
}