[筆記] ElasticSearch Analyzer
分析器隨筆
flowchart LR
A[text] --> F[character filter]
subgraph Analyzer
F --> G[tokenizer]
G --> H[token filter]
end
H --> J[Index]
組成
- character filter: 主要做篩選與過濾 e.g. 將HTML字元濾掉
- tokenizer: 主要將字串進行分詞 e.g. “It’s a cool guy” » “it’s” “a” “cool” “guy”
- token filter: 可將拆出的分詞進行 增加,刪除,修改 e.g. 大小寫轉換
觸發時機
- 查詢時
- 寫入時
使用分析器
PUT user
## 範例index
Index 特定field
PUT user/_mapping
{
"properties": {
"name": {
"type": "text",
"index": true,
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
},
"sex": {
"type":"keyword",
"index": true
},
"tel": {
"type": "keyword",
"index": false
}
}
}
Index 全局預設
POST /user/_close
PUT user/_settings
{
"analysis": {
"analyzer": {
"default": {
"type": "ik_max_word"
},
"default_search":{
"type": "ik_max_word"
}
}
}
}
POST /user/_open
檢視Index當前設定值
GET /user/_settings
return
{
"user" : {
"settings" : {
"index" : {
"routing" : {
"allocation" : {
"include" : {
"_tier_preference" : "data_content"
}
}
},
"number_of_shards" : "1",
"provided_name" : "user",
"creation_date" : "1699429776184",
"analysis" : {
"analyzer" : {
"default_search" : {
"type" : "ik_max_word"
},
"default" : {
"type" : "ik_max_word"
}
}
},
"number_of_replicas" : "1",
"uuid" : "81Jc8eyxSN6uYqQskE7YFg",
"version" : {
"created" : "7170699"
}
}
}
}
}
自訂分析器
自訂分析器, 先將html符號濾掉 再用分析器拆分 最後被用大寫輸出
PUT test_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"custom",
"char_filter":[
"html_strip"
],
"tokenizer":"standard",
"filter":[
"uppercase"
]
}
}
}
},
"mappings":{
"properties":{
"my_text":{
"type":"text",
"analyzer":"my_analyzer",
"search_analyzer":"my_analyzer"
}
}
}
}
# character filer 使用 html_strip 去除html
# tokenizer 使用 standard 拆分
# tokenfilter 使用 lowercase 處理輸出分詞
其他內建標準
tokenizer
token_filter
character filter
測試自訂分析器
POST _analyze
{
"char_filter":[
"html_strip"
],
"tokenizer": "standard",
"filter": [
"uppercase"
],
"text": "<title>Hello World</title>"
}
or
POST test_index/_analyze
{
"analyzer": "my_analyzer",
"text": "<title>Hello World</title>"
}
return
{
"tokens" : [
{
"token" : "HELLO",
"start_offset" : 7,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "WORLD",
"start_offset" : 13,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}