接触ES后遇到的一些问题

群里的问题

hive表数据量19亿左右（id有少量重复），把id作为es的_id进行写入，发现es中的数据量比hive通过id去重后的数据要多几万条

force merge之后做一次count试试，你看到的数据条数可能包含同一条数据不同版本的。

es小知识点

动态映射

都知道动态映射会创建mapping模板，but如果你在创建索引时手动指定某个字段的mapping后，动态映射还会再创建一次，且自动驼峰转下划线，比如

"brand_id" : { //比如手动设置为brandId，结果创建后都有
 	"type" : "long"
 },
"brandId" : { 
 	"type" : "keyword"
 }

所以最好直接写入转换下划线后的数据

有助于理解es组合查询

https://cloud.tencent.com/developer/article/1689238

商城筛选商品且过滤参数的DSL

实现全文搜索和筛选

GET /es_idx_item/_search/
{
  "from": 0,
  "size": 20, 
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name_en": "Shoes"
          }
        }
      ],
      "filter": [
        {
          "bool": {
            "must":[
                {"term":{"category_id":"7038"}},
                {"term":{"brand_id":"123534847"}}
              ]
          }
        }
      ]
    }
  }
}

CreateIndexRequest和IndexRequest有什么区别？

前者是用来创建并配置索引的，后者是将数据与索引相关联，并且让数据可以被搜索。

如何确定git分支是从哪个分支拉下来的?

git reflog show branch_name

如果是多语言搜索，该如何设置查询条件？

ES7.15中，混合语言搜索自动分词，且不需要在mapping中设置分词器，应该是服务端自动匹配分词了。

Logstatsh 指定启动配置文件

bin/logstash -f conf/conf_file

https://doc.yonyoucloud.com/doc/logstash-best-practice-cn/get_start/hello_world.html

fastjson序列化转换

// SnakeCase  转为 _
// CamelCase  转为驼峰
// KebabCase  转为 -
// PascalCase 转为 单词首字母大写

https://github.com/alibaba/fastjson/wiki/PropertyNamingStrategy_cn

ES困惑

检索类型很多，不知道选择哪一个？嵌套查询，不知道什么时候 bool 组合，什么时候单个查询？

其一：全文检索，应用场景是：全文检索场景，比如：大数据系统。举例：检索包含：“牙膏黑人”的内容。核心使用的类型包含但不限于： match、match_phrase、query_string 等检索类型。

其二：精准匹配。应用场景是：精确匹配的场景，比如：搜索邮编、电话号码等信息的精准查询。包含但不限于：terms、term、exists、range、wildcard 等。

如上单类检索解决不了组合的问题，比如：期望检索：正文内容=XXX，标题=YYY，发布时间介于XXX到XXX，且发文来源是：新华网的信息，这时候就得组合检索，组合检索就需要使用 bool 组合检索。引申出 bool 组合检索的语法，包含但不限于： must、should、must_not、filter、minum_should_count 等的组合。

！！！！！ES bulkprocessor数据同步

排除环境问题，可能是mapping的字段类型和source的数据字段类型不一致导致的

搜索、更新和新建doc可以用索引别名，创建索引不能用别名，只能指定索引名，然后关联别名

public void processorUpdateItem(List<Item> itemList) {
    final List<ItemIndexTemplate> templates = itemList.stream()
            .parallel()
            .map(templateConvert::convert).collect(Collectors.toList());

    templates.forEach(temp -> {
        final String source = JSONObject.toJSONString(temp, config);
        final IndexRequest indexRequest = new IndexRequest(itemIndexAlias)
                .id(String.valueOf(temp.getId()))
                .source(source, XContentType.JSON);
        processor.add(indexRequest);
    });

}

Elasticsearch exception [type=illegal_argument_exception, reason=no write index is defined for alias [item_index_alias]. The write index may be explicitly disabled using is_write_index=false or the alias points to multiple indices without one being designated as a write index]"

这种情况需要指定写索引，当一个别名关联多个索引，写入时不指定写入到哪个索引中会报以上错误

ES 数据建模

相同或相近含义字段，一定要统一字段名、字段类型
特定字段独立建模
索引生命周期管理方案
- 6.6+
  - ILM
- 7.9+
  - data stream
分片设置多少？
- 索引创建后不可修改
- 考虑数据量、数据节点规模
refresh_interval
- 数据由index buffer的堆内存缓存区刷新到堆外内存区域，形成segment
分页方式
- search_after
  - 一页一页翻
- scroll
  - 全量导出
管道预处理ingest
- 比如像logstash的filter阶段

别名对应多个索引时

es会自动查询所有索引的分片
当多个索引指向同一个别名时，这些索引各自都是一个分片（默认1分片1副本），检索时都会检索到

sort的mode默认是什么？

没有默认值，所谓的mode是ScoreMode，是枚举

今天又是被nested折磨的一天

如何对nested中的某个对象更新？

伪命题，直接对整个文档覆盖更新

创建文档时可以使用别名创建，如果别名在创建索引时绑定过了。仅针对一个索引，多个索引时要对写入索引指定 is_write_index=true

要学的东西还有很多

pri 所有的分片数

rep 所有副本数

docs.count 所有文档数

docs.deleted 所有已删除文档数

store.size 所有分片的总存储大小，包括分片的副本

pri.store.size 所有分片的大小

终于找到了佐证

https://www.elastic.co/guide/en/elasticsearch/reference/7.6/ilm-rollover.html#ilm-rollover

To see the current index size, use the _cat indices API. The pri.store.size value shows the combined size of all primary shards.

spring构造方法注入导致循环依赖

用autowire

spring所有bean实例化后执行指定动作

从db读数据并整理写入es，用 SmartInitializingSingleton

处理第三方比如缓存之类的，可以用@PostConstruct

nested字段的查询以及对nested父级字段的查询

GET item_index_2021_12_10_102614/_search
{
  "from": 0,
  "size": 20,
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "spu_name_thai": {
              "query": "fruit"
            }
          }
        },
        {
          "match": {
            "spu_name_en": {
              "query": "fruit"
            }
          }
        },
        {
          "match": {
            "brand_name_thai": {
              "query": "fruit"
            }
          }
        },
        {
          "match": {
            "brand_name_en": {
              "query": "fruit"
            }
          }
        }
      ],
      "filter": [
        {
          "nested": {
            "query": {
              "bool": {
                "filter": [
                  {
                    "term": {
                      "skus.active": {
                        "value": true
                      }
                    }
                  },
                  {
                    "term": {
                      "skus.visible": {
                        "value": true
                      }
                    }
                  }
                ],
                "adjust_pure_negative": true
              }
            },
            "path": "skus",
            "ignore_unmapped": false,
            "score_mode": "none",
            "boost": 1
          }
        },
        {
          "term": {
            "is_gift": {
              "value": false
            }
          }
        },
        {
          "term": {
            "visible": {
              "value": true
            }
          }
        },
        {
          "term": {
            "active": {
              "value": true
            }
          }
        }
      ]
    }
  }
}

查询nested文档中的指定字段

POST /item_index2021-12-03-182031/_search
{
  "from": 0,
  "size": 20,
  "query": {
    "bool": {
      "nested": {
        "query": {
          "term": {
            "skus.id": {
              "value": "382620383",
              "boost": 1
            }
          }
        },
        "path": "skus",
        "ignore_unmapped": false,
        "score_mode": "none",
        "boost": 1
      }
    }
  }
}

如何让程序更懂用户想要什么？

ES from从0开始

ES实战举例

如何计算出某个品牌下最流行的衣服？（聚合+过滤）

使用聚合搜索

PUT /shirts/_doc/1?refresh
{
  "brand": "gucci",
  "color": "red",
  "model": "slim"
}
GET /shirts/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "color": "red"   }},
        { "term": { "brand": "gucci" }}
      ]
    }
  },
  "aggs": {
    "models": {
      "terms": { "field": "model" } 
    }
  }
}

后置过滤器post_filter可以过滤hits中的结果

比如只让hits保留红色

{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "brand": "gucci"
          }
        }
      ]
    }
  },
  "aggs": {
    "colors": {
      "terms": {
        "field": "color"
      }
    },
    "colors_red": {
      "filter": {
        "term": {
          "color": "red"
        }
      },
      "aggs": {
        "models": {
          "terms": {
            "field": "model"
          }
        }
      }
    }
  },
  "post_filter": {
    "term": {
      "color": "red"
    }
  }
}

重新算分(rescore)可以提高搜索精确度

方法是重新排序post_filter中过滤后的数据，重新算分由各分片独立执行，最终由执行查询请求的节点重新对结果排序。
若rescore请求被安排了业务字段的排序（_score不算），则会抛异常。

建议使用固定的“步长”分页查询。

默认情况下，原始查询的得分会与rescore查询的得分组合以得到最终算分_score，可分别通过query_weight和rescore_query_weight控制权重。

重新算分请求可以同时有多个。多个rescore请求，最终只会有1个rescore会和原始查询组合算分。后一个rescore可以看到前一个rescore的分数，并据此排序

创建模版时需指定analyzer和search_analyzer

使搜索词被准确分词

为什么最匹配的结果没有得分最高？

list对象去重用stream treeset 序

final ArrayList<Integer> list = Stream.of(3, 3, 6, 3, 2, 4, 5, 6, 9).collect(Collectors.collectingAndThen(Collectors.toCollection(TreeSet::new), ArrayList::new));

去重并升序排序

public static void main(String[] args) {
    final ArrayList<Integer> list = Stream.of(3, 3, 6, 3, 2, 4, 5, 6, 9).collect(Collectors.collectingAndThen(Collectors.toCollection(() -> new TreeSet<>(Comparator.comparingInt(x -> (int) x).reversed())), ArrayList::new));
    System.out.println(list);
}

docker找安装的软件的目录

docker exec -it elasticsearch /bin/bash

es的分页参数

es的分页参数from是当前行数，不是当前页数

比如第一页是 0,10，第二页就是 10,10，第三页就是20,10

analyzer icu_analyzer has not been configured in mappings

本想连到测试环境（已安装icu）下，结果连到了local（没有icu）

nested字段排序

一定要指定nested_path

Unrecognized SSL message, plaintext connection?

在使用 https 协议访问网络资源时无法识别 SSL 信息，不用https就好了，或者用https但是肯定要配置证书

冻结索引

POST sale_index_test_bulk/_freeze
POST sale_index_test_bulk/_unfreeze

冻结的索引，不可写入查询，解冻后即可写入查询。

这在做搜索降级时用来测试搜索功能很方便

向量搜索（淘宝拍图搜同款）

阿里knn写入性能不好，占内存高，京东vearch，scann/faiss[是个库，得自己包服务]，hnsw

如何指定写索引（别名切换写入索引）

在创建索引并绑定别名时指定is_write_index=true，默认false，同一个别名下只允许一个索引指定is_write_index=true

#为索引创建别名
POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "test_index", // 原写入索引
        "alias": "my_alias",
        "is_write_index":false
      }
    },
    {
      "add": {
        "index": "test_index01",
        "alias": "my_alias",
        "is_write_index":true
      }
    }
  ]
}

如何做数据迁移

创建新索引，设置修改后的mapping

迁移数据

POST _reindex
{
  "source": {
    "index": "my-index-000001"
  },
  "dest": {
    "index": "my-new-index-000001",
    "op_type": "create" // 仅当dest索引不存在这个doc才会创建
  }
}

迁移完成后别名切换写入索引

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "my-index-000002",
        "alias": "my-index-alias",
        "is_write_index": false // 最好写上
      }
    },
    {
      "add": {
        "index": "my-new-index-000002",
        "alias": "my-index-alias",
        "is_write_index": true //must
      }
    }
  ]
}

Warn: 迁移前确保数据没有修改中的状态

Warn: 迁移完成后需将原索引 _freeze，防止数据混乱

如何处理数据迁移过程中可能存在的数据遗失问题？

补偿机制

比如，从迁移开始，单独记录一个索引（Or 缓存，预估用时设置失效时间） X，其中记录itemId，代表要重算的数据

当迁移完成，做别名切换，关闭单独记录的itemId，再读取数据重新写入es

Rollover滚动索引实践

PUT %3Cmy-index-%7Bnow%2Fd%7D-000001%3E
{
  "aliases": {
    "my-alias1": {
      "is_write_index": true
    }
  }
}


# 2、批量导入数据
PUT my-alias1/_bulk
{"index":{"_id":1}}
{"title":"testing 01"}
{"index":{"_id":2}}
{"title":"testing 02"}
{"index":{"_id":3}}
{"title":"testing 03"}
{"index":{"_id":4}}
{"title":"testing 04"}
{"index":{"_id":5}}
{"title":"testing 05"}
 
# 3、rollover 滚动索引
POST my-alias1/_rollover
{
  "conditions": {
    "max_age": "7d",
    "max_docs": 5,
    "max_primary_shard_size": "50gb"
  }
}
 
GET my-alias1/_count
 
# 4、在满足滚动条件的前提下滚动索引
PUT my-alias1/_bulk
{"index":{"_id":6}}
{"title":"testing 06"}
 
# 5、检索数据，验证滚动是否生效
GET my-alias1/_search

GET my-index-2021.12.28-000001

es中如何把特殊字符替换为空格

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "\\u005c=>\\u0020"
      ]
    }
  ],
  "text": "My license plate is/"
}

本质是Native和ASCII的转换

特殊字符映射整理

"char_filter": {
        "my_en_char_filter": {
          "type": "mapping",
          "mappings": [
            ".=>\\u0020",
            "/=>\\u0020",
            "#=>\\u0020",
            "(=>\\u0020",
            ")=>\\u0020",
            "_=>\\u0020",
            "==>\\u0020",
            "+=>\\u0020",
            "&=>\\u0020",
            "!=>\\u0020",
            "@=>\\u0020",
            "$=>\\u0020",
            "%=>\\u0020",
            "^=>\\u0020",
            "*=>\\u0020",
            "?=>\\u0020",
            ",=>\\u0020",
            "'=>\\u0020",
            "\"=>\\u0020",
            "[=>\\u0020",
            "]=>\\u0020",
            "{=>\\u0020",
            "}=>\\u0020",
            ":=>\\u0020",
            ";=>\\u0020",
            "\\u005c=>\\u0020",
            "|=>\\u0020"
          ]
        }
      }

使用java代码发送异步的Reindex请求并处理返回结果

public Response<String> reindexAsync(String src, String dest) throws IOException {
    final ReindexRequest reindexRequest = new ReindexRequest().setSourceIndices(src).setDestIndex(dest).setRefresh(true);
    final Request request = RequestConverters.reindex(reindexRequest);
    final org.elasticsearch.client.Response response = lowerClient.performRequest(request);
    final InputStream inputStream = response.getEntity().getContent();

    if (inputStream != null) {
        final InputStreamReader inputStreamReader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);
        final Gson gson = new Gson();
        final ReindexAsyncResponse asyncResponse = gson.fromJson(inputStreamReader, ReindexAsyncResponse.class);
        return Response.success(asyncResponse.getTask());
    }
    return Response.errorMsg(null, "cannot fetch response");
}

@Data
public class ReindexAsyncResponse {
    private String task;
}

es如何删除已完成的异步Task，以节省space？

有网友疑惑同我，在官方社区提问了...
https://discuss.elastic.co/t/how-to-delete-a-task-in-elasticsearch-v7-14/283321

最后说是用

DELETE .tasks/task/_doc/

"This is simply a document DELETE."

然而，我实际测试后发现，任务完成后task就自动移除了，我的版本是7.15

elasticsearch如何确定哪个索引的 is_write_index=true ?

GET test_item_index_222_alias/_alias
{
  "test_item_index_666" : {
    "aliases" : {
      "test_item_index_222_alias" : {
        "is_write_index" : true
      }
    }
  },
  "test_item_index_222" : {
    "aliases" : {
      "test_item_index_222_alias" : {
        "is_write_index" : false
      }
    }
  }
}
// java 代码里可能拿到返回数据后要遍历判断了

posted @ 2022-01-19 16:10 夜旦阅读(2115) 评论(0) 编辑收藏举报

刷新页面返回顶部

夜旦

生如蝼蚁当有鸿鹄之志。