打通es及lucene应用,lucene应用es Query,应用完整的es query

首先说,勉强实现了,但不完美,有些额外的应用条件,这也是做减法的最后一篇

因为最初的目标只是提取es query_string的分词对象,其实这一步已经足够满足打通es和其他java,大数据生态

最终因为目标明确,只作query_string,因此选择了做减法的方式

后来又加入了结合非分词查询 类似 term,terms,range

虽然成功了,但是分词和非分词查询是独立的

非分词查询解析json串,分别对每一个块解析,再聚合

类似

{
                    "terms": {
                        "domain": [
                            "www.github.com"
                        ]
                    }
}
{
                    "range": {
                        "date": {
                            "gte": "2021-02-03T00:00:00",
                            "lte": "2021-02-04T00:00:00"
                        }
                    }
}

而es的查询是可以分词和非分词组合的,并且层层嵌套,尝试下能不直接应用

其实想到这个需求的时候,就很不乐观

{
    "query": {
        "bool": {
            "must": [
                {
                    "terms": {
                        "domain": [
                            "www.github.com"
                        ]
                    }
                },
                {
                    "range": {
                        "date": {
                            "gte": "2021-02-03T00:00:00",
                            "lte": "2021-02-04T00:00:00"
                        }
                    }
                }
            ]
        }
    }
}

找到了es query解析类

Screen Shot 2021-03-02 at 5.30.23 PM

对应的测试类

Screen Shot 2021-03-02 at 5.29.28 PM

提取测试用例代杩

测试用例-不需要分词解析

{
    "query": {
        "bool": {
            "must": [
                {
                    "terms": {
                        "kw_source": [
                            "douyin.com"
                        ]
                    }
                },
                {
                    "range": {
                        "date_idate": {
                            "gte": "2021-02-03T00:00:00",
                            "lte": "2021-02-04T00:00:00"
                        }
                    }
                }
            ]
        }
    }
}

对不需要分词的字段,输出如下,转换成功,输出

searchSourceBuilder query: {
  "bool" : {
    "must" : [
      {
        "terms" : {
          "domain" : [
            "www.github.com"
          ],
          "boost" : 1.0
        }
      },
      {
        "range" : {
          "date" : {
            "from" : "2021-02-03T00:00:00",
            "to" : "2021-02-04T00:00:00",
            "include_lower" : true,
            "include_upper" : true,
            "boost" : 1.0
          }
        }
      }
    ],
    "adjust_pure_negative" : true,
    "boost" : 1.0
  }
}
searchSourceBuilder query: +domain:(www.github.com) +date:[2021-02-03T00:00:00 TO 2021-02-04T00:00:00]

如果加上分词字段

{
    "query": {
        "bool": {
            "must": [
              {
                    "query_string": {
                        "analyzer": "ik_smart",
                        "query": "hello word",
                        "fields": [
                            "content",
                            "title"
                        ]
                    }
                },
                {
                    "terms": {
                        "domain": [
                            "www.github.com"
                        ]
                    }
                },
                {
                    "range": {
                        "date": {
                            "gte": "2021-02-03T00:00:00",
                            "lte": "2021-02-04T00:00:00"
                        }
                    }
                }
            ]
        }
    }
}

[cclient_test_index] QueryShardException[[query_string] analyzer [ik_smart] not found
]
at org.elasticsearch.index.query.QueryStringQueryBuilder.doToQuery(QueryStringQueryBuilder.java:1011)
at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:105)
at org.elasticsearch.index.query.BoolQueryBuilder.addBooleanClauses(BoolQueryBuilder.java:415)
at org.elasticsearch.index.query.BoolQueryBuilder.doToQuery(BoolQueryBuilder.java:383)

其实这里就知道问题在哪了,还是最根本的问题,es通过config,plugin加载插件和分词,通过mapping 设置索引的格式,并应用相查询

而插件和分词我这里都没有,没有ik_smart插件

如果我删除 "analyzer": "ik_smart",

虽然能执行,但是输出错误

searchSourceBuilder query: +MatchNoDocsQuery("unmapped fields [null]") +kw_source:(douyin.com) +date_idate:[2021-02-03T00:00:00 TO 2021-02-04T00:00:00]

我自已的方案通过外部给QueryStringQueryParser 调置filed和analyzer,用自建的QueryStringQueryParser实例执行Query 解析,生成Query

但是es官方是通过QueryStringQueryBuilder执行的,其内部会自已实例化QueryStringQueryParser,并没有外部设置的field和analyzer

因此需要更改QueryStringQueryBuilder内的QueryStringQueryParser实例,为所有QueryStringQueryParser实例设置field和analyzer

这接上一篇,上一篇用QueryStringQueryParser,所以不想改QueryStringQueryBuilder

QueryStringQueryBuilder


功能要求已经越来越逼近es官方原生了,如果一开始就是这个目的,我在初期就是选择作加法的方案

我已经懒得改QueryStringQueryBuilder,准备换做加法的路子,模拟真实的es环境

善事善终吧

QueryStringQueryBuilder 内的QueryStringQueryParser 太多了,这部分代码就有5个实例,为每个实例都设置,重复代码太多

        if (defaultField != null) {
            if (Regex.isMatchAllPattern(defaultField)) {
                queryParser = new QueryStringQueryParser(context, lenient == null ? true : lenient);
            } else {
                queryParser = new QueryStringQueryParser(context, defaultField, isLenient);
            }
        } else if (fieldsAndWeights.size() > 0) {
            final Map<String, Float> resolvedFields = QueryParserHelper.resolveMappingFields(context, fieldsAndWeights);
            queryParser = new QueryStringQueryParser(context, resolvedFields, isLenient);
        } else {
            List<String> defaultFields = context.defaultFields();
            if (context.getMapperService().allEnabled() == false &&
                    defaultFields.size() == 1 && AllFieldMapper.NAME.equals(defaultFields.get(0))) {
                // For indices created before 6.0 with _all disabled
                defaultFields = Collections.singletonList("*");
            }
            boolean isAllField = defaultFields.size() == 1 && Regex.isMatchAllPattern(defaultFields.get(0));
            if (isAllField) {
                queryParser = new QueryStringQueryParser(context, lenient == null ? true : lenient);
            } else {
                final Map<String, Float> resolvedFields = QueryParserHelper.resolveMappingFields(context,
                        QueryParserHelper.parseFieldsAndWeights(defaultFields));
                queryParser = new QueryStringQueryParser(context, resolvedFields, isLenient);
            }
        }

考虑加在构造函数里,做到在外部为QueryStringQueryParser设置静态变量out_forceAnalyzer,out_multiFields,实例化时统一加载

    static Analyzer out_forceAnalyzer=null;
    static List<String> out_multiFields=null;
    private QueryStringQueryParser(QueryShardContext context, String defaultField,
                                   Map<String, Float> fieldsAndWeights,
                                   boolean lenient, Analyzer analyzer) {
        super(defaultField, analyzer);
        this.context = context;
        this.fieldsAndWeights = Collections.unmodifiableMap(fieldsAndWeights);
        this.queryBuilder = new MultiMatchQuery(context);
        queryBuilder.setZeroTermsQuery(MatchQuery.ZeroTermsQuery.NULL);
        queryBuilder.setLenient(lenient);
        this.lenient = lenient;
        if(out_forceAnalyzer!=null){
            this.forceAnalyzer=out_forceAnalyzer;
        }
        if(out_multiFields!=null){
            this.multiFields=out_multiFields;
        }
    }
        QueryStringQueryParser.outMultiFields= Arrays.asList("content","title");
        QueryStringQueryParser.outForceAnalyzer=new StandardAnalyzer();
        try (XContentParser parser = queryParser.createParser(JsonXContent.jsonXContent, restContent)) {
            SearchSourceBuilder searchSourceBuilder = SearchSourceBuilder.fromXContent(parser);
            System.out.println("searchSourceBuilder: "+searchSourceBuilder);
            System.out.println("searchSourceBuilder query: "+searchSourceBuilder.query());
            System.out.println("searchSourceBuilder query: "+searchSourceBuilder.query().toQuery(qc));
        }

searchSourceBuilder query: +((title:hello title:word) | (content:hello content:word)) +domain:(www.github.com) +date:[2021-02-03T00:00:00 TO 2021-02-04T00:00:00]

验证成功

也做到了前一篇,完成了QueryStringQueryBuilder实现生成Query的尾巴

目前对不含query_string分词字段的es查询语句可以做到直接解析

但是对含query_string分词字段的es查询语句只能做到部分成功,且有要求

  • 1 如果analyzer 为官方默认集成在es内的则不需要变更,例如standardScreen Shot 2021-03-02 at 10.20.37 PM

  • 2 如果analyzer 为第三方插件,安装的,例如https://github.com/medcl/elasticsearch-analysis-ik

    "query_string": {
        "analyzer": "ik_smart",
        "query": "hello word",
        "fields": [
            "content",
            "title"
        ]
    }
    

    则需要删除"analyzer": "ik_smart"这一行,不然会报错[cclient_test_index] QueryShardException[[query_string] analyzer [ik_smart] not found

    原因也说过很多次了,这种方案,没有加载第三方分词插件的部分

  • 需要手动设置需要的分词项及分词算法,这里的示例是new StandardAnalyzer(); 实际可以外部传入ik的StandardAnalyzer

    QueryStringQueryParser.outMultiFields= Arrays.asList("content","title");
    QueryStringQueryParser.outForceAnalyzer=new StandardAnalyzer();
    

方案讲解告一段落

也不会有更进一步了

因为要求越来越高的es兼容,而我做减法的方式,丢失了一部分信息,应该换条路来调研

目前的方案虽不完美,但足够胜任一部分查询场景

只限查询,没有es的聚合,聚合需要外部实现,本来搞些方案就是为了打通其他olap系统,聚合可以用这些来做,类似sql,druid clickhouse kylin等方案

另外以上其于6.8版本做的,es大版本已经7.10.2/7.11.1了,如果有后续,会针对新版本来做

posted @ 2021-03-02 22:23  cclient  阅读(317)  评论(0编辑  收藏  举报