陋室铭
永远也不要停下学习的脚步(大道至简至易)

posts - 2169,comments - 570,views - 413万
               

今天碰到一个问题,数据之前入solr的时候并没有计算条数,现在需要计算出某几个表中去重后的总数。
由于solr的ISearch并没有相关的Distinct功能.想到一个解决方案是用Solr的Facet分组进行GrupBy,但是因为Facet只能返回100条,而数据肯定大于100个分组.所有该方案PASS了。
后来在网上搜到Solr Count Distinct,这么一个东西,是Solr已经发布的脚本(Solr Search Requests)其中有类似的功能

A 100% accurate count of distinct values (count distinct) is not generally possible without actually observing all of the values together. However there are a number of ways to estimate the count.

“unique” Facet Function
  The unique facet function is Solr’s fastest implementation to calculate the number of distinct values.
  It always provides exact counts on a single Solr node. For distributed search over multiple nodes, it provides exact counts when the number of values per node does not exceed 100 (by default).

When the number of unique values does exceed 100 in any given shard, the following algorithm is used:

It estimates the count by sending the top 100 results from each shard along with the total exact “unique” count for each shard.
  totalSeen is the number of actual results we saw from all shards (i.e. not deduped yet).
  uniqueSeen is the number of unique values we saw from all shards (i.e. deduped).
  notSeen is the number of unique values from each shard that were not sent (because of the 100 cutoff).
  factor = uniqueSeen / totalSeen (i.e. what fraction of values that we saw were unique)
  estimate = uniqueSeen + ( notSeen * factor ) (i.e. we simply apply the factor to the number of values we didn’t see)
  Example use:

$ curl http://localhost:8983/solr/techproducts/query -d '
q=*:*&
json.facet={
  x : "unique(manu_exact)"    // manu_exact is the manufacturer indexed as a single string
}'
  • 1
  • 2
  • 3
  • 4
  • 5

For more facet functions, adding facet functions to each facet bucket, or sorting by facet function, see Solr Facet Functions


Aggregation Functions
Faceting involves breaking up the domain into multiple buckets and providing information about each bucket.
There are multiple aggregation functions / statistics that can be used:

AggregationExampleEffect
sum sum(sales) summation of numeric values
avg avg(popularity) average of numeric values
sumsq sumsq(rent) sum of squares
min min(salary) minimum value
max max(mul(price,popularity)) maximum value
unique unique(state) number of unique values (count distinct)
hll hll(state) number of unique values using the HyperLogLog algorithm
percentile percentile(salary,50,75,99,99.9)    calculates percentiles

下面是我写的一个例子

curl http://192.168.1.1:8080/solr/xxshard/query?q=*:* -d '
    json.facet={
        x:"unique(RB040002)"
    }'
  • 1
  • 2
  • 3
  • 4

详细用法及其他功能在下面原文中

http://yonik.com/solr-count-distinct/
  http://yonik.com/solr-facet-functions/

           
posted on   宏宇  阅读(686)  评论(0编辑  收藏  举报
编辑推荐:
· AI与.NET技术实操系列:基于图像分类模型对图像进行分类
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
阅读排行:
· 分享一个免费、快速、无限量使用的满血 DeepSeek R1 模型,支持深度思考和联网搜索!
· 25岁的心里话
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· ollama系列01:轻松3步本地部署deepseek,普通电脑可用
· 按钮权限的设计及实现
历史上的今天:
2017-07-19 AngularJs+bootstrap搭载前台框架——准备工作
2015-07-19 关于WinPE安装操作系统
< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5

点击右上角即可分享
微信分享提示