Lucene.Net 按类别统计搜索结果数

今天群里有个朋友问"如何按类别统计搜索结果数?是不是要循环一个个类别去查询出总数啊?"

以Lucene.Net现在的API，只能这样做。当然这样做一般会带来性能问题，所以更好的解决方案就是改动库文件了。

　　注意：本文内容仅适用于Lucene.Net，以2.1版为例，其它版本可能会有出入，Java版本差别更大一些。

改动库先要有个思路。Lucene.Net的查询结果是一个Hits,而它有一个方法length可以得到总的结果。这个结果是一个精确值。这个值实际上是在TopDocCollector类的Collect方法计算出来的。要改精算为估算也就是在这里添加算法就可以了。

        public override void  Collect(int doc, float score)
        {
            if (score > 0.0f)
            {
                totalHits++;
                if (hq.Size() < numHits || score >= minScore)
                {
                    hq.Insert(new ScoreDoc(doc, score));
                    minScore = ((ScoreDoc) hq.Top()).score; // maintain minScore
                }
            }
        }

这个方法中已经有了Document的id号，只要有办法拿到Document就能得到类别了。能拿到Document的类，IndexSearcher和IndexReader都可以。这里用IndexReader比较合算，因为IndexSearcher本身就包含IndexReader的。

Collect方法会在几个地方被用到。都是Scorer一系的类中。比如TermScorer,BooleanScorer2等。所以按分类统计如果给Collect增加参数的话改动量可能会比较大。所以修改TopDocCollector的构造函数。

        private IndexReader reader;

        public TopDocCollector(int numHits, IndexReader reader)
            : this(numHits, new HitQueue(numHits), reader)
        {
        }

        internal TopDocCollector(int numHits, PriorityQueue hq, IndexReader reader)
        {
            this.numHits = numHits;
            this.hq = hq;
            this.reader = reader;
        }

同时有两个调用构造函数的地方需要被修改。

TopFieldDocCollector的构造函数：

        public TopFieldDocCollector(IndexReader reader, Sort sort, int numHits)
            : base(numHits, new FieldSortedHitQueue(reader, sort.fields, numHits), reader) {
        }

IndexSearcher的构造函数：

        public override TopDocs Search(Weight weight, Filter filter, int nDocs)
        {

            if (nDocs <= 0)
                // null might be returned from hq.top() below.
                throw new System.ArgumentException("nDocs must be > 0");

            TopDocCollector collector = new TopDocCollector(nDocs, this.reader);
            Search(weight, filter, collector);
            return collector.TopDocs();
        }

现在TopDocCollector类就可以拿到分类了。

        public override void  Collect(int doc, float score)
        {
            if (score > 0.0f)
            {
                Document d = reader.Document(doc);
                int category = int.Parse(d.Get("category"));

                totalHits++;
                if (hq.Size() < numHits || score >= minScore)
                {
                    hq.Insert(new ScoreDoc(doc, score));
                    minScore = ((ScoreDoc) hq.Top()).score; // maintain minScore
                }
            }
        }

最终这个统计的结构需要反映到Hits类去。返回结构和TopDocCollector的public virtual TopDocs TopDocs()方法有关。给TopDocs 增加一个字段：

public System.Collections.Generic.Dictionary<int, int> category_count；

Collect方法改成：

        private System.Collections.Generic.Dictionary<int, int> category_count = new System.Collections.Generic.Dictionary<int,int>();
        public override void  Collect(int doc, float score)
        {
            if (score > 0.0f)
            {
                Document d = reader.Document(doc);
                int category = int.Parse(d.Get("category"));
                if (category_count.ContainsKey(category))
                    category_count[category]++;
                else
                    category_count.Add(category, 1);
                totalHits++;
                if (hq.Size() < numHits || score >= minScore)
                {
                    hq.Insert(new ScoreDoc(doc, score));
                    minScore = ((ScoreDoc) hq.Top()).score; // maintain minScore
                }
            }
        }

TopDocs方法改成

        public virtual TopDocs TopDocs()
        {
            ScoreDoc[] scoreDocs = new ScoreDoc[hq.Size()];
            for (int i = hq.Size() - 1; i >= 0; i--)
                // put docs in array
                scoreDocs[i] = (ScoreDoc) hq.Pop();

            float maxScore = (totalHits == 0) ? System.Single.NegativeInfinity : scoreDocs[0].score;
            TopDocs docs = new TopDocs(totalHits, scoreDocs, maxScore);
            docs.category_count = category_count;
            return docs;
        }

Hits类增加：

        private Dictionary<int, int> category_count;
        public Dictionary<int, int> Category_Count {
            get {
                return category_count;
            }
        }

同时修改：

        private void  GetMoreDocs(int min)
        {
            if (hitDocs.Count > min)
            {
                min = hitDocs.Count;
            }

            int n = min * 2; // double # retrieved
            TopDocs topDocs = (sort == null) ? searcher.Search(weight, filter, n) : searcher.Search(weight, filter, n, sort);
            category_count = topDocs.category_count;
            length = topDocs.totalHits;
            ScoreDoc[] scoreDocs = topDocs.scoreDocs;

            float scoreNorm = 1.0f;

            if (length > 0 && topDocs.GetMaxScore() > 1.0f)
            {
                scoreNorm = 1.0f / topDocs.GetMaxScore();
            }

            int end = scoreDocs.Length < length?scoreDocs.Length:length;
            for (int i = hitDocs.Count; i < end; i++)
            {
                hitDocs.Add(new HitDoc(scoreDocs[i].score * scoreNorm, scoreDocs[i].doc));
            }
        }

至此就OK了。从结果中取的时候，比如ID为1的分类，则

hits.Category_Count[1]就出来了。

posted @ 2009-01-09 17:42 Birdshover 阅读(7148) 评论(21) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· 阿里最新开源QwQ-32B，效果媲美deepseek-r1满血版，部署成本又又又降低了！
· 单线程的Redis速度为什么快？
· 展开说说关于C#中ORM框架的用法！
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决：字节Trae VS Cursor，谁才是开发者新宠？

公告

昵称： Birdshover
园龄： 18年10个月
粉丝： 443
关注： 16

+加关注

2009年1月

日

一

二

三

四

五

六

智慧掩盖真相
我的微博

天之道，不争而善胜。

Lucene.Net 按类别统计搜索结果数

公告

搜索

我的标签

积分与排名

随笔分类 (141)

随笔档案 (138)

文章分类 (26)

文章档案 (48)

我的好友

我的其它博客

阅读排行榜

最新评论

智慧掩盖真相我的微博

天之道，不争而善胜。

Lucene.Net 按类别统计搜索结果数

公告

搜索

我的标签

积分与排名

随笔分类 (141)

随笔档案 (138)

文章分类 (26)

文章档案 (48)

我的好友

我的其它博客

阅读排行榜

最新评论

智慧掩盖真相
我的微博