评分机制 TF IDF

算法介绍

relevance score算法,简单来说,就是计算出,一个索引中的文本,与搜索文本,他们之间的关联匹配程度。

ElasticSearch 使用的是 term frequency/inverse document frequency 算法,简称为 TF/IDF算法。TF词频(Term Frequency),IDF逆向文件频率(Inverse Document Frequency)

Term frequency:搜索文本中的各个词条在 field 文本中出现了多少次,出现次数越多,就越相关。

举例:请求搜索:hello world

doc1:hello you and me,and world is very good

doc2:hello,how are you

Inverse document frequency:搜索文本中的各个词条在整个索引的所有文档中出现了多少次,出现的次数越多,就越不相关

举例:请求搜索:hello world

doc1:hello,today is very good

doc2:hi world,how are you

整个 index 中有1亿条数据。hello 的 document 1000个,有 world 的 document 有100个。

doc2 更相关

Field-length norm:field长度,field越长,相关度越弱

举例:请求搜索:hello world

doc1:{"title": "hello article", "content ": "balabalabal 1万个"}

doc2:{"title": "my article", "content ": "balabalabal 1万个,world"}

_score是如何被计算出来的

POST /article/_search
http://139.196.183.130:9200/article/_search

{
    "query":{
        "match":{
            "content":"老铁,奥利给"
        }
    }
}

返回结果如下所示:

{
    "took":2,
    "timed_out":false,
    "_shards":{
        "total":1,
        "successful":1,
        "skipped":0,
        "failed":0
    },
    "hits":{
        "total":{
            "value":3,
            "relation":"eq"
        },
        "max_score":1.0393288,
        "hits":[
            {
                "_shard":"[article][0]",
                "_node":"SoA0mqjkTzSn8EoI7Vb7OQ",
                "_index":"article",
                "_type":"_doc",
                "_id":"0352c029-92c1-4bdd-a6e9-672e1ebf6f3e",
                "_score":1.0393288,
                "_source":{
                    "_class":"com.example.pojo.Article",
                    "id":"0352c029-92c1-4bdd-a6e9-672e1ebf6f3e",
                    "title":"Java从入门到精通",
                    "content":"老铁们,今天就不拍什么搞笑段子了,给大家表演一个铁锅炖大鹅,奥利给",
                    "read":500,
                    "types":"Java",
                    "author":"鸡哥"
                },
                "_explanation":{
                    "value":1.0393288,
                    "description":"sum of:",
                    "details":[
                        {
                            "value":0.91478837,
                            "description":"weight(content:老铁 in 0) [PerFieldSimilarity], result of:",
                            "details":[
                                {
                                    "value":0.91478837,
                                    "description":"score(freq=1.0), product of:",
                                    "details":[
                                        {
                                            "value":2.2,
                                            "description":"boost",
                                            "details":[

                                            ]
                                        },
                                        {
                                            "value":0.98082924,
                                            "description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details":[
                                                {
                                                    "value":1,
                                                    "description":"n, number of documents containing term",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":3,
                                                    "description":"N, total number of documents with field",
                                                    "details":[

                                                    ]
                                                }
                                            ]
                                        },
                                        {
                                            "value":0.42394015,
                                            "description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details":[
                                                {
                                                    "value":1,
                                                    "description":"freq, occurrences of term within document",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":1.2,
                                                    "description":"k1, term saturation parameter",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":0.75,
                                                    "description":"b, length normalization parameter",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":20,
                                                    "description":"dl, length of field",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":17,
                                                    "description":"avgdl, average length of field",
                                                    "details":[

                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        },
                        {
                            "value":0.1245405,
                            "description":"weight(content:奥利给 in 0) [PerFieldSimilarity], result of:",
                            "details":[
                                {
                                    "value":0.1245405,
                                    "description":"score(freq=1.0), product of:",
                                    "details":[
                                        {
                                            "value":2.2,
                                            "description":"boost",
                                            "details":[

                                            ]
                                        },
                                        {
                                            "value":0.13353139,
                                            "description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details":[
                                                {
                                                    "value":3,
                                                    "description":"n, number of documents containing term",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":3,
                                                    "description":"N, total number of documents with field",
                                                    "details":[

                                                    ]
                                                }
                                            ]
                                        },
                                        {
                                            "value":0.42394015,
                                            "description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details":[
                                                {
                                                    "value":1,
                                                    "description":"freq, occurrences of term within document",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":1.2,
                                                    "description":"k1, term saturation parameter",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":0.75,
                                                    "description":"b, length normalization parameter",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":20,
                                                    "description":"dl, length of field",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":17,
                                                    "description":"avgdl, average length of field",
                                                    "details":[

                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            },
            {
                "_shard":"[article][0]",
                "_node":"SoA0mqjkTzSn8EoI7Vb7OQ",
                "_index":"article",
                "_type":"_doc",
                "_id":"077f557a-c26f-479a-9111-c44a587dbfe5",
                "_score":0.16058116,
                "_source":{
                    "_class":"com.example.pojo.Article",
                    "id":"077f557a-c26f-479a-9111-c44a587dbfe5",
                    "title":"Java从入门到精通",
                    "content":"Java是一门面向对象语言,奥利给",
                    "read":500,
                    "types":"PHP",
                    "author":"雷哥"
                },
                "_explanation":{
                    "value":0.16058116,
                    "description":"sum of:",
                    "details":[
                        {
                            "value":0.16058116,
                            "description":"weight(content:奥利给 in 0) [PerFieldSimilarity], result of:",
                            "details":[
                                {
                                    "value":0.16058116,
                                    "description":"score(freq=1.0), product of:",
                                    "details":[
                                        {
                                            "value":2.2,
                                            "description":"boost",
                                            "details":[

                                            ]
                                        },
                                        {
                                            "value":0.13353139,
                                            "description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details":[
                                                {
                                                    "value":3,
                                                    "description":"n, number of documents containing term",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":3,
                                                    "description":"N, total number of documents with field",
                                                    "details":[

                                                    ]
                                                }
                                            ]
                                        },
                                        {
                                            "value":0.54662377,
                                            "description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details":[
                                                {
                                                    "value":1,
                                                    "description":"freq, occurrences of term within document",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":1.2,
                                                    "description":"k1, term saturation parameter",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":0.75,
                                                    "description":"b, length normalization parameter",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":10,
                                                    "description":"dl, length of field",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":17,
                                                    "description":"avgdl, average length of field",
                                                    "details":[

                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            },
            {
                "_shard":"[article][0]",
                "_node":"SoA0mqjkTzSn8EoI7Vb7OQ",
                "_index":"article",
                "_type":"_doc",
                "_id":"76eae24a-925f-48f8-99b7-7807b03c14e0",
                "_score":0.12180669,
                "_source":{
                    "_class":"com.example.pojo.Article",
                    "id":"76eae24a-925f-48f8-99b7-7807b03c14e0",
                    "title":"PHP是世界上最好的语言",
                    "content":"学习不可三天打鱼两天晒网,打铁还需自身硬,干就完了,加油奥利给",
                    "read":0,
                    "types":"Java",
                    "author":"吴彦祖"
                },
                "_explanation":{
                    "value":0.12180669,
                    "description":"sum of:",
                    "details":[
                        {
                            "value":0.12180669,
                            "description":"weight(content:奥利给 in 0) [PerFieldSimilarity], result of:",
                            "details":[
                                {
                                    "value":0.12180669,
                                    "description":"score(freq=1.0), product of:",
                                    "details":[
                                        {
                                            "value":2.2,
                                            "description":"boost",
                                            "details":[

                                            ]
                                        },
                                        {
                                            "value":0.13353139,
                                            "description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details":[
                                                {
                                                    "value":3,
                                                    "description":"n, number of documents containing term",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":3,
                                                    "description":"N, total number of documents with field",
                                                    "details":[

                                                    ]
                                                }
                                            ]
                                        },
                                        {
                                            "value":0.41463417,
                                            "description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details":[
                                                {
                                                    "value":1,
                                                    "description":"freq, occurrences of term within document",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":1.2,
                                                    "description":"k1, term saturation parameter",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":0.75,
                                                    "description":"b, length normalization parameter",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":21,
                                                    "description":"dl, length of field",
                                                    "details":[

                                                    ]
                                                },
                                                {
                                                    "value":17,
                                                    "description":"avgdl, average length of field",
                                                    "details":[

                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            }
        ]
    }
}

一个 document 是如何被匹配上的

POST /article/_explain/0352c029-92c1-4bdd-a6e9-672e1ebf6f3e
{
    "query":{
        "match":{
            "content":"老铁,奥利给"
        }
    }
}
posted @   BNTang  阅读(274)  评论(0编辑  收藏  举报
编辑推荐:
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
阅读排行:
· winform 绘制太阳,地球,月球 运作规律
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· AI 智能体引爆开源社区「GitHub 热点速览」
· 写一个简单的SQL生成工具
点击右上角即可分享
微信分享提示