评分机制 TF IDF
算法介绍
relevance score算法,简单来说,就是计算出,一个索引中的文本,与搜索文本,他们之间的关联匹配程度。
ElasticSearch 使用的是 term frequency/inverse document frequency
算法,简称为 TF/IDF算法。TF词频(Term Frequency),IDF逆向文件频率(Inverse Document Frequency)
Term frequency:搜索文本中的各个词条在 field 文本中出现了多少次,出现次数越多,就越相关。
举例:请求搜索:hello world
doc1:hello you and me,and world is very good
doc2:hello,how are you
Inverse document frequency:搜索文本中的各个词条在整个索引的所有文档中出现了多少次,出现的次数越多,就越不相关
举例:请求搜索:hello world
doc1:hello,today is very good
doc2:hi world,how are you
整个 index 中有1亿条数据。hello 的 document 1000个,有 world 的 document 有100个。
doc2 更相关
Field-length norm:field长度,field越长,相关度越弱
举例:请求搜索:hello world
doc1:{"title": "hello article", "content ": "balabalabal 1万个"}
doc2:{"title": "my article", "content ": "balabalabal 1万个,world"}
_score是如何被计算出来的
POST /article/_search
http://139.196.183.130:9200/article/_search
{
"query":{
"match":{
"content":"老铁,奥利给"
}
}
}
返回结果如下所示:
{
"took":2,
"timed_out":false,
"_shards":{
"total":1,
"successful":1,
"skipped":0,
"failed":0
},
"hits":{
"total":{
"value":3,
"relation":"eq"
},
"max_score":1.0393288,
"hits":[
{
"_shard":"[article][0]",
"_node":"SoA0mqjkTzSn8EoI7Vb7OQ",
"_index":"article",
"_type":"_doc",
"_id":"0352c029-92c1-4bdd-a6e9-672e1ebf6f3e",
"_score":1.0393288,
"_source":{
"_class":"com.example.pojo.Article",
"id":"0352c029-92c1-4bdd-a6e9-672e1ebf6f3e",
"title":"Java从入门到精通",
"content":"老铁们,今天就不拍什么搞笑段子了,给大家表演一个铁锅炖大鹅,奥利给",
"read":500,
"types":"Java",
"author":"鸡哥"
},
"_explanation":{
"value":1.0393288,
"description":"sum of:",
"details":[
{
"value":0.91478837,
"description":"weight(content:老铁 in 0) [PerFieldSimilarity], result of:",
"details":[
{
"value":0.91478837,
"description":"score(freq=1.0), product of:",
"details":[
{
"value":2.2,
"description":"boost",
"details":[
]
},
{
"value":0.98082924,
"description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details":[
{
"value":1,
"description":"n, number of documents containing term",
"details":[
]
},
{
"value":3,
"description":"N, total number of documents with field",
"details":[
]
}
]
},
{
"value":0.42394015,
"description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details":[
{
"value":1,
"description":"freq, occurrences of term within document",
"details":[
]
},
{
"value":1.2,
"description":"k1, term saturation parameter",
"details":[
]
},
{
"value":0.75,
"description":"b, length normalization parameter",
"details":[
]
},
{
"value":20,
"description":"dl, length of field",
"details":[
]
},
{
"value":17,
"description":"avgdl, average length of field",
"details":[
]
}
]
}
]
}
]
},
{
"value":0.1245405,
"description":"weight(content:奥利给 in 0) [PerFieldSimilarity], result of:",
"details":[
{
"value":0.1245405,
"description":"score(freq=1.0), product of:",
"details":[
{
"value":2.2,
"description":"boost",
"details":[
]
},
{
"value":0.13353139,
"description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details":[
{
"value":3,
"description":"n, number of documents containing term",
"details":[
]
},
{
"value":3,
"description":"N, total number of documents with field",
"details":[
]
}
]
},
{
"value":0.42394015,
"description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details":[
{
"value":1,
"description":"freq, occurrences of term within document",
"details":[
]
},
{
"value":1.2,
"description":"k1, term saturation parameter",
"details":[
]
},
{
"value":0.75,
"description":"b, length normalization parameter",
"details":[
]
},
{
"value":20,
"description":"dl, length of field",
"details":[
]
},
{
"value":17,
"description":"avgdl, average length of field",
"details":[
]
}
]
}
]
}
]
}
]
}
},
{
"_shard":"[article][0]",
"_node":"SoA0mqjkTzSn8EoI7Vb7OQ",
"_index":"article",
"_type":"_doc",
"_id":"077f557a-c26f-479a-9111-c44a587dbfe5",
"_score":0.16058116,
"_source":{
"_class":"com.example.pojo.Article",
"id":"077f557a-c26f-479a-9111-c44a587dbfe5",
"title":"Java从入门到精通",
"content":"Java是一门面向对象语言,奥利给",
"read":500,
"types":"PHP",
"author":"雷哥"
},
"_explanation":{
"value":0.16058116,
"description":"sum of:",
"details":[
{
"value":0.16058116,
"description":"weight(content:奥利给 in 0) [PerFieldSimilarity], result of:",
"details":[
{
"value":0.16058116,
"description":"score(freq=1.0), product of:",
"details":[
{
"value":2.2,
"description":"boost",
"details":[
]
},
{
"value":0.13353139,
"description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details":[
{
"value":3,
"description":"n, number of documents containing term",
"details":[
]
},
{
"value":3,
"description":"N, total number of documents with field",
"details":[
]
}
]
},
{
"value":0.54662377,
"description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details":[
{
"value":1,
"description":"freq, occurrences of term within document",
"details":[
]
},
{
"value":1.2,
"description":"k1, term saturation parameter",
"details":[
]
},
{
"value":0.75,
"description":"b, length normalization parameter",
"details":[
]
},
{
"value":10,
"description":"dl, length of field",
"details":[
]
},
{
"value":17,
"description":"avgdl, average length of field",
"details":[
]
}
]
}
]
}
]
}
]
}
},
{
"_shard":"[article][0]",
"_node":"SoA0mqjkTzSn8EoI7Vb7OQ",
"_index":"article",
"_type":"_doc",
"_id":"76eae24a-925f-48f8-99b7-7807b03c14e0",
"_score":0.12180669,
"_source":{
"_class":"com.example.pojo.Article",
"id":"76eae24a-925f-48f8-99b7-7807b03c14e0",
"title":"PHP是世界上最好的语言",
"content":"学习不可三天打鱼两天晒网,打铁还需自身硬,干就完了,加油奥利给",
"read":0,
"types":"Java",
"author":"吴彦祖"
},
"_explanation":{
"value":0.12180669,
"description":"sum of:",
"details":[
{
"value":0.12180669,
"description":"weight(content:奥利给 in 0) [PerFieldSimilarity], result of:",
"details":[
{
"value":0.12180669,
"description":"score(freq=1.0), product of:",
"details":[
{
"value":2.2,
"description":"boost",
"details":[
]
},
{
"value":0.13353139,
"description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details":[
{
"value":3,
"description":"n, number of documents containing term",
"details":[
]
},
{
"value":3,
"description":"N, total number of documents with field",
"details":[
]
}
]
},
{
"value":0.41463417,
"description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details":[
{
"value":1,
"description":"freq, occurrences of term within document",
"details":[
]
},
{
"value":1.2,
"description":"k1, term saturation parameter",
"details":[
]
},
{
"value":0.75,
"description":"b, length normalization parameter",
"details":[
]
},
{
"value":21,
"description":"dl, length of field",
"details":[
]
},
{
"value":17,
"description":"avgdl, average length of field",
"details":[
]
}
]
}
]
}
]
}
]
}
}
]
}
}
一个 document 是如何被匹配上的
POST /article/_explain/0352c029-92c1-4bdd-a6e9-672e1ebf6f3e
{
"query":{
"match":{
"content":"老铁,奥利给"
}
}
}
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· winform 绘制太阳,地球,月球 运作规律
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· AI 智能体引爆开源社区「GitHub 热点速览」
· 写一个简单的SQL生成工具