LLM && LVLM evaluation
TinyEval -- LLM 评测原理讲解代码
https://github.com/datawhalechina/tiny-universe/tree/main/content/TinyEval
https://huzixia.github.io/2024/05/29/eval/
https://meeting.tencent.com/user-center/shared-record-info?id=8b9cf6ca-add6-477b-affe-5b62e2d8f27e&from=3
首先,根据目标数据集的任务类型指定合理的评测metric.
根据目标数据的形式总结模型引导prompt.
根据模型初步预测结果采纳合理的抽取方式.
对相应的pred与anwser进行得分计算.
opencompass -- LLM 评测工具
https://opencompass.org.cn/home
Large Model Evaluation System
Shanghai AI Laboratory
Open-source, efficient, and comprehensive
large model evaluation system and open platform
C-Eval - 中文评测数据集
https://opendatalab.com/OpenDataLab/C-Eval/tree/main
https://hub.opencompass.org.cn/dataset-detail/C-Eval
New NLP benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. C-Eval comprises multiple-choice questions across four difficulty levels: middle school, high school, college, and professional. The questions span 52 diverse disciplines, ranging from humanities to science and engineering. C-Eval is accompanied by C-Eval Hard, a subset of very challenging subjects in C-Eval that requires advanced reasoning abilities to solve. We conduct a comprehensive evaluation of the most advanced LLMs on C-Eval, including both English- and Chinese-oriented models. Results indicate that only GPT-4 could achieve an average accuracy of over 60%, suggesting that there is still significant room for improvement for current LLMs. We anticipate C-Eval will help analyze important strengths and shortcomings of foundation models, and foster their development and growth for Chinese users.
Meta Data
The data set has
Question: The body of the question
A, B, C, D: The options which the model should choose from
Answer: (Only in dev and val set) The correct answer to the question
Explanation: (Only in dev set) The reason for choosing the answer.
Example
Question: 对于UDP协议,如果想实现可靠传输,应在哪一层实现____
A. 数据链路层
B. 网络层
C. 传输层
D. 应用层
Answer: D
lmdeploy -- LLM 部署工具,同vllm
https://lmdeploy.readthedocs.io/zh-cn/latest/benchmark/evaluate_with_opencompass.html
https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/evaluation_turbomind.html
issue with openai
https://github.com/open-compass/opencompass/discussions/1100
https://github.com/open-compass/opencompass/issues/673
dataset
https://github.com/open-compass/opencompass/releases/tag/0.2.2.rc1
https://zhuanlan.zhihu.com/p/669291064
LVLM -- 视觉语言大模型
https://mmbench.opencompass.org.cn/home
https://github.com/open-compass/VLMEvalKit
https://github.com/open-compass/MMBench/tree/main/samples
VLMEvalKit -- 评测工具
https://github.com/open-compass/VLMEvalKit/blob/main/docs/zh-CN/Quickstart_zh-CN.md
langchain eval -- 应用层评测
https://developer.aliyun.com/article/1518981
https://python.langchain.com.cn/docs/guides/evaluation/agent_vectordb_sota_pg
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全网最简单!3分钟用满血DeepSeek R1开发一款AI智能客服,零代码轻松接入微信、公众号、小程
· .NET 10 首个预览版发布,跨平台开发与性能全面提升
· 《HelloGitHub》第 107 期
· 全程使用 AI 从 0 到 1 写了个小工具
· 从文本到图像:SSE 如何助力 AI 内容实时呈现?(Typescript篇)
2021-08-01 计算PI -- 采用刘徽的割圆术方法
2019-08-01 推荐算法(基于用户和基于物品)
2015-08-01 JQuery选择器中含有冒号的ID处理差异的分析