HumanEval (2021年)

HumanEval: Hand-Written Evaluation Set

HumanEval是一个OpenAI在2021年构造的代码生成LLM评估数据集。

数据格式
所有数据放在一个json文件中，每条数据包含提示词，有效代码示例，多个测试代码。

下面是截取的第一条数据

{
"task_id": "HumanEval/0",
"prompt":
"from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
",

"entry_point": "has_close_elements",

"canonical_solution":
" for idx, elem in enumerate(numbers):
for idx2, elem2 in enumerate(numbers):
if idx != idx2:
distance = abs(elem - elem2)
if distance < threshold:
return True

return False

"test":
"METADATA = {
'author': 'jt',
'dataset': 'test'
}

def check(candidate):
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False

"
}

提示词风格
可见提示词风格是：函数声明行+函数doc_string，函数doc_string包含功能和多个输入输出示例。

把上面提示词格式化成ollama提示词格式

提示词
"""from typing import List

测试一个刚出的LLM：codestral 22b，量化为Q4_1（很奇怪，4060Ti 16G不能完全放下Q5_0）

输出
This problem can be solved by sorting the list of numbers and then checking if any two adjacent numbers are less
than or equal to the threshold apart from each other. If such a pair is found, we return True; otherwise, we
return False after checking all pairs. Here's how you could do this in Python:

from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    numbers.sort()  # sort the list of numbers
    for i in range(1, len(numbers)):
        if abs(numbers[i] - numbers[i - 1]) <= threshold:
            return True  # two close elements are found
    return False  # no two close elements are found

You can test this function with the provided examples:

print(has_close_elements([1.0, 2.0, 3.0], 0.5))  # prints: False
print(has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3))  # prints: True

其中代码能成功通过所有测试用例，可见codestral适应这种提示词风格，且推理正确。

测试一个老牌的LLM：codellama:34b-python-q4_K_M

输出
# turn into list sorted in ascending order
numbers = sorted(numbers)

for i, element in enumerate(numbers):
    if (i >= len(numbers)):
        return False

    diff = abs(element - numbers[i+1])
    print(f"diff: {diff}")

    if (diff < threshold):
        return True

return False

其中代码能成功通过所有测试用例，可见codellama 34B适应这种提示词风格，且推理正确，也是不错的。

小结
可以使用该数据集对LLM进行测试。
可以用该数据集的提示词风格编写本地LLM提示词。
该数据集时代久远了一点。
该数据集期望的代码行大约10行左右。

参考
https://blog.csdn.net/qq_27590277/article/details/135163862
https://github.com/openai/human-eval

posted @ 2024-05-30 21:38 lusonixs 阅读(331) 评论(0) 编辑收藏举报

刷新页面返回顶部

epicblue

...发呆...

HumanEval (2021年)

公告