利用LLM生成人工合成数据

编写prompt让LLM直接生成结构化的数据
用prompt让LLM生成能够生成结构化数据的代码
用prompt让LLM合成文本数据
处理不平衡（imbalanced）或非多样化（non-diverse）的文本数据

合成数据（Synthetic Data）是指通过人工生成的方式创建的数据，而非从现实世界直接收集的数据。通常，合成数据用于替代或补充真实数据，尤其是在真实数据难以获取、昂贵、涉及隐私问题或数量不足的情况下。合成数据在人工智能、机器学习、计算机视觉、数据隐私保护等多个领域中得到了广泛应用。

编写prompt让LLM直接生成结构化的数据

例1. 让GPT-4o-mini生成CSV格式的住宅数据表：

创建一个包含10行住房数据的CSV文件。
每一行应包括以下字段：
 - id（从1开始的递增整数）
 - 房屋面积（平方米）
 - 房屋价格
 - 地点
 - 卧室数量

请确保数字合理（例如，更多的房间通常意味着更大的面积，更贵的地点会提高价格，面积越大通常价格越高等。请确保所有的数字都是合理的）。另外，只需以CSV格式回复。

用prompt让LLM生成能够生成结构化数据的代码

由于成本以及上下文窗口大小限制等原因，让LLM写代码生成数据将会是更加高效并且可控的方式。
例1. 让GPT-4o-mini生成CSV格式的住宅数据表：

请创建一个 Python 程序，用于生成100行房屋数据。最终输出应为包含100行数据的 pandas 数据框架。每一行数据应包括以下字段：
 - id（从1开始递增的整数）
 - 房屋面积（平方米）
 - 房屋价格
 - 位置
 - 卧室数量

请确保生成的数字合理（例如，更多的房间通常意味着更大的面积，更昂贵的位置会提高价格，更大的面积通常意味着更高的价格等）。确保所有数字都合理。

例2. 数据集往往会更复杂，比如住宅数据可能会存在多张表：住房、地址、房屋类型，并且彼此有相互关联，因此需要在prompt中描述清楚表之间的关系，表数据量彼此匹配，以及主外键关系正确等：

请创建一个Python程序来生成三个不同的Pandas Dataframes。

1. **住房数据**
   - 我需要100行数据。每一行应包含以下字段：
     - id（从1开始递增的整数）
     - 房屋面积（m²）
     - 房屋价格
     - 位置
     - 卧室数量
     - 房屋类型
     - 任何相关的外键

2. **位置**
   - 每一行应包含以下字段：
     - id（从1开始递增的整数）
     - 国家
     - 城市
     - 人口
     - 面积（m²）
     - 任何相关的外键

3. **房屋类型**
   - id（从1开始递增的整数）
   - 房屋类型
   - 房屋类型的平均价格
   - 房屋数量
   - 任何相关的外键

请确保生成的数据符合逻辑（例如：更多的房间通常意味着更大的面积，价格更高的地点通常房价更高，面积越大价格通常越高等）。
请确保Dataframes之间的关系符合常识性的检查，例如：Dataframes的大小在彼此比较时合理。
请确保外键匹配，并且在创建每个Dataframe时可以使用之前生成的Dataframes。
你可以使用之前生成的Dataframe来生成下一个Dataframe。

用prompt让LLM合成文本数据

合成文本数据通常可以用于训练或微调语言模型。
例1. 一个零售商需要训练语言模型生成商品的描述。定义好输入输出和一定的数据格式：

我正在创建输入输出训练对，以微调我的GPT模型。使用场景是零售商根据产品目录生成产品描述。我希望输入为产品名称和类别（产品所属类别），输出为产品描述。

格式应为以下形式：
1.
输入：产品名称，类别
输出：描述
2.
输入：产品名称，类别
输出：描述

请勿在此格式周围添加任何多余字符，否则将导致输出解析出错。
请尽可能多地创建训练对。

处理不平衡（imbalanced）或非多样化（non-diverse）的文本数据

高质量的合成文本数据应该满足几个条件：

准确性：数据是否符合事实；
一致性：相同的输入对应的输出是否（基本）相同；
多样性：数据点是否尽可能多地覆盖真实场景的整个分布；
平衡性：每种类别的数据数量是否相当；
聚类算法可以帮助我们发现数据中的不平衡和非多样化的问题：

cluster的数据点数量差距大：不平衡
某些cluster中没有数据：非多样化
我们可以使用递归的方式执行生成+聚类分析过程，实现自动生成高质量的合成文本数据。
例1. 一个零售商需要训练语言模型生成商品的描述。定义好商品主题、输入输出和一定的数据格式：

我正在创建输入输出训练对，以微调我的GPT模型。我希望输入为产品名称和类别，输出为描述。类别应包括以下内容：手机、鞋子、耳机、笔记本电脑、电动牙刷等。更重要的是，这些类别应归纳为四个主要主题：交通工具、服装、洗漱用品、食品。

在每个示例的数量后，还应标明主题区域。格式应如下所示：
1. 主题区域
   输入：产品名称，类别
   输出：描述

请勿在格式周围添加任何额外字符，以免导致输出解析错误。

以下是一些有用的示例，以帮助您正确理解输出样式。

1) 服装
   输入：“鞋子名称，鞋子”
   输出：“体验无与伦比的舒适感。这些鞋子融合了现代风格和传统的优越缓冲，完美适合那些总是忙碌的人。”
 
输出样例：
1. 交通工具
输入: "特斯拉 Model 3, 电动汽车"  
输出: "特斯拉 Model 3 是一款革命性的电动汽车，拥有令人印象深刻的续航能力和尖端技术，旨在提供令人振奋的驾驶体验，同时最大程度地减少对环境的影响。"

2. 服装
输入: "耐克 Air Max, 鞋子"  
输出: "提升您的运动鞋风格，选择耐克 Air Max。这款鞋子将标志性的风格与卓越的舒适性和支撑性相结合，适合锻炼和休闲场合。"

3. 日用品
输入: "Oral-B Pro 1000, 电动牙刷"  
输出: "使用 Oral-B Pro 1000 实现卓越的清洁效果。这款电子牙刷具有3D清洁功能，通过脉动和振动去除比普通手动牙刷更多的牙菌斑。"

4. 食品
输入: "Chobani 希腊酸奶, 酸奶"  
输出: "享受营养丰富的零食，选择 Chobani 希腊酸奶。富含蛋白质和美味的口味，是健康早餐或随时享用的理想选择。"

5. 交通工具

用Python正则表达式解析输出，并提取产品名称列：

pattern = re.compile(r'(\d+)\.\s*(\w+)\s*Input:\s*"(.+?),\s*(.+?)"\s*Output:\s*"(.*?)"', re.DOTALL)
pattern = re.compile(r'(\d+)\.\s*(\w+)\s*Input:\s*"(.+?),\s*(.+?)"\s*Output:\s*"(.*?)"', re.DOTALL)
matches = pattern.findall(output_string)

topics = []
products = []
categories = []
descriptions = []

for match in matches:
    number, topic, product, category, description = match
    topics.append(topic)
    products.append(product)

首先我们对文本进行向量化。cluster的个数可以用k-means + elbow算法估算。

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")

    response = client.embeddings.create(input=[text], model=model)

    return response.data[0].embedding

embedding_model = "text-embedding-3-small"
df["embedding"] = df.Category.apply(lambda x: get_embedding(x, model=embedding_model))

if len(df.embedding.values) > 0:
    matrix = np.vstack(df.embedding.values)
else:
    matrix = np.array([])
inertias = []
range_of_clusters = range(1, 13)  # 尝试的cluster数量范围

for n_clusters in range_of_clusters:
    kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42, n_init=10)
    kmeans.fit(matrix)
    inertias.append(kmeans.inertia_)

估算后得到最优的cluster数量可以是3、4、5，这里我们取5：

n_clusters = 5

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
kmeans.fit(matrix)
labels = kmeans.labels_
df["Cluster"] = labels

接下来需要分析聚类数据。首先看数据平衡性如何：

cluster_counts = df["Cluster"].value_counts().sort_index()
print(cluster_counts)

结果显示

Cluster
0    5
1    7
2    8
3    6
4    2
Name: count, dtype: int64

可以提问LLM新的clusters属于什么主题名称。

selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3, replace=True)).reset_index(drop=True)

# Format the selected examples
formatted_examples = "\n".join(
    f'Input: "{row["Product"]}, {row["Category"]}"\nOutput: "{row["Description"]}"\nCluster: "{row["Cluster"]}"'
    for _, row in selected_examples.iterrows()
)

topic_prompt = f"""
    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.
    I want you identify the broad topic areas these clusters belong to.
    Previous examples:
    {formatted_examples}


    Your output should be strictly of the format:
    Cluster: number, topic: topic
    Cluster: number, topic: topic
    Cluster: number, topic: topic

    Do not add any extra characters around that formatting as it will make the output parsing break.
    """

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed analyze clustered data"},
    {"role": "user", "content": topic_prompt}
  ]
)
res = response.choices[0].message.content

pattern = r"Cluster: (\d+), topic: ([^\n]+)"
matches = re.findall(pattern, res)
clusters = [{"cluster": int(cluster), "topic": topic} for cluster, topic in matches]
json_output = json.dumps(clusters, indent=2)
print(json_output)

可以针对性地让LLM提供更多数据点比较少的cluster的数据以减少不平衡。
此外，为了增加数据多样性，我们可以从每个cluster随机抽取一些数据点并让LLM生成更多的商品类别（代码复用了让LLM给cluster起名）：

selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3, replace=True)).reset_index(drop=True)

# Format the selected examples
formatted_examples = "\n".join(
    f'Input: "{row["Product"]}, {row["Category"]}"\nOutput: "{row["Description"]}"\nCluster: "{row["Cluster"]}"'
    for _, row in selected_examples.iterrows()
)

topic_prompt = f"""
    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.
    I want to promote diversity in my examples across categories so follow the procedure below:
    1. You must identify the broad topic areas these clusters belong to.
    2. You should generate further topic areas which don't exist so I can generate data within these topics to improve diversity.


    Previous examples:
    {formatted_examples}


    Your output should be strictly of the format:

    1. Cluster topic mapping
    Cluster: number, topic: topic
    Cluster: number, topic: topic
    Cluster: number, topic: topic

    2. New topics
    1. topic
    2. topic
    3. topic
    4. topic

    Do not add any extra characters around that formatting as it will make the output parsing break. It is very important you stick to that output format
    """

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to analyze clustered data"},
    {"role": "user", "content": topic_prompt}
  ]
)
res = response.choices[0].message.content
print(res)

以下是新生成的商品类别

1. 汽车
2. 个人护理
3. 鞋类
4. 食品
5. 电动车

6. 家用电器
7. 户外设备
8. 智能家居技术
9. 健身设备

最后，让LLM用这些新的商品类别扩展更多样的数据：

我正在创建输入输出训练对，以便微调我的GPT模型。我希望输入为产品名称和类别，输出为描述。类别应包括如：手机、鞋子、耳机、笔记本电脑、电动牙刷等，并且更重要的是，这些类别应归属于一些主要主题：汽车, 个人护理, 鞋类, 食品, 电动车, 家用电器, 户外设备, 智能家居技术, 健身设备。

在每个示例的数量后，还需注明主题领域。格式应如下所示：
1. 主题领域
   输入：产品名称，类别
   输出：描述

请不要在格式周围添加任何额外字符，以免破坏输出解析。

以下是一些有帮助的示例，以便您了解正确的输出风格。

1) 服装
   输入：“鞋子名称，鞋子”
   输出：“体验无与伦比的舒适。这些鞋子融合了现代风格和传统的优质缓震，非常适合那些经常活动的人。”

参考openAI Cookbook，英文原文在这里

posted @ 2024-08-19 01:45 LexLuc 阅读(310) 评论(0) 编辑收藏举报

刷新页面返回顶部

Lex个人随想乡

Attention before pay attention

利用LLM生成人工合成数据

编写prompt让LLM直接生成结构化的数据

用prompt让LLM生成能够生成结构化数据的代码

用prompt让LLM合成文本数据

处理不平衡（imbalanced）或非多样化（non-diverse）的文本数据