如何使用大语言模型(LLM)自动构建知识图谱:基于OpenAI实现(附Python代码)
探索如何将LLM转变成一个更加强大的信息提取工具
LLM不仅能够处理复杂的非结构化原始文本,还能将这些文本转化为结构化且易于查询的事实。在回顾一些关键概念之后,我们将重点介绍如何使用 OpenAI 的 GPT-3.5 Turbo 从原始文本数据(电商产品标题)中构建知识图谱。
毕竟大多数公司的数据中都有大量未能有效利用的非结构化数据,创建知识图谱能够最大程度的从这些数据中提取有价值的信息,并使用这些信息做出更明智的决策。当然,除了直接应用之外,我们也可以利用这个思路来构建结构化的大模型微调数据集。
小百科:什么是知识图谱?
知识图谱是一种语义网络,它不仅表达和相互连接现实世界中的实体,如人物、组织、物体、事件和概念,还描述这些实体的具体属性及其属性值。这些实体通过以下两种形式的三元组相互关联,构成了知识图谱的基础:
head → relation → tail
实体 → 关系 → 实体(例如,“爱因斯坦 → 出生于 → 德国”)
实体 → 属性 → 属性值(例如,“埃菲尔铁塔 → 高度 → 300米”)
这种网络表示形式使我们能够提取和分析这些实体之间的复杂关系,以及实体自身的详细描述。
任务描述
本文将使用 OpenAI 的 gpt-3.5-turbo 模型,针对kaggle上的亚马逊产品数据集中的标题数据,创建一个知识图谱。
可以看到,数据集包含若干产品的相关字段,但这次我们只用到里面的title字段,如果你手头有其他数据集,包含更完善的信息,也可以同时使用多个字段(如果你打算使用多个字段,那就直接把多个字段拼接为一个新的字段"text",比如标题_描述_规格),然后我们将使用 ChatGPT 从中提取构建知识图谱所需要的实体和关系信息。
具体实现
1. Python依赖包安装并读取数据
首先,我们要在终端执行下面命令,安装依赖的Python包,或者在IDE中安装也可以。
pip install pandas openai sentence-transformers network
pandas:一个强大的 Python 数据分析工具库,用于数据清洗、转换、分析和可视化,特别擅长处理表格数据。
--------------------------------------------------------
openai:OpenAI 提供的库,主要用于访问和使用 OpenAI 开发的各种人工智能模型,包括语言、图像生成和游戏AI等。
----------------------------------------------------------------
sentence-transformers:一款基于 PyTorch 的库,专门用于句子和文本嵌入的计算,可用于各种自然语言处理任务,如句子相似度计算和聚类。
-------------------------------------------------------------------------------
networkx:一个用于创建、操作和研究复杂网络结构和动态网络的 Python 库,支持多种类型的图,如无向、有向和多重图。
---------------------------------------------------------------
2. 导入库并读取数据
然后,我们把安装好的包导入,并将把亚马逊产品数据集作为 pandas 数据帧读取。
import json
import logging
import matplotlib.pyplot as plt
import networkx as nx
from networkx import connected_components
from openai import OpenAI
import pandas as pd
from sentence_transformers import SentenceTransformer, util
data = pd.read_csv("amazon_products.csv")
3. 信息提取
接下来,我们使用 ChatGPT 来从产品数据中挖掘出实体和它们之间的关系,并把这些信息以 JSON 格式的对象数组返回。
为了引导 ChatGPT 正确进行实体关系提取,我们会提供一系列特定的实体类型和关系类型。这些类型将与 Schema.org 中的对应实体和关系相映射。映射时,我们用的键是提供给 ChatGPT 的实体和关系类型,而值则是 Schema.org 中相关对象和属性的 URL。
因此,返回的 JSON 对象需要包含以下几个关键部分:
-
head键,从我们提供的数据中提取出来的实体;
-
head_type键,提取的head实体的类型;
-
tail键,同样是从数据中提取出的实体文本;
-
tail_type键,tail实体的类型;
-
relation键,“head”和“tail”之间的关系类型;
# ENTITY TYPES:
entity_types = {
"product": "https://schema.org/Product",
"rating": "https://schema.org/AggregateRating",
"price": "https://schema.org/Offer",
"characteristic": "https://schema.org/PropertyValue",
"material": "https://schema.org/Text",
"manufacturer": "https://schema.org/Organization",
"brand": "https://schema.org/Brand",
"measurement": "https://schema.org/QuantitativeValue",
"organization": "https://schema.org/Organization",
"color": "https://schema.org/Text",
}
# RELATION TYPES:
relation_types = {
"hasCharacteristic": "https://schema.org/additionalProperty",
"hasColor": "https://schema.org/color",
"hasBrand": "https://schema.org/brand",
"isProducedBy": "https://schema.org/manufacturer",
"hasColor": "https://schema.org/color",
"hasMeasurement": "https://schema.org/hasMeasurement",
"isSimilarTo": "https://schema.org/isSimilarTo",
"madeOfMaterial": "https://schema.org/material",
"hasPrice": "https://schema.org/offers",
"hasRating": "https://schema.org/aggregateRating",
"relatedTo": "https://schema.org/isRelatedTo"
}
然后我们创建一个 OpenAI 客户端,默认模型选择为 gpt-3.5-turbo,因为它的性能足以进行这个简单的演示,没有必要使用更贵的 gpt-4(土豪可以忽略)。
client = OpenAI(api_key="<YOUR_API_KEY>")
def extract_information(text, model="gpt-3.5-turbo"):
completion = client.chat.completions.create(
model=model,
temperature=0,
messages=[
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": user_prompt.format(
entity_types=entity_types,
relation_types=relation_types,
specification=text
)
}
]
)
return completion.choices[0].message.content
4. 提示工程
- system_prompt ,用来指导 ChatGPT 从原始文本中提取实体和关系,并以 JSON 对象数组的形式返回结果,每个对象都有键:“head”、“head_type”、“relation”、“tail”和“tail_type”。
system_prompt = """You are an expert agent specialized in analyzing product specifications in an online retail store.
Your task is to identify the entities and relations requested with the user prompt, from a given product specification.
You must generate the output in a JSON containing a list with JOSN objects having the following keys: "head", "head_type", "relation", "tail", and "tail_type".
The "head" key must contain the text of the extracted entity with one of the types from the provided list in the user prompt, the "head_type"
key must contain the type of the extracted head entity which must be one of the types from the provided user list,
the "relation" key must contain the type of relation between the "head" and the "tail", the "tail" key must represent the text of an
extracted entity which is the tail of the relation, and the "tail_type" key must contain the type of the tail entity. Attempt to extract as
many entities and relations as you can.
""
- user_prompt,定义数据集中单个规范所需输出的单个示例,并提示 ChatGPT 以相同的方式从提供的规范中提取实体和关系。
user_prompt = """Based on the following example, extract entities and relations from the provided text.
Use the following entity types:
# ENTITY TYPES:
{entity_types}
Use the following relation types:
{relation_types}
--> Beginning of example
# Specification
"YUVORA 3D Brick Wall Stickers | PE Foam Fancy Wallpaper for Walls,
Waterproof & Self Adhesive, White Color 3D Latest Unique Design Wallpaper for Home (70*70 CMT) -40 Tiles
[Made of soft PE foam,Anti Children's Collision,take care of your family.Waterproof, moist-proof and sound insulated. Easy clean and maintenance with wet cloth,economic wall covering material.,Self adhesive peel and stick wallpaper,Easy paste And removement .Easy To cut DIY the shape according to your room area,The embossed 3d wall sticker offers stunning visual impact. the tiles are light, water proof, anti-collision, they can be installed in minutes over a clean and sleek surface without any mess or specialized tools, and never crack with time.,Peel and stick 3d wallpaper is also an economic wall covering material, they will remain on your walls for as long as you wish them to be. The tiles can also be easily installed directly over existing panels or smooth surface.,Usage range: Featured walls,Kitchen,bedroom,living room, dinning room,TV walls,sofa background,office wall decoration,etc. Don't use in shower and rugged wall surface]
Provide high quality foam 3D wall panels self adhesive peel and stick wallpaper, made of soft PE foam,children's collision, waterproof, moist-proof and sound insulated,easy cleaning and maintenance with wet cloth,economic wall covering material, the material of 3D foam wallpaper is SAFE, easy to paste and remove . Easy to cut DIY the shape according to your decor area. Offers best quality products. This wallpaper we are is a real wallpaper with factory done self adhesive backing. You would be glad that you it. Product features High-density foaming technology Total Three production processes Can be use of up to 10 years Surface Treatment: 3D Deep Embossing Damask Pattern."
################
# Output
[
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "isProducedBy",
"tail": "YUVORA",
"tail_type": "manufacturer"
}},
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "hasCharacteristic",
"tail": "Waterproof",
"tail_type": "characteristic"
}},
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "hasCharacteristic",
"tail": "Self Adhesive",
"tail_type": "characteristic"
}},
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "hasColor",
"tail": "White",
"tail_type": "color"
}},
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "hasMeasurement",
"tail": "70*70 CMT",
"tail_type": "measurement"
}},
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "hasMeasurement",
"tail": "40 tiles",
"tail_type": "measurement"
}},
{{
"head": "YUVORA 3D Brick Wall Stickers",
"head_type": "product",
"relation": "hasMeasurement",
"tail": "40 tiles",
"tail_type": "measurement"
}}
]
--> End of example
For the following specification, generate extract entitites and relations as in the provided example.
# Specification
{specification}
################
# Output
"""
5. 图谱构建
现在,我们为数据集中的每个产品调用 extract_information 函数,并创建一个包含所有提取的三元组的列表,这些三元组将代表我们的知识图谱。在本案例中,我们将仅使用 100 个产品的标题数据生成一个知识图谱。
kg = []
for content in data['text'].values[:100]:
try:
extracted_relations = extract_information(content)
extracted_relations = json.loads(extracted_relations)
kg.extend(extracted_relations)
except Exception as e:
logging.error(e)
kg_relations = pd.DataFrame(kg)
得到的数据结果如下图所示——
6. 实体解析
实体解析的作用是将数据集中的实体与现实世界中的概念相匹配,会使用 NLP 技术,对数据集中的头部和尾部实体执行基本的实体解析。
在本案例中,我们将使用“all-MiniLM-L6-v2”这个句子转换器,为每个头部创建embedding,并计算头部实体之间的余弦相似度,并检查相似度是否大于 0.95,超过阈值的实体视为相同的实体,并将它们的文本值规范化为相等。同样的道理也适用于尾部实体。举个例子,如果我们有两个实体,一个值为“微软”,另一个值为“微软公司”,那么这两个实体将合并为一个。
我们按以下方式加载并使用embeddeding模型来计算第一个和第二个头部实体之间的相似度。
heads = kg_relations['head'].values
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedding_model.encode(heads)
similarity = util.cos_sim(embeddings[0], embeddings[1])
7. 图谱可视化
最终,我们还可以使用 networkx 这个Python 库,来实现图谱数据的可视化。
G = nx.Graph()
for _, row in kg_relations.iterrows():
G.add_edge(row['head'], row['tail'], label=row['relation'])
pos = nx.spring_layout(G, seed=47, k=0.9)
labels = nx.get_edge_attributes(G, 'label')
plt.figure(figsize=(15, 15))
nx.draw(G, pos, with_labels=True, font_size=10, node_size=700, node_color='lightblue', edge_color='gray', alpha=0.6)
nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, font_size=8, label_pos=0.3, verticalalignment='baseline')
plt.title('Product Knowledge Graph')
plt.show()
效果如下图所示:
到目前为止,我们已经完成了使用 LLM 从原始文本数据中提取实体和关系的三元组,并自动构建知识图谱,甚至生成可视化图表。
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 在鹅厂做java开发是什么体验
· 百万级群聊的设计实践
· WPF到Web的无缝过渡:英雄联盟客户端的OpenSilver迁移实战
· 永远不要相信用户的输入:从 SQL 注入攻防看输入验证的重要性
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析