GraphFrames的使用

GraphFrames在pyspark中的使用

1 GraphFrames简要介绍

GraphFrame是将Spark中的Graph算法统一到DataFrame接口的Graph操作接口，为Scala、Java和Python提供了统一的图处理API。

Graphframes是开源项目，源码工程如下：https://github.com/graphframes/graphframes

可以参考：

官网：https://graphframes.github.io/graphframes/docs/_site/index.html
GraphFrames用户指南-Python — Databricks文档：https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html

在GraphFrames中图的顶点(Vertex)和边(edge)都是以DataFrame形式存储的：

顶点DataFrame：必须包含列名为“id”的列，用于作为顶点的唯一标识
边DataFrame：必须包含列名为“src”和“dst”的列，根据唯一标识id标识关系

2 GraphFrame的配置

2.1 在python中安装GraphFrames

pip install graphframes

2.2 下载对应版本的jar关联包

https://spark-packages.org/package/graphframes/graphframes

2.3 将jar包移动到/usr/local/lib/python3.9/dist-packages/pyspark/jars

直接进入该目录:/usr/local/lib/python3.9/dist-packages/pyspark/jars

在该目录下wget下载jar包，这样就可以省去步骤2.2

这样就可以在pyspark中成功使用graphframes了。

3 GraphFrame 的使用

3.1 图的构建

数据集:http://opendata.pku.edu.cn/dataset.xhtml?persistentId=doi:10.18170/DVN/KVBL82

点击查看代码

# todo 使用motif查找并输出所有这样的人：有两个或以上的人关注他，他又至少关注两个或以上的人。

# 导入相应库
from pyspark.sql import SparkSession
from graphframes import *
import os
# 构建执行环境入口SparkSession
Spark = SparkSession.builder.\
    appName("graphtest").\
    getOrCreate()

# 读取数据
nodes_df=Spark.read.format("csv").\
    options(header=None).\
    load(r'nodes.csv')
# 对节点数据进行重命名，将"_c0"列命名为'id
nodes_df=nodes_df.withColumnRenamed('_c0','id')
nodes_df.show()

# 读取边数据,将from命名为src(起点)，将to命名为dst(终点)
edges_df = Spark.read.format("csv").\
    options(header='true').\
    load(r'edges.csv')
edges_df=edges_df.withColumnRenamed('from','src'). \
    withColumnRenamed('to', 'dst')
edges_df.show()

# 将节点的DataFtame和边的DataFrame放入 GraphFrame中构建图
graph = GraphFrame(nodes_df, edges_df)

3.2 Motif finding（模式发现）

参考:https://blog.csdn.net/as604049322/article/details/123009617

点击查看代码

# 查找符合src-[to]-dst模式的边
motif = graph.find("(src)-[to]->(dst)")
motif.show()

# 计算关注数，按照src分组并统计数量
src_count = motif.select('src').groupby('src').count().\
    withColumnRenamed('src','user')
# 关注数大于二即count大于二
src_count = src_count.filter("count >= 2").\
    withColumnRenamed('count','fellowing_count')
# 计算粉丝数，按照dst分组并统计数量
dst_count = motif.select('dst').groupby('dst').count().\
    withColumnRenamed('dst','user')
# 被关注数大于二即count大于二
dst_count = dst_count.filter("count >= 2").\
    withColumnRenamed('count','fellowed_count')
# 利用join连接粉丝数和关注数
result = dst_count.join(src_count, src_count["user"] == dst_count["user"])
result=result.drop(src_count["user"])
result.show()

3.3 使用PageRank算法

点击查看代码

# 输出pageRank排名前10的所有人
# pyspark sql API中的 function类
from pyspark.sql import functions as F
# 构建pageRank对象，resetProbability是超参数，重新设置的概率，设为0.15，将迭代次数设为5
pagerank = graph.pageRank(resetProbability=0.15, maxIter=5)
# 调用pagerank的vertices方法生成pagerank结果，这是DataFrame类型的对象，用orderBy做从大到小的排序
# 将排序后的结果保留两位小数，取前10条记录中的id列表，得到最终结果。
pageRank_sort = pagerank.vertices.orderBy("pagerank", ascending=False).\
    withColumn('pageRank',F.round('pagerank',2)).\
    select(['id','pageRank']).take(10)
print(pageRank_sort)

3.4 入度属性使用

点击查看代码

# 直接将构建的图对象的inDegress赋值给indegress，这将得到图的入度列表
indegrees = graph.inDegrees
# 按照入度结果从大到小排序，取前10个，inDegrees是Row对象，键是id，值是入度
indegrees_sort = indegrees.orderBy('inDegree', ascending=False).take(10)
print(indegrees_sort)

posted @ 2022-06-30 17:03 ArkonLu 阅读(1149) 评论(0) 编辑收藏举报

刷新页面返回顶部

Loading

Arkon's Blog

GraphFrames的使用

GraphFrames在pyspark中的使用

1 GraphFrames简要介绍

2 GraphFrame的配置

2.1 在python中安装GraphFrames

2.2 下载对应版本的jar关联包

2.3 将jar包移动到/usr/local/lib/python3.9/dist-packages/pyspark/jars

3 GraphFrame 的使用

3.1 图的构建

3.2 Motif finding（模式发现）

3.3 使用PageRank算法

3.4 入度属性使用

公告