FATE 教程
FATE 教程
写在前面:本教程包括 FATE 的安装和一个实战案例:横向逻辑回归。
FATE 安装
单机部署
使用 Docker 镜像安装
# 1. 拉取镜像
docker pull federatedai/standalone_fate:1.8.0
# 2. 启动
docker run -it --name standalone_fate -p 8080:8080 federatedai/standalone_fate:1.8.0
# 3. 进入容器内部
docker exec -it standalone_fate /bin/bash
# 4. 加载环境变量
source bin/init_env.sh
# 通过网页访问 localhost:8080 可以进入 FATE Board 页面, 账号密码默认为:admin
# 返回终端,开始测试
# 5. Toy 测试
flow test toy -gid 10000 -hid 10000
# 若成功,会输出: success to calculate secure_sum, it is 1999.9999999999986
# 进入 FATE Broad,点击右上角 JOBS,看到任务栏出现两项任务,并且状态为 success 则说明成功
# 6. 单元测试
fate_test unittest federatedml --yes
# 若成功,会输出 there are 0 failed test
如果需要目录挂载:
docker run -it --name standalone_fate -p 8080:8080 -v D:/Docker/data:/dataset federatedai/standalone_fate:1.8.0
本机安装
-
安装 FATE
# 1. 端口检查
netstat -apln|grep 8080
netstat -apln|grep 9360
netstat -apln|grep 9380
# 2. 获取安装包,并解压
sudo wget https://webank-ai-1251170195.cos.ap-guangzhou.myqcloud.com/fate/1.8.0/release/standalone_fate_install_1.8.0_release.tar.gz --no-check-certificate
tar -xzvf standalone_fate_install_1.8.0_release.tar.gz
# 3. 安装
cd standalone_fate_install_1.8.0
bash init.sh init
# 4. 启动
bash init.sh status
bash init.sh start
# 5. 加载环境变量
source bin/init_env.sh
# 6. 测试,见上
安装 FATE-Client、FATE-Test、FATE-Flow
- 为方便使用 FATE,需要在环境内安装安装的交互工具 FATE-Client 以及测试工具 FATE-Test
python -m pip install fate-client
python -m pip install fate-test
- 安装 FATE-Client 的过程会安装好 FATE-Flow,这个工具是联邦学习端到端流水线的多方联邦任务安全调度平台,是执行任务的核心
# flow 初始化
flow init --ip 127.0.0.1 --port 9380
FATE 实战:横向逻辑回归
数据集预处理
数据集:乳腺癌肿瘤数据集(内置在 sklearn 库)
from sklearn.datasets import load_breast_cancer
import pandas as pd
breast_dataset = load_breast_cancer()
breast = pd.DataFrame(breast_dataset.data, columns=breast_dataset.feature_names)
breast = (breast-breast.mean())/(breast.std()) # z-score 标准化
# breast.shape
# (569, 30)
col_names = breast.columns.values.tolist()
columns = {}
for idx, n in enumerate(col_names):
columns[n] = "x%d"%idx
breast = breast.rename(columns=columns)
# 将标签名称改为 y
breast['y'] = breast_dataset.target
breast['idx'] = range(breast.shape[0])
idx = breast['idx']
breast.drop(labels=['idx'], axis=1, inplace = True)
# 将 idx 放在最前面
breast.insert(0, 'idx', idx)
breast = breast.sample(frac=1) # 打乱数据
横向数据集切分
为了模拟横向联邦建模的场景,将数据集切分为特征相同的横向联邦形式。
切分策略如下:
- 前 469 个数据作为训练数据,后 100 个数据作为测试数据
- 训练数据中,前 200 个作为机构 A 的数据,存为
breast_1_train.csv
;后 269 个数据作为机构 B 的数据,存为breast_2_train.csv
- 测试数据不切分,存为
breast_eval.csv
train_data = breast.iloc[:469]
eval_data = breast.iloc[469:]
breast_1_train = train_data.iloc[:200]
breast_2_train = train_data.iloc[200:]
breast_1_train.to_csv('data/breast/breast_1_train.csv', index=False, header=True)
breast_2_train.to_csv('data/breast/breast_2_train.csv', index=False, header=True)
eval_data.to_csv('data/breast/breast_eval.csv', index=False, header=True)
通过 dsl
和 conf
运行训练和预测任务
准备工作
- 启动容器,进行目录挂载
docker run -it --name standalone_fate -p 8080:8080 -v ${本机地址}:${Docker内地址} federatedai/standalone_fate:1.8.0
# 例如
docker run -it --name standalone_fate -p 8080:8080 -v D:\Dropbox\学习计划\FATE\data\breast:/workspace federatedai/standalone_fate:1.8.0
- 进入容器
docker exec -it standalone_fate /bin/bash
- 在容器内进入挂载地址
[root@9605a317930a]# cd /workspace
[root@9605a317930a workspace]#
数据上传
上传数据的的配置文件模板在 data/projects/fate/examples/dsl/v2/upload/upload_conf.json
,内容如下:
{
"file": "/data/projects/fate/examples/data/breast_hetero_guest.csv", // 数据文件
"table_name": "breast_hetero_guest", // 需要转换为 DTable 格式的表名
"namespace": "experiment", // DTable 格式的表名对应的命名空间
"head": 1, // 指定数据文件是否包含表头,1:是,0:否
"partition": 8, // 指定用于存储数据的分区数
"work_mode": 0, // 指定工作模式,1:集群版,0:单机版
"backend": 0 // 指定后端,0:EggRoll,1:Spark _ RabbitMQ,2:Spark + Pulsar
}
我们需要在本机的挂载目录下新建如下四个文件,文件会自动同步到 Docker:
upload_train_host_conf.json
:上传训练数据至机构 1
{
"file": "breast_1_train.csv",
"table_name": "homo_breast_1_train",
"namespace": "homo_host_breast_train",
"head": 1,
"partition": 8,
"work_mode": 0,
"backend": 0
}
upload_train_guest_conf.json
:上传训练数据至机构 2
{
"file": "breast_2_train.csv",
"table_name": "homo_breast_2_train",
"namespace": "homo_guest_breast_train",
"head": 1,
"partition": 8,
"work_mode": 0,
"backend": 0
}
upload_eval_host_conf.json
:上传测试数据至机构 1
{
"file": "breast_eval.csv",
"table_name": "homo_breast_1_eval",
"namespace": "homo_host_breast_eval",
"head": 1,
"partition": 8,
"work_mode": 0,
"backend": 0
}
upload_eval_guest_conf.json
:上传测试数据至机构 2
{
"file": "breast_eval.csv",
"table_name": "homo_breast_2_eval",
"namespace": "homo_guest_breast_eval",
"head": 1,
"partition": 8,
"work_mode": 0,
"backend": 0
}
- 上传数据:在 bash 界面右键可粘贴命令
# flow 初始化
[root@9605a317930a workspace]#flow init --ip 127.0.0.1 --port 9380
[root@9605a317930a workspace]# flow data upload -c upload_train_host_conf.json
[root@9605a317930a workspace]# flow data upload -c upload_train_guest_conf.json
[root@9605a317930a workspace]# flow data upload -c upload_eval_host_conf.json
[root@9605a317930a workspace]# flow data upload -c upload_eval_guest_conf.json
- 在 FATEBoard 中查看,账号和密码都是
admin
,进入JOBS
模型训练
FATE 提供的 DSL 中,各个任务模块可以通过一个有向无环图(DAG)组织起来,用户可以根据自身的需要,灵活地组合各种算法模块。
这里以逻辑回归为例:
-
在模型训练阶段,官方提供的配置 dsl 文件示例在
data/projects/fate/examples/dsl/v2/homo_logistic_regression/homo_lr_train_dsl.json
,内容如下:-
reader_0
&data_transform_0
:数据 IO 组件,用于将本地数据转换为 DTable -
scale_0
:特征工程组件 -
homo_lr_0
:横向逻辑回归组件 -
evaluation_0
:模型评估组件,如果未提供测试集则自动使用训练集
-
{
"components": {
"reader_0": {
"module": "Reader",
"output": {
"data": ["data"]
}
},
"data_transform_0": {
"module": "DataTransform",
"input": {
"data": {
"data": ["reader_0.data"]
}
},
"output": {
"data": ["data"],
"model": ["model"]
}
},
"scale_0": {
"module": "FeatureScale",
"input": {
"data": {"data": ["data_transform_0.data"]
}
},
"output": {
"data": ["data"],
"model": ["model"]
}
},
"homo_lr_0": {
"module": "HomoLR",
"input": {
"data": {
"train_data": ["scale_0.data"]
}
},
"output": {
"data": ["data"],
"model": ["model"]
}
},
"evaluation_0": {
"module": "Evaluation",
"input": {
"data": {
"data": ["homo_lr_0.data"]
}
},
"output": {
"data": ["data"]
}
}
}
}
对于每一个模块,FATE 会将所有 party 的不同参数保存到同一个运行配置文件(Submit Runtime Conf),并且所有的 party 都将共用这个配置文件,因此除了 dsl 的配置文件外,用户还需要准备一份运行配置文件 conf 用于设置各个组件的参数。
- 在模型训练阶段,官方提供的配置 conf 文件示例在
data/projects/fate/examples/dsl/v2/homo_logistic_regression/homo_lr_train_conf.json
,不同角色说明如下:arbiter
:是用来辅助多方完成联合建模的,它的主要作用是聚合梯度或者模型。比如纵向lr
里面,各方将自己一半的梯度发送给arbiter
,然后arbiter
再联合优化等等。initiator
:任务发起人host
:数据提供方guest
:数据应用方local
:本地任务,该角色仅用于 upload 和 download 阶段
{
"dsl_version": 2,
// 发起人
"initiator": {
"role": "guest",
"party_id": 10000
},
// 所有参与此任务的角色
// 每一个元素代表一种角色以及承担这个角色的 party_id,是一个列表表示同一角色可能有多个实体 ID
"role": {
"guest": [10000],
"host": [10000],
"arbiter": [10000] // 仲裁者
},
// 设置模型训练的超参数信息
"component_parameters": {
"common": {
"data_transform_0": {
"with_label": true,
"output_format": "dense"
},
"homo_lr_0": {
"penalty": "L2",
"tol": 1e-05,
"alpha": 0.01,
"optimizer": "sgd",
"batch_size": -1,
"learning_rate": 0.15,
"init_param": {
"init_method": "zeros"
},
"max_iter": 30,
"early_stop": "diff",
"encrypt_param": {
"method": null
},
"cv_param": {
"n_splits": 4,
"shuffle": true,
"random_seed": 33,
"need_cv": false
},
"decay": 1,
"decay_sqrt": true
},
"evaluation_0": {
"eval_type": "binary"
}
},
"role": {
"host": {
"0": {
"reader_0": {
"table": {
"name": "breast_homo_host", // DTable 的表名,对应配置文件中的 table_name
"namespace": "experiment" // 命名空间,对应配置文件中的 namespace
}
},
"evaluation_0": {
"need_run": false
}
}
},
"guest": {
"0": {
"reader_0": {
"table": {
"name": "breast_homo_guest", // DTable 的表名,对应配置文件中的 table_name
"namespace": "experiment" // 命名空间,对应配置文件中的 namespace
}
}
}
}
}
}
}
- 在本机的挂载目录下新建
homo_lr_train_dsl.json
,内容与模板相同 - 在本机的挂载目录下新建
homo_lr_train_conf.json
,内容修改如下:
// 设置模型训练的超参数信息
"component_parameters": {
"common": {
// 增加 label_name 和 label_type
"data_transform_0": {
"with_label": true,
"label_name": "y",
"label_type": "int",
"output_format": "dense"
},
......
"role": {
"host": {
"0": {
"reader_0": {
"table": {
"name": "homo_breast_1_train", // DTable 的表名,对应配置文件中的 table_name
"namespace": "homo_host_breast_train" // 命名空间,对应配置文件中的 namespace
}
},
"evaluation_0": {
"need_run": false
}
}
},
"guest": {
"0": {
"reader_0": {
"table": {
"name": "homo_breast_2_train", // DTable 的表名,对应配置文件中的 table_name
"namespace": "homo_guest_breast_train" // 命名空间,对应配置文件中的 namespace
}
}
}
}
}
}
- 提交任务:
flow job submit -c ${conf_path} -d ${dsl_path}
[root@9605a317930a workspace]# flow job submit -c homo_lr_train_conf.json -d homo_lr_train_dsl.json
{
"data": {
"board_url": "http://127.0.0.1:8080/index.html#/dashboard?job_id=202305230940139350290&role=guest&party_id=10000",
"code": 0,
"dsl_path": "/data/projects/fate/fateflow/jobs/202305230940139350290/job_dsl.json",
"job_id": "202305230940139350290",
"logs_directory": "/data/projects/fate/fateflow/logs/202305230940139350290",
"message": "success",
"model_info": {
"model_id": "arbiter-10000#guest-10000#host-10000#model",
"model_version": "202305230940139350290"
},
"pipeline_dsl_path": "/data/projects/fate/fateflow/jobs/202305230940139350290/pipeline_dsl.json",
"runtime_conf_on_party_path": "/data/projects/fate/fateflow/jobs/202305230940139350290/guest/10000/job_runtime_on_party_conf.json",
"runtime_conf_path": "/data/projects/fate/fateflow/jobs/202305230940139350290/job_runtime_conf.json",
"train_runtime_conf_path": "/data/projects/fate/fateflow/jobs/202305230940139350290/train_runtime_conf.json"
},
"jobId": "202305230940139350290",
"retcode": 0,
"retmsg": "success"
}
- 在 FATEBoard 中查看,全部 success 说明运行成功,可以查看输出结果
模型评估
在模型评估阶段,官方提供的配置文件示例在
data/projects/fate/examples/dsl/v2/homo_logistic_regression/homo_lr_train_eval_conf.json
data/projects/fate/examples/dsl/v2/homo_logistic_regression/homo_lr_train_eval_dsl.json
与训练阶段的操作类似,我们需要对配置文件进行修改:
- 在本机的挂载目录下新建
homo_lr_train_eval_dsl.json
,内容与模板相同 - 在本机的挂载目录下新建
homo_lr_train_eval_conf.json
,内容修改如下:
"component_parameters": {
"common": {
"data_transform_0": {
"with_label": true,
"label_type": "int",
"label_name": "y",
"output_format": "dense"
},
......
"role": {
"host": {
"0": {
"evaluation_0": {
"need_run": false
},
"reader_1": {
"table": {
"name": "homo_breast_1_eval",
"namespace": "homo_host_breast_eval"
}
},
"reader_0": {
"table": {
"name": "homo_breast_1_train",
"namespace": "homo_host_breast_train"
}
}
}
},
"guest": {
"0": {
"reader_1": {
"table": {
"name": "homo_breast_2_eval",
"namespace": "homo_guest_breast_eval"
}
},
"reader_0": {
"table": {
"name": "homo_breast_2_train",
"namespace": "homo_guest_breast_train"
}
}
}
}
}
}
- 提交任务:
flow job submit -c ${conf_path} -d ${dsl_path}
[root@9605a317930a workspace]# flow job submit -c homo_lr_train_eval_conf.json -d homo_lr_train_eval_dsl.json
{
"data": {
"board_url": "http://127.0.0.1:8080/index.html#/dashboard?job_id=202305230954229348000&role=guest&party_id=10000",
"code": 0,
"dsl_path": "/data/projects/fate/fateflow/jobs/202305230954229348000/job_dsl.json",
"job_id": "202305230954229348000",
"logs_directory": "/data/projects/fate/fateflow/logs/202305230954229348000",
"message": "success",
"model_info": {
"model_id": "arbiter-10000#guest-10000#host-10000#model",
"model_version": "202305230954229348000"
},
"pipeline_dsl_path": "/data/projects/fate/fateflow/jobs/202305230954229348000/pipeline_dsl.json",
"runtime_conf_on_party_path": "/data/projects/fate/fateflow/jobs/202305230954229348000/guest/10000/job_runtime_on_party_conf.json",
"runtime_conf_path": "/data/projects/fate/fateflow/jobs/202305230954229348000/job_runtime_conf.json",
"train_runtime_conf_path": "/data/projects/fate/fateflow/jobs/202305230954229348000/train_runtime_conf.json"
},
"jobId": "202305230954229348000",
"retcode": 0,
"retmsg": "success"
}