1. What is Artifact?
An Artifact is a file or directory produced by a tfx component, which can be passed to a downstream component, and then the downstream component can use it.
2. How does tfx pass an Artifact between components?
tfx pipeline has an argument "pipeline_root" when instantiating, this is the directory where components will put output Artifacts in, and when one component completes executing, it stores its output Artifacts in pipeline_root, and records the uri (namely the path) of each output Artifact in database metadb. when a downstream component has an Input Artifact, namely it needs an Output Artifact of its upperdstream component, It queries the database metadb accoding to the Input Artifact.channel (namely information about which pipeline, which pipeline-run, which pipeline node the Artifact belongs to. ) , to find the Artifact record in the database metadb, and in the Artifact record, there is its uri, and the downstream component reads the Input Artifact's data from its uri.
tfx.dsl.Pipeline(
pipeline_name=pipeline_name,
pipeline_root=pipeline_root,
metadata_connection_config=tfx.orchestration.metadata
.sqlite_metadata_connection_config(metadata_path),
components=components,
)
3. when run tfx pipelne using kubeflow pipeline, how to pass an Artifact between components?
When run tfx pipelne using kubeflow pipeline, the pipeline runs on kubernetes cluster, one component runs in one pod, and a container in a pod has standalone file system. Even thoug pipeline_root is same in each component's container, since they belong to standaloine file systems, they are different directories. So one component can not read the Artifact from its uri (pipeline_root/xxx),since the Artifact is stored at pipeiine_root/xxx of another container's file system, and when one container finishes, its file system also finishes, namely all files in it not exist any more.
pipeline_root needs to be a persistent directory which can be read and written by all components of the pipeline, One solution is to mount one PersistentVolume to each component's container directory (pipeline_root) , and the PersistentVolume should be nfs (network files system), since normally components of a pipeline run on different computers (namely nodes) in the kubernetes cluster.
# create resource Persitent Volume tfx_pv in kubernetes cluster
kubectl apply -f tfx_pv.yaml
# create resource Persistent Volume Claim tfx_pv_claim in kubernetes cluster
# Attention: tfx_pv_claim needs to be in the same namespace with the object who wants to use it,
# in this case, the components of tfx pipeline.
# a pv claim will wait for the first comsumer (namely a pod who uses this pv claim) before binding to an available pv (persistent volume).
kubeclt apply -f tfx_pv_claim.yaml
# pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: detect-anomolies-on-wafer-tfdv-schema-
annotations: {pipelines.kubeflow.org/kfp_sdk_version: 1.8.0,
pipelines.kubeflow.org/pipeline_compilation_time: '2024-01-07T22:16:36.438482',
pipelines.kubeflow.org/pipeline_spec: '{"description": "Constructs a Kubeflow
pipeline.", "inputs": [{"default": "pipelines/detect_anomolies_on_wafer_tfdv_schema",
"name": "pipeline-root"}], "name": "detect_anomolies_on_wafer_tfdv_schema"}'}
labels: {pipelines.kubeflow.org/kfp_sdk_version: 1.8.0}
spec:
entrypoint: detect-anomolies-on-wafer-tfdv-schema
...
volumes:
- name: tfx-pv
persistentVolumeClaim:
claimName: tfx-pv-claim
templates:
- name: detect-anomolies-on-wafer-tfdv-schema
inputs:
parameters:
- {name: pipeline-root}
dag:
tasks:
- name: importexamplegen
template: importexamplegen
arguments:
parameters:
- {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
volumes:
- name: wafer-data
- name: tfx-pv
- name: pusher
template: pusher
dependencies: [trainer]
arguments:
parameters:
- {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
volumes:
- name: tfx-pv
- name: schema-importer
template: schema-importer
arguments:
parameters:
- {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
volumes:
- name: tfx-pv
- name: schema-path
- name: statisticsgen
template: statisticsgen
dependencies: [importexamplegen]
arguments:
parameters:
- {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
volumes:
- name: tfx-pv
- name: trainer
template: trainer
dependencies: [importexamplegen, transform]
arguments:
parameters:
- {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
volumes:
- name: trainer-module
- name: tfx-pv
- name: transform
template: transform
dependencies: [importexamplegen, schema-importer]
arguments:
parameters:
- {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
volumes:
- name: transform-module
- name: tfx-pv
- name: schema-path
- name: importexamplegen
container:
...
volumeMounts:
- mountPath: /maye/trainEvalData
name: wafer-data
- mountPath: /tfx/tfx_pv
name: tfx-pv
# tfx_pv_claim.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tfx-pv-claim
namespace: kubeflow
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
# tfx_pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: tfx-pv
spec:
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
nfs:
server: nfs-server-ip
path: /home/maye/nfs/tfx_pv
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- maye-inspiron-5547
Note:
-
The backend of kubeflow pipeline is argo workflow. The way in which kubeflow pipeline pass an artifact is copying artifact to minio when one component has output artifact, and copying artifact from minio to component container directory when one component has input artifact.
(Attention: here the output artifact is kubeflow pipeline's, defined in kubeflow pipeline, tfx output artifact is defined in tfx_ir, in file pipeline.yaml. pipeline.yaml is kubeflow pipeline definition file.) -
define output artifact, input artifact in file pipeline.yaml
# pipeline.yaml
...
templates:
- name: detect-anomolies-on-wafer-tfdv-schema
inputs:
parameters:
- {name: pipeline-root}
dag:
tasks:
- name: importexamplegen
template: importexamplegen
arguments:
parameters:
- {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
- name: statisticsgen
template: statisticsgen
dependencies: [importexamplegen]
arguments:
parameters:
- {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
artifacts:
- {name: import_example_gen_outputs, from: "{{tasks.importexamplegen.outputs.artifacts.import_example_gen_outputs}}"}
- name: importexamplegen
container:
...
inputs:
parameters:
- {name: pipeline-root}
outputs:
artifacts:
- {name: mlpipeline-ui-metadata, path: /mlpipeline-ui-metadata.json}
- {name: import_example_gen_outputs, path: /tmp/pipelines}
- name: statisticsgen
container:
...
inputs:
parameters:
- {name: pipeline-root}
artifacts:
- {name: import_example_gen_outputs, path: /tmp/pipelines}
outputs:
artifacts:
- {name: mlpipeline-ui-metadata, path: /mlpipeline-ui-metadata.json}
Note:
- The parent directory of inputs.artifacts.path need exist. or, will not raise error, just not work.
- artifact.path: the local path of the artifact.
task.arguments: real arguments of task.
task.template.inputs: formal parameters of task.