1. What is Artifact?

An Artifact is a file or directory produced by a tfx component, which can be passed to a downstream component, and then the downstream component can use it.

2. How does tfx pass an Artifact between components?

tfx pipeline has an argument "pipeline_root" when instantiating, this is the directory where components will put output Artifacts in, and when one component completes executing, it stores its output Artifacts in pipeline_root, and records the uri (namely the path) of each output Artifact in database metadb. when a downstream component has an Input Artifact, namely it needs an Output Artifact of its upperdstream component, It queries the database metadb accoding to the Input Artifact.channel (namely information about which pipeline, which pipeline-run, which pipeline node the Artifact belongs to. ) , to find the Artifact record in the database metadb, and in the Artifact record, there is its uri, and the downstream component reads the Input Artifact's data from its uri.


tfx.dsl.Pipeline(
pipeline_name=pipeline_name,
pipeline_root=pipeline_root,
metadata_connection_config=tfx.orchestration.metadata
.sqlite_metadata_connection_config(metadata_path),
components=components,
)

3. when run tfx pipelne using kubeflow pipeline, how to pass an Artifact between components?

  When run tfx pipelne using kubeflow pipeline, the pipeline runs on kubernetes cluster, one component runs in one pod, and a container in a pod has standalone file system. Even thoug pipeline_root is same in each component's container, since they belong to standaloine file systems, they are different directories. So one component can not read the Artifact from its uri (pipeline_root/xxx),since the Artifact is stored at pipeiine_root/xxx of another container's file system, and when one container finishes, its file system also finishes, namely all files in it not exist any more.

pipeline_root needs to be a persistent directory which can be read and written by all components of the pipeline, One solution is to mount one PersistentVolume to each component's container directory (pipeline_root) , and the PersistentVolume should be nfs (network files system), since normally components of a pipeline run on different computers (namely nodes) in the kubernetes cluster.

# create resource Persitent Volume tfx_pv in kubernetes cluster
kubectl apply -f tfx_pv.yaml

# create resource Persistent Volume Claim tfx_pv_claim in kubernetes cluster
# Attention: tfx_pv_claim needs to be in the same namespace with the object who wants to use it,
# in this case, the components of tfx pipeline.
# a pv claim will wait for the first comsumer (namely a pod who uses this pv claim) before binding to an available pv (persistent volume).
kubeclt apply -f tfx_pv_claim.yaml


# pipeline.yaml

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: detect-anomolies-on-wafer-tfdv-schema-
annotations: {pipelines.kubeflow.org/kfp_sdk_version: 1.8.0, 
  pipelines.kubeflow.org/pipeline_compilation_time: '2024-01-07T22:16:36.438482',
  pipelines.kubeflow.org/pipeline_spec: '{"description": "Constructs a Kubeflow
  pipeline.", "inputs": [{"default": "pipelines/detect_anomolies_on_wafer_tfdv_schema",
  "name": "pipeline-root"}], "name": "detect_anomolies_on_wafer_tfdv_schema"}'}
  labels: {pipelines.kubeflow.org/kfp_sdk_version: 1.8.0}
spec:
  entrypoint: detect-anomolies-on-wafer-tfdv-schema

  ...
  volumes:
  - name: tfx-pv
    persistentVolumeClaim:
      claimName: tfx-pv-claim

  templates:
  - name: detect-anomolies-on-wafer-tfdv-schema
    inputs:
    parameters:
    - {name: pipeline-root}

    dag:
      tasks:
      - name: importexamplegen
        template: importexamplegen
        arguments:
        parameters:
        - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
        volumes:
        - name: wafer-data
        - name: tfx-pv

      - name: pusher
        template: pusher
        dependencies: [trainer]
        arguments:
        parameters:
        - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
        volumes:
        - name: tfx-pv

      - name: schema-importer
        template: schema-importer
        arguments:
        parameters:
        - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
        volumes:
        - name: tfx-pv
        - name: schema-path

      - name: statisticsgen
        template: statisticsgen
        dependencies: [importexamplegen]
        arguments:
        parameters:
        - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
        volumes:
        - name: tfx-pv

      - name: trainer
        template: trainer
        dependencies: [importexamplegen, transform]
        arguments:
        parameters:
        - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
        volumes:
        - name: trainer-module
        - name: tfx-pv

      - name: transform
        template: transform
        dependencies: [importexamplegen, schema-importer]
        arguments:
        parameters:
        - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
        volumes:
        - name: transform-module
        - name: tfx-pv
        - name: schema-path

  - name: importexamplegen
    container:
      ...
      volumeMounts:
      - mountPath: /maye/trainEvalData
        name: wafer-data
      - mountPath: /tfx/tfx_pv
        name: tfx-pv

# tfx_pv_claim.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tfx-pv-claim
namespace: kubeflow
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
# tfx_pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
name: tfx-pv
spec:
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
nfs:
server: nfs-server-ip
path: /home/maye/nfs/tfx_pv
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- maye-inspiron-5547

Note:

  1. The backend of kubeflow pipeline is argo workflow. The way in which kubeflow pipeline pass an artifact is copying artifact to minio when one component has output artifact, and copying artifact from minio to component container directory when one component has input artifact.
    (Attention: here the output artifact is kubeflow pipeline's, defined in kubeflow pipeline, tfx output artifact is defined in tfx_ir, in file pipeline.yaml. pipeline.yaml is kubeflow pipeline definition file.)

  2. define output artifact, input artifact in file pipeline.yaml

# pipeline.yaml
...
  templates:
  - name: detect-anomolies-on-wafer-tfdv-schema
    inputs:
      parameters:
      - {name: pipeline-root}
    dag:
      tasks:
      - name: importexamplegen
        template: importexamplegen
        arguments:
          parameters:
          - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
     
      - name: statisticsgen
        template: statisticsgen
        dependencies: [importexamplegen]
        arguments:
          parameters:
          - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}          
          artifacts:
          - {name: import_example_gen_outputs, from: "{{tasks.importexamplegen.outputs.artifacts.import_example_gen_outputs}}"}
          
  - name: importexamplegen
    container:
      ...
    inputs:
      parameters:
      - {name: pipeline-root}
    outputs:
      artifacts:
      - {name: mlpipeline-ui-metadata, path: /mlpipeline-ui-metadata.json}   
      - {name: import_example_gen_outputs, path: /tmp/pipelines}

  - name: statisticsgen
    container:
      ...
    inputs:
      parameters:
      - {name: pipeline-root}      
      artifacts:
      - {name: import_example_gen_outputs, path: /tmp/pipelines}   
                                                                   
    outputs:                                                         
      artifacts:                                                   
      - {name: mlpipeline-ui-metadata, path: /mlpipeline-ui-metadata.json}
    

Note:

  1. The parent directory of inputs.artifacts.path need exist. or, will not raise error, just not work.
  2. artifact.path: the local path of the artifact.
    task.arguments: real arguments of task.
    task.template.inputs: formal parameters of task.