Deploy standalone kubeflow pipeline on a kubernetes cluster

1. Computer environment

OS: ubuntu 20.04
kubenetes: 1.26.12
Attention:
kubeflow pipeline 2.0 is compatible up to kubernetes v1.26.

2. Prepare file kubeflow_pipeline_deployment.yaml

2.1 generate file kubeflow_pipeline_deployment.yaml

$ git clone https://github.com/kubeflow/pipelines.git
$ cd pipelines/manifests/kustomize

$ KFP_ENV=platform-agnostic
$ kustomize build "env/${KFP_ENV}/" > kubeflow_pipeline_deployment.yaml

$ kustomize build "env/${KFP_ENV}/" > kubeflow_pipeline_deployment.yaml
# Warning: 'bases' is deprecated. Please use 'resources' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
# Warning: 'bases' is deprecated. Please use 'resources' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
# Warning: 'vars' is deprecated. Please use 'replacements' instead. [EXPERIMENTAL] Run 'kustomize edit fix' to update your Kustomization automatically.
# Warning: 'bases' is deprecated. Please use 'resources' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
# Warning: 'bases' is deprecated. Please use 'resources' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
# Warning: 'patchesJson6902' is deprecated. Please use 'patches' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
# Warning: 'bases' is deprecated. Please use 'resources' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
# Warning: 'patchesStrategicMerge' is deprecated. Please use 'patches' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
$

2.2 change image registry in kubeflow_pipeline_deployment.yaml to accessible one in china

gedit kubeflow_pipeline_deployment.yaml

replace "gcr.io" with "gcr.lank8s.cn"

3. Install Kubeflow Pipelines Standalone using Kustomize Manifests

(env/platform-agnostic) install on any Kubernetes cluster

Install:

#$ git clone https://github.com/kubeflow/pipelines.git

#$ cd pipelines/manifests/kustomize

#$ KFP_ENV=platform-agnostic
$ kubectl apply -k cluster-scoped-resources/
$ kubectl wait crd/applications.app.k8s.io --for condition=established --timeout=60s

$ kubectl apply -k cluster-scoped-resources/
namespace/kubeflow created
customresourcedefinition.apiextensions.k8s.io/applications.app.k8s.io created
customresourcedefinition.apiextensions.k8s.io/clusterworkflowtemplates.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/cronworkflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/scheduledworkflows.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/viewers.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/workfloweventbindings.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtaskresults.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtasksets.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtemplates.argoproj.io created
serviceaccount/kubeflow-pipelines-cache-deployer-sa created
clusterrole.rbac.authorization.k8s.io/kubeflow-pipelines-cache-deployer-clusterrole created

$ kubectl wait crd/applications.app.
k8s.io --for condition=established --timeout=60s
customresourcedefinition.apiextensions.k8s.io/applications.app.k8s.io condition met
$

# need to change image registry to an accessible # one in china, or pods will not ready.
#$ kubectl apply -k "env/${KFP_ENV}/"
$ kubectl apply -f kubeflow_pipeline_deployment.yaml
$ kubectl wait pods -l application-crd-id=kubeflow-pipelines -n kubeflow --for condition=Ready --timeout=1800s
$ kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

$ kubectl apply -f kubeflow_pipeline_deployment.yaml 
serviceaccount/argo created
serviceaccount/kubeflow-pipelines-cache created
serviceaccount/kubeflow-pipelines-container-builder created
serviceaccount/kubeflow-pipelines-metadata-writer created
serviceaccount/kubeflow-pipelines-viewer created
serviceaccount/metadata-grpc-server created
serviceaccount/ml-pipeline created
serviceaccount/ml-pipeline-persistenceagent created
serviceaccount/ml-pipeline-scheduledworkflow created
serviceaccount/ml-pipeline-ui created
serviceaccount/ml-pipeline-viewer-crd-service-account created
serviceaccount/ml-pipeline-visualizationserver created
serviceaccount/mysql created
serviceaccount/pipeline-runner created
role.rbac.authorization.k8s.io/argo-role created
role.rbac.authorization.k8s.io/kubeflow-pipelines-cache-deployer-role created
role.rbac.authorization.k8s.io/kubeflow-pipelines-cache-role created
role.rbac.authorization.k8s.io/kubeflow-pipelines-metadata-writer-role created
role.rbac.authorization.k8s.io/ml-pipeline created
role.rbac.authorization.k8s.io/ml-pipeline-persistenceagent-role created
role.rbac.authorization.k8s.io/ml-pipeline-scheduledworkflow-role created
role.rbac.authorization.k8s.io/ml-pipeline-ui created
role.rbac.authorization.k8s.io/ml-pipeline-viewer-controller-role created
role.rbac.authorization.k8s.io/pipeline-runner created
rolebinding.rbac.authorization.k8s.io/argo-binding created
rolebinding.rbac.authorization.k8s.io/kubeflow-pipelines-cache-binding created
rolebinding.rbac.authorization.k8s.io/kubeflow-pipelines-cache-deployer-rolebinding created
rolebinding.rbac.authorization.k8s.io/kubeflow-pipelines-metadata-writer-binding created
rolebinding.rbac.authorization.k8s.io/ml-pipeline created
rolebinding.rbac.authorization.k8s.io/ml-pipeline-persistenceagent-binding created
rolebinding.rbac.authorization.k8s.io/ml-pipeline-scheduledworkflow-binding created
rolebinding.rbac.authorization.k8s.io/ml-pipeline-ui created
rolebinding.rbac.authorization.k8s.io/ml-pipeline-viewer-crd-binding created
rolebinding.rbac.authorization.k8s.io/pipeline-runner-binding created
configmap/kfp-launcher created
configmap/metadata-grpc-configmap created
configmap/ml-pipeline-ui-configmap created
configmap/pipeline-install-config created
configmap/workflow-controller-configmap created
secret/mlpipeline-minio-artifact created
secret/mysql-secret created
service/cache-server created
service/metadata-envoy-service created
service/metadata-grpc-service created
service/minio-service created
service/ml-pipeline created
service/ml-pipeline-ui created
service/ml-pipeline-visualizationserver created
service/mysql created
service/workflow-controller-metrics created
priorityclass.scheduling.k8s.io/workflow-controller created
persistentvolumeclaim/minio-pvc created
persistentvolumeclaim/mysql-pv-claim created
deployment.apps/cache-deployer-deployment created
deployment.apps/cache-server created
deployment.apps/metadata-envoy-deployment created
deployment.apps/metadata-grpc-deployment created
deployment.apps/metadata-writer created
deployment.apps/minio created
deployment.apps/ml-pipeline created
deployment.apps/ml-pipeline-persistenceagent created
deployment.apps/ml-pipeline-scheduledworkflow created
deployment.apps/ml-pipeline-ui created
deployment.apps/ml-pipeline-viewer-crd created
deployment.apps/ml-pipeline-visualizationserver created
deployment.apps/mysql created
(base) maye@maye-Inspiron-5547:~/Documents/kubernetes_install/kubeflow_pipeline_deployment/pipelines/manifests/kustomize$ 

(base) maye@maye-Inspiron-5547:~$ kubectl wait pods -l application-crd-id=kubeflow-pipelines -n kubeflow --for condition=Ready --timeout=1800s
pod/cache-deployer-deployment-5f4468f56-v45tp condition met
pod/cache-server-787f58d7d8-vt4v6 condition met
pod/metadata-envoy-deployment-8448bbb7cf-zbc2b condition met
pod/metadata-grpc-deployment-659594dfcb-g9jfl condition met
pod/metadata-writer-7994b79f84-gxrqt condition met
pod/minio-5c79479b46-cq4hw condition met
pod/ml-pipeline-96899cdb-cmmjt condition met
pod/ml-pipeline-persistenceagent-557dcfbfdc-dqzw4 condition met
pod/ml-pipeline-scheduledworkflow-86dbb6ffb7-jh7nm condition met
pod/ml-pipeline-ui-6dcf65dcb5-twzrc condition met
pod/ml-pipeline-viewer-crd-584b649c6-m4sk6 condition met
pod/ml-pipeline-visualizationserver-d799cdc66-xff9z condition met
pod/mysql-6b95d686-zsp98 condition met
pod/workflow-controller-5b6c7f9779-xrv6f condition met
(base) maye@maye-Inspiron-5547:~$ 

(base) maye@maye-Inspiron-5547:~$ kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80
Forwarding from 127.0.0.1:8080 -> 3000
Forwarding from [::1]:8080 -> 3000

Now you can access Kubeflow Pipelines UI in your browser by http://localhost:8080. ^[1]

4. Uninstall

### 1. namespace scoped
# Depends on how you installed it:
kubectl kustomize env/platform-agnostic | kubectl delete -f -
# or
kubectl kustomize env/dev | kubectl delete -f -
# or
kubectl kustomize env/gcp | kubectl delete -f -
# or
kubectl delete applications/pipeline -n kubeflow

### 2. cluster scoped
kubectl delete -k cluster-scoped-resources/

5. Error & Solution

[ERROR: User \"system:serviceaccount:kubeflow:kubeflow-pipelines-metadata-writer\" cannot watch resource \"pods\" in API group \"\" at the cluster scope","reason":"Forbidden"]

(base) maye@maye-Inspiron-5547:~/Documents/kubernetes_install/kubeflow_pipeline_deployment/pipelines/manifests/kustomize$ kubectl logs metadata-writer-b6cd5d484-fk77j -n kubeflow
Connected to the metadata store
Start watching Kubernetes Pods created by Argo
Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 163, in <module>
    for event in pod_stream:
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 142, in stream
    resp = func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 13630, in list_pod_for_all_namespaces
    (data) = self.list_pod_for_all_namespaces_with_http_info(**kwargs)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 13724, in list_pod_for_all_namespaces_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 344, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 178, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 365, in request
    headers=headers)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in GET
    query_params=query_params)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '5d64491c-9a0b-468a-b880-00e93a5b3d94', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': '50ac8f2b-4c6e-47cd-93b1-bf724a63bd9c', 'X-Kubernetes-Pf-Prioritylevel-Uid': '17070a51-1bcd-4d89-8fd0-311289f9092f', 'Date': 'Sun, 14 Jan 2024 08:45:34 GMT', 'Content-Length': '303'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \\"system:serviceaccount:kubeflow:kubeflow-pipelines-metadata-writer\\" cannot watch resource \\"pods\\" in API group \\"\\" at the cluster scope","reason":"Forbidden","details":{"kind":"pods"},"code":403}\n'

(base) maye@maye-Inspiron-5547:~/Documents/kubernetes_install/kubeflow_pipeline_deployment/pipelines/manifests/kustomize$

[SOLUTION]
Create resources ClusterRole kubeflow-pipelines-metadata-writer-role and
ClusterRoleBinding kubeflow-pipelines-metadata-writer-binding:

kubectl apply -f kubeflow_pipelines_metadata_writer_role.yaml

# kubeflow_pipelines_metadata_writer_role.yaml
apiVersion: rbac.authorization.k8s.io/v1
#kind: Role

kind: ClusterRole

metadata:
  labels:
    app: kubeflow-pipelines-metadata-writer-role
    application-crd-id: kubeflow-pipelines
  name: kubeflow-pipelines-metadata-writer-role
  #namespace: kubeflow
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - list
  - watch
  - update
  - patch
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- apiGroups:
  - argoproj.io
  resources:
  - workflows
  verbs:
  - get
  - list
  - watch
  - update
  - patch
---

apiVersion: rbac.authorization.k8s.io/v1
#kind: RoleBinding

kind: ClusterRoleBinding

metadata:
  labels:
    application-crd-id: kubeflow-pipelines
  name: kubeflow-pipelines-metadata-writer-binding
  #namespace: kubeflow
roleRef:
  apiGroup: rbac.authorization.k8s.io
  #kind: Role
  kind: ClusterRole
  
  name: kubeflow-pipelines-metadata-writer-role
subjects:
- kind: ServiceAccount
  name: kubeflow-pipelines-metadata-writer
  namespace: kubeflow
---

Note:
RoleBinding与ClusterRoleBinding
角色绑定将一个角色中定义的各种权限授予一个或者一组用户。角色绑定包含了一组相关主体（即subject, 包括用户——User、用户组——Group、或者服务账户——Service Account）以及对被授予角色的引用。在命名空间中可以通过RoleBinding对象授予权限，而集群范围的权限授予则通过ClusterRoleBinding对象完成。

RoleBinding可以引用在同一命名空间内定义的Role对象。

ClusterRole对象可以授予与Role对象相同的权限，但由于它们属于集群范围对象，也可以使用它们授予对以下几种资源的访问权限：

集群范围资源（例如节点，即node） - 非资源类型endpoint（例如”/healthz”） - 跨所有命名空间的命名空间范围资源（例如pod，需要运行命令kubectl get pods --all-namespaces来查询集群中所有的pod） ^[2]

[ERROR: mysql: Out of range value for column 'create_time_since_epoch']

INFO:absl:MetadataStore with gRPC connection initialized
WARNING:absl:mlmd client InternalError: Cannot create node for type_id: 11 name: "detect_anomolies_on_wafer_tfdv_schema"mysql_query failed: errno: Out of range value for column 'create_time_since_epoch' at row 1, error: Out of range value for column 'create_time_since_epoch' at row 1

[SOLUTION]
This is due to that the precision of value of create_time_since_epoch in python code is millisecond, which needs int64, namely bigint of mysql, to store, but datatype of column 'create_time_since_epoch' is int (namely int32).

check which tables has column 'create_time_since_epoch'

mysql> SELECT TABLE_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME = 'create_time_since_epoch';

change datatype of column 'create_time_since_epoch' to BIGINT for each table who has this column.

kubectl exec -it mysql-pod-name -n kubeflow

entered mysql pod:

mysql> ALTER TABLE table_name MODIFY create_time_since_epoch BIGINT NOT NULL DEFAULT 0;

check column 'create_time_since_epoch'

kubectl exec -it mysql-pod-name -n kubeflow

entered mysql pod:

mysql> DESCIRBE table_name;

Attention:
Setting the datatype of column 'create_time_since_epoch' to TIMESTAMP is not ok, because the format of value of create_time_since_epoch in python code is an integer number,(e.g. 2325547), and the format of TIMESTAMP of mysql is %Y-%m-%d HH:MM:SS, mysql will raise error: "Incorrect datetime value" .

[ERROR: pod has unbound immediate PersistentVolumneClaim]

Warning  FailedScheduling  33s (x2 over 5m38s)  default-scheduler  0/2 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..

[SOLUTION]

Create a persistent volumn:
1.1 if no default storateclass, create a storageclass, which will be set as default storageclass later.
1.1.1 write the definition yaml file of the storageclass:

# storageclass_local.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

1.1.2 create the storageclass resource (resource=object):

kubectl apply -f <path of the storageclass definition yaml file>

(base) maye@maye-Inspiron-5547:~$ kubectl apply -f /home/maye/Documents/kubernetes_install/storageclass_local.yaml 
storageclass.storage.k8s.io/local-storage created
(base) maye@maye-Inspiron-5547:~$

1.1.3 check storageclass

kubectl get storageclass

(base) maye@maye-Inspiron-5547:~$ kubectl get storageclass
NAME            PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-storage   kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  66s
(base) maye@maye-Inspiron-5547:~$

1.1.4 Changing the default StorageClass
List the StorageClasses in your cluster:

kubectl get storageclass

The output is similar to this:

NAME                 PROVISIONER               AGE
standard (default)   kubernetes.io/gce-pd      1d
gold                 kubernetes.io/gce-pd      1d

The default StorageClass is marked by (default).

Mark the default StorageClass as non-default:

The default StorageClass has an annotation storageclass.kubernetes.io/is-default-class set to true. Any other value or absence of the annotation is interpreted as false.
To mark a StorageClass as non-default, you need to change its value to false:

kubectl patch storageclass standard -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'

where standard is the name of your chosen StorageClass.

Mark a StorageClass as default:
Similar to the previous step, you need to add/set the annotation storageclass.kubernetes.io/is-default-class=true.

kubectl patch storageclass gold -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Attention:
Please note that at most one StorageClass can be marked as default. If two or more of them are marked as default, a PersistentVolumeClaim without storageClassName explicitly specified cannot be created.

Verify that your chosen StorageClass is default:

kubectl get storageclass

The output is similar to this:

NAME             PROVISIONER               AGE
standard         kubernetes.io/gce-pd      1d
gold (default)   kubernetes.io/gce-pd      1d

1.2 insert a u disk (any disk is ok, e.g. hard disk), mount u disk
1.2.1 see device name of the u disk:

sudo fdisk -l

/dev/sdb

1.2.2 mount u disk

mkdir /mnt/sdb
umount /dev/sdb
mount /dev/sdb /mnt/sdb

1.2.3 check if u disk is mounted ok

df -h

1.3 write persistent volume definition yaml file

#local_pv_kioxia_32g.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: local-pv-kioxia-32g
spec:
  capacity:
    storage: 30Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  local:
    path: /mnt/sdb
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - maye-inspiron-5547

1.4 Create resource persistent volume on kubernetes cluster

# on kubernetes cluster control plane node
kubectl apply -f  <path of pv definition yaml file>

1.5 check pv

kubectl get pv --all-namespaces

(base) maye@maye-Inspiron-5547:~$ kubectl get pv --all-namespaces
NAME                  CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS    REASON   AGE
local-pv-kioxia-32g   30Gi       RWO            Retain           Available           local-storage            15s
(base) maye@maye-Inspiron-5547:~$

Once pv and storageclass created, default storageclass marked, the unbound pvc will be bound automatically:

(base) maye@maye-Inspiron-5547:~$ kubectl get pvc --all-namespaces
NAMESPACE      NAME              STATUS   VOLUME                CAPACITY   ACCESS MODES   STORAGECLASS    AGE
istio-system   authservice-pvc   Bound    local-pv-kioxia-32g   30Gi       RWO            local-storage   175m
(base) maye@maye-Inspiron-5547:~$

Attention:
If pv is created with ReclaimPolicy: retain , After pvc bound to it is deleted, the status of the pv will be "released", not "available", and a new pvc can not bind to it.
Need to delete field "ClaimRef" in its yaml file manually, then it will be "available", and can bind to a new pvc.

kubectl edit pv <pv-name>

Note

A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes. It is a resource in the cluster just like a node is a cluster resource. PVs are volume plugins like Volumes, but have a lifecycle independent of any individual Pod that uses the PV. This API object captures the details of the implementation of the storage, be that NFS, iSCSI, or a cloud-provider-specific storage system.

A PersistentVolumeClaim (PVC) is a request for storage by a user. It is similar to a Pod. Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and Memory). Claims can request specific size and access modes (e.g., they can be mounted ReadWriteOnce, ReadOnlyMany, ReadWriteMany, or ReadWriteOncePod, see AccessModes).

A control loop in the control plane watches for new PVCs, finds a matching PV (if possible), and binds them together.

Claims will remain unbound indefinitely if a matching volume does not exist. Claims will be bound as matching volumes become available, and first consumer pod of this pv claim is ready if the pv claim has set volumeBindingMode: WaitForFirstConsumer in its yaml file.

A persistent volume can be a standalone disk, or a partition of a standalone disk.

On unbuntu 给u盘分区：
2.1. umount u disk；
2.2 sudo fdisk -l 查看u disk device name， e.g. /dev/sde
2.3. sudo fdisk /dev/sde

Command (m for help): p    # 查看分区
Command (m for help): d    # delete partition
Command (m for help): n    # create a new partition, if u disk not for system boot disk, can create one extended partition will all capacity, then create multiple logical partitions on the extended partition.

Command (m for help): w    # save and exit

2.4. 格式化分区：

sudo mkfs -t ext4 <partition name, e.g. /dev/sde5>

Attention:
A partition can not be mount if not formatted .

[ERROR: mysql-pod: chown: changing ownership of '/var/lib/mysql/': Operation not permitted]

kubeclt describe pod mysql-pod-name -n kubeflow

kubeflow       mysql-6b95d686-rgqtj                              0/1     CrashLoopBackOff 
Exit Code:    1


kubectl logs mysql-6b95d686-rgqtj -n kubeflow --all-containers

-Inspiron-5547:~$ kubectl logs mysql-6b95d686-rgqtj -n kubeflow --all-containers
2024-01-10 12:11:19+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 8.0.26-1debian10 started.
chown: changing ownership of '/var/lib/mysql/': Operation not permitted
chown: changing ownership of '/var/lib/mysql/datadir': Operation not permitted
(base) maye@maye-Inspiron-5547:~$

[SOLUTION]
This is caused by that the folder bound to mysql-pvc is owned by root on the host, and its mode is rwxr-xr-x,
so other users has no permmision to write this folder, including container, on the host chmod 777 <the-folder-name-on-host> .

[ERROR: mysql data directory has files in it, but initialization is assigned]
[SOLUTION]
empty mysql data directory:

rm -rf <host path bound to mysql-pvc>

[ERROR: ml-pipeline-pod: exit code: 137]

[ANALYSIS]
This is caused by that kubelet tells containerd to stop one container, and containerd send SIGKILL to the container process to stop the container.

journalctl -fu containerd

月 11 13:50:01 maye-Inspiron-5547 containerd[6794]: time="2024-01-11T13:50:01.917354662+08:00" level=info msg="CreateContainer within sandbox "d853b1ca36011217e11dd812e8312233d45b1360a980136c8a2aec02866aa658" for &ContainerMetadata{Name:ml-pipeline-api-server,Attempt:199,} returns container id "2a9f4cf46804f8b62b94a145dd5aebde32f83bcace3ba180d62d21f314ebacea""
1月 11 13:50:01 maye-Inspiron-5547 containerd[6794]: time="2024-01-11T13:50:01.918137372+08:00" level=info msg="StartContainer for "2a9f4cf46804f8b62b94a145dd5aebde32f83bcace3ba180d62d21f314ebacea""
1月 11 13:50:02 maye-Inspiron-5547 containerd[6794]: time="2024-01-11T13:50:02.450201053+08:00" level=info msg="StartContainer for "2a9f4cf46804f8b62b94a145dd5aebde32f83bcace3ba180d62d21f314ebacea" returns successfully"
1月 11 13:50:59 maye-Inspiron-5547 containerd[6794]: time="2024-01-11T13:50:59.932650029+08:00" level=info msg="StopContainer for "2a9f4cf46804f8b62b94a145dd5aebde32f83bcace3ba180d62d21f314ebacea" with timeout 30 (s)"
1月 11 13:50:59 maye-Inspiron-5547 containerd[6794]: time="2024-01-11T13:50:59.933136978+08:00" level=info msg="Stop container "2a9f4cf46804f8b62b94a145dd5aebde32f83bcace3ba180d62d21f314ebacea" with signal terminated"
1月 11 13:51:29 maye-Inspiron-5547 containerd[6794]: time="2024-01-11T13:51:29.941363418+08:00" level=info msg="Kill container "2a9f4cf46804f8b62b94a145dd5aebde32f83bcace3ba180d62d21f314ebacea""

From the log of containerd, It can be seen that the container is stopped due to timeout.
[SOLUTION]
This is due to pod metadata-grpc-deployment not ok. when pod metadata-grpc-deployment becomes ok, it will be ok automatically.

[ERROR: pod metadata-grpc-deployment: Exit Code: 139]

$ kubectl describe pod metadata-grpc-deployment-659594dfcb-k8qmz -n kubeflow

Exit Code: 139

Example 1:

$ kubectl describe pod metadata-grpc-deployment-659594dfcb-2jhn4 -n kubeflow

I0111 14:57:55.153116     1 metadata_store_server_main.cc:551] Retry attempt 0
W0111 14:57:56.759151     1 metadata_store_server_main.cc:550] Connection Aborted with error: ABORTED: There are a subset of tables in MLMD instance. This may be due to concurrent connection to the empty database. Please retry the connection. checks: 15 errors: 14, present tables: type_table, missing tables: parent_type_table, type_property_table, artifact_table, artifact_property_table, execution_table, execution_property_table, event_table, event_path_table, mlmd_env_table, context_table, parent_context_table, context_property_table, association_table, attribution_table Errors: INTERNAL: mysql_query failed: errno: Table 'metadb.ParentType' doesn't exist, error: Table 'metadb.ParentType' doesn't exist [mysql-error-info='\x08\xfa\x08']

Example 2:

I0210 08:45:45.443547     1 metadata_store_server_main.cc:551] Retry attempt 3
W0210 08:45:50.460564     1 metadata_store_server_main.cc:550] Connection Aborted with error: ABORTED: There are a subset of tables in MLMD instance. This may be due to concurrent connection to the empty database. Please retry the connection. checks: 15 errors: 13, present tables: type_table, type_property_table, missing tables: parent_type_table, artifact_table, artifact_property_table, execution_table, execution_property_table, event_table, event_path_table, mlmd_env_table, context_table, parent_context_table, context_property_table, association_table, attribution_table Errors: INTERNAL: mysql_query failed: errno: Table 'metadb.ParentType' doesn't exist, error: Table 'metadb.ParentType' doesn't exist [mysql-error-info='\x08\xfa\x08']

[SOLUTION]
"Exit Code: 139" is SEGMENT FAULT, namely accessing not existing file, or not existing table in database, or accessing out of memory boundary.
This error is due to that there are subset of tables in database metadb, not all needed tables, when metadata-grpc-deployment pod wants to query an not existing table, It meets "Exit Code: 139". The debug and solution detail of this error is in another article <<Debug: kubeflow pipeline: There are a subset of tables in MLMD instance>> https://www.cnblogs.com/zhenxia-jiuyou/p/17997363 .

[ERROR: ml-pipeline-ui webpage: Failed to list pipelines: InternalServerError: Failed to start transaction to list pipelines: dial tcp: lookup mysql on 10.96.0.10:53: no such host","code":13,"message":"Failed to list pipelines in namespace .]

{"error":"Failed to list pipelines in namespace . Check error stack: Failed to list pipelines with context \u0026{0xc00017a040}, options \u0026{10 0xc000968080}: InternalServerError: Failed to start transaction to list pipelines: dial tcp: lookup mysql on 10.96.0.10:53: no such host","code":13,"message":"Failed to list pipelines in namespace . Check error stack: Failed to list pipelines with context \u0026{0xc00017a040}, options \u0026{10 0xc000968080}: InternalServerError: Failed to start transaction to list pipelines: dial tcp: lookup mysql on 10.96.0.10:53: no such host","details":[{"@type":"type.googleapis.com/google.rpc.Status","code":13,"message":"Internal Server Error"}]}

[SOLUTION]
This is due to temporary not good network, so can not connect to mysql. Just refresh the webpage, then ok.

Note:

database metadb is created by pod metadata-grpc-deployment,
databse mlpipeline is created by pod ml-pipeline .
enter and exit a pod

# enter a pod
kubectl exec -it <pod-name> -n <namespace> -- bash 
# exit a pod
root@mysql-6b95d686-gkszv:/#  exit

mirror websites of gcr.io :
3.1 gcr.lank8s.cn
3.2. registry.aliyuncs.com/google_containers
3.3. registry.cn-hangzhou.aliyuncs.com/google_containers
3.4. gcr.dockerproxy.com
pull container image from https://dockerproxy.com/ :
第一步：在网站 https://dockerproxy.com/ 输入原始镜像地址获取代理镜像地址.
gcr.io/ml-pipeline/api-server:2.0.5
第二步：代理拉取镜像
nerdctl pull gcr.dockerproxy.com/ml-pipeline/api-server:2.0.5
第三步：重命名镜像
nerdctl tag gcr.dockerproxy.com/ml-pipeline/api-server:2.0.5 gcr.io/ml-pipeline/api-server:2.0.5
第四步：删除代理镜像
nerdctl rmi gcr.dockerproxy.com/ml-pipeline/api-server:2.0.5

Attention:
k8s集群内的所有node （namely computer）都需要拉取一样的镜像，并且打上一样的标签，因为k8s在没有做亲和力的情况下，部署pod是在集群内随机部署的。若没有全部拉取镜像，也会出现找不到本地镜像的问题，pod状态是 “ErrImageNeverPull”。

clean cache/buffer on Ubuntu:

sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

/proc/sys/vm/drop_caches:
writing to this will cause the kernel to drop clean caches, as well as reclaimable slab objects like dentries and
inodes. Once dropped, their memeory becomes free.
to free pagecache:
echo 1 > /proc/sys/vm/drop_caches
to free reclaimable slab objects (includes dentries and inodes)
echo 2 > /proc/sys/vm/drop_caches
to free slab objects and pagecaches:
echo 3 > /proc/sys/vm/drop_caches

this is a non-destructive operation and will not free any dirty objects, to increase the number of
objects freed by this operations, the user may run 'sync' prior to writing to /proc/sys/vm/drop_cahces

On Ubuntu restart network service:

sudo systemctl restart NetworkManager.service

what does each pod of kubeflow pipeline deployment do:
7.1 ml-pipeline:
creates database mlpipeline, manages pipeline creating, deleting, listing, pipeline run creating, listing. ml-pipeline-ui web page or kfp.Client() is user interface of creating, deleting pipeline or pipeline run, the back-end server is ml-pipeline pod. Its main container is api-server.
7.2 workflow-controller:
workflow controller of argo, manages creating a workflow based on file pipeline.yaml (namely the definition file of the workflow) and running components of the workflow. when ml-pipeline creates a pipeline run, it calls workflow-controller to create a workflow, namely a pipeline run = a workflow.
7.3 metadata-grpc-deployment:
creates database metadb, manages connecting to mysql and querying database (namely read or write record in database) for metadata. when a tfx pipeline component needs to store an output artifact, or read an input artifact, it calls metadata-grpc-deployment to connect to mysql, query database to record metadata of the output artifact including uri, then write data of the output artifact to the uri, or retrieve metadata of the input artifact, including uri, then read data of the input artifact from the uri.
7.4 minio:
artifacts transfer station of kubeflow pipeline, kubeflow pipeline copies output artifact from a component container who produces it to minio, copies input artifact from minio to a component container who needs it.

References:

posted on 2024-02-02 13:42 zhenxia-jiuyou 阅读(174) 评论(0) 收藏举报

刷新页面返回顶部

导航