Debug: tf distribute strategy parameter server: NOT_FOUND: No such file or directory

[ERROR: NOT_FOUND: /tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory]

log of pod tfx-trainer-component:

ERROR:tensorflow: /job:worker/task:0 encountered the following error when processing closure: NotFoundError():Graph execution error:

2 root error(s) found.
  (0) NOT_FOUND:   /tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]
	 [[IteratorGetNextAsOptional]]
Additional GRPC error information from remote target /job:worker/replica:0/task:0 while calling /tensorflow.WorkerService/RecvTensor:
:{"created":"@1707896978.099891609","description":"Error received from peer ipv4:10.102.74.8:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"/tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory\n\t [[{{node MultiDeviceIteratorGetNextFromShard}}]]\n\t [[RemoteCall]]\n\t [[IteratorGetNextAsOptional]]","grpc_status":5}
	 [[Cast_27/_24]]
Additional GRPC error information from remote target /job:ps/replica:0/task:0/device:CPU:0 while calling /tensorflow.eager.EagerService/RunComponentFunction:
:{"created":"@1707896978.103488999","description":"Error received from peer ipv4:10.96.200.160:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":" /tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory\n\t [[{{node MultiDeviceIteratorGetNextFromShard}}]]\n\t [[RemoteCall]]\n\t [[IteratorGetNextAsOptional]]\nAdditional GRPC error information from remote target /job:worker/replica:0/task:0 while calling /tensorflow.WorkerService/RecvTensor:\n:{"created":"@1707896978.099891609","description":"Error received from peer ipv4:10.102.74.8:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"/tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory\n\t [[{{node MultiDeviceIteratorGetNextFromShard}}]]\n\t [[RemoteCall]]\n\t [[IteratorGetNextAsOptional]]","grpc_status":5}\n\t [[Cast_27/_24]]","grpc_status":5}
  (1) NOT_FOUND:  /tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]
	 [[IteratorGetNextAsOptional]]
0 successful operations.
0 derived errors ignored.

[SOLUTION]

This error is due to that pipeline_root directory has not been mounted to file system of worker container, mount it in definition yaml file of worker service:

# definition yaml file of worker service
kind: Service
apiVersion: v1
metadata:
  name: dist-strat-example-worker-0
  
  namespace: kubeflow
  
spec:

  type: LoadBalancer

  selector:
    app: dist-strat-example-worker-0
       
  ports:
  - port: 5000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: dist-strat-example-worker-0

  name: dist-strat-example-worker-0
  
  namespace: kubeflow

spec:
  replicas: 1
  
  selector:
    matchLabels:
      app: dist-strat-example-worker-0

      
  template:
    metadata:
      labels:
        app: dist-strat-example-worker-0
    
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - maye-inspiron-5547
      
      containers:

      - name: tensorflow
        image: tf_std_server:v1
        resources:
          limits:
            #nvidia.com/gpu: 2

        env:

        - name: TF_CONFIG
          value: "{
  \"cluster\": {
    \"worker\": [\"dist-strat-example-worker-0:5000\",\"dist-strat-example-worker-1:5000\"], 
    \"ps\": [\"dist-strat-example-ps-0:5000\"]},
  \"task\": {
    \"type\": \"worker\",
    \"index\": \"0\"
  }
}"

        #- name: GOOGLE_APPLICATION_CREDENTIALS
        #  value: "/var/secrets/google/key.json"
        ports:
        - containerPort: 5000

        command:
        - "/usr/bin/python"
        - "/tf_std_server.py"
        - ""
        
        
        volumeMounts:
        - mountPath: /tfx/tfx_pv
          name: tfx-pv  
        
        #- name: credential
        #  mountPath: /var/secrets/google
        
        
      volumes:
      - name: tfx-pv
        persistentVolumeClaim:
          claimName: tfx-pv-claim


      #- name: credential
      #  secret:
      #    secretName: credential
---

Attention:

persistentVolumeClaim can only be used by resource in the same namespace. In this example, persistentVolumeClaim "tfx-pv-claim" is in namespace "kubeflow", so worker service and worker deployment should also specify namespace "kubeflow". Or raise error:

...
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  83s   default-scheduler  0/2 nodes are available: persistentvolumeclaim "tfx-pv-claim" not found. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..

posted on 2024-02-14 17:58 zhenxia-jiuyou 阅读(15) 评论(0) 编辑收藏举报

刷新页面返回顶部

导航

Debug: tf distribute strategy parameter server: NOT_FOUND: No such file or directory

[ERROR: NOT_FOUND: /tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory]

[SOLUTION]