[ERROR: tf distribute strategy parameter server: tfx component trainer: OutOfRangeError(), Node: 'cond/IteratorGetNext' End of sequence]

log of pod tfx-component-trainer:

2024-02-14 09:43:48.571820: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:58] Ignoring an error encountered when deleting remote tensors handles: INVALID_ARGUMENT: Unable to find the relevant tensor remote_handle: Op ID: 5019, Output num: 7
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1707903828.498704799","description":"Error received from peer ipv4:10.105.206.29:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 5019, Output num: 7","grpc_status":3} [type.googleapis.com/tensorflow.core.platform.ErrorSourceProto='\x08\x05']
ERROR:tensorflow: /job:worker/task:1 encountered the following error when processing closure: OutOfRangeError():Graph execution error:

Detected at node 'cond/IteratorGetNext' defined at (most recent call last):
    File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
      main(sys.argv[1:])
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
      execution_info = component_launcher.launch()
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
      executor_output = self._run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
      executor_output = self._executor_operator.run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
      return run_with_executor(execution_info, executor)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
      result = executor.Do(execution_info.input_dict, output_dict,
    File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
      run_fn(fn_args)
    File "/tmp/tmpss6gfjtk/detect_anomalies_in_wafer_trainer.py", line 316, in run_fn
      trainer_train_history = model.fit(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1729, in fit
      val_logs = self.evaluate(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 2072, in evaluate
      tmp_logs = self.test_function(iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1861, in <lambda>
      lambda it: self._cluster_coordinator.schedule(
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1852, in test_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1835, in step_function
      data = next(iterator)
Node: 'cond/IteratorGetNext'
Detected at node 'cond/IteratorGetNext' defined at (most recent call last):
    File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
      main(sys.argv[1:])
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
      execution_info = component_launcher.launch()
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
      executor_output = self._run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
      executor_output = self._executor_operator.run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
      return run_with_executor(execution_info, executor)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
      result = executor.Do(execution_info.input_dict, output_dict,
    File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
      run_fn(fn_args)
    File "/tmp/tmpss6gfjtk/detect_anomalies_in_wafer_trainer.py", line 316, in run_fn
      trainer_train_history = model.fit(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1729, in fit
      val_logs = self.evaluate(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 2072, in evaluate
      tmp_logs = self.test_function(iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1861, in <lambda>
      lambda it: self._cluster_coordinator.schedule(
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1852, in test_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1835, in step_function
      data = next(iterator)
Node: 'cond/IteratorGetNext'
End of sequence
	 [[{{node cond/IteratorGetNext}}]]
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1707903828.595072820","description":"Error received from peer ipv4:10.102.137.138:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":" End of sequence\n\t [[{{node cond/IteratorGetNext}}]]","grpc_status":11} [Op:__inference_test_function_29421]
ERROR:tensorflow: /job:worker/task:1 encountered the following error when processing closure: OutOfRangeError():Graph execution error:

Detected at node 'cond/IteratorGetNext' defined at (most recent call last):
    File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
      main(sys.argv[1:])
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
      execution_info = component_launcher.launch()
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
      executor_output = self._run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
      executor_output = self._executor_operator.run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
      return run_with_executor(execution_info, executor)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
      result = executor.Do(execution_info.input_dict, output_dict,
    File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
      run_fn(fn_args)
    File "/tmp/tmpss6gfjtk/detect_anomalies_in_wafer_trainer.py", line 316, in run_fn
      trainer_train_history = model.fit(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1729, in fit
      val_logs = self.evaluate(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 2072, in evaluate
      tmp_logs = self.test_function(iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1861, in <lambda>
      lambda it: self._cluster_coordinator.schedule(
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1852, in test_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1835, in step_function
      data = next(iterator)
Node: 'cond/IteratorGetNext'
Detected at node 'cond/IteratorGetNext' defined at (most recent call last):
    File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
      main(sys.argv[1:])
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
      execution_info = component_launcher.launch()
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
      executor_output = self._run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
      executor_output = self._executor_operator.run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
      return run_with_executor(execution_info, executor)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
      result = executor.Do(execution_info.input_dict, output_dict,
    File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
      run_fn(fn_args)
    File "/tmp/tmpss6gfjtk/detect_anomalies_in_wafer_trainer.py", line 316, in run_fn
      trainer_train_history = model.fit(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1729, in fit
      val_logs = self.evaluate(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 2072, in evaluate
      tmp_logs = self.test_function(iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1861, in <lambda>
      lambda it: self._cluster_coordinator.schedule(
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1852, in test_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1835, in step_function
      data = next(iterator)
Node: 'cond/IteratorGetNext'
End of sequence
	 [[{{node cond/IteratorGetNext}}]]
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1707903828.595072820","description":"Error received from peer ipv4:10.102.137.138:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":" End of sequence\n\t [[{{node cond/IteratorGetNext}}]]","grpc_status":11} [Op:__inference_test_function_29421]
ERROR:tensorflow:Start cancelling closures due to error OutOfRangeError(): Graph execution error:

[SOLUTION]

This error is due to that validation dataset is finite, operation IteratorGetNext meets end of validation dataset.
The solution is:

# repeat validation dataset indefinitely
validation_dataset = validation_dataset.repeat()
# specify validation_steps, one step = one batch
model.fit(validation_steps=<an integer>, ...)

ok log of pod tfx-component-trainer:

Epoch 1/50
/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py:467: UserWarning: To make it possible to preserve tf.data options across serialization boundaries, their implementation has moved to be part of the TensorFlow graph. As a consequence, the options value is in general no longer known at graph construction time. Invoking this method in graph mode retains the legacy behavior of the original implementation, but note that the returned value might not reflect the actual value of the options.
  warnings.warn("To make it possible to preserve tf.data options across "
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
21/21 - 31s - loss: 0.7040 - cross entropy: 0.7036 - tp: 882.0000 - fp: 820.0000 - tn: 547.0000 - fn: 439.0000 - precision: 0.5182 - recall: 0.6677 - auc: 0.5415 - prc: 0.5361 - val_loss: 0.6753 - val_cross entropy: 0.6749 - val_tp: 30.0000 - val_fp: 181.0000 - val_tn: 166.0000 - val_fn: 7.0000 - val_precision: 0.1422 - val_recall: 0.8108 - val_auc: 0.7620 - val_prc: 0.3268 - 31s/epoch - 1s/step
Epoch 2/50
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
21/21 - 14s - loss: 0.5821 - cross entropy: 0.5817 - tp: 1036.0000 - fp: 439.0000 - tn: 926.0000 - fn: 287.0000 - precision: 0.7024 - recall: 0.7831 - auc: 0.8096 - prc: 0.8139 - val_loss: 0.5677 - val_cross entropy: 0.5673 - val_tp: 25.0000 - val_fp: 82.0000 - val_tn: 271.0000 - val_fn: 6.0000 - val_precision: 0.2336 - val_recall: 0.8065 - val_auc: 0.8646 - val_prc: 0.3799 - 14s/epoch - 657ms/step
Epoch 3/50
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$ 

Note:

  1. Keras Model.fit with parameter server training assumes that each worker receives the same dataset, except when it is shuffled differently. Therefore, by calling Dataset.shuffle, you ensure more even iterations over the data.
    Because workers do not synchronize, they may finish processing their datasets at different times. Therefore, the easiest way to define epochs with parameter server training is to use Dataset.repeat—which repeats a dataset indefinitely when called without an argument—and specify the steps_per_epoch argument in the Model.fit call. [1]

References:


  1. https://tensorflow.google.cn/tutorials/distribute/parameter_server_training ↩︎