[ERROR: tf distribute strategy parameter server: tfx component trainer: model.save(): failed to connect to all addresses]
log of pod tfx-component-trainer:
2024-02-14 13:56:45.656154: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'examples' with dtype string and shape [?]
[[{{node examples}}]]
WARNING:absl:Found untraced functions such as _update_step_xla while saving (showing 1 of 1). These functions will not be directly callable after loading.
2024-02-14 13:56:58.654607: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'serving_default_examples' with dtype string and shape [?]
[[{{node serving_default_examples}}]]
ERROR:absl:Execution 81 failed.
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
main(sys.argv[1:])
File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
execution_info = component_launcher.launch()
File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
executor_output = self._run_executor(execution_info)
File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
executor_output = self._executor_operator.run_executor(execution_info)
File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
return run_with_executor(execution_info, executor)
File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
result = executor.Do(execution_info.input_dict, output_dict,
File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
run_fn(fn_args)
File "/tmp/tmp6whwtr0z/detect_anomalies_in_wafer_trainer.py", line 339, in run_fn
model.save(fn_args.serving_model_dir, save_format='tf', signatures=signatures)
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnavailableError: Graph execution error:
failed to connect to all addresses
Additional GRPC error information from remote target /job:chief/replica:0/task:0 while calling /tensorflow.WorkerService/RecvTensor:
:{"created":"@1707919046.196861281","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1707919046.183670123","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}
[[{{node num_shards/_4}}]]
Additional GRPC error information from remote target /job:ps/replica:0/task:0/device:CPU:0:
:{"created":"@1707919046.428929072","description":"Error received from peer ipv4:10.105.27.97:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":" failed to connect to all addresses\nAdditional GRPC error information from remote target /job:chief/replica:0/task:0 while calling /tensorflow.WorkerService/RecvTensor:\n:{"created":"@1707919046.196861281","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1707919046.183670123","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}\n\t [[{{node num_shards/_4}}]]","grpc_status":14} [Op:__inference_tf_function_save_108956]
2024-02-14 13:57:51.914254: I tensorflow/core/common_runtime/eager/kernel_and_device.cc:94] Ignoring error status when releasing multi-device function handle UNIMPLEMENTED: Releasing a multi-device component handle on a remote device is not yet implemented.
INFO:tensorflow:ClusterCoordinator destructor: stopping cluster
INFO:tensorflow:ClusterCoordinator destructor: stopping cluster
INFO:tensorflow:Stopping cluster, starting with failure handler
INFO:tensorflow:Stopping cluster, starting with failure handler
INFO:tensorflow:Stopping workers
INFO:tensorflow:Stopping workers
INFO:tensorflow:Stopping queue
INFO:tensorflow:Stopping queue
INFO:tensorflow:Start cancelling remote resource-building functions
INFO:tensorflow:Start cancelling remote resource-building functions
time="2024-02-14T13:58:57.578Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2024-02-14T13:58:58.482Z" level=error msg="cannot save artifact /mlpipeline-ui-metadata.json" argo=true error="stat /mlpipeline-ui-metadata.json: no such file or directory"
Error: exit status 1
[ANALYSIS]
# in run_fn() of module file of tfx component trainer
model.save(fn_args.serving_model_dir, ...)
trainer = tfx.components.Trainer(
module_file=module_file,
examples=example_gen.outputs['examples'],
transform_graph=transform.outputs['transform_graph'],
train_args=tfx.proto.TrainArgs(num_steps=_STEPS_PER_EPOCH,),
custom_config={"epochs": 5},
)
tfx/compoments/trainer/fn_args_utils.py
@attr.s
class FnArgs:
"""Args to pass to user defined training/tuning function(s).
Attributes:
working_dir: Working dir.
train_files: A list of patterns for train files.
eval_files: A list of patterns for eval files.
train_steps: Number of train steps.
eval_steps: Number of eval steps.
schema_path: A single uri for schema file. Will be None if not specified.
schema_file: Deprecated, use `schema_path` instead.
transform_graph_path: An optional single uri for transform graph produced by
TFT. Will be None if not specified.
transform_output: Deprecated, use `transform_graph_path` instead.
data_accessor: Contains factories that can create tf.data.Datasets or other
means to access the train/eval data. They provide a uniform way of
accessing data, regardless of how the data is stored on disk.
serving_model_dir: A single uri for the output directory of the serving
model.
eval_model_dir: A single uri for the output directory of the eval model.
Note that this is estimator only, Keras doesn't require it for TFMA.
model_run_dir: A single uri for the output directory of model training
related files.
base_model: An optional base model path that will be used for this training.
hyperparameters: An optional keras_tuner.HyperParameters config.
custom_config: An optional dictionary passed to the component.
"""
working_dir = attr.ib(type=str, default=None)
train_files = attr.ib(type=List[str], default=None)
eval_files = attr.ib(type=List[str], default=None)
train_steps = attr.ib(type=int, default=None)
eval_steps = attr.ib(type=int, default=None)
schema_path = attr.ib(type=str, default=None)
schema_file = attr.ib(type=str, default=None)
transform_graph_path = attr.ib(type=str, default=None)
transform_output = attr.ib(type=str, default=None)
data_accessor = attr.ib(type=DataAccessor, default=None)
serving_model_dir = attr.ib(type=str, default=None)
eval_model_dir = attr.ib(type=str, default=None)
model_run_dir = attr.ib(type=str, default=None)
base_model = attr.ib(type=str, default=None)
hyperparameters = attr.ib(type=Dict[str, Any], default=None)
custom_config = attr.ib(type=Dict[str, Any], default=None)
def get_common_fn_args(input_dict: Dict[str, List[types.Artifact]],
exec_properties: Dict[str, Any],
working_dir: Optional[str] = None) -> FnArgs:
"""Get common args of training and tuning."""
if input_dict.get(standard_component_specs.TRANSFORM_GRAPH_KEY):
transform_graph_path = artifact_utils.get_single_uri(
input_dict[standard_component_specs.TRANSFORM_GRAPH_KEY])
else:
transform_graph_path = None
if input_dict.get(standard_component_specs.SCHEMA_KEY):
schema_path = io_utils.get_only_uri_in_dir(
artifact_utils.get_single_uri(
input_dict[standard_component_specs.SCHEMA_KEY]))
else:
schema_path = None
train_args = trainer_pb2.TrainArgs()
eval_args = trainer_pb2.EvalArgs()
proto_utils.json_to_proto(
exec_properties[standard_component_specs.TRAIN_ARGS_KEY], train_args)
proto_utils.json_to_proto(
exec_properties[standard_component_specs.EVAL_ARGS_KEY], eval_args)
# Default behavior is train on `train` split (when splits is empty in train
# args) and evaluate on `eval` split (when splits is empty in eval args).
if not train_args.splits:
train_args.splits.append('train')
absl.logging.info("Train on the 'train' split when train_args.splits is "
'not set.')
if not eval_args.splits:
eval_args.splits.append('eval')
absl.logging.info("Evaluate on the 'eval' split when eval_args.splits is "
'not set.')
train_files = []
for train_split in train_args.splits:
train_files.extend([
io_utils.all_files_pattern(uri)
for uri in artifact_utils.get_split_uris(
input_dict[standard_component_specs.EXAMPLES_KEY], train_split)
])
eval_files = []
for eval_split in eval_args.splits:
eval_files.extend([
io_utils.all_files_pattern(uri)
for uri in artifact_utils.get_split_uris(
input_dict[standard_component_specs.EXAMPLES_KEY], eval_split)
])
data_accessor = DataAccessor(
tf_dataset_factory=tfxio_utils.get_tf_dataset_factory_from_artifact(
input_dict[standard_component_specs.EXAMPLES_KEY],
_TELEMETRY_DESCRIPTORS),
record_batch_factory=tfxio_utils.get_record_batch_factory_from_artifact(
input_dict[standard_component_specs.EXAMPLES_KEY],
_TELEMETRY_DESCRIPTORS),
data_view_decode_fn=tfxio_utils.get_data_view_decode_fn_from_artifact(
input_dict[standard_component_specs.EXAMPLES_KEY],
_TELEMETRY_DESCRIPTORS)
)
# https://github.com/tensorflow/tfx/issues/45: Replace num_steps=0 with
# num_steps=None. Conversion of the proto to python will set the default
# value of an int as 0 so modify the value here. Tensorflow will raise an
# error if num_steps <= 0.
train_steps = train_args.num_steps or None
eval_steps = eval_args.num_steps or None
# Load and deserialize custom config from execution properties.
# Note that in the component interface the default serialization of custom
# config is 'null' instead of '{}'. Therefore we need to default the
# json_utils.loads to 'null' then populate it with an empty dict when
# needed.
custom_config = json_utils.loads(
exec_properties.get(standard_component_specs.CUSTOM_CONFIG_KEY, 'null'))
# TODO(ruoyu): Make this a dict of tag -> uri instead of list.
if input_dict.get(standard_component_specs.BASE_MODEL_KEY):
base_model_artifact = artifact_utils.get_single_instance(
input_dict[standard_component_specs.BASE_MODEL_KEY])
base_model = path_utils.serving_model_path(
base_model_artifact.uri,
path_utils.is_old_model_artifact(base_model_artifact))
else:
base_model = None
return FnArgs(
working_dir=working_dir,
train_files=train_files,
eval_files=eval_files,
train_steps=train_steps,
eval_steps=eval_steps,
schema_path=schema_path,
transform_graph_path=transform_graph_path,
data_accessor=data_accessor,
base_model=base_model,
custom_config=custom_config,
)
# tfx/components/trainer/executor.py
def _GetFnArgs(self, input_dict: Dict[str, List[types.Artifact]],
output_dict: Dict[str, List[types.Artifact]],
exec_properties: Dict[str, Any]) -> fn_args_utils.FnArgs:
if input_dict.get(standard_component_specs.HYPERPARAMETERS_KEY):
hyperparameters_file = io_utils.get_only_uri_in_dir(
artifact_utils.get_single_uri(
input_dict[standard_component_specs.HYPERPARAMETERS_KEY]))
hyperparameters_config = json.loads(
file_io.read_file_to_string(hyperparameters_file))
else:
hyperparameters_config = None
output_path = artifact_utils.get_single_uri(
output_dict[standard_component_specs.MODEL_KEY])
serving_model_dir = path_utils.serving_model_dir(output_path)
...
# tfx/utils/path_utils.py
def serving_model_dir(output_uri: str, is_old_artifact: bool = False) -> str:
"""Returns directory for exported model for serving purpose."""
if is_old_artifact:
return os.path.join(output_uri, _OLD_SERVING_MODEL_DIR)
return os.path.join(output_uri, path_constants.SERVING_MODEL_DIR)
# tfx/utils/path_constants.py
SERVING_MODEL_DIR = 'Format-Serving'
fn_args.serving_model_dir is uri of trainer output artifact “model“ + 'Format-Serving'.
File "/tmp/tmp6whwtr0z/detect_anomalies_in_wafer_trainer.py", line 339, in run_fn
model.save(fn_args.serving_model_dir, save_format='tf', signatures=signatures)
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnavailableError: Graph execution error:
failed to connect to all addresses
[SOLUTION]
This error is saying that "failed to connect to all addresses of metadata-grpc-deployment", after checking it is found that pod metadata-grpc-deployment is down and recreated again. So just retry.