[ERROR: stuck at "INFO:tensorflow:ParameterServerStrategyV2 is now connecting to cluster with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})"]
# service dist-strat-example-ps-0 definition yaml file
---
kind: Service
apiVersion: v1
metadata:
name: dist-strat-example-ps-0
spec:
type: LoadBalancer
selector:
app: dist-strat-example-ps-0
ports:
- port: 5000
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: dist-strat-example-ps-0
name: dist-strat-example-ps-0
spec:
replicas: 1
selector:
matchLabels:
app: dist-strat-example-ps-0
template:
metadata:
labels:
app: dist-strat-example-ps-0
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- maye-inspiron-5547
containers:
- name: tensorflow
image: tf_std_server:v1
resources:
limits:
#nvidia.com/gpu: 2
env:
- name: TF_CONFIG
value: "{
\"cluster\": {
\"worker\": [\"dist-strat-example-worker-0:5000\",\"dist-strat-example-worker-1:5000\"],
\"ps\": [\"dist-strat-example-ps-0:5000\"]},
\"task\": {
\"type\": \"ps\",
\"index\": \"0\"
}
}"
#- name: GOOGLE_APPLICATION_CREDENTIALS
# value: "/var/secrets/google/key.json"
ports:
- containerPort: 5000
command:
- "/usr/bin/python"
- "/tf_std_server.py"
- ""
#volumeMounts:
#- name: credential
# mountPath: /var/secrets/google
#volumes:
#- name: credential
# secret:
# secretName: credential
---
# run_fn in module file of tfx component trainer
def run_fn(fn_args: tfx.components.FnArgs):
cluster_dict = {}
### ClusterIp should be used, not service name, or
### this error will be raised.
cluster_dict["worker"] = ["dist-strat-example-worker-0:5000", "dist-strat-example-worker-1:5000"]
cluster_dict["ps"] = ["dist-strat-example-ps-0:5000"]
#cluster_dict["worker"] = ["10.102.74.8:5000", "10.100.198.218:5000"]
#cluster_dict["ps"] = ["10.96.200.160:5000"]
cluster_spec = tf.train.ClusterSpec(cluster_dict)
cluster_resolver = tf.distribute.cluster_resolver.SimpleClusterResolver(
cluster_spec, rpc_layer="grpc")
strategy = tf.distribute.ParameterServerStrategy(
cluster_resolver,)
tf_transform_output = tft.TFTransformOutput(fn_args.transform_output)
train_dataset = _input_fn(
fn_args.train_files,
fn_args.data_accessor,
tf_transform_output,
batch_size=_TRAIN_BATCH_SIZE,
)
resampled_train_dataset = _resample_train_dataset(train_dataset,
batch_size=_TRAIN_BATCH_SIZE)
#tf.print(f"resampled_train_dataset {resampled_train_dataset.cardinality()}")
val_dataset = _input_fn(
fn_args.eval_files,
fn_args.data_accessor,
tf_transform_output,
batch_size=_EVAL_BATCH_SIZE,
)
with strategy.scope():
model = _build_keras_model()
trainer_train_history = model.fit(
resampled_train_dataset,
epochs=fn_args.custom_config['epochs'],
steps_per_epoch=fn_args.train_steps,
validation_data=val_dataset,
#callbacks=[tensorboard_callback],
)
with open('trainer_train_history.json', 'w') as f:
json.dump(trainer_train_history.history, f)
signatures = {
'serving_default': _get_serve_tf_examples_fn(model, tf_transform_output),
}
model.save(fn_args.serving_model_dir, save_format='tf', signatures=signatures)
$ kubectl logs pod-tfx-trainer-component -n kubeflow
...
INFO:absl:Successfully installed '/tfx/pipelines/tfx_user_code_Trainer-0.0+a0a99f38e703a50fc266bc1da356164d31c1f23c893900324e04c03582c72555-py3-none-any.whl'.
INFO:absl:Training model.
INFO:tensorflow:`tf.distribute.experimental.ParameterServerStrategy` is initialized with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})
INFO:tensorflow:`tf.distribute.experimental.ParameterServerStrategy` is initialized with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})
INFO:tensorflow:ParameterServerStrategyV2 is now connecting to cluster with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})
INFO:tensorflow:ParameterServerStrategyV2 is now connecting to cluster with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})
(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$ kubectl logs dist-strat-example-ps-0-85fdfdddcb-9x6mt
2024-02-14 05:51:36.101034: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-14 05:51:38.566981: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: UNKNOWN ERROR (34)
2024-02-14 05:51:38.570913: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:457] Started server with target: grpc://dist-strat-example-ps-0:5000
(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$
[SOLUTION]
This error is due to that service name, e.g. "dist-strat-example-worker-0", is used when passing to tf.distribute.ParameterServerStrategy(), and "dist-strat-example-worker-0" is service name of worker-0, not its host name, so tf.distribute.ParameterServerStrategy() thinks that worker-0 it needs is not ready and keeps waiting.
clusterIp of service "dist-strat-example-worker-0" should be used here, so that tf.distribute.ParameterServerStrategy() can connect to it.
(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$ kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
dist-strat-example-ps-0 LoadBalancer 10.96.200.160 <pending> 5000:32409/TCP 53m
dist-strat-example-worker-0 LoadBalancer 10.102.74.8 <pending> 5000:30550/TCP 53m
dist-strat-example-worker-1 LoadBalancer 10.100.198.218 <pending> 5000:31080/TCP 53m
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 4d17h
def run_fn(fn_args: tfx.components.FnArgs):
cluster_dict = {}
#cluster_dict["worker"] = ["dist-strat-example-worker-0:5000", "dist-strat-example-worker-1:5000"]
#cluster_dict["ps"] = ["dist-strat-example-ps-0:5000"]
cluster_dict["worker"] = ["10.102.74.8:5000", "10.100.198.218:5000"]
cluster_dict["ps"] = ["10.96.200.160:5000"]
cluster_spec = tf.train.ClusterSpec(cluster_dict)
cluster_resolver = tf.distribute.cluster_resolver.SimpleClusterResolver(
cluster_spec, rpc_layer="grpc")
strategy = tf.distribute.ParameterServerStrategy(
cluster_resolver,)
ok log of pod tfx-trainer-component:
. Setting to DenseTensor.
INFO:absl:Feature feature_998 has a shape dim {
size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature feature_999 has a shape dim {
size: 1
}
. Setting to DenseTensor.
INFO:absl:Model: "model"
INFO:absl:__________________________________________________________________________________________________
INFO:absl: Layer (type) Output Shape Param # Connected to
INFO:absl:==================================================================================================
INFO:absl: feature_2 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1244 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_3 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_352 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_969 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1533 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1485 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_399 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1277 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1230 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_105 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1279 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_27 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1409 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_4 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1508 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1437 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_21 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_735 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_466 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1292 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1248 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1168 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_188 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_628 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1059 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1436 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1082 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1502 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1501 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_567 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_380 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1403 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_639 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_479 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_12 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1514 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_458 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_140 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1231 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1455 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1463 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1024 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_50 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_156 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1528 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: feature_1511 (InputLayer) [(None, 1)] 0 []
INFO:absl:
INFO:absl: concatenate (Concatenate) (None, 48) 0 ['feature_2[0][0]',
INFO:absl: 'feature_1244[0][0]',
INFO:absl: 'feature_3[0][0]',
INFO:absl: 'feature_352[0][0]',
INFO:absl: 'feature_969[0][0]',
INFO:absl: 'feature_1533[0][0]',
INFO:absl: 'feature_1[0][0]',
INFO:absl: 'feature_1485[0][0]',
INFO:absl: 'feature_399[0][0]',
INFO:absl: 'feature_1277[0][0]',
INFO:absl: 'feature_1230[0][0]',
INFO:absl: 'feature_105[0][0]',
INFO:absl: 'feature_1279[0][0]',
INFO:absl: 'feature_27[0][0]',
INFO:absl: 'feature_1409[0][0]',
INFO:absl: 'feature_4[0][0]',
INFO:absl: 'feature_1508[0][0]',
INFO:absl: 'feature_1437[0][0]',
INFO:absl: 'feature_21[0][0]',
INFO:absl: 'feature_735[0][0]',
INFO:absl: 'feature_466[0][0]',
INFO:absl: 'feature_1292[0][0]',
INFO:absl: 'feature_1248[0][0]',
INFO:absl: 'feature_1168[0][0]',
INFO:absl: 'feature_188[0][0]',
INFO:absl: 'feature_628[0][0]',
INFO:absl: 'feature_1059[0][0]',
INFO:absl: 'feature_1436[0][0]',
INFO:absl: 'feature_1082[0][0]',
INFO:absl: 'feature_1502[0][0]',
INFO:absl: 'feature_1501[0][0]',
INFO:absl: 'feature_567[0][0]',
INFO:absl: 'feature_380[0][0]',
INFO:absl: 'feature_1403[0][0]',
INFO:absl: 'feature_639[0][0]',
INFO:absl: 'feature_479[0][0]',
INFO:absl: 'feature_12[0][0]',
INFO:absl: 'feature_1514[0][0]',
INFO:absl: 'feature_458[0][0]',
INFO:absl: 'feature_140[0][0]',
INFO:absl: 'feature_1231[0][0]',
INFO:absl: 'feature_1455[0][0]',
INFO:absl: 'feature_1463[0][0]',
INFO:absl: 'feature_1024[0][0]',
INFO:absl: 'feature_50[0][0]',
INFO:absl: 'feature_156[0][0]',
INFO:absl: 'feature_1528[0][0]',
INFO:absl: 'feature_1511[0][0]']
INFO:absl:
INFO:absl: dense (Dense) (None, 16) 784 ['concatenate[0][0]']
INFO:absl:
INFO:absl: dense_1 (Dense) (None, 16) 272 ['dense[0][0]']
INFO:absl:
INFO:absl: dense_2 (Dense) (None, 1) 17 ['dense_1[0][0]']
INFO:absl:
INFO:absl:==================================================================================================
INFO:absl:Total params: 1,073
INFO:absl:Trainable params: 1,073
INFO:absl:Non-trainable params: 0
INFO:absl:__________________________________________________________________________________________________
Epoch 1/50
/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py:467: UserWarning: To make it possible to preserve tf.data options across serialization boundaries, their implementation has moved to be part of the TensorFlow graph. As a consequence, the options value is in general no longer known at graph construction time. Invoking this method in graph mode retains the legacy behavior of the original implementation, but note that the returned value might not reflect the actual value of the options.
warnings.warn("To make it possible to preserve tf.data options across "
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
(base) maye@maye-Inspiron-5547:~$
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· 写一个简单的SQL生成工具
· AI 智能体引爆开源社区「GitHub 热点速览」
· C#/.NET/.NET Core技术前沿周刊 | 第 29 期(2025年3.1-3.9)