Debug: tf distribute strategy parameter server: stuck at "INFO:tensorflow:ParameterServerStrategyV2 is now connecting to cluster

[ERROR: stuck at "INFO:tensorflow:ParameterServerStrategyV2 is now connecting to cluster with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})"]

# service dist-strat-example-ps-0 definition yaml file
---
kind: Service
apiVersion: v1
metadata:
  name: dist-strat-example-ps-0
spec:
  type: LoadBalancer
  
  selector:
    app: dist-strat-example-ps-0  
  
  ports:
  - port: 5000
---

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: dist-strat-example-ps-0

  name: dist-strat-example-ps-0

spec:
        
  replicas: 1
  
  selector:
    matchLabels:
      app: dist-strat-example-ps-0  
  
  
  template:
    metadata:
      labels:
        app: dist-strat-example-ps-0 
  
  
    spec:

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - maye-inspiron-5547   

    
      containers:

      - name: tensorflow
        image: tf_std_server:v1
        resources:
          limits:
            #nvidia.com/gpu: 2

        env:

        - name: TF_CONFIG
          value: "{
  \"cluster\": {
    \"worker\": [\"dist-strat-example-worker-0:5000\",\"dist-strat-example-worker-1:5000\"],
    \"ps\": [\"dist-strat-example-ps-0:5000\"]},
  \"task\": {
    \"type\": \"ps\",
    \"index\": \"0\"
  }
}"

        #- name: GOOGLE_APPLICATION_CREDENTIALS
        #  value: "/var/secrets/google/key.json"
        ports:
        - containerPort: 5000

        command:
        - "/usr/bin/python"
        - "/tf_std_server.py"
        - ""
        #volumeMounts:
        #- name: credential
        #  mountPath: /var/secrets/google
      #volumes:
      #- name: credential
      #  secret:
      #    secretName: credential
---

# run_fn in module file of tfx component trainer
def run_fn(fn_args: tfx.components.FnArgs):
    
    cluster_dict = {}

### ClusterIp should be used, not service name, or 
### this error will be raised.
    cluster_dict["worker"] = ["dist-strat-example-worker-0:5000", "dist-strat-example-worker-1:5000"]
    cluster_dict["ps"] = ["dist-strat-example-ps-0:5000"]
    
    #cluster_dict["worker"] = ["10.102.74.8:5000", "10.100.198.218:5000"]
    #cluster_dict["ps"] = ["10.96.200.160:5000"]
    
    cluster_spec = tf.train.ClusterSpec(cluster_dict)
    
    cluster_resolver = tf.distribute.cluster_resolver.SimpleClusterResolver(
      cluster_spec, rpc_layer="grpc")
    
    strategy = tf.distribute.ParameterServerStrategy(
    cluster_resolver,)
    
    tf_transform_output = tft.TFTransformOutput(fn_args.transform_output)
    
    train_dataset = _input_fn(
        fn_args.train_files,
        fn_args.data_accessor,
        tf_transform_output,
        batch_size=_TRAIN_BATCH_SIZE,
    )
    
    
    resampled_train_dataset = _resample_train_dataset(train_dataset, 
                                                      batch_size=_TRAIN_BATCH_SIZE)
    
    #tf.print(f"resampled_train_dataset {resampled_train_dataset.cardinality()}")
    
    val_dataset = _input_fn(
        fn_args.eval_files,
        fn_args.data_accessor,
        tf_transform_output,
        batch_size=_EVAL_BATCH_SIZE,
    )
          
    with strategy.scope():
        model = _build_keras_model()

    trainer_train_history = model.fit(
        resampled_train_dataset,
        epochs=fn_args.custom_config['epochs'],
        steps_per_epoch=fn_args.train_steps,
        validation_data=val_dataset,
        #callbacks=[tensorboard_callback],
    )
    
    with open('trainer_train_history.json', 'w') as f:
        json.dump(trainer_train_history.history, f)
    
    signatures = {
        'serving_default': _get_serve_tf_examples_fn(model, tf_transform_output),
    }
    
    model.save(fn_args.serving_model_dir, save_format='tf', signatures=signatures)

$ kubectl logs pod-tfx-trainer-component -n kubeflow
...
INFO:absl:Successfully installed '/tfx/pipelines/tfx_user_code_Trainer-0.0+a0a99f38e703a50fc266bc1da356164d31c1f23c893900324e04c03582c72555-py3-none-any.whl'.
INFO:absl:Training model.
INFO:tensorflow:`tf.distribute.experimental.ParameterServerStrategy` is initialized with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})
INFO:tensorflow:`tf.distribute.experimental.ParameterServerStrategy` is initialized with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})
INFO:tensorflow:ParameterServerStrategyV2 is now connecting to cluster with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})
INFO:tensorflow:ParameterServerStrategyV2 is now connecting to cluster with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})

(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$ kubectl logs dist-strat-example-ps-0-85fdfdddcb-9x6mt 
2024-02-14 05:51:36.101034: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-14 05:51:38.566981: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: UNKNOWN ERROR (34)
2024-02-14 05:51:38.570913: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:457] Started server with target: grpc://dist-strat-example-ps-0:5000
(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$

[SOLUTION]

This error is due to that service name, e.g. "dist-strat-example-worker-0", is used when passing to tf.distribute.ParameterServerStrategy(), and "dist-strat-example-worker-0" is service name of worker-0, not its host name, so tf.distribute.ParameterServerStrategy() thinks that worker-0 it needs is not ready and keeps waiting.
clusterIp of service "dist-strat-example-worker-0" should be used here, so that tf.distribute.ParameterServerStrategy() can connect to it.

(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$ kubectl get service
NAME                          TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
dist-strat-example-ps-0       LoadBalancer   10.96.200.160    <pending>     5000:32409/TCP   53m
dist-strat-example-worker-0   LoadBalancer   10.102.74.8      <pending>     5000:30550/TCP   53m
dist-strat-example-worker-1   LoadBalancer   10.100.198.218   <pending>     5000:31080/TCP   53m
kubernetes                    ClusterIP      10.96.0.1        <none>        443/TCP          4d17h

def run_fn(fn_args: tfx.components.FnArgs):
    
    cluster_dict = {}
    #cluster_dict["worker"] = ["dist-strat-example-worker-0:5000", "dist-strat-example-worker-1:5000"]
    #cluster_dict["ps"] = ["dist-strat-example-ps-0:5000"]
    
    cluster_dict["worker"] = ["10.102.74.8:5000", "10.100.198.218:5000"]
    cluster_dict["ps"] = ["10.96.200.160:5000"]
    
    cluster_spec = tf.train.ClusterSpec(cluster_dict)
    
    cluster_resolver = tf.distribute.cluster_resolver.SimpleClusterResolver(
      cluster_spec, rpc_layer="grpc")
    
    strategy = tf.distribute.ParameterServerStrategy(
    cluster_resolver,)

ok log of pod tfx-trainer-component:

. Setting to DenseTensor.
INFO:absl:Feature feature_998 has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature feature_999 has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Model: "model"
INFO:absl:__________________________________________________________________________________________________
INFO:absl: Layer (type)                   Output Shape         Param #     Connected to                     
INFO:absl:==================================================================================================
INFO:absl: feature_2 (InputLayer)         [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1244 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_3 (InputLayer)         [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_352 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_969 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1533 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1 (InputLayer)         [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1485 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_399 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1277 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1230 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_105 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1279 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_27 (InputLayer)        [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1409 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_4 (InputLayer)         [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1508 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1437 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_21 (InputLayer)        [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_735 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_466 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1292 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1248 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1168 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_188 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_628 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1059 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1436 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1082 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1502 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1501 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_567 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_380 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1403 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_639 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_479 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_12 (InputLayer)        [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1514 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_458 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_140 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1231 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1455 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1463 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1024 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_50 (InputLayer)        [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_156 (InputLayer)       [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1528 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: feature_1511 (InputLayer)      [(None, 1)]          0           []                               
INFO:absl:                                                                                                  
INFO:absl: concatenate (Concatenate)      (None, 48)           0           ['feature_2[0][0]',              
INFO:absl:                                                                  'feature_1244[0][0]',           
INFO:absl:                                                                  'feature_3[0][0]',              
INFO:absl:                                                                  'feature_352[0][0]',            
INFO:absl:                                                                  'feature_969[0][0]',            
INFO:absl:                                                                  'feature_1533[0][0]',           
INFO:absl:                                                                  'feature_1[0][0]',              
INFO:absl:                                                                  'feature_1485[0][0]',           
INFO:absl:                                                                  'feature_399[0][0]',            
INFO:absl:                                                                  'feature_1277[0][0]',           
INFO:absl:                                                                  'feature_1230[0][0]',           
INFO:absl:                                                                  'feature_105[0][0]',            
INFO:absl:                                                                  'feature_1279[0][0]',           
INFO:absl:                                                                  'feature_27[0][0]',             
INFO:absl:                                                                  'feature_1409[0][0]',           
INFO:absl:                                                                  'feature_4[0][0]',              
INFO:absl:                                                                  'feature_1508[0][0]',           
INFO:absl:                                                                  'feature_1437[0][0]',           
INFO:absl:                                                                  'feature_21[0][0]',             
INFO:absl:                                                                  'feature_735[0][0]',            
INFO:absl:                                                                  'feature_466[0][0]',            
INFO:absl:                                                                  'feature_1292[0][0]',           
INFO:absl:                                                                  'feature_1248[0][0]',           
INFO:absl:                                                                  'feature_1168[0][0]',           
INFO:absl:                                                                  'feature_188[0][0]',            
INFO:absl:                                                                  'feature_628[0][0]',            
INFO:absl:                                                                  'feature_1059[0][0]',           
INFO:absl:                                                                  'feature_1436[0][0]',           
INFO:absl:                                                                  'feature_1082[0][0]',           
INFO:absl:                                                                  'feature_1502[0][0]',           
INFO:absl:                                                                  'feature_1501[0][0]',           
INFO:absl:                                                                  'feature_567[0][0]',            
INFO:absl:                                                                  'feature_380[0][0]',            
INFO:absl:                                                                  'feature_1403[0][0]',           
INFO:absl:                                                                  'feature_639[0][0]',            
INFO:absl:                                                                  'feature_479[0][0]',            
INFO:absl:                                                                  'feature_12[0][0]',             
INFO:absl:                                                                  'feature_1514[0][0]',           
INFO:absl:                                                                  'feature_458[0][0]',            
INFO:absl:                                                                  'feature_140[0][0]',            
INFO:absl:                                                                  'feature_1231[0][0]',           
INFO:absl:                                                                  'feature_1455[0][0]',           
INFO:absl:                                                                  'feature_1463[0][0]',           
INFO:absl:                                                                  'feature_1024[0][0]',           
INFO:absl:                                                                  'feature_50[0][0]',             
INFO:absl:                                                                  'feature_156[0][0]',            
INFO:absl:                                                                  'feature_1528[0][0]',           
INFO:absl:                                                                  'feature_1511[0][0]']           
INFO:absl:                                                                                                  
INFO:absl: dense (Dense)                  (None, 16)           784         ['concatenate[0][0]']            
INFO:absl:                                                                                                  
INFO:absl: dense_1 (Dense)                (None, 16)           272         ['dense[0][0]']                  
INFO:absl:                                                                                                  
INFO:absl: dense_2 (Dense)                (None, 1)            17          ['dense_1[0][0]']                
INFO:absl:                                                                                                  
INFO:absl:==================================================================================================
INFO:absl:Total params: 1,073
INFO:absl:Trainable params: 1,073
INFO:absl:Non-trainable params: 0
INFO:absl:__________________________________________________________________________________________________
Epoch 1/50
/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py:467: UserWarning: To make it possible to preserve tf.data options across serialization boundaries, their implementation has moved to be part of the TensorFlow graph. As a consequence, the options value is in general no longer known at graph construction time. Invoking this method in graph mode retains the legacy behavior of the original implementation, but note that the returned value might not reflect the actual value of the options.
  warnings.warn("To make it possible to preserve tf.data options across "
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
(base) maye@maye-Inspiron-5547:~$

posted on 2024-02-14 16:17 zhenxia-jiuyou 阅读(33) 评论(0) 编辑收藏举报

刷新页面返回顶部

导航

Debug: tf distribute strategy parameter server: stuck at "INFO:tensorflow:ParameterServerStrategyV2 is now connecting to cluster

[ERROR: stuck at "INFO:tensorflow:ParameterServerStrategyV2 is now connecting to cluster with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})"]

[SOLUTION]