Tensorflow Debug:InvalidArgumentError: Cannot assign a device for operation 'gradients/embedding_lookup_grad/ToInt32': Could not satisfy explicit device specification '' because the node was colocated
原始代码为:
#Learning Algorithm for CADE # config = tf.ConfigProto(allow_soft_placement = True) sess = tf.InteractiveSession() maxIter = 100 ite = int(0) sess.run(tf.global_variables_initializer()) sess.run(tf.local_variables_initializer()) while ite<maxIter: t1 = time() print('Iteration%d start at %.4f...'%(ite,t1)) for i in range(train_usernums): _loss,_ = sess.run((loss,train),feed_dict={x:trainuimat.todense(),v:trainuni,_drop_rate:drop_rate}) print('\t loss:%f'%(_loss)) out = sess.run(out2,feed_dict={x:validuimat.todense(),v:validuni,_drop_rate:0}) out = out*(validuimat.todense()==0) out = np.argsort(out)[:,::-1] for _k in [1,5,10]: _MAP = MAP(testuidict,out,_k) print('Iteration%d : MAP@%d %f'%(ite,_k,_MAP)) print('Iteration%d used time:%.4f s'%(ite,time()-t1)) ite+=1
然后报错:
--------------------------------------------------------------------------- InvalidArgumentError Traceback (most recent call last) ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1322 try: -> 1323 return fn(*args) 1324 except errors.OpError as e: ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata) 1301 feed_dict, fetch_list, target_list, -> 1302 status, run_metadata) 1303 ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg) 472 compat.as_text(c_api.TF_Message(self.status.status)), --> 473 c_api.TF_GetCode(self.status.status)) 474 # Delete the underlying status object from memory otherwise it stays alive InvalidArgumentError: Cannot assign a device for operation 'gradients/embedding_lookup_grad/ToInt32': Could not satisfy explicit device specification '' because the node was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0' Colocation Debug Info: Colocation group had the following types and devices: Const: GPU CPU VariableV2: GPU CPU UnsortedSegmentSum: GPU CPU Identity: GPU CPU L2Loss: GPU CPU Shape: GPU CPU Mul: GPU CPU Gather: GPU CPU SparseApplyAdagrad: CPU Cast: GPU CPU Unique: GPU CPU StridedSlice: GPU CPU [[Node: gradients/embedding_lookup_grad/ToInt32 = Cast[DstT=DT_INT32, SrcT=DT_INT64, _class=["loc:@EmbeddingParams"]](gradients/embedding_lookup_grad/Shape)]] During handling of the above exception, another exception occurred: InvalidArgumentError Traceback (most recent call last) <ipython-input-6-01dec1a8f42a> in <module>() 10 print('Iteration%d start at %.4f...'%(ite,t1)) 11 for i in range(train_usernums): ---> 12 _loss,_ = sess.run((loss,train),feed_dict={x:trainuimat.todense(),v:trainuni,_drop_rate:drop_rate}) 13 print('\t loss:%f'%(_loss)) 14 out = sess.run(out2,feed_dict={x:validuimat.todense(),v:validuni,_drop_rate:0}) ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata) 887 try: 888 result = self._run(None, fetches, feed_dict, options_ptr, --> 889 run_metadata_ptr) 890 if run_metadata: 891 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr) ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata) 1118 if final_fetches or final_targets or (handle and feed_dict_tensor): 1119 results = self._do_run(handle, final_targets, final_fetches, -> 1120 feed_dict_tensor, options, run_metadata) 1121 else: 1122 results = [] ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 1315 if handle is None: 1316 return self._do_call(_run_fn, self._session, feeds, fetches, targets, -> 1317 options, run_metadata) 1318 else: 1319 return self._do_call(_prun_fn, self._session, handle, feeds, fetches) ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1334 except KeyError: 1335 pass -> 1336 raise type(e)(node_def, op, message) 1337 1338 def _extend_graph(self): InvalidArgumentError: Cannot assign a device for operation 'gradients/embedding_lookup_grad/ToInt32': Could not satisfy explicit device specification '' because the node was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0' Colocation Debug Info: Colocation group had the following types and devices: Const: GPU CPU VariableV2: GPU CPU UnsortedSegmentSum: GPU CPU Identity: GPU CPU L2Loss: GPU CPU Shape: GPU CPU Mul: GPU CPU Gather: GPU CPU SparseApplyAdagrad: CPU Cast: GPU CPU Unique: GPU CPU StridedSlice: GPU CPU [[Node: gradients/embedding_lookup_grad/ToInt32 = Cast[DstT=DT_INT32, SrcT=DT_INT64, _class=["loc:@EmbeddingParams"]](gradients/embedding_lookup_grad/Shape)]] Caused by op 'gradients/embedding_lookup_grad/ToInt32', defined at: File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module> app.launch_new_instance() File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance app.start() File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 486, in start self.io_loop.start() File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 127, in start self.asyncio_loop.run_forever() File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/asyncio/base_events.py", line 422, in run_forever self._run_once() File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/asyncio/base_events.py", line 1432, in _run_once handle._run() File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/asyncio/events.py", line 145, in _run self._callback(*self._args) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tornado/ioloop.py", line 759, in _run_callback ret = callback() File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 276, in null_wrapper return fn(*args, **kwargs) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 536, in <lambda> self.io_loop.add_callback(lambda : self._handle_events(self.socket, 0)) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events self._handle_recv() File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv self._run_callback(callback, msg) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback callback(*args, **kwargs) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 276, in null_wrapper return fn(*args, **kwargs) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher return self.dispatch_shell(stream, msg) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell handler(stream, idents, msg) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 399, in execute_request user_expressions, allow_stdin) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 208, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 537, in run_cell return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2662, in run_cell raw_cell, store_history, silent, shell_futures) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell interactivity=interactivity, compiler=compiler, result=result) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2903, in run_ast_nodes if self.run_code(code, result): File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-4-d4a6590ba166>", line 27, in <module> train = optimizer.minimize(loss) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 343, in minimize grad_loss=grad_loss) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 414, in compute_gradients colocate_gradients_with_ops=colocate_gradients_with_ops) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 581, in gradients grad_scope, op, func_call, lambda: grad_fn(op, *out_grads)) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 353, in _MaybeCompile return grad_fn() # Exit early File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 581, in <lambda> grad_scope, op, func_call, lambda: grad_fn(op, *out_grads)) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 367, in _GatherGrad params_shape = math_ops.to_int32(params_shape) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 826, in to_int32 return cast(x, dtypes.int32, name=name) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 745, in cast return gen_math_ops.cast(x, base_type, name=name) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 892, in cast "Cast", x=x, DstT=DstT, name=name) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__ self._traceback = self._graph._extract_stack() # pylint: disable=protected-access ...which was originally created as op 'embedding_lookup', defined at: File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) [elided 23 identical lines from previous traceback] File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-4-d4a6590ba166>", line 11, in <module> ve = tf.nn.embedding_lookup(embedding_params,v) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 328, in embedding_lookup transform_fn=None) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 150, in _embedding_lookup_and_transform result = _clip(_gather(params[0], ids, name=name), ids, max_norm) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 54, in _gather return array_ops.gather(params, ids, name=name) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2486, in gather params, indices, validate_indices=validate_indices, name=name) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1834, in gather validate_indices=validate_indices, name=name) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/home/likewise-open/SENSETIME/liupengcheng/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__ self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'gradients/embedding_lookup_grad/ToInt32': Could not satisfy explicit device specification '' because the node was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0' Colocation Debug Info: Colocation group had the following types and devices: Const: GPU CPU VariableV2: GPU CPU UnsortedSegmentSum: GPU CPU Identity: GPU CPU L2Loss: GPU CPU Shape: GPU CPU Mul: GPU CPU Gather: GPU CPU SparseApplyAdagrad: CPU Cast: GPU CPU Unique: GPU CPU StridedSlice: GPU CPU [[Node: gradients/embedding_lookup_grad/ToInt32 = Cast[DstT=DT_INT32, SrcT=DT_INT64, _class=["loc:@EmbeddingParams"]](gradients/embedding_lookup_grad/Shape)]]
google到:https://github.com/tensorflow/tensorflow/issues/2292
说是GPU配置问题:
I just follow mrry's suggestion here, adding "allow_soft_placement=True" as follows: config = tf.ConfigProto(allow_soft_placement = True) sess = tf.Session(config = config) Then it works. I reviewed the Using GPUs in tutorial. It mentions adding "allow_soft_placement" under the error "Could not satisfy explicit device specification '/gpu:X' ". But it not mentions it could also solve the error "no supported kernel for GPU devices is available". Maybe it's better to add this in tutorial text in order to avoid confusing future users.
添加该语句(源代码注释部分),得到错误:
InvalidArgumentError Traceback (most recent call last) ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1322 try: -> 1323 return fn(*args) 1324 except errors.OpError as e: ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata) 1301 feed_dict, fetch_list, target_list, -> 1302 status, run_metadata) 1303 ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg) 472 compat.as_text(c_api.TF_Message(self.status.status)), --> 473 c_api.TF_GetCode(self.status.status)) 474 # Delete the underlying status object from memory otherwise it stays alive InvalidArgumentError: AttrValue must not have reference type value of float_ref for attr 'tensor_type' ; NodeDef: EmbeddingParams/Adagrad/_41 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_470_EmbeddingParams/Adagrad", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^Adagrad/learning_rate/_43, ^Adagrad/update_EmbeddingParams/UnsortedSegmentSum, ^Adagrad/update_EmbeddingParams/Unique); Op<name=_Recv; signature= -> tensor:tensor_type; attr=tensor_type:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true> [[Node: EmbeddingParams/Adagrad/_41 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_470_EmbeddingParams/Adagrad", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^Adagrad/learning_rate/_43, ^Adagrad/update_EmbeddingParams/UnsortedSegmentSum, ^Adagrad/update_EmbeddingParams/Unique)]] During handling of the above exception, another exception occurred: InvalidArgumentError Traceback (most recent call last) <ipython-input-6-3171867edb26> in <module>() 10 print('Iteration%d start at %s...'%(ite,t1)) 11 for i in range(train_usernums): ---> 12 _loss,_ = sess.run((loss,train),feed_dict={x:trainuimat.todense(),v:trainuni,_drop_rate:drop_rate}) 13 print('\t loss:%f'%(_loss)) 14 out = sess.run(out2,feed_dict={x:validuimat.todense(),v:validuni,_drop_rate:0}) ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata) 887 try: 888 result = self._run(None, fetches, feed_dict, options_ptr, --> 889 run_metadata_ptr) 890 if run_metadata: 891 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr) ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata) 1118 if final_fetches or final_targets or (handle and feed_dict_tensor): 1119 results = self._do_run(handle, final_targets, final_fetches, -> 1120 feed_dict_tensor, options, run_metadata) 1121 else: 1122 results = [] ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 1315 if handle is None: 1316 return self._do_call(_run_fn, self._session, feeds, fetches, targets, -> 1317 options, run_metadata) 1318 else: 1319 return self._do_call(_prun_fn, self._session, handle, feeds, fetches) ~/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1334 except KeyError: 1335 pass -> 1336 raise type(e)(node_def, op, message) 1337 1338 def _extend_graph(self): InvalidArgumentError: AttrValue must not have reference type value of float_ref for attr 'tensor_type' ; NodeDef: EmbeddingParams/Adagrad/_41 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_470_EmbeddingParams/Adagrad", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^Adagrad/learning_rate/_43, ^Adagrad/update_EmbeddingParams/UnsortedSegmentSum, ^Adagrad/update_EmbeddingParams/Unique); Op<name=_Recv; signature= -> tensor:tensor_type; attr=tensor_type:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true> [[Node: EmbeddingParams/Adagrad/_41 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_470_EmbeddingParams/Adagrad", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^Adagrad/learning_rate/_43, ^Adagrad/update_EmbeddingParams/UnsortedSegmentSum, ^Adagrad/update_EmbeddingParams/Unique)]]
google得到:https://github.com/tensorflow/tensorflow/issues/13880
采用了方法之一:把InteractiveSession改为常规session。解决问题:
#Learning Algorithm for CADE config = tf.ConfigProto(allow_soft_placement = True) with tf.Session(config=config) as sess: maxIter = 100 ite = int(0) sess.run(tf.global_variables_initializer()) sess.run(tf.local_variables_initializer()) while ite<maxIter: t1 = time() print('Iteration%d start at %.4f s...'%(ite,t1)) for i in range(train_usernums): _loss,_ = sess.run((loss,train),feed_dict={x:trainuimat.todense(),v:trainuni,_drop_rate:drop_rate}) print('\t loss:%f'%(_loss)) out = sess.run(out2,feed_dict={x:validuimat.todense(),v:validuni,_drop_rate:0}) out = out*(validuimat.todense()==0) out = np.argsort(out)[:,::-1] for _k in [1,5,10]: _MAP = MAP(testuidict,out,_k) print('Iteration%d : MAP@%d %f'%(ite,_k,_MAP)) print('Iteration%d used time:%.4f s'%(ite,time()-t1)) ite+=1
不过又出了新的问题:
InvalidArgumentError: indices[69165,0] = 69166 is not in [0, 69166) [[Node: embedding_lookup = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@EmbeddingParams"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](EmbeddingParams/read, _arg_Placeholder_1_0_1)]]
看Node:是embedding出了问题,
想起献文昨天说过embedding输入维度+1的事情,改了
embedding_params = tf.get_variable('EmbeddingParams',shape=[train_usernums+1,K],dtype=tf.float32, initializer=tf.glorot_normal_initializer(), regularizer=tf.contrib.layers.l2_regularizer(lamda))
又出新问题:
InternalError: Dst tensor is not initialized. [[Node: embedding_lookup/_15 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_38_embedding_lookup", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]] [[Node: add/_33 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_414_add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
查了说是GPU 内存满了,猜想可能是没有开
config.gpu_options.allow_growth = True
加上,没用。
然后用nvidia-smi查看,发现竟然用了3ge多G的GPU内存。然后顿悟,我是不是应该一条一条的传给placeholder而不是全部传进去……
然后问题解决了……太智障了……