Incompatible shapes during the half way training---Invalid argument: Incompatible shapes: [1,63,4] vs. [1,64,4] - youxiaogeo

这是tensorflow model 中我使用它的faster--cnn,但是就是训练过程中，代码执行到一半

一般是step=40~120的时候就报错了：

INFO:tensorflow:global step 65: loss = 4.8004 (0.854 sec/step)
INFO:tensorflow:global step 66: loss = 0.2637 (0.868 sec/step)
INFO:tensorflow:global step 67: loss = 1.5711 (0.845 sec/step)
INFO:tensorflow:global step 68: loss = 0.2334 (0.866 sec/step)
INFO:tensorflow:global step 69: loss = 0.6833 (0.846 sec/step)
2017-07-11 14:47:16.293535: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Incompatible shapes: [1,63,4] vs. [1,64,4]


这种错误tensorflow model issue上面也有解介绍，主要
解决方案是
1、labels的class要从1开始不要从0开始，
或者
2、对于边界框0.00,0.99,改成0.01,0.98之类的避免边界，

但是上面基本不起作用，我还一直以为是自己数据的问题，毕竟我是用自己数据的且类别只有1类，
这样检查数据，看是训练到哪个图像就停止，然后返回查看是不是标注的框少了，多了，在边界等
情况，但是发现这是一个随机性的error。
最后面只能靠自己了，老老实实查看error的部分，然后traceback：
发现是这里出的错：

diff = prediction_tensor - target_tensor

一个是 [1,63,4] ，一个是[1,64,4]，我们明明设置是64呀，怎么回事，跑出63？？？？？？？？
只能沿着向量流tensorflow to traceback：
发现这样一段代码：
      refined_box_encodings_masked_by_class_targets = tf.boolean_mask(
          refined_box_encodings_with_background,
          tf.greater(flat_cls_targets_with_background, 0))

哎呀，这里就是把64变成63的万恶之源呀！！！！！！！！，我就不明白了，为什么这里这么明显是可能出问题的呀，
我又想，代码是自己改的，没有和github同步，会不会作者更新了嘞？？？

然后就去github1上面看作者的源码和本地对比，果然！！！！！！

refined_box_encodings_masked_by_class_targets = tf.boolean_mask(
    refined_box_encodings_with_background,
    tf.greater(one_hot_flat_cls_targets_with_background, 0))

不一样，然后发现开发者做了一些改动，我就看他们的改动按照自己的来改，果然，ok了，其中开发者还加了一个注释：

# For anchors with multiple labels, picks refined_location_encodings
# for just one class to avoid over-counting for regression loss and
# (optionally) mask loss.


是吧，后面阔以啦，具体开发者代码在：faster_rcnn_meta_arch.py   _loss_box_classifier函数里哈
链接https://github.com/tensorflow/models/blob/master/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py

发表于 2017-10-30 20:02 youxiaogeo 阅读(1520) 评论(0) 编辑收藏举报