Fix multiple GPUs fails in training Mask

Test with:

Keras: 2.2.4
Python: 3.6.9
Tensorflow: 1.12.0

==================

Problem:

Using code from https://github.com/matterport/Mask_RCNN

When setting GPU_COUNT > 1

enconter this error:

RuntimeError: It looks like you are subclassing `Model` and you forgot to call `super(YourClass, self).__init__()`. Always start with this line.

Traceback (most recent call last):

  File "D:\Anaconda33\lib\site-packages\keras\engine\network.py", line 313, in __setattr__

    is_graph_network = self._is_graph_network

  File "parallel_model.py", line 46, in __getattribute__

    return super(ParallelModel, self).__getattribute__(attrname)

AttributeError: 'ParallelModel' object has no attribute '_is_graph_network'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "parallel_model.py", line 159, in <module>

    model = ParallelModel(model, GPU_COUNT)

  File "parallel_model.py", line 35, in __init__

    self.inner_model = keras_model

  File "D:\Anaconda33\lib\site-packages\keras\engine\network.py", line 316, in __setattr__

    'It looks like you are subclassing `Model` and you '

RuntimeError: It looks like you are subclassing `Model` and you forgot to call `super(YourClass, self).__init__()`. Always start with this line.

Solution 1:

changing code in mrcnn/parallel_model.py as the following:

class ParallelModel(KM.Model):

    def __init__(self, keras_model, gpu_count):

        """Class constructor.

        keras_model: The Keras model to parallelize

        gpu_count: Number of GPUs. Must be > 1

        """

        super(ParallelModel, self).__init__()

        self.inner_model = keras_model

        self.gpu_count = gpu_count

        merged_outputs = self.make_parallel()

        super(ParallelModel, self).__init__(inputs=self.inner_model.inputs,

                                            outputs=merged_outputs)

When getting this error:

asking for two arguments: inputs and outputs

Just upgrade your Keras to 2.2.4

When getting this error:

No node-device colocations were active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation.
Device assignments active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation:
with tf.device(/gpu:1): <M:\new\mrcnn\parallel_model.py:70>

No node-device colocations were active during op 'anchors/Variable' creation.
No device assignments were active during op 'anchors/Variable' creation.

Traceback (most recent call last):

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call

    return fn(*args)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\client\session.py", line 1317, in _run_fn

    self._extend_graph()

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\client\session.py", line 1352, in _extend_graph

    tf_session.ExtendSession(self._session)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot colocate nodes {{colocation_node tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0}} and {{colocation_node anchors/Variable}}: Cannot merge devices with incompatible ids: '/device:GPU:0' and '/device:GPU:1'

         [[{{node tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0}} = Identity[T=DT_FLOAT, _class=["loc:@anchors/Variable"], _device="/device:GPU:1"](tower_1/mask_rcnn/anchors/Variable/cond/Merge)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "train_mul.py", line 448, in <module>

    "mrcnn_bbox", "mrcnn_mask"])

  File "M:\new\mrcnn\model.py", line 2132, in load_weights

    saving.load_weights_from_hdf5_group_by_name(f, layers)

  File "D:\Anaconda33\lib\site-packages\keras\engine\saving.py", line 1022, in load_weights_from_hdf5_group_by_name

    K.batch_set_value(weight_value_tuples)

  File "D:\Anaconda33\lib\site-packages\keras\backend\tensorflow_backend.py", line 2440, in batch_set_value

    get_session().run(assign_ops, feed_dict=feed_dict)

  File "D:\Anaconda33\lib\site-packages\keras\backend\tensorflow_backend.py", line 197, in get_session

    [tf.is_variable_initialized(v) for v in candidate_vars])

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\client\session.py", line 929, in run

    run_metadata_ptr)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run

    feed_dict_tensor, options, run_metadata)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run

    run_metadata)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call

    raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot colocate nodes node tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0 (defined at M:\new\mrcnn\model.py:1936) having device Device assignments active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation:

  with tf.device(/gpu:1): <M:\new\mrcnn\parallel_model.py:70>  and node anchors/Variable (defined at M:\new\mrcnn\model.py:1936) having device No device assignments were active during op 'anchors/Variable' creation. : Cannot merge devices with incompatible ids: '/device:GPU:0' and '/device:GPU:1'

         [[node tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0 (defined at M:\new\mrcnn\model.py:1936)  = Identity[T=DT_FLOAT, _class=["loc:@anchors/Variable"], _device="/device:GPU:1"](tower_1/mask_rcnn/anchors/Variable/cond/Merge)]]

No node-device colocations were active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation.

Device assignments active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation:

  with tf.device(/gpu:1): <M:\new\mrcnn\parallel_model.py:70>

No node-device colocations were active during op 'anchors/Variable' creation.

No device assignments were active during op 'anchors/Variable' creation.

Caused by op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0', defined at:

  File "train_mul.py", line 417, in <module>

    model_dir=MODEL_DIR)

  File "M:\new\mrcnn\model.py", line 1839, in __init__

    self.keras_model = self.build(mode=mode, config=config)

  File "M:\new\mrcnn\model.py", line 2064, in build

    model = ParallelModel(model, config.GPU_COUNT)

  File "M:\new\mrcnn\parallel_model.py", line 36, in __init__

    merged_outputs = self.make_parallel()

  File "M:\new\mrcnn\parallel_model.py", line 80, in make_parallel

    outputs = self.inner_model(inputs)

  File "D:\Anaconda33\lib\site-packages\keras\engine\base_layer.py", line 457, in __call__

    output = self.call(inputs, **kwargs)

  File "D:\Anaconda33\lib\site-packages\keras\engine\network.py", line 570, in call

    output_tensors, _, _ = self.run_internal_graph(inputs, masks)

  File "D:\Anaconda33\lib\site-packages\keras\engine\network.py", line 724, in run_internal_graph

    output_tensors = to_list(layer.call(computed_tensor, **kwargs))

  File "D:\Anaconda33\lib\site-packages\keras\layers\core.py", line 682, in call

    return self.function(inputs, **arguments)

  File "M:\new\mrcnn\model.py", line 1936, in <lambda>

    anchors = KL.Lambda(lambda x: tf.Variable(anchors), name="anchors")(input_image)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\ops\variables.py", line 183, in __call__

    return cls._variable_v1_call(*args, **kwargs)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\ops\variables.py", line 146, in _variable_v1_call

    aggregation=aggregation)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\ops\variables.py", line 125, in <lambda>

    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 2444, in default_variable_creator

    expected_shape=expected_shape, import_scope=import_scope)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\ops\variables.py", line 187, in __call__

    return super(VariableMetaclass, cls).__call__(*args, **kwargs)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\ops\variables.py", line 1329, in __init__

    constraint=constraint)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\ops\variables.py", line 1480, in _init_from_args

    self._initial_value),

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\ops\variables.py", line 2177, in _try_guard_against_uninitialized_dependencies

    return self._safe_initial_value_from_tensor(initial_value, op_cache={})

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\ops\variables.py", line 2195, in _safe_initial_value_from_tensor

    new_op = self._safe_initial_value_from_op(op, op_cache)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\ops\variables.py", line 2241, in _safe_initial_value_from_op

    name=new_op_name, attrs=op.node_def.attr)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func

    return func(*args, **kwargs)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op

    op_def=op_def)

  File "D:\Anaconda33\lib\site-packages\tensorflow\python\framework\ops.py", line 1770, in __init__

    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Cannot colocate nodes node tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0 (defined at M:\new\mrcnn\model.py:1936) having device Device assignments active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation:

  with tf.device(/gpu:1): <M:\new\mrcnn\parallel_model.py:70>  and node anchors/Variable (defined at M:\new\mrcnn\model.py:1936) having device No device assignments were active during op 'anchors/Variable' creation. : Cannot merge devices with incompatible ids: '/device:GPU:0' and '/device:GPU:1'

         [[node tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0 (defined at M:\new\mrcnn\model.py:1936)  = Identity[T=DT_FLOAT, _class=["loc:@anchors/Variable"], _device="/device:GPU:1"](tower_1/mask_rcnn/anchors/Variable/cond/Merge)]]

No node-device colocations were active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation.

Device assignments active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation:

  with tf.device(/gpu:1): <M:\new\mrcnn\parallel_model.py:70>

No node-device colocations were active during op 'anchors/Variable' creation.

No device assignments were active during op 'anchors/Variable' creation.

Adding this line:

import keras.backend.tensorflow_backend as KTF

config = tf.ConfigProto()

config.allow_soft_placement=True

session = tf.Session(config=config)

KTF.set_session(session)

Solution 2:(not recommended)

downgrade Keras to 2.1.3:

conda install keras=2.1.3

(this works for someone but not works for me)

Reference:

https://github.com/matterport/Mask_RCNN/issues/921

https://github.com/tensorflow/tensorflow/issues/2285