14 Replies Latest reply on Jan 1, 2018 3:01 AM by Anju_Paul

    NaN error while training tensorflow object detection

    KimChuan

      Traceback (most recent call last):
      File “object_detection/train.py”, line 163, in <module>
      tf.app.run()
      File “/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py”, line 48, in run
      _sys.exit(main(_sys.argv[:1] + flags_passthrough))
      File “object_detection/train.py”, line 159, in main
      worker_job_name, is_chief, FLAGS.train_dir)
      File “/home/u7485/models/research/object_detection/trainer.py”, line 332, in train
      saver=saver)
      File “/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py”, line 755, in train
      sess, train_op, global_step, train_step_kwargs)
      File “/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py”, line 488, in train_step
      run_metadata=run_metadata)
      File “/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 895, in run
      run_metadata_ptr)
      File “/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1124, in _run
      feed_dict_tensor, options, run_metadata)
      File “/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1321, in _do_run
      options, run_metadata)
      File “/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py”, line 1340, in _do_call
      raise type(e)(node_def, op, message)
      tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values
      [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message=”LossTensor is inf or nan.”, _device=”/job:localhost/replica:0/task:0/cpu:0″](total_loss)]]

      Caused by op ‘CheckNumerics’, defined at:
      File “object_detection/train.py”, line 163, in <module>
      tf.app.run()
      File “/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py”, line 48, in run
      _sys.exit(main(_sys.argv[:1] + flags_passthrough))
      File “object_detection/train.py”, line 159, in main
      worker_job_name, is_chief, FLAGS.train_dir)
      File “/home/u7485/models/research/object_detection/trainer.py”, line 263, in train
      total_loss = tf.check_numerics(total_loss, ‘LossTensor is inf or nan.’)
      File “/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py”, line 413, in check_numerics
      message=message, name=name)
      File “/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py”, line 767, in apply_op
      op_def=op_def)
      File “/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py”, line 2630, in create_op
      original_op=self._default_original_op, op_def=op_def)
      File “/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py”, line 1204, in __init__
      self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

      InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values
      [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message=”LossTensor is inf or nan.”, _device=”/job:localhost/replica:0/task:0/cpu:0″](total_loss)]]

      the submitted job script is as follow:
      #PBS -l nodes=1:knl
      cd $PBS_O_WORKDIR
      source activate py35
      protoc object_detection/protos/*.proto –python_out=.
      export PYTHONPATH=$PYTHONPATH:pwd:pwd/slim
      python object_detection/train.py –logtostderr –pipeline_config_path=object_detection/data/faster_rcnn_resnet101_pets.config –train_dir=object_detection/data/checkpoint

        • 1. Re: NaN error while training tensorflow object detection
          Anju_Paul

          Hi,

           

          Try reducing your learning rate.

           

          Note : This is regarding the job script , "#PBS -l nodes=1:knl"

          Not sure if you are running in Dev Cloud.

          If yes, Dev Cloud contains no KNLs, only SKLs.

           

          Regards,

          Anju

          • 2. Re: NaN error while training tensorflow object detection
            Ajit Kumar Pookalangara

            Hi Kim

            In order to help resolve this issue , please let me know the following:

            1. Whats the version of Tensorflow you are using with Python 3.5?

            2. What is the input dataset you are are training for the object detection? ( I see pets config file but wanted to confirm it)

            3.   # of classes in the data set

            4. What is the Learning rate (try lowering it further )? Please share the .config file

            5. Are you using transfer learning or doing it from scratch?

            • 3. Re: NaN error while training tensorflow object detection
              Ajit Kumar Pookalangara

              Hi Kim

              We have a faster cluster called DevCloud (on Skylake processors), may be you can request access to this cluster thru this link https://access.colfaxresearch.com

              and once you have access to this environment you can run your experiments there.

              • 4. Re: NaN error while training tensorflow object detection
                Rishabh_Intel

                Hi Kim,

                 

                Did your issue get resolved?

                Please let us know for any further concern.

                 

                Thanks,

                Rishabh

                • 5. Re: NaN error while training tensorflow object detection
                  KimChuan

                  Hi Anju,

                  submitting with skl giving me system busy error message.

                   

                  [u7485@c001 research]$ qsub train.sh

                  qsub: submit error (Job exceeds queue resource limits MSG=cannot locate feasible nodes (nodes file is empty, all systems are busy, or no nodes have the requested feature))

                   

                  cat train.sh

                  #PBS -l nodes=1:skl

                  cd $PBS_O_WORKDIR

                  source activate py35

                  protoc object_detection/protos/*.proto --python_out=.

                  export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim

                  python object_detection/train.py --logtostderr --pipeline_config_path=object_detection/data/faster_rcnn_resnet101_pets.config --train_dir=object_detection/data/checkpoint

                  • 6. Re: NaN error while training tensorflow object detection
                    Anju_Paul

                    Hi Kim,

                     

                    That implies that you are not running on Dev Cloud.

                    You might be running on Colfax Cluster which had only KNLs.

                     

                    Regards,

                    Anju

                    • 7. Re: NaN error while training tensorflow object detection
                      KimChuan

                      Hi Ajit,

                       

                      I was trying to reproduce the tensorflow object detection API training sample models/faster_rcnn_resnet101_pets.config at master · tensorflow/models · GitHub

                       

                      the version of the Intel build tensorflow (python 3.5) is as follow:

                      (py35) [u7485@c001 research]$ conda list | grep tensor

                      tensorflow                1.3.0                         0

                      tensorflow-base           1.3.0            py35h79a3156_1

                      tensorflow-tensorboard    0.1.5                    py35_0

                      • 8. Re: NaN error while training tensorflow object detection
                        Rishabh_Intel

                        Hi Kim,

                         

                        You are running things on Colfax (c001). Please try running as Anju suggested:

                        1 On Dev Cloud (c009)

                        2) Try reducing learning rate.

                         

                        Thanks,

                        Rishabh

                        • 9. Re: NaN error while training tensorflow object detection
                          KimChuan

                          increasing learning rate to 0.03 still get the NaN error

                           

                          INFO:tensorflow:Scale of 0 disables regularizer.

                          INFO:tensorflow:Scale of 0 disables regularizer.

                          INFO:tensorflow:Scale of 0 disables regularizer.

                          INFO:tensorflow:Scale of 0 disables regularizer.

                          INFO:tensorflow:Scale of 0 disables regularizer.

                          INFO:tensorflow:depth of additional conv before box predictor: 0

                          INFO:tensorflow:Scale of 0 disables regularizer.

                          INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.

                          2017-12-27 15:41:56.890696: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.

                          2017-12-27 15:41:56.890844: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.

                          2017-12-27 15:41:56.890892: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.

                          2017-12-27 15:41:56.890938: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.

                          2017-12-27 15:41:56.890983: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX512F instructions, but these are available on your machine and could speed up CPU computations.

                          2017-12-27 15:41:56.891028: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

                          /home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:95: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.

                            "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

                          INFO:tensorflow:Restoring parameters from object_detection/data/checkpoint/model.ckpt-0

                          2017-12-27 15:43:19.352974: I tensorflow/core/common_runtime/simple_placer.cc:697] Ignoring device specification /device:GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /device:CPU:0

                          INFO:tensorflow:Starting Session.

                          INFO:tensorflow:Saving checkpoint to path object_detection/data/checkpoint/model.ckpt

                          INFO:tensorflow:Starting Queues.

                          INFO:tensorflow:global_step/sec: 0

                          INFO:tensorflow:Recording summary at step 0.

                          INFO:tensorflow:global step 1: loss = 4.4225 (33.565 sec/step)

                          INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values

                                   [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/cpu:0"](total_loss)]]

                           

                           

                          Caused by op 'CheckNumerics', defined at:

                            File "object_detection/train.py", line 167, in <module>

                              tf.app.run()

                            File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run

                              _sys.exit(main(_sys.argv[:1] + flags_passthrough))

                            File "object_detection/train.py", line 163, in main

                              worker_job_name, is_chief, FLAGS.train_dir)

                            File "/home/u7485/models/research/object_detection/trainer.py", line 263, in train

                              total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.')

                            File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 413, in check_numerics

                           

                           

                           

                          batch_size: 1

                            optimizer {

                              momentum_optimizer: {

                                learning_rate: {

                                  manual_step_learning_rate {

                                    initial_learning_rate: 0.03

                                    schedule {

                                      step: 0

                                      learning_rate: .03

                                    }

                                    schedule {

                                      step: 900000

                                      learning_rate: .003

                                    }

                                    schedule {

                                      step: 1200000

                                      learning_rate: .0003

                                    }

                                  }

                                }

                                momentum_optimizer_value: 0.9

                              }

                              use_moving_average: false

                            }

                            gradient_clipping_by_norm: 10.0

                          • 10. Re: NaN error while training tensorflow object detection
                            Anju_Paul

                            Hi,

                             

                            We mentioned decreasing the learning rate, not increasing it.

                             

                            Regards,

                            Anju

                            • 11. Re: NaN error while training tensorflow object detection
                              KimChuan

                              Both decreasing (0.00003) and increasing (0.03) learning rate resulted in NaN error.

                              • 12. Re: NaN error while training tensorflow object detection
                                Anju_Paul

                                Hi,

                                 

                                From the logs (INFO:tensorflow:Restoring parameters from object_detection/data/checkpoint/model.ckpt-0),

                                seems like it is using some older wrong weights for initializing the model.

                                 

                                Is the directory "object_detection/data/checkpoint" gets created/populated automatically when the code runs?

                                Did the directory contain any code or data when you first ran the program?

                                If the answer is yes for the first question and no for the second, please try running after deleting this folder or emptying it.

                                 

                                Regards,

                                Anju

                                • 13. Re: NaN error while training tensorflow object detection
                                  KimChuan

                                  deleting the folder then proceed with the training (initial learning rate 0.03) still result in NaN error.

                                   

                                  [u7485@c001 research]$ cat train.sh.e27486

                                   

                                  INFO:tensorflow:Scale of 0 disables regularizer.

                                  INFO:tensorflow:Scale of 0 disables regularizer.

                                  INFO:tensorflow:Scale of 0 disables regularizer.

                                  INFO:tensorflow:Scale of 0 disables regularizer.

                                  INFO:tensorflow:Scale of 0 disables regularizer.

                                  INFO:tensorflow:depth of additional conv before box predictor: 0

                                  INFO:tensorflow:Scale of 0 disables regularizer.

                                  INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.

                                  2017-12-28 16:56:38.297831: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.

                                  2017-12-28 16:56:38.297978: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.

                                  2017-12-28 16:56:38.298026: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.

                                  2017-12-28 16:56:38.298072: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.

                                  2017-12-28 16:56:38.298117: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX512F instructions, but these are available on your machine and could speed up CPU computations.

                                  2017-12-28 16:56:38.298161: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

                                  2017-12-28 16:58:01.456051: I tensorflow/core/common_runtime/simple_placer.cc:697] Ignoring device specification /device:GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /device:CPU:0

                                  /home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:95: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.

                                    "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

                                  INFO:tensorflow:Restoring parameters from /home/u7485/models/research/object_detection/data/faster_rcnn_resnet101_coco_11_06_2017/model.ckpt

                                  INFO:tensorflow:Starting Session.

                                  INFO:tensorflow:Saving checkpoint to path object_detection/data/checkpoint/model.ckpt

                                  INFO:tensorflow:Starting Queues.

                                  INFO:tensorflow:global_step/sec: 0

                                  INFO:tensorflow:Recording summary at step 0.

                                  INFO:tensorflow:global step 1: loss = 4.6975 (32.468 sec/step)

                                  INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values

                                           [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/cpu:0"](total_loss)]]

                                   

                                   

                                  Caused by op 'CheckNumerics', defined at:

                                    File "object_detection/train.py", line 167, in <module>

                                      tf.app.run()

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run

                                      _sys.exit(main(_sys.argv[:1] + flags_passthrough))

                                    File "object_detection/train.py", line 163, in main

                                      worker_job_name, is_chief, FLAGS.train_dir)

                                    File "/home/u7485/models/research/object_detection/trainer.py", line 263, in train

                                      total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.')

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 413, in check_numerics

                                      message=message, name=name)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op

                                      op_def=op_def)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op

                                      original_op=self._default_original_op, op_def=op_def)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__

                                      self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

                                   

                                   

                                  InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values

                                           [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/cpu:0"](total_loss)]]

                                   

                                   

                                  Traceback (most recent call last):

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call

                                      return fn(*args)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn

                                      status, run_metadata)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/contextlib.py", line 66, in __exit__

                                      next(self.gen)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status

                                      pywrap_tensorflow.TF_GetCode(status))

                                  tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values

                                           [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/cpu:0"](total_loss)]]

                                   

                                   

                                  During handling of the above exception, another exception occurred:

                                   

                                   

                                  Traceback (most recent call last):

                                    File "object_detection/train.py", line 167, in <module>

                                      tf.app.run()

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run

                                      _sys.exit(main(_sys.argv[:1] + flags_passthrough))

                                    File "object_detection/train.py", line 163, in main

                                      worker_job_name, is_chief, FLAGS.train_dir)

                                    File "/home/u7485/models/research/object_detection/trainer.py", line 332, in train

                                      saver=saver)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 755, in train

                                      sess, train_op, global_step, train_step_kwargs)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 488, in train_step

                                      run_metadata=run_metadata)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 895, in run

                                      run_metadata_ptr)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1124, in _run

                                      feed_dict_tensor, options, run_metadata)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run

                                      options, run_metadata)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call

                                      raise type(e)(node_def, op, message)

                                  tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values

                                           [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/cpu:0"](total_loss)]]

                                   

                                   

                                  Caused by op 'CheckNumerics', defined at:

                                    File "object_detection/train.py", line 167, in <module>

                                      tf.app.run()

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run

                                      _sys.exit(main(_sys.argv[:1] + flags_passthrough))

                                    File "object_detection/train.py", line 163, in main

                                      worker_job_name, is_chief, FLAGS.train_dir)

                                    File "/home/u7485/models/research/object_detection/trainer.py", line 263, in train

                                      total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.')

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 413, in check_numerics

                                      message=message, name=name)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op

                                      op_def=op_def)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op

                                      original_op=self._default_original_op, op_def=op_def)

                                    File "/home/u7485/.conda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__

                                      self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

                                   

                                   

                                  InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values

                                           [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/cpu:0"](total_loss)]]

                                  • 14. Re: NaN error while training tensorflow object detection
                                    Anju_Paul

                                    Hi,

                                     

                                    Tried to recreate your problem here, but seems like, the dataset is too big to fit to memory.

                                    Do you have smaller image set that I could try it on?

                                     

                                    Also, please try deleting the folder again and running with a learning rate 0.001.

                                    Kindly let me know if that helps.

                                     

                                    Regards,

                                    Anju