You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With GPU enabled, tensorflow freezes unless I force "Discounted Monte-Carlo returns." to CPU. Adding with tf.device("/cpu") into discounted_return(reward, length, discount) seem to address the issue.
I've seen this with TF 1.7, 1.11, 1.12, CUDA 8, 9, 10 and CUDA compute capability from 5.2 to 7.5.
Not sure how to debug TensorFlow when it quietly freezes (or crashes). Tried the thing with TensorFlow Debugger - it doesn't really show where it happens and also has GRPC issues. GDB shows that the process is in the following place, but with so many threads it is hard to tell if this has any relevance:
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007f4902dde6db in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2 0x00007f4902dddcf9 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3 0x00007f4902ddb2bb in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4 0x00007f4902ddb793 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x00007f490280594c in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, long long) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6 0x00007f490280599b in tensorflow::DirectSession::WaitForNotification(tensorflow::DirectSession::RunState*, tensorflow::CancellationManager*, long long) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7 0x00007f490280c5fb in tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*) ()
from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8 0x00007f4902815598 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9 0x00007f48ff8afc8c in tensorflow::SessionRef::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00007f48ffaa49b4 in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Tensor**, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Buffer*, TF_Status*) ()
from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x00007f48ffaa57e6 in TF_SessionRun () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007f48ff8acffd in tensorflow::TF_SessionRun_wrapper_helper(TF_Session*, char const*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007f48ff8ad032 in tensorflow::TF_SessionRun_wrapper(TF_Session*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x00007f48ff867d84 in _wrap_TF_SessionRun_wrapper () from /home/dmitry/.local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
The text was updated successfully, but these errors were encountered:
Thank for you investigating this! Computing returns on the CPU seems reasonable anyway. Would you like to create a PR with this change? I don't know of any more efficient way to debug deadlocks in TensorFlow -- personally I usually disable code until I can locate the problematic part.
I don't really understand why forcing this particular part of the graph onto CPU fixes (or masks) the issue. So I'll not be able to explain what this PR does ;)
Haha, okay. It seems like there is a problem when running the tf.scan() or tf.reverse() ops on GPU that are used to compute the return. Placing this loop on CPU should be faster anyway, so this seems like a reasonable change to me.
With GPU enabled, tensorflow freezes unless I force "Discounted Monte-Carlo returns." to CPU. Adding with tf.device("/cpu") into discounted_return(reward, length, discount) seem to address the issue.
I've seen this with TF 1.7, 1.11, 1.12, CUDA 8, 9, 10 and CUDA compute capability from 5.2 to 7.5.
Not sure how to debug TensorFlow when it quietly freezes (or crashes). Tried the thing with TensorFlow Debugger - it doesn't really show where it happens and also has GRPC issues. GDB shows that the process is in the following place, but with so many threads it is hard to tell if this has any relevance:
The text was updated successfully, but these errors were encountered: