-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault running MNIST example lenet-stn.jl #369
Comments
I have exactly the same problem with this
It always segfaults when I start training, it is not a problem of the My Debian version is 9.3
And it gives a bit more info about the segfault.
I hope that this helps to solve the problem. I am busy this week, so I can only do more tests next weekend. |
Furthermore, I think that it is a |
Here is my gdb trace: Thread 37 "julia" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff35a15700 (LWP 13819)]
0x00007fff83e77ba0 in mshadow::BilinearSamplingBackward<float> (input_grad=..., grid_src_data=..., output_grad=...,
input_data=...) at src/operator/spatial_transformer.cc:120
120 *(g_input + data_index + 1) += *(grad + grad_index) * top_left_y_w
(gdb) bt
#0 0x00007fff83e77ba0 in mshadow::BilinearSamplingBackward<float> (input_grad=..., grid_src_data=..., output_grad=...,
input_data=...) at src/operator/spatial_transformer.cc:120
#1 0x00007fff83e5f18c in mxnet::op::SpatialTransformerOp<mshadow::cpu, float>::Backward (this=0x38bcd30, ctx=...,
out_grad=std::vector of length 1, capacity 1 = {...}, in_data=std::vector of length 2, capacity 2 = {...},
out_data=std::vector of length 3, capacity 3 = {...}, req=std::vector of length 2, capacity 2 = {...},
in_grad=std::vector of length 2, capacity 2 = {...}, aux_args=std::vector of length 0, capacity 0)
at src/operator/./spatial_transformer-inl.h:136 I guess there is something wrong in shape. |
oh.. |
I can reproduce the segfault via change optimizer to % ./train_mnist.py --network lenet --add_stn --optimizer adam
INFO:root:start with arguments Namespace(add_stn=True, batch_size=64, disp_batches=100, dtype='float32', gpus=None, kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, monitor=0, network='lenet', num_classes=10, num_epochs=20, num_examples=60000, num_layers=None, optimizer='adam', test_io=0, top_k=0, wd=0.0001)
Segmentation fault: 11
Stack trace returned 10 entries:
[bt] (0) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2559619) [0x7f642acdd619]
[bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f645935b4b0]
[bt] (2) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2527f9d) [0x7f642acabf9d]
[bt] (3) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x252a9f6) [0x7f642acae9f6]
[bt] (4) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2203f87) [0x7f642a987f87]
[bt] (5) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x1fea13b) [0x7f642a76e13b]
[bt] (6) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x1fee562) [0x7f642a772562]
[bt] (7) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x1fd0cbd) [0x7f642a754cbd]
[bt] (8) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x1fd48c1) [0x7f642a7588c1]
[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f6454474c80] so... simply switch to SGD and it works diff --git a/examples/mnist/lenet-stn.jl b/examples/mnist/lenet-stn.jl
index 23ca9de..60f2def 100644
--- a/examples/mnist/lenet-stn.jl
+++ b/examples/mnist/lenet-stn.jl
@@ -57,6 +57,6 @@
model = mx.FeedForward(lenet, context=mx.cpu())
# optimizer
-optimizer = mx.ADAM(lr=0.01, weight_decay=0.00001)
+optimizer = mx.SGD(lr=0.1, momentum=.9)
# fit parameters |
make its optimizer configured same as Python's fix #369
So, does this mean there is something wrong in |
@rickhg12hs seems ADAM make some value fall into negative, then |
well, not exactly, IMO. |
|
I changed
|
🤔 ignore my post, |
try this? 8e99fa9 |
got this on my machine
|
Using the edits in 8e99fa9, I get a segfault.
|
hmm, I believe that it's a bug of libmxnet now. |
I reported this issue to upstream: apache/mxnet#9050 |
I think it is a bug in the STN layer. I also had some issues with that, I train a model using the simple_bind API and sometimes I get SegFaults, sometimes not. Seems to be dependent on the random parameter initialization. GDB stack trace told me it was in the BilinearSamplingBackward method, same as was mentioned before here. |
@adrianloy do you have GPU and can try out cuDNN ? |
lenet.jl
example seems to run OK, butlenet-stn.jl
segfaults.The text was updated successfully, but these errors were encountered: