Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on unresponsive engine #33

Open
sderr opened this issue Dec 16, 2020 · 3 comments
Open

Crash on unresponsive engine #33

sderr opened this issue Dec 16, 2020 · 3 comments

Comments

@sderr
Copy link
Contributor

sderr commented Dec 16, 2020

Hi, I got a an abort in free():

(gdb) bt
#0  0x00007f6fa421e2c7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f6fa421f9b8 in __GI_abort () at abort.c:90
#2  0x00007f6fa4260e17 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f6fa43727f8 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007f6fa4269609 in malloc_printerr (ar_ptr=0x7f6fa45ae760 <main_arena>, ptr=<optimized out>, str=0x7f6fa436feb1 "free(): invalid pointer", action=3)
    at malloc.c:5004
#4  _int_free (av=0x7f6fa45ae760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3843
#5  0x00007f6fa426482a in __GI__IO_free_backup_area (fp=fp@entry=0xf24d40) at genops.c:202
#6  0x00007f6fa4263f95 in _IO_new_file_overflow (f=0xf24d40, ch=-1) at fileops.c:828
#7  0x00007f6fa4262900 in _IO_new_file_xsputn (f=0xf24d40, data=<optimized out>, n=17) at fileops.c:1326
#8  0x00007f6fa4234f73 in _IO_vfprintf_internal (s=<optimized out>, format=<optimized out>, ap=ap@entry=0x7f6fa31e4728) at vfprintf.c:1635
#9  0x00007f6fa423b347 in __fprintf (stream=<optimized out>, format=format@entry=0x40d4ac "%s -> %s\n") at fprintf.c:32
#10 0x0000000000405eeb in engine_readln (w=w@entry=0xf25f00, e=e@entry=0x7f6fa31e4e70, line=line@entry=0x7f6fa31e4890) at src/engine.c:281
#11 0x0000000000406c3e in uci_engine_bestmove (info=0x7f6fa31e49d0, pv=0x7f6fa31e4bc0, best=0x7f6fa31e4c00, timeLeft=0x7f6fa31e4be0, e=0x7f6fa31e4e70, w=0xf25f00)
    at src/engine.c:322
#12 engine_bestmove (w=w@entry=0xf25f00, e=e@entry=0x7f6fa31e4e70, timeLeft=timeLeft@entry=0x7f6fa31e4be0, best=best@entry=0x7f6fa31e4c00, pv=pv@entry=0x7f6fa31e4bc0,
    info=info@entry=0x7f6fa31e49d0) at src/engine.c:420
#13 0x0000000000407e42 in game_play (w=w@entry=0xf25f00, g=g@entry=0x7f6fa31e4db0, o=o@entry=0x6e4d60 <options>, engines=engines@entry=0x7f6fa31e4e70,
    eo=eo@entry=0x7f6fa31e4d60, reverse=<optimized out>) at src/game.c:318
#14 0x000000000040a10d in thread_start (arg=0xf25f00) at src/main.c:119
#15 0x00007f6fa48bedd5 in start_thread (arg=0x7f6fa31e5700) at pthread_create.c:307
#16 0x00007f6fa42e602d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

With in the main thread:

[Switching to thread 6 (Thread 0x7f6fa4cdc740 (LWP 15145))]
#0  0x00007f6fa42d77a0 in __close_nocancel () at ../sysdeps/unix/syscall-template.S:81
81      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0  0x00007f6fa42d77a0 in __close_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f6fa426311c in _IO_new_file_close_it (fp=fp@entry=0xf24f80) at fileops.c:179
#2  0x00007f6fa4256138 in _IO_new_fclose (fp=0xf24f80) at iofclose.c:58
#3  0x000000000040c648 in worker_destroy (w=0xf25f68) at src/workers.c:91
#4  0x000000000040a404 in main_destroy () at src/main.c:37
#5  0x00007f6fa4221c29 in __run_exit_handlers (status=status@entry=1, listp=0x7f6fa45ae6c8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:77
#6  0x00007f6fa4221c77 in __GI_exit (status=status@entry=1) at exit.c:99
#7  0x000000000040a931 in main (argc=<optimized out>, argv=<optimized out>) at src/main.c:189

Looks like there's a race when worker_destroy() is called while the worker thread is still running. w->log was set to NULL while it was still being used. This happens on DIE() as we don't pthread_join() the workers.

I think the proper way to fix this would be to implement (using poll/select or whatever) a version of engine_readln() that checks the deadline of the current thread, instead of checking the deadlines in the main thread.

@lucasart
Copy link
Owner

Thank you for the detailled bug report. I'm a bit reluctant to using poll/select in the worker threads though, as this would create further complexity. But I don't have an alternative solution at this point. Will think about it. Now really a high priority bug to fix, because DIE() is fatal anwyay, and the OS cleans up our mess anyway ;-)

@sderr
Copy link
Contributor Author

sderr commented Dec 19, 2020

Yes, I understand, indeed if it was a simple fix I would have sent a pull request rather than an issue.

Regarding the priority of this bug however, I think it would be nice to be able to treat that engine failure as a single defeat rather than aborting the whole tournament.

@lucasart
Copy link
Owner

lucasart commented Dec 22, 2020

I looked into poll, but it's impractical because:

  • I want a str_getline() with timeout. It's not really practical to write that with poll(), because the only thing poll() tells you is whether or not there is something to read. You don't know if that something contains a '\n' before reading it (and risk blocking).
  • Even if I had a str_getline() with timeout, often I want to read a bunch of things, knowing that I am under the protection of a deadline (observed/enforced by master thread). To remove that protection I would have to set a timeout for all str_getline() calls. And that would mean keeping time of wall clock between each call, to know how much of the alloted time remains.
  • From what I understand, I would have to use O_NONBLOCK, and basically do away with the C I/O library, and write my own buffering (+polling), and design an async I/O library with callback functions.

Perhaps this is really the way to do it. But for a simple project like c-chess-cli, it seems overkill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants