Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calc: bgsave not working ... #9082

Open
mmeeks opened this issue May 16, 2024 · 11 comments
Open

calc: bgsave not working ... #9082

mmeeks opened this issue May 16, 2024 · 11 comments
Labels
24.04 bug Something isn't working

Comments

@mmeeks
Copy link
Contributor

mmeeks commented May 16, 2024

It seems background save doesn't always work - sometimes we have a stray thread:

May 16 16:08:42 ip-172-31-14-76 coolwsd[216014]: kit-216014-216014 2024-05-16 16:08:42.530632 +0000 [ kitbroker_148 ] WRN Failed to ensure we have just one, we have: 2| kit/Kit.cpp:1388

And then we cannot save ... but what thread is it ? =)

@mmeeks mmeeks added bug Something isn't working unconfirmed 24.04 and removed unconfirmed labels May 16, 2024
@mmeeks
Copy link
Contributor Author

mmeeks commented May 16, 2024

@caolanm you noticed this - the logs support the thesis; but I wonder what creates the extra thread we don't know about.

@caolanm
Copy link
Contributor

caolanm commented May 16, 2024

FWIW the coolwsd.xml for perf-staging has experimental:true and deepl:true, but that doesn't seem to have any thread related stuff, so I presume it has then to be some core thread. Add dumping of /proc/self/task/*/comm ?

@mmeeks
Copy link
Contributor Author

mmeeks commented May 17, 2024

Ah - sure; I just need a way to reproduce it; it may well be something like the officecfg config writing thread, or as you say a DeepL / some other thread =) I wonder if those threads set comm though =)

Also - it's not possible to access /proc/thread/self/tasks once we dropped capabilities; this is only possible due to holding a directory file-descriptor open from pre-dropping capabilities; so for new directories unknown at that time we can't inspect them at least inside the forkit process itself =( so manual hunting is better for now I think.

@mmeeks
Copy link
Contributor Author

mmeeks commented May 17, 2024

Easy to reproduce it seems from staging-perf:

$ cat /proc/25814/task/*/comm
kitbroker_001
kit_spare_001
kitbroker_001
kitbroker_001
kitbroker_001

@mmeeks
Copy link
Contributor Author

mmeeks commented May 17, 2024

gdb behaving oddly - but during save at least:

Thread 296 (Thread 0x7f05cd175700 (LWP 26816) "WakeUpThread"):
#0 0x00007f05f1641a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f05ed529c1a in __gthread_cond_timedwait (__abs_timeout=0x7f05cd164df0, __mutex=, __cond=0x4575ec70) at /opt/rh/devtoolset-12/root/usr/include/c++/12/x86_64-redhat-linux/bits/gthr-default.h:872
#2 std::__condvar::wait_until (__abs_time=..., __m=..., this=0x4575ec70) at /opt/rh/devtoolset-12/root/usr/include/c++/12/bits/std_mutex.h:162
#3 std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__lock=..., __atime=..., this=0x4575ec70) at /opt/rh/devtoolset-12/root/usr/include/c++/12/condition_variable:221
#4 std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=..., this=0x4575ec70) at /opt/rh/devtoolset-12/root/usr/include/c++/12/condition_variable:134
#5 std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> >, framework::WakeUpThread::execute()::<lambda()> > (__p=..., __atime=..., __lock=..., this=0x4575ec70) at /opt/rh/devtoolset-12/root/usr/include/c++/12/condition_variable:151
#6 std::condition_variable::wait_for<long int, std::ratio<1, 1000>, framework::WakeUpThread::execute()::<lambda()> > (__rtime=..., __p=..., __lock=..., this=0x4575ec70) at /opt/rh/devtoolset-12/root/usr/include/c++/12/condition_variable:174
#7 framework::WakeUpThread::execute (this=0x4575ec40) at /home/collabora/online-buildscripts/staging/builddir/libreoffice/framework/source/helper/wakeupthread.cxx:34
#8 0x00007f05ea17d1cb in salhelper::Thread::run (this=0x4575ec40) at /home/collabora/online-buildscripts/staging/builddir/libreoffice/salhelper/source/thread.cxx:39
#9 0x00007f05ea17dcc0 in osl::threadFunc (param=0x4575ec50) at /home/collabora/online-buildscripts/staging/builddir/libreoffice/include/osl/thread.hxx:189
#10 0x00007f05ea1ff1ab in ?? () from /opt/collaboraoffice/program/libuno_sal.so.3
#11 0x00007ffc5b087eff in ?? ()
#12 0x0000000000000000 in ?? ()

several:
#2 0x00000000005d7373 in ThreadPool::work (this=0x309579f0) at ./common/ThreadPool.hpp:137

...

Thread 2 (Thread 0x7f05d4810700 (LWP 25821) "kit_spare_001"):
#0 0x00007f05f1641a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000599c45 in __gthread_cond_timedwait (__abs_timeout=0x7f05d47ffe40, __mutex=0x314096d8, __cond=0x31409700) at /opt/rh/devtoolset-12/root/usr/include/c++/12/x86_64-redhat-linux/bits/gthr-default.h:872
#2 std::__condvar::wait_until (__abs_time=..., __m=..., this=0x31409700) at /opt/rh/devtoolset-12/root/usr/include/c++/12/bits/std_mutex.h:162
#3 std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__lock=..., __atime=..., this=0x31409700) at /opt/rh/devtoolset-12/root/usr/include/c++/12/condition_variable:221
#4 std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=..., this=0x31409700) at /opt/rh/devtoolset-12/root/usr/include/c++/12/condition_variable:134
#5 std::condition_variable::wait_for<long, std::ratio<1l, 1000l> > (__rtime=..., __lock=..., this=0x31409700) at /opt/rh/devtoolset-12/root/usr/include/c++/12/condition_variable:162
#6 Watchdog::checkTime (this=0x314096c0) at ./common/Watchdog.hpp:107

Ah - and I'm suckered - the save is synchronous because it is manually triggered in this case: bother ... =)

@mmeeks
Copy link
Contributor Author

mmeeks commented May 17, 2024

At the point of auto-saving I have:

Thread 5 (Thread 0x7f05ce177700 (LWP 26232) "kitbroker_001"):
#0 0x00007f05f164170c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f05f1a6f397 in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /usr/lib64/libstdc++.so.6
#2 0x00000000005d7373 in ThreadPool::work (this=0x309579f0) at ./common/ThreadPool.hpp:137
#3 0x0000000000ab1173 in execute_native_thread_routine ()
#4 0x00007f05f163a6ea in start_thread () from /lib64/libpthread.so.0
#5 0x00007f05f155149f in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f05ce978700 (LWP 26231) "kitbroker_001"):
#0 0x00007f05f164170c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f05f1a6f397 in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /usr/lib64/libstdc++.so.6
#2 0x00000000005d7373 in ThreadPool::work (this=0x309579f0) at ./common/ThreadPool.hpp:137
#3 0x0000000000ab1173 in execute_native_thread_routine ()
#4 0x00007f05f163a6ea in start_thread () from /lib64/libpthread.so.0
#5 0x00007f05f155149f in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f05cf179700 (LWP 26230) "kitbroker_001"):
#0 0x00007f05f164170c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f05f1a6f397 in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /usr/lib64/libstdc++.so.6
#2 0x00000000005d7373 in ThreadPool::work (this=0x309579f0) at ./common/ThreadPool.hpp:137
#3 0x0000000000ab1173 in execute_native_thread_routine ()
#4 0x00007f05f163a6ea in start_thread () from /lib64/libpthread.so.0
#5 0x00007f05f155149f in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f05d4810700 (LWP 25821) "kit_spare_001"):
#0 0x00007f05f1641a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000599c45 in __gthread_cond_timedwait (__abs_timeout=0x7f05d47ffe40, __mutex=0x314096d8, __cond=0x31409700) at /opt/rh/devtoolset-12/root/usr/include/c++/12/x86_64-redhat-linux/bits/gthr-default.h:872
#2 std::__condvar::wait_until (__abs_time=..., __m=..., this=0x31409700) at /opt/rh/devtoolset-12/root/usr/include/c++/12/bits/std_mutex.h:162
#3 std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__lock=..., __atime=..., this=0x31409700) at /opt/rh/devtoolset-12/root/usr/include/c++/12/condition_variable:221
#4 std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=..., this=0x31409700) at /opt/rh/devtoolset-12/root/usr/include/c++/12/condition_variable:134
#5 std::condition_variable::wait_for<long, std::ratio<1l, 1000l> > (__rtime=..., __lock=..., this=0x31409700) at /opt/rh/devtoolset-12/root/usr/include/c++/12/condition_variable:162
#6 Watchdog::checkTime (this=0x314096c0) at ./common/Watchdog.hpp:107
#7 0x0000000000ab1173 in execute_native_thread_routine ()
#8 0x00007f05f163a6ea in start_thread () from /lib64/libpthread.so.0
#9 0x00007f05f155149f in clone () from /lib64/libc.so.6

Which looks fine; perhaps we don't shut the watchdog down properly - let me poke at that.

@mmeeks
Copy link
Contributor Author

mmeeks commented May 20, 2024

This is assumed to be either the webdav thread, or perhaps the configmgr thread - it's really rather tricky to decide - both are patched to join nicely:

https://gerrit.libreoffice.org/c/core/+/167868 - configmgr
https://gerrit.libreoffice.org/c/core/+/167858 - webdav

Quite probably there are more to find; lets see ...

@caolanm
Copy link
Contributor

caolanm commented May 20, 2024

After https://gerrit.libreoffice.org/c/core/+/167858 should there now be a matching getLOKit()->startThreads() in Document::startThreads?

#9114 for that thought

@mmeeks
Copy link
Contributor Author

mmeeks commented May 21, 2024

No more instances of: "WRN Failed to ensure we have just one, we have: 2| kit/Kit.cpp:1388" on the staging server since this was deployed; lets assume this is closed then =)
Thanks!

@mmeeks mmeeks closed this as completed May 21, 2024
@mmeeks
Copy link
Contributor Author

mmeeks commented May 22, 2024

Interestingly we see this problem during Unit tests running:

https://cpci.cbg.collabora.co.uk:8080/job/github_online_master_debug_vs_co-24.04/1653/consoleText

contains the immortal:

kit-4140617-4140617 2024-05-22 17:24:28.849983 +0000 [ kitbroker_002 ] WRN Failed to ensure we have just one, we have: 2| kit/Kit.cpp:1397

in the check before:

TST_LOG("Background save exited early as expected");

which is exciting =)

@mmeeks mmeeks reopened this May 22, 2024
@mmeeks
Copy link
Contributor Author

mmeeks commented May 24, 2024

A ton of code reading later I came up with this:

https://gerrit.libreoffice.org/c/core/+/168032
and
https://gerrit.libreoffice.org/c/core/+/168033

which I hope together fix this; turns out we start and stop that progress thread a lot 18x times during the save-torture test for example; can't prove it's the problem - but it seems quite plausible :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
24.04 bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

2 participants