Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPP-1413] VPP main thread gets stuck in a deadlock on running CLI in loop #2877

Closed
vvalderrv opened this issue Feb 1, 2025 · 6 comments
Closed

Comments

@vvalderrv
Copy link
Contributor

Description

I am running a script which fires a VPP CLI in loop and keeps collecting some data. The CLI hangs after sometimes and the main thread gets locked at "mheap_maybe_lock". 

 

This is happening with other CLI commands as well and for me it is easily reproducible.

 

Here in a snip from "info thr"

{{ 2 Thread 0x7fb2337fc700 (LWP 5026) "vpp_stats" 0x00007fb4597bdf3d in nanosleep () at ../sysdeps/unix/syscall-template.S:81}}

* 1 Thread 0x7fb45be9c740 (LWP 5016) "vpp_main" 0x00007fb459f09a3b in mheap_maybe_lock (v=0x7fb398fc9000)

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vppinfra/mheap.c:66}}

 

 

 

Also, here is the Backtrace taken using GDB :

(gdb) bt

#0 0x00007fb459f09a3b in mheap_maybe_lock (v=0x7fb398fc9000)

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vppinfra/mheap.c:66}}

#1 mheap_get_aligned (v=0x7fb398fc9000, n_user_data_bytes=8, n_user_data_bytes@entry=5, align=, align@entry=4,

{{ align_offset=0, align_offset@entry=4, offset_return=offset_return@entry=0x7fb399c16698)}}

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vppinfra/mheap.c:675}}

#2 0x00007fb459f33730 in clib_mem_alloc_aligned_at_offset (os_out_of_memory_on_failure=1, align_offset=4, align=4, size=5)

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vppinfra/mem.h:91}}

#3 vec_resize_allocate_memory (v=, length_increment=length_increment@entry=1, data_bytes=5,

{{ header_bytes=, header_bytes@entry=0, data_align=data_align@entry=4)}}

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vppinfra/vec.c:59}}

#4 0x00007fb45b83477b in _vec_resize (data_align=, header_bytes=, data_bytes=,

{{ length_increment=, v=)}}

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vppinfra/vec.h:142}}

#5 unix_cli_add_pending_output (uf=0x7fb399ccbd4c, buffer=0x7fb45b84a8d6 "\r", buffer_bytes=1, cf=)

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vlib/unix/cli.c:528}}

#6 0x00007fb45b8374c0 in unix_cli_file_welcome (cf=0x7fb39a5d4a40, cm=)

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vlib/unix/cli.c:1137}}

#7 0x00007fb459f3e1ab in timer_interrupt (signum=)

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vppinfra/timer.c:125}}

#8

#9 mheap_maybe_unlock (v=0x7fb398fc9000) at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vppinfra/mheap.c:85

#10 mheap_get_aligned (v=0x7fb398fc9000, n_user_data_bytes=, n_user_data_bytes@entry=12, align=,

{{ align@entry=4, align_offset=0, align_offset@entry=4, offset_return=offset_return@entry=0x7fb399c16ef8)}}

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vppinfra/mheap.c:717}}

#11 0x00007fb459f33730 in clib_mem_alloc_aligned_at_offset (os_out_of_memory_on_failure=1, align_offset=4, align=4, size=12)

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vppinfra/mem.h:91}}

#12 vec_resize_allocate_memory (v=v@entry=0x0, length_increment=1, data_bytes=12, header_bytes=,

{{ header_bytes@entry=0, data_align=data_align@entry=4)}}

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vppinfra/vec.c:59}}

#13 0x00007fb45b83af23 in _vec_resize (data_align=0, header_bytes=0, data_bytes=,

{{ length_increment=, v=)}}

--Type to continue, or q to quit--

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vppinfra/vec.h:142}}

#14 vlib_process_get_events (data_vector=, vm=0x7fb45ba572c0 <vlib_global_main>)

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vlib/node_funcs.h:562}}

#15 unix_cli_process (vm=0x7fb45ba572c0 <vlib_global_main>, rt=0x7fb399c06000, f=)

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vlib/unix/cli.c:2414}}

#16 0x00007fb45b8037a6 in vlib_process_bootstrap (_a=)

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vlib/main.c:1231}}

#17 0x00007fb459efe808 in clib_calljmp ()

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vppinfra/longjmp.S:110}}

#18 0x00007fb39a5c8c20 in ?? ()

#19 0x00007fb45b804ae9 in vlib_process_startup (f=0x0, p=0x7fb399c06000, vm=0x7fb45ba572c0 <vlib_global_main>)

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vlib/main.c:1253}}

#20 dispatch_process (vm=0x7fb45ba572c0 <vlib_global_main>, p=0x7fb399c06000, last_time_stamp=0, f=0x0)

{{ at /root/vpp_dt/build-root/rpmbuild/vpp-18.01.1.0/build-data/../src/vlib/main.c:1296}}

Assignee

Chris Luke

Reporter

Siddarth Rai

Comments

  • chrisluke (Wed, 10 Jul 2019 20:47:46 +0000): The proposed change has now been merged. Since this issue duplicates VPP-1711, I shall close this one.
  • chrisluke (Wed, 10 Jul 2019 03:49:32 +0000): Siddarth Rai FYI, It's possible this is the same issue as https://jira.fd.io/browse/VPP-1711 for which I have a proposed fix at https://gerrit.fd.io/r/#/c/20573/
  • jhahn (Sun, 17 Feb 2019 23:12:40 +0000): Siddarth Rai Is this still an issue?
  • chrisluke (Mon, 10 Sep 2018 18:32:32 +0000): Interesting; which means it's a non-interactive session.

So I see two problems here.

  1. It's in the unix_cli_file_welcome function at a point suggesting it is going to send a banner; it shouldn't be, since it's non-interactive. It got here from a timer, meaning the session failed to to Telnet negotiation and timed-out. There was a bug for this fixed last week in master.

  2. The timer happened to fire whilst in a memory management function; the crash/deadlock happens as the timer function also causes memory management to happen, meaning we have some non reentrant functionality. I previously thought the timer interrupt was protected from this so now suspect we need to handle that differently.

Since there was a fix for 1) posted recently (specifically the Telnet protocol state machine had an issue where it would stall until new input arrived) I suggest trying a build from master to see if it improves your lot. The fix was in https://gerrit.fd.io/r/#/c/14684/ and is probably backportable.

  • siddsr (Mon, 10 Sep 2018 18:05:25 +0000): I am using a VPPCTL command only.

Something like this :

#!/bin/sh

for i in seq 1 90000

do

vppctl sh node counters

done

  • chrisluke (Mon, 10 Sep 2018 17:49:50 +0000): Can you be more specific about your "script which fires a VPP CLI in loop"? I see it's trying to display the banner, which means you're probably not using vppctl with the CLI command as a parameter, correct? How are you calling the CLI? Which commands?

Also, 18.01 is probably considered old at this point, it's two releases behind the current release and several code changes that may relate to this (event loop, memory management, CLI).

Original issue: https://jira.fd.io/browse/VPP-1413

@vvalderrv
Copy link
Contributor Author

The proposed change has now been merged. Since this issue duplicates VPP-1711, I shall close this one.

@vvalderrv
Copy link
Contributor Author

Siddarth Rai FYI, It's possible this is the same issue as https://jira.fd.io/browse/VPP-1711 for which I have a proposed fix at https://gerrit.fd.io/r/#/c/20573/

@vvalderrv
Copy link
Contributor Author

Siddarth Rai Is this still an issue?

@vvalderrv
Copy link
Contributor Author

Interesting; which means it's a non-interactive session.

So I see two problems here.
1) It's in the unix_cli_file_welcome function at a point suggesting it is going to send a banner; it shouldn't be, since it's non-interactive. It got here from a timer, meaning the session failed to to Telnet negotiation and timed-out. There was a bug for this fixed last week in master.

2) The timer happened to fire whilst in a memory management function; the crash/deadlock happens as the timer function also causes memory management to happen, meaning we have some non reentrant functionality. I previously thought the timer interrupt was protected from this so now suspect we need to handle that differently.

Since there was a fix for 1) posted recently (specifically the Telnet protocol state machine had an issue where it would stall until new input arrived) I suggest trying a build from master to see if it improves your lot. The fix was in https://gerrit.fd.io/r/#/c/14684/ and is probably backportable.

@vvalderrv
Copy link
Contributor Author

I am using a VPPCTL command only.

Something like this :

#!/bin/sh
for i in `seq 1 90000`
do
vppctl sh node counters
done

@vvalderrv
Copy link
Contributor Author

Can you be more specific about your "script which fires a VPP CLI in loop"? I see it's trying to display the banner, which means you're probably not using vppctl with the CLI command as a parameter, correct? How are you calling the CLI? Which commands?

Also, 18.01 is probably considered old at this point, it's two releases behind the current release and several code changes that may relate to this (event loop, memory management, CLI).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant