Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CSIT-1947] Rare VPP crash in nat avf tests #4029

Open
vvalderrv opened this issue Feb 4, 2025 · 7 comments
Open

[CSIT-1947] Rare VPP crash in nat avf tests #4029

vvalderrv opened this issue Feb 4, 2025 · 7 comments

Comments

@vvalderrv
Copy link
Contributor

Description

So far seen only in small scale cps tests, originally UDP and 4C only. [0] [1]

Probably not a duplicate of CSIT-1901, although 4c and avf, that issue needs high traffic.

Also improbable to be a duplicate of CSIT-1937, although udp, that issue appears on different nic+driver and only causes small packet drop, not a crash.

RC1 testing is in progress, so I will try to get core dumps later.

[0] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-spr/45/log.html.gz#s1-s1-s1-s2-s8-t3-k3-k5-k1-k1

[1] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-spr/45/log.html.gz#s1-s1-s1-s2-s22-t3-k3-k5-k1-k1

Assignee

Unassigned

Reporter

Vratko Polak

Comments

  • vrpolak (Wed, 13 Nov 2024 11:50:52 +0000): ... and so is the clib_dlist_remove symptom [7].

[7] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2410-2n-icx/51/log.html.gz#s1-s1-s1-s2-s4-t1-k3-k4-k1

  • vrpolak (Wed, 13 Nov 2024 09:49:55 +0000):

    The nat44_ed_in2out_fast_path_node_fn_inline symptom is still present [6] in rls2410. Ticket

VPP-2117 may describe the same underlying cause in as a different symptom.

[6] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2410-2n-icx/35/log.html.gz#s1-s1-s1-s2-s10-t3-k3-k4-k1

  • vrpolak (Wed, 24 Jul 2024 10:59:58 +0000): 1C2T failure is also possible, seen in soak [5].

[5] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-2n-clx/94/log.html.gz#s1-s1-s1-s2-s5-t1-k3-k4

  • vrpolak (Wed, 24 Jul 2024 10:49:56 +0000):

    Also seen on non-small scale and only 2C. Core [4] points to clib_dlist_remove called by nat44_session_update_lru (slow path). I still assume this is all just one issue, but somehow corrupting NAT state, so crash does not happen in single place.

Still happening only rarely, most iterative runs have no failure.

[4] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-2n-clx/79/log.html.gz#s1-s1-s1-s2-s37-t2-k3-k4-k1

  • vrpolak (Wed, 24 Jul 2024 10:32:21 +0000): In rls2406 I see this happening also in TCP (small scale CPS AVF 4c), I assume it is the same issue. Core [3] points to nat44_ed_in2out_fast_path_node_fn_inline.

[3] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-2n-clx/72/log.html.gz#s1-s1-s1-s2-s20-t3-k3-k4-k1

  • vrpolak (Wed, 27 Mar 2024 09:39:51 +0000): Seems hard to reproduce in verify jobs. So far I got one [2] crash with debug image, but the mechanism is not clear to me yet.

[2] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/vpp-csit-verify-perf-master-ubuntu2204-x86_64-2n-spr/37/csit_current/0/log.html.gz#s1-s1-s1-s1-s1-t1-k3-k4-k1

Original issue: https://jira.fd.io/browse/CSIT-1947

@vvalderrv
Copy link
Contributor Author

@vvalderrv
Copy link
Contributor Author

The nat44_ed_in2out_fast_path_node_fn_inline symptom is still present [6] in rls2410. Ticket
VPP-2117 may describe the same underlying cause in as a different symptom.

[6] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2410-2n-icx/35/log.html.gz#s1-s1-s1-s2-s10-t3-k3-k4-k1

@vvalderrv
Copy link
Contributor Author

@vvalderrv
Copy link
Contributor Author

Also seen on non-small scale and only 2C. Core [4] points to clib_dlist_remove called by nat44_session_update_lru (slow path). I still assume this is all just one issue, but somehow corrupting NAT state, so crash does not happen in single place.
Still happening only rarely, most iterative runs have no failure.

[4] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-2n-clx/79/log.html.gz#s1-s1-s1-s2-s37-t2-k3-k4-k1

@vvalderrv
Copy link
Contributor Author

In rls2406 I see this happening also in TCP (small scale CPS AVF 4c), I assume it is the same issue. Core [3] points to nat44_ed_in2out_fast_path_node_fn_inline.

[3] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-2n-clx/72/log.html.gz#s1-s1-s1-s2-s20-t3-k3-k4-k1

@vvalderrv
Copy link
Contributor Author

Seems hard to reproduce in verify jobs. So far I got one [2] crash with debug image, but the mechanism is not clear to me yet.

[2] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/vpp-csit-verify-perf-master-ubuntu2204-x86_64-2n-spr/37/csit_current/0/log.html.gz#s1-s1-s1-s1-s1-t1-k3-k4-k1

@vrpolakatcisco
Copy link
Contributor

I do not see this crash in rls2502 results, but still occasionally happens [8] in periodic jobs (without core).

[8] https://logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-soak-weekly-master-2n-spr/44/log.html.gz#s1-s1-s1-s2-s6-t1-k3-k5-k1-k1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants