Reset sx_fbtbc to 0 once we are sure it's haproxy v2 header #322

taodd · 2024-09-29T11:44:16Z

When receiving a haproxy v2 message, after processing the first fragment which only contain the proxy header, the ::recv might return EGAIN if the next fragment is not ready to read yet, then the function svc_vc_recv would return, but left xd->sx_fbtbx == 0x0d0a0d0a (PP2_SIG_UINT32), which was set when parsing the haproxy v2 header. When the next fragment is ready, the epoll event for the same xprt will be trigged, then a non-zero xd->sx_fbtbc would deceit sv_vc_recv into derefrencing a NULL uv pointer.

Stack trace:
Thread 76 "ganesha.nfsd" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fc9937ee640 (LWP 2931)]
0x00007fc9f3e30fa6 in svc_vc_recv (xprt=0x7fc9dc00e3b0) at /git/nfs-ganesha/src/libntirpc/src/svc_vc.c:1070 1070 flags = uv->u.uio_flags;
(gdb) bt
#0 0x00007fc9f3e30fa6 in svc_vc_recv (xprt=0x7fc9dc00e3b0) at /git/nfs-ganesha/src/libntirpc/src/svc_vc.c:1070
#1 0x00007fc9f3e2baab in svc_rqst_xprt_task_recv (wpe=0x7fc9dc00e680) at /git/nfs-ganesha/src/libntirpc/src/svc_rqst.c:1210
#2 0x00007fc9f3e2c740 in svc_rqst_epoll_loop (wpe=0x3003cb8) at /git/nfs-ganesha/src/libntirpc/src/svc_rqst.c:1585
#3 0x00007fc9f3e39be8 in work_pool_thread (arg=0x7fc9e007e030) at /git/nfs-ganesha/src/libntirpc/src/work_pool.c:187
#4 0x00007fc9f3aa0c52 in start_thread () from /lib64/libc.so.6
#5 0x00007fc9f3b24f74 in clone () from /lib64/libc.so.6
(gdb) p uv
$1 = (struct xdr_ioq_uv *) 0x0
(gdb) p xd->sx_fbtbc
$2 = 218762506

taodd · 2024-10-01T00:35:57Z

Reporter with the same issue here:
nfs-ganesha/nfs-ganesha#1158

It's pretty easy to reproduce:

Deploy single haproxy + single nfs-ganesha (v5.5)
Add "default-server send-proxy-v2" option to haproxy config
Repeated do the mount and umount the nfs client in a for loop, it's highly likely to reproduce the crash within 1 minute.

ffilz

I'd like to understand the flow path of the failure better.

Also, it seems like it might be better to reset sx_fbtbc down below in the switch after the call to handle_haproxy_header().

I'm also curious down there is we should do a go to again for local command also? Not quite sure how that's supposed to be handled...

taodd · 2024-10-09T12:30:11Z

I'd like to understand the flow path of the failure better.
The flow path is like below:

svc_vc_recv started to process a new haproxy v2 message and the xd->sx_fbtbc will be set to 0x0d0a0d0a (PP2_SIG_UINT32)
Once the header is processed, the flow jumped to the again block @ https://github.com/nfs-ganesha/ntirpc/blob/next/src/svc_vc.c#L1033, which will try to receive the remaining body message.
In the case of the remaining message not ready, ::recv will return EAGAIN and the function svc_vc_recv will just return at https://github.com/nfs-ganesha/ntirpc/blob/next/src/svc_vc.c#L993, which will left the xd->sx_fbtbc to a non-zero value 0x0d0a0d0a.
After the remaining body message is ready and the same xprt handler with xd->fbtbc == 0x0d0a0d0a will be passed to svc_vc_recv function and the if condition at https://github.com/nfs-ganesha/ntirpc/blob/next/src/svc_vc.c#L967 will fail and go directly to https://github.com/nfs-ganesha/ntirpc/blob/next/src/svc_vc.c#L1069, which will crash due to an empty ioq.
I think svc_vc_recv only expect a non-zero xd->fbtbc which represent the remaining bytes of the message to be received. But this is apparently not the case when xd->fbtbc represents a proxy v2 protocal, so we xd->fbtbc==0x0d0a0d0a should be a temporary thing and reset it once it's no longer needed.

Also, it seems like it might be better to reset sx_fbtbc down below in the switch after the call to handle_haproxy_header().

I did try to reset it after the ::recv returned EAGAIN, right before https://github.com/nfs-ganesha/ntirpc/blob/next/src/svc_vc.c#L993, but there is a race condition: Before the sx_fbtbc can be reset in this thread, a new thread which is trigged by the epoll event on the same socket fd, trying to recv the body message, will call svc_vc_recv with the same xprt and crash at the same place.
So I think we should reset it before the haproxy v2 header was completely received ? However I can test to reset it after the the call to handle_haproxy_header.

When receiving a haproxy v2 message, after processing the first fragment which only contain the proxy header, the ::recv might return EGAIN if the next fragment is not ready to read yet, then the function svc_vc_recv would return, but left xd->sx_fbtbx == 0x0d0a0d0a (PP2_SIG_UINT32), which was set when parsing the haproxy v2 header. When the next fragment is ready, the epoll event for the same xprt will be trigged, then a non-zero xd->sx_fbtbc would deceit sv_vc_recv into derefrencing a NULL uv pointer. Stack trace: Thread 76 "ganesha.nfsd" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fc9937ee640 (LWP 2931)] 0x00007fc9f3e30fa6 in svc_vc_recv (xprt=0x7fc9dc00e3b0) at /git/nfs-ganesha/src/libntirpc/src/svc_vc.c:1070 1070 flags = uv->u.uio_flags; (gdb) bt (gdb) p uv $1 = (struct xdr_ioq_uv *) 0x0 (gdb) p xd->sx_fbtbc $2 = 218762506

taodd · 2024-10-13T00:42:09Z

@ffilz I revised the patch by resetting the sx_fbtbc after the call to handle_haproxy_header, I can't reproduce the crash with it, works perfect! please take a look

ffilz

Glad this works. That looks great and easily understood (and the code is tight enough here to not need a comment explaining why sx_fbtbc is being reset).

ivveh mentioned this pull request Oct 1, 2024

ganesha.nfsd: segfault at 50 ip 00007fbbf1be3286 sp 00007fbb905f5270 error 4 in libntirpc.so.5.0 nfs-ganesha/nfs-ganesha#1158

Open

dang requested a review from ffilz October 7, 2024 14:18

ffilz reviewed Oct 7, 2024

View reviewed changes

taodd force-pushed the fix-haproxy-v2-segfault branch from 7cb5246 to 17e95c3 Compare October 13, 2024 00:24

ffilz approved these changes Oct 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset sx_fbtbc to 0 once we are sure it's haproxy v2 header #322

Reset sx_fbtbc to 0 once we are sure it's haproxy v2 header #322

taodd commented Sep 29, 2024 •

edited

Loading

taodd commented Oct 1, 2024

ffilz left a comment

taodd commented Oct 9, 2024 •

edited

Loading

taodd commented Oct 13, 2024

ffilz left a comment

Reset sx_fbtbc to 0 once we are sure it's haproxy v2 header #322

Are you sure you want to change the base?

Reset sx_fbtbc to 0 once we are sure it's haproxy v2 header #322

Conversation

taodd commented Sep 29, 2024 • edited Loading

taodd commented Oct 1, 2024

ffilz left a comment

Choose a reason for hiding this comment

taodd commented Oct 9, 2024 • edited Loading

taodd commented Oct 13, 2024

ffilz left a comment

Choose a reason for hiding this comment

taodd commented Sep 29, 2024 •

edited

Loading

taodd commented Oct 9, 2024 •

edited

Loading