Added protection for cases when client drives sockets itself after HTTP protocol upgrade #12761

dkarpov1970 · 2024-01-23T00:34:20Z

Added crash protection for cases when client drives sockets itself after upgrading libcurl HTTP transfer to a different protocol (i.e. WebSockets) and steps on socket errors.

In such cases, it is possible that client performs read/write operations and calls libcurl socket filters after libcurl has cleared the transfer easy handle.

…grading HTTP libcurl transfer to a different protocol and steps on socket errors. In such cases, it is possible that client performs read/write operations and calls libcurl socket filters after libcurl has cleared the transfer easy handle.

jay · 2024-01-23T04:58:14Z

Ref: https://curl.se/mail/lib-2024-01/0038.html

As I said in that thread I am not sure this is correct, I would expect data always be != NULL.

/cc @icing

bagder · 2024-01-23T07:38:21Z

Can you elaborate more on how you get this problem? Maybe an example reproducing it? To me it feels like this is just hiding the problem when maybe it could be fixed earlier...

dkarpov1970 · 2024-01-23T08:49:17Z

As I mentioned in the e-mail thread, this problem happens when some client code uses Curl HTTP transfer for "upgrading" to some bi-directional protocol like WebSocket.

After the "upgrade", the client code extract SSL handle and drives read/write operations itself performing protocol operations.

It works OK unless there is some socket/network error, and it looks like that Curl easy handle that was used to start the protocol transfer gets cleared by the error (maybe socket was closed or something like that), and the subsequent client SSL read operations crash in the libcurl socket filter function with a null data upon socket read failure.

It works OK with 7.84, but back then there was not socket filter and read/write operations just failed on socket/network errors indicating that the WebSocket is done.

The main factor here seems to the fact that the "data" gets cleared by some socket error, but subsequent client read/write socket operations still use libcurl socket filter and thus step on the crash due to null data.

dkarpov1970 · 2024-01-31T19:09:54Z

@bagder As I can see these changes didn't get into 8.6.0, I am wondering if it is possible to get them into the next release?

bagder · 2024-02-10T16:21:52Z

I still want more information on how to reproduce this, and possibly @icing's comment on this. It does not feel right to just add this check. To me, it seems to indicate a previous problem that we should deal with rather than to paint it over with this extra condition.

I think we want:

asserts in both these places to trigger hard in debug situations to help us catch these problems
both functions that you change with this patch access ->data in other places so this is not a complete fix

dkarpov1970 · 2024-02-13T06:00:47Z

As I mentioned, this problem happens when some client drives sockets itself after "upgrading" to some protocol.
So, it performs read/write operation on its own. This works OK, until some socket error occurs, and then when the client tries to read/write it crashes because 'data' is null the in filter function.

Maybe, you are right, and the problem is bigger, but it happens because the 'data' gets null as a result of socket error before the socket filter is unregistered - so when client does its read/write it steps on it.

My change is specifically for handling socket error parts in read/write functions, and I guess the other places where 'data' can be null should be checked/handled differently.

I guess the main question to @icing (as I understand he is the author of the socket filter code) - if it was thought about cases when client performing socket operations itself may do read/writes after socket was closed because of error (that's when we get null 'data' in the filter)?

Unfortunately, I don't know how to reliably reproduce these crashes, because they randomly happen on remote devices with limited debugging resources and inside 3rd party code, but they happen after the client calls SSL_read() some time after socket error has occurred.

icing · 2024-02-13T09:09:23Z

@dkarpov1970 If I understand you correctly, the application does a "connect only", getting the socket from curl and then reading/writing to that socket directly. When it is done, it calls curl_easy_cleanup(), right?

My question is then: how do connection filters get involved after the application has taken over? If you see a crash with data == NULL, what is the call stack?

Thanks for your help in understanding this.

dkarpov1970 · 2024-02-13T20:00:23Z

@icing Yes, the application does this kind of thing - "connect only", then doing SSL read/writes to run its own WebSocket implementation and curl_easy_cleanup() at the end of WebSocket life.

Everything works smoothly until some socket error occurs, and the next SSL read after it crashes.

The call stack of the crash is the following:

Program terminated with signal SIGSEGV, Segmentation fault.
#0 0xb647c6ce in Curl_failf (data=0x0, fmt=0xb64a70c8 = "Recv failure: %s") at ../../../../../../os/dist/curl/Carbon14_dev.OBJ/src/lib/curl_trc.c:90
#1 0xb6473e78 in nw_in_read (reader_ctx=0xb43fc928, buf=, len=5, err=0xb43fc964) at ../../../../../../os/dist/curl/Carbon14_dev.OBJ/src/lib/cf-socket.c:866
#2 0xb6475304 in cf_socket_recv (cf=0xb4d10810, data=0x0, buf=0xb4dd6ddb, len=5, err=0xb43fc964) at ../../../../../../os/dist/curl/Carbon14_dev.OBJ/src/lib/cf-socket.c:1414
#3 0xb6498234 in ossl_bio_cf_in_read (bio=0xb4d10560, buf=0xb4dd6ddb, blen=5) at ../../../../../../os/dist/curl/Carbon14_dev.OBJ/src/lib/vtls/openssl.c:763
#4 0xb6240c9c in bread_conv (bio=, data=, datal=, readbytes=0xb43fc9cc) at crypto/bio/bio_meth.c:123
#5 0xb6242bd2 in bio_read_intern (b=0xb4d10560, data=0xb4dd6ddb, dlen=5, readbytes=0xb43fc9cc) at crypto/bio/bio_lib.c:292
#6 0xb6242c70 in BIO_read (dlen=, data=, b=) at crypto/bio/bio_lib.c:318
#7 0xb6242c70 in BIO_read (b=, data=, dlen=) at crypto/bio/bio_lib.c:310
#8 0xb6428ff4 in ssl3_read_n (s=0xb4d527a0, n=5, max=, extend=, clearold=1, readbytes=0xb43fca4c) at ssl/record/rec_layer_s3.c:293
#9 0xb64299ba in ssl3_get_record (s=0xb4d527a0) at ssl/record/ssl3_record.c:210
#10 0xb64299ba in ssl3_read_bytes (s=0xb4d527a0, type=, recvd_type=, buf=, len=, peek=, readbytes=) at ssl/record/rec_layer_s3.c:1350
#11 0xb64153ea in ssl3_read_internal (s=0xb4d527a0, buf=0xb43fcb68, len=8192, peek=0, readbytes=0xb43fcb1c) at ssl/s3_lib.c:4463
#12 0xb641542e in ssl3_read (s=, buf=, len=, readbytes=) at ssl/s3_lib.c:4486
#13 0xb6418ba2 in SSL_read (s=, buf=, num=) at ssl/ssl_lib.c:1958 ...
...

dkarpov1970 · 2024-02-13T21:22:52Z

I think I found a not over complicated sequence that can reproduce the crash with HTTPs transfer done via curl_easy_perform(), socket closure and post transfer SSL read:

Transfer setup:

Transfer should register SSL context callback via CURLOPT_SSL_CTX_FUNCTION to get pointer to SSL_CTX instance.
Transfer should register one of SSL-specific callbacks to get pointer to SSL handle used later in SSL_read() calls.
(I.e. the SSL_CTX_set_verify() can be used to register peer verification callback and then use X509_STORE_CTX_get_ex_data() to get SSL handle).
- Transfer should register socket option callback via CURLOPT_SOCKOPTFUNCTION to get socket descriptors and save them.

Steps:

perform HTTPs transfer (i.e. https://www.google.com)
close all the socket handles obtained in socket option callback (this will make them invalid and trigger errors on read/write operations).
Do post-transfer SSL_read() (i.e. 8 KiB) using SSL handle obtained in the SSL callbacks, which should step into a null data in the nw_in_read()/Curl_failf() and crash.

icing · 2024-02-14T07:59:54Z

I think I found a not over complicated sequence that can reproduce the crash with HTTPs transfer done via curl_easy_perform(), socket closure and post transfer SSL read:

Transfer setup:

Transfer should register SSL context callback via CURLOPT_SSL_CTX_FUNCTION to get pointer to SSL_CTX instance.

Transfer should register one of SSL-specific callbacks to get pointer to SSL handle used later in SSL_read() calls.
(I.e. the SSL_CTX_set_verify() can be used to register peer verification callback and then use X509_STORE_CTX_get_ex_data() to get SSL handle).

Transfer should register socket option callback via CURLOPT_SOCKOPTFUNCTION to get socket descriptors and save them.

Steps:

perform HTTPs transfer (i.e. https://www.google.com)

close all the socket handles obtained in socket option callback (this will make them invalid and trigger errors on read/write operations).

Do post-transfer SSL_read() (i.e. 8 KiB) using SSL handle obtained in the SSL callbacks, which should step into a null data in the nw_in_read()/Curl_failf() and crash.

Re Step 3: you cannot use any SSL_* function on the instance after the transfer has been cleaned up (if I understand you correctly). That will not work.

dkarpov1970 · 2024-02-14T19:39:13Z

@icing I think I should've clarified that the step 3 should be done after the transfer has been performed (by curl_easy_perform() in my example) but before its easy handle is cleared by the curl_easy_cleanup().

In this case, the post-transfer SSL_read() will still work because the SSL context and SSL handle are still alive.
And closing transfer sockets before calling SSL_read(), will make a null data and a crash in SSL_read().

If I don't close sockets before calling SSL_read(), then nw_in_read() in the call chain will return -1 without crashes with CURLE_AGAIN error because of EWOULDBLOCK kind of errors.

But if I close the sockets, then the nw_in_read() goes to the non-"EWOULDBLOCK" error part where it tries to access null data for the failf() function.

The real application which stepped on the crash, as far as I can tell, uses a multi-handle and drives multiple transfers, but the crash is caused by the same or a similar sequence - SSL_read() after socket error.

dkarpov1970 · 2024-02-22T22:47:26Z

@icing @bagder Any update on this one?

icing · 2024-02-23T08:04:53Z

@dkarpov1970 Thanks for the reminder. Question: would curl_easy_recv() not be the right way to read more from the connection?

I still don't like the idea that apps do I/O on an SSL they obtained from a callback. Maybe @bagder has an idea how this is supposed to work.

bagder · 2024-02-23T09:36:45Z

Is the SSL_read() is done by the application after having extracted the SSL handle/context from the connection libcurl setup? If so, then I don't think this is covered by libcurl's responsibility. We cannot guarantee that this will work. It is not the libcurl API. We provide curl_easy_recv() for reading from a connect-only created connection.

dkarpov1970 · 2024-02-23T19:50:25Z

Yes, the SSL_read is done by the application after having extracted the SSL handle/context from the connection libcurl.

But the app developers told me that they can't use the curl_easy_recv() because they pass the SSL handle to some WebSocket client library code via its public API.
This WS library which can work with different HTTP clients, and it cannot be easily modified and use curl_easy_recv() because it works with SSL handles internally.

They are now facing a crashing regression because everything worked fine for many years in the previous Curl releases.

While I agree that this case is probably not quite aligned with Curl usage model, it is still possible that "Connect-only" clients extract SSL handles and pass them to some 3rd party libraries which work with SSL handles internally and can't be modified to use curl_easy_recv() because they may work not only with libcurl.

I think my change provides a very simple protection for such cases and helps to avoid crashes which may occur in the scenarios that I described.

bagder · 2024-02-26T08:30:20Z

I think my change provides a very simple protection for such cases and helps to avoid crashes which may occur in the scenarios that I described.

I already explained how it is not a complete fix.

They are now facing a crashing regression because everything worked fine for many years in the previous Curl releases.

I don't think this is a convincing argument. They extract internal data from the libcurl engine and run with that, but libcurl makes no promises that this works nor will we make any such promises going forward. libcurl owns that handle and sets up stuff for it in ways that has been tested and works assuming the API is used as documented.

If someone goes beyond that, then we cannot guarantee behavior or functionality.

icing · 2024-02-26T08:41:25Z

Yes, the SSL_read is done by the application after having extracted the SSL handle/context from the connection libcurl.

But the app developers told me that they can't use the curl_easy_recv() because they pass the SSL handle to some WebSocket client library code via its public API. This WS library which can work with different HTTP clients, and it cannot be easily modified and use curl_easy_recv() because it works with SSL handles internally.

They are now facing a crashing regression because everything worked fine for many years in the previous Curl releases.

This sort of use is not part of curl's API. Even if it worked in the past, it is not a use case we want libcurl to accommodate for.

While your proposed patch solves the current issue, we are not able to guarantee this will continue to work in the future. The SSL* is an internal detail of our implementation. This is exactly the reason why it broke for your application. We changed the way it is implemented. We want to be free to change/enhance it again in the future.

I think my change provides a very simple protection for such cases and helps to avoid crashes which may occur in the scenarios that I described.

While the change is simple, you ask us to "make this use case work" in future releases as well. And that is what we do not want to do. For example, your application might one day use a HTTP proxy setup and that will make drag in other parts of libcurl. Which then need also be fixed.

And the SSL* might get a SESSION update from the server, which will try to update curl's internal session cache. etc. etc. We'll open one can of worms after another here.

dkarpov1970 · 2024-02-26T20:37:41Z

I would argue that fixing the crash is not really opening a box of warms here.
It doesn't require any additional guarantees for the scenarios that I described beyond the crash fix.

And the scenario that I described is not fully artificial, because 3rd party libraries driving high-level protocols after "connect only" phase may require low level read/write handles like SSL handle or sockets to make protocol functional, and they can't be modified to use curl_easy_recv() instead.

I understand that this scenario is not supported and it may not work in future, but preventing the crash doesn't create a claim that from now on such scenarios are fully supported.

It still will be responsibility of the client to take the risk of unsupported case, use this mechanism properly, and take care of the issues, but at least crashes when reporting read/write socket errors will be avoided.

So, this change is just a basic null pointer protection and not a requirement to add a full support for such use case in the future.
And we can add additional "debug" info message to verbose output, indicating unexpected null data when preventing the crash.

Also, libcurl provides a way to extract SSL context and sockets via CURLOPT_SSL_CTX_FUNCTION and CURLOPT_SOCKOPTFUNCTION, so the "box of warms" is already opened, and my trivial change doesn't add any new "warm" to it.

dkarpov1970 · 2024-02-27T19:33:05Z

Another point. While I agree that using curl_easy_send() and curl_easy_recv() is the preferable way to drive send and receive operations for the "connect only" case, it may not work if some application needs to use 3rd party libraries which implement some complex protocols on top of HTTP.

Such libraries may be written without capabilities to pass curl_easy_send() and curl_easy_recv() as part of their IO-related API, and may require sockets or SSL handle to be passed in.

These libraries may be designed to work with different HTTP clients, not just libcurl, and they take care of all nuances between their protocol client and backend.

This is a valid use case, and the only way how libcurl clients may work with such libraries is by extracting sockets/SSL handle and passing them to the libraries. And this approach used to work just fine, and it still works OK even if it isn't officially supported.

My simple change doesn't add anything new, it just allows this approach to work without crashes.
It is still a full responsibility of the client code to make sure that it works on the protocol level, but the change allows client not to crash in libcurl code in case of socket errors.

jay · 2024-02-29T07:58:05Z

I have to agree that this is not something we can support.

My simple change doesn't add anything new, it just allows this approach to work without crashes.

Your change is covering up something that should not happen.

dkarpov1970 · 2024-03-01T20:10:32Z

@jay I don't ask for a full support for such scenarios. My change just allows to avoid the crash and let it work, and it is a trivial null pointer check.
It is up to the clients to take the risk of using it if it can't be supported.

And they may have no other choice if some 3rd party high-level protocol library needs low-level IO read/read handles.
As I mentioned, such libraries may be designed to work with different HTTP clients, and they cannot be easily modified to change their IO code to use curl_easy_send() and curl_easy_recv() as part of their implementation just for libcurl.

Your change is covering up something that should not happen.

This scenario has been working fine for many years, and it is possible that not only my client uses some 3rd party libraries requiring sockets/SSL handles. So, it may happen even if it shouldn't.

And as I suggested in my previous comments, we can add a debug warning message about null data as an indication that it is not a supported scenario, but we at least we can help to avoid the crashes.

dkarpov1970 · 2024-03-02T01:08:58Z

I want to talk a bit more about using direct SSL_read()/SSL_write() use case for the "connect only" case.

The scenario works OK because curl_easy_recv() / curl_easy_send() also use SSL_read()/SSL_write() internally, but they just make sure that they attach easy handle data to the reader/writer context.

And if we look at the nw_in_read() and cf_socket_send() then we will see that the easy handle data is only needed to report errors, and for nothing else.

That's why direct SSL_read()/SSL_write() work the same way as curl_easy_recv() / curl_easy_send() except the case of socket errors where they crash because of null easy data needed only to report these socket errors!

So, if we provide a trivial protection for a null easy data in the nw_in_read() and cf_socket_send() (which is used only for error reporting) then the SSL_read()/SSL_write() use case will continue working just fine, and no crashes will occur in the clients that have been using this mechanism for many years.

And the CURL_TRC_CF() macro used in nw_in_read() and cf_socket_send() already has such protection - it checks internally if the "data" is null, so my change make it consistent with that macro while avoiding socket error crashes.

bagder · 2024-03-04T12:37:29Z

Such libraries may be written without capabilities to pass curl_easy_send() and curl_easy_recv() as part of their IO-related API

I don't think anyone denies that there might be such code bases. I think what we might be proposing is that the appropriate fix here should rather be to make such users use the correct API instead of letting them keep walking blind-folded on the edge of a cliff.

And if we look at the nw_in_read() and cf_socket_send() then we will see that the easy handle data is only needed to report errors, and for nothing else.

The missing words from that sentence are: for the moment. It is quite possible that these functions will be modified going forward to do other things involving the easy handles.

Your change is covering up something that should not happen.

This scenario has been working fine for many years

It has probably worked for decades but I don't think that is a strong argument. It is undocumented behavior.

Covering up for cases in code that can't happen is generally a bad idea. It makes the code harder to read and it generally also tricks static code analyzers (and humans) etc to believe that a NULL pointer is okay in that variable at that time. It also risks being a cat and mouse game of further similar changes - if we do this change for a case that can't happen, why should we turn down additional checks for other situations that only can trigger if the API is "abused" (in other ways)?

I would argue that we should assert on NULLs instead, because they should not be NULL and if they are that's a sign something is terribly wrong.

dkarpov1970 · 2024-03-06T02:17:59Z

I don't think anyone denies that there might be such code bases. I think what we might be proposing is that the appropriate fix here should rather be to make such users use the correct API instead of letting them keep walking blind-folded on the edge of a cliff.

@bagder But what would be the appropriate fix here? The problem is that some client app needs to use some 3rd party library which implements high-level protocol on top of HTTP upgrade which internally uses low-level IO read/write primitives like sockets or SSL handle.

And if it doesn't provide a way to "inject" curl_easy_recv() / curl_easy_send() for its internal IO read/write operations, then what are good options that client has to integrate it with libcurl?

I don't see any viable options except extracting sockets/SSL handle in libcurl callbacks and passing them to these libraries even though it is not a supported scenario.

And argument that it has been working for decades, and still works is a strong one - it indicates that there is nothing fundamentally wrong with that use case, and there may be a significant number of client apps that had to use it by the reasons I mentioned above.

Don't get me wrong, I fully understand the hesitation of doing any changes that will support the unsupported scenario that just happens to work somehow.
But on the other hand, if the price for keeping it working is very low (yes, with the current code base, maybe not for the future), and client is accepting the risk that it may break in the future then why not do it?

I would argue that we should assert on NULLs instead, because they should not be NULL and if they are that's a sign something is terribly wrong.

I am not sure that my "not normal" case is the only case when the crash that my change is trying to prevent may happen.

I suspect that there may be some other "normal" cases when the "data" can be null, besides my "not normal" scenario, and that's why the CURL_TRC_CF() macro does check it, and if we assert on that cases then something normal may get broken.

So, my change may potentially prevent crashes in some "normal" cases.

bagder · 2024-06-30T21:37:00Z

So, my change may potentially prevent crashes in some "normal" cases.

I don't think so. If we would like to catch internal mistakes, I think we should rather insert DEBUGASSERT()s for those cases. As this is made, it is not a change I think benefits us. It leads us down the wrong path.

github-actions bot added the connecting & proxies label Jan 23, 2024

bagder closed this Jun 30, 2024

Uh oh!

Added protection for cases when client drives sockets itself after HTTP protocol upgrade #12761

Added protection for cases when client drives sockets itself after HTTP protocol upgrade #12761

Uh oh!

Conversation

dkarpov1970 commented Jan 23, 2024

Uh oh!

jay commented Jan 23, 2024

Uh oh!

bagder commented Jan 23, 2024

Uh oh!

dkarpov1970 commented Jan 23, 2024

Uh oh!

dkarpov1970 commented Jan 31, 2024

Uh oh!

bagder commented Feb 10, 2024

Uh oh!

dkarpov1970 commented Feb 13, 2024

Uh oh!

icing commented Feb 13, 2024

Uh oh!

dkarpov1970 commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dkarpov1970 commented Feb 13, 2024

Uh oh!

icing commented Feb 14, 2024

Uh oh!

dkarpov1970 commented Feb 14, 2024

Uh oh!

dkarpov1970 commented Feb 22, 2024

Uh oh!

icing commented Feb 23, 2024

Uh oh!

bagder commented Feb 23, 2024

Uh oh!

dkarpov1970 commented Feb 23, 2024

Uh oh!

bagder commented Feb 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icing commented Feb 26, 2024

Uh oh!

dkarpov1970 commented Feb 26, 2024

Uh oh!

dkarpov1970 commented Feb 27, 2024

Uh oh!

jay commented Feb 29, 2024

Uh oh!

dkarpov1970 commented Mar 1, 2024

Uh oh!

dkarpov1970 commented Mar 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bagder commented Mar 4, 2024

Uh oh!

dkarpov1970 commented Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bagder commented Jun 30, 2024

Uh oh!

Uh oh!

dkarpov1970 commented Feb 13, 2024 •

edited

Loading

bagder commented Feb 26, 2024 •

edited

Loading

dkarpov1970 commented Mar 2, 2024 •

edited

Loading

dkarpov1970 commented Mar 6, 2024 •

edited

Loading