Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncaught exceptions lead to on_execute / on_operation lifecycle hooks completing before some resolvers #3414

Open
kkom opened this issue Mar 19, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@kkom
Copy link

kkom commented Mar 19, 2024

TL;DR

Strawberry short-circuits the HTTP response whenever there is an uncaught exception. This reduces the latency, but leads to:

(i) (a) incomplete and (b) nondeterministic responses (edit: established in the comments that it's expected)
(ii) hooks being completed before some resolvers, leading to apparent violation of a contract

I wonder if it would be possible to make Strawberry run all resolves to the end, even if some of them raise uncaught exceptions?

Describe the Bug

  1. Strawberry executes all resolvers, even if there was an uncaught exception which triggered an early HTTP response with errors.
  2. However, it eagerly returns a response with errors, as soon as an (edit: incoercible) exception is raised.
  3. Finally, it completes all lifecycle hooks before return the response – including on_execute and on_operation.

This last point can lead to issues – it violates the invariant that on_execute / on_operation lifecycle hooks wrap around all resolver executions.

This can be problematic when these hooks do state management, like in the example given in Strawberry's docs. As a result, in addition to seeing the original uncaught exception in our observability suite, we have additional noise from knock-on failures – caused by premature completion of hooks.

Is this natural behaviour given how various async tasks are orchestrated, or is possible to tweak this a little? I'm thinking:

  1. cancelling the tasks that won't be used for the response anyway (as it's been already returned)
  2. waiting until all resolvers finish to return the response

In fact, 2 may have other benefits – making the responses more (a) complete and (b) predictable. Currently, the GraphQL responses (i.e. which fields will return data and which won't) are non-deterministic (albeit a little faster thanks to the uncaught exception short-circuit). (edit: established in the comments that the short-circuiting is expected)

Repro code

Schema:

@strawberry.type
class Query:
    @strawberry.field
    @staticmethod
    async def fail() -> str:
        await sleep(0.5)
        raise Exception(f"'fail' resolver has failed ({datetime.now()})")

    @strawberry.field
    @staticmethod
    async def wait() -> str:
        await sleep(2)
        print(f"'wait' resolver is about to return ({datetime.now()})")
        return "foo"

Logging extension:

class MyCustomExtension(SchemaExtension):
    @override
    def on_execute(self) -> Generator[None, None, None]:
        print(f"'on_execute' start ({datetime.now()})")
        yield
        print(f"'on_execute' end ({datetime.now()})")

    @override
    async def resolve(
        self,
        _next: Callable[..., object],
        root: object,
        info: GraphQLResolveInfo,
        *args,
        **kwargs,
    ) -> AwaitableOrValue[object]:
        print(f"'{info.field_name}' resolver start ({datetime.now()})")
        result = await await_maybe(_next(root, info, *args, **kwargs))
        print(f"'{info.field_name}' resolver end ({datetime.now()})")
        return result

Example query:

query {
  fail
  wait
}

Example response:

{
  "data": null,
  "errors": [
    {
      "message": "'fail' resolver has failed (2024-03-19 21:08:12.088337)",
      "locations": [
        {
          "line": 2,
          "column": 3
        }
      ],
      "path": [
        "fail"
      ]
    }
  ]
}

Logs demonstrating that the resolvers continue being executed after hooks complete:

'on_execute' start (2024-03-19 21:08:11.587192)
'fail' resolver start (2024-03-19 21:08:11.587345)
'wait' resolver start (2024-03-19 21:08:11.587378)
'fail' resolver has failed (2024-03-19 21:08:12.088337)

GraphQL request:2:3
1 | query {
2 |   fail
  |   ^
3 |   wait
Traceback (most recent call last):
  File "/Users/kkom/Repos/isometric/python/services/backend/.venv/lib/python3.12/site-packages/graphql/execution/execute.py", line 528, in await_result
    return_type, field_nodes, info, path, await result
                                          ^^^^^^^^^^^^
  File "/Users/kkom/Repos/isometric/python/services/backend/backend/api/graphql/extensions/extensions.py", line 30, in resolve
    result = await await_maybe(_next(root, info, *args, **kwargs))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kkom/Repos/isometric/python/services/backend/.venv/lib/python3.12/site-packages/strawberry/utils/await_maybe.py", line 12, in await_maybe
    return await value
           ^^^^^^^^^^^
  File "/Users/kkom/Repos/isometric/python/services/backend/.venv/lib/python3.12/site-packages/strawberry/schema/schema_converter.py", line 682, in _async_resolver
    return await await_maybe(
           ^^^^^^^^^^^^^^^^^^
  File "/Users/kkom/Repos/isometric/python/services/backend/.venv/lib/python3.12/site-packages/strawberry/utils/await_maybe.py", line 12, in await_maybe
    return await value
           ^^^^^^^^^^^
  File "/Users/kkom/Repos/isometric/python/services/backend/backend/api/graphql/schemas/public.py", line 55, in fail
    raise Exception(f"'fail' resolver has failed ({datetime.now()})")
Exception: 'fail' resolver has failed (2024-03-19 21:08:12.088337)
'on_execute' end (2024-03-19 21:08:12.096968)
INFO:     127.0.0.1:58138 - "POST /graphql HTTP/1.1" 200 OK
'wait' resolver is about to return (2024-03-19 21:08:13.588281)
'wait' resolver end (2024-03-19 21:08:13.588422)

System Information

  • Strawberry version (if applicable): 0.220.0

Additional Context

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar
@kkom kkom added the bug Something isn't working label Mar 19, 2024
@kkom kkom changed the title Uncaught exceptions lead to (i) incomplete responses and (ii) on_execute lifecycle hook completing before some resolvers Uncaught exceptions lead to (i) incomplete & non-deterministic responses and (ii) on_execute lifecycle hook completing before some resolvers Mar 19, 2024
@patrick91
Copy link
Member

I think that's expected when field aren't optional, if you make the failing field option it should work as expected. Here's a test without the exception: https://play.strawberry.rocks/?gist=05181d8366f2584d390ccdac21d5d000

@kkom
Copy link
Author

kkom commented Mar 20, 2024

Thanks!

I think that's expected when field aren't optional

When you say this, do you mean according to the GraphQL spec or Strawberry's design decision?

@patrick91
Copy link
Member

@kkom from the spec: https://spec.graphql.org/October2021/#sec-Handling-Field-Errors

Here's an example that better shows the behaviour of failures in nested fields: https://play.strawberry.rocks/?gist=b1ca6f1c33325331a76e36fb0d80ebfa

@kkom
Copy link
Author

kkom commented Mar 20, 2024

Ah, I see it now – thanks! Yeah, the short-circuiting behaviour makes perfect sense :)

I tested it in anger in a lot of configurations and every behaviour agreed with the GraphQL spec.

PS: I only found some small non-determinism in the errors field. The speed with which coercible and incoercible errors occur affects the contents of errors. Say you have two fields that fail – one nullable and one not. If the nullable one fails first – you'll see two error entries. But if the non-nullable one fails first – we won't include the error for the nullable one. I don't think that's a huge deal though and doubt the spec worries about it.

@kkom
Copy link
Author

kkom commented Mar 20, 2024

However, we still have the problem of on_execute and on_operation hooks completing before resolvers can finish. 😓

We use a custom Strawberry extension for context management. Basically same as this – it opens and closes a SQLAlchemy session which can be used in the resolvers.

If a response is prematurely concluded because of an incoercible field error, the unfinished resolvers will run in the background. However, it's very likely they'll raise more errors – which we do see in our Datadog logs.

The errors are varied, but all of them are noise – they don't matter anymore and happen because the context for the field resolvers is just messed up by the lifecycle hooks...

Solutions I'm thinking of

  1. Strawberry could somehow instruct the Python event loop to abandon the no longer necessary tasks.
    • Not sure if this is possible.
    • But if it was possible, it would have resource usage benefits (thought not latency). Marginal though.
  2. Strawberry could be configured to wait for all resolvers to finish, even the unneeded ones. Only then finish of on_execute and on_operation lifecycle hooks and return the response.
    • If this was a configuration option, we'd definitely turn it on. We'd increase the latency of failed responses, but they're failed anyway ¯\_(ツ)_/¯ and the clear error logs benefits seem more important to us now.
  3. Maybe there is a way to instruct the Datadog extension to ignore errors that happen in orphaned resolver execution. Not sure if this is easy or not.
    • Even if possible at one point, I feel like it could end up being an endless game of Whac-A-Mole...

What do you think? Any ideas appreciated! 🙏

@kkom
Copy link
Author

kkom commented Mar 20, 2024

I guess there is theoretically the 4th option of the lifecycle hooks waiting for also the unneeded resolvers to finish.

I wonder though if the HTTP response concluding before the on_execute / on_operation lifecycle hooks finish, would cause even more problems...

@kkom
Copy link
Author

kkom commented Mar 20, 2024

Oh, I think there is a decent variation on option 4!

What if it was possible to configure a schema extension to wait for all resolvers to finish before concluding? Then we could selectively apply it to the extension where it matters (e.g. the one that manages our DB session).

Not sure how difficult it would be to implement and how much complexity it would add to the code, but I think it would solve all our problems. 🤞

@kkom kkom changed the title Uncaught exceptions lead to (i) incomplete & non-deterministic responses and (ii) on_execute lifecycle hook completing before some resolvers Uncaught exceptions lead to on_execute / on_operation lifecycle hooks completing before some resolvers Mar 20, 2024
@patrick91
Copy link
Member

What if it was possible to configure a schema extension to wait for all resolvers to finish before concluding? Then we could selectively apply it to the extension where it matters (e.g. the one that manages our DB session).

Is there any particular reason why you're doing this on an extension and not using your web framework functionality?

In any case I think we could introduce a hook for when the request is done 😊 @nrbnlulu what do you think?

@nrbnlulu
Copy link
Member

nrbnlulu commented Mar 21, 2024

What an endeavour @kkom 🙃!

So what I understand from this issue

When a NonNull field fails on_execute | on_operation exits BEFORE all the other resolvers are gathered
This can be problematic for life-cycle resources because they might have been cleaned by the exit hook.

I think this issue does not relate to extensions but for graphql-core here
They don't cancel tasks if the execution is terminated, So I suggest to open an issue there.

BTW:

  • I don't know how it will behave with the new execution mechanism in graphql-core (since @defere / @stream support was added).

@kkom
Copy link
Author

kkom commented Mar 22, 2024

Is there any particular reason why you're doing this on an extension and not using your web framework functionality?

That's a very good question! The context we prepare is highly dependent on whether the GraphQL operation is a query or a mutation.

In queries we instantiate a special DB client optimised for reads – without the overhead of explicit transactions and leveraging separate sessions for each SQL query (to support concurrent reads). In writes we instantiate a DB client that works off a single transaction for the entire mutation.

from collections.abc import AsyncGenerator
from copy import copy
from typing import cast, override

from strawberry.extensions import SchemaExtension
from strawberry.types.graphql import OperationType

from backend.api.context import get_read_context, get_write_context
from backend.domain.context import ReadContext, RequestContext, WriteContext


class ContextManagerExtension(SchemaExtension):
    """
    Depending on whether it's a mutation or query, replace appropriate keys in the context. At the
    end, go back to the original context injected by FastAPI.

    Mutating context in this way is suggested in Strawberry docs, see:

    https://strawberry.rocks/docs/guides/custom-extensions#execution-context
    """

    # We are managing the context in the more narrow `on_execute` hook (which wraps just the
    # execution of the query/mutation, after parsing & validation have completed), instead of the
    # overarching `on_operation` hook, because we need the knowledge whether the operation is a
    # query or a mutation (so that we can prepare a "read" or "write" context).
    @override
    async def on_execute(self) -> AsyncGenerator[None, None]:
        # Initially, we have just RequestContext injected by FastAPI
        context = cast(RequestContext, self.execution_context.context)

        def mutate_context(new_context: RequestContext | ReadContext | WriteContext) -> None:
            # Clear existing keys
            keys = list(context.keys())
            for k in keys:
                del context[k]

            # Replace keys with new values
            for k, v in new_context.items():
                context[k] = v

        request_context = copy(context)

        # `self.execution_context.operation_type` relies on the parsing step having already run.
        # This is why this extension uses the `on_execute` lifecycle hook.
        match self.execution_context.operation_type:
            case OperationType.MUTATION:
                async with get_write_context(context) as write_context:
                    mutate_context(write_context)

                    try:
                        yield
                    finally:
                        mutate_context(request_context)

            case OperationType.QUERY:
                read_context = get_read_context(context)

                mutate_context(read_context)

                try:
                    yield
                finally:
                    mutate_context(request_context)

            case OperationType.SUBSCRIPTION:
                raise NotImplementedError()
@asynccontextmanager
async def get_write_context(
    request_context: Annotated[RequestContext, Depends(get_request_context)],
) -> AsyncGenerator[WriteContext, None]:
    async with request_context["session_maker"]() as session:
        db_client = ReadWriteDbClient(session)

        actions = build_actions(
            db_client,
            request_context["authentication_context"],
            request_context["clients"]["event_logger"],
        )

        adhoc_data_loaders = build_adhoc_data_loaders(db_client)

        queries = build_queries(
            db_client,
            request_context["authentication_context"],
            request_context["clients"]["blob_store"],
            actions,
        )

        services = build_services(
            db_client,
            request_context["authentication_context"],
            request_context["clients"],
        )

        yield {
            "actions": actions,
            "authentication_context": request_context["authentication_context"],
            "clients": request_context["clients"],
            "adhoc_data_loaders": adhoc_data_loaders,
            "queries": queries,
            "services": services,
            "db_client": db_client,
        }

        await session.commit()


def get_read_context(
    request_context: Annotated[RequestContext, Depends(get_request_context)],
) -> ReadContext:
    db_client = ReadOnlyDbClient(request_context["read_only_session_maker"])

    actions = build_actions(
        db_client,
        request_context["authentication_context"],
        request_context["clients"]["event_logger"],
    )

    adhoc_data_loaders = build_adhoc_data_loaders(db_client)

    queries = build_queries(
        db_client,
        request_context["authentication_context"],
        request_context["clients"]["blob_store"],
        actions,
    )

    return {
        "actions": actions,
        "authentication_context": request_context["authentication_context"],
        "clients": request_context["clients"],
        "adhoc_data_loaders": adhoc_data_loaders,
        "queries": queries,
        "db_client": db_client,
    }

That's why we use the on_execute hook – to leverage self.execution_context.operation_type already prepared by Strawberry. It is quite elaborate, but the performance benefits of using different DB access patterns are huge.

But thanks for the suggestion – maybe there is a decent way to leverage the web framework for it! We could use a heuristics on the body of the request to distinguish queries from mutations before Strawberry kicks in. Or always construct both kinds of contexts (though this may be inefficient).

I think this issue does not relate to extensions but for graphql-core here
They don't cancel tasks if the execution is terminated, So I suggest to open an issue there.

Thanks so much for the pointer! I'll try to report it there – feels like fixing this should be valuable whichever way we go! :)

@kkom
Copy link
Author

kkom commented Mar 22, 2024

Ok, reported this to graphql-core: graphql-python/graphql-core#217 - thanks for the suggestion @nrbnlulu !

@patrick91 – I'll play with using FastAPI for it, but I've realised that there may be a problem. If these tasks are somehow abandoned on the event loop, I'm not sure if even FastAPI would be able to wrap around them. Remember that the HTTP response is returned by that time...

@kkom
Copy link
Author

kkom commented Mar 23, 2024

A little update from graphql-core - it was acknowledged as something the maintainer did want to address as well. :) graphql-python/graphql-core#217 (comment)

So the issue is definitely real - but not yet sure when it'll be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants