Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Degradation Under Load From Fragments (Blueprint Walk Churns Memory?) #1352

Open
jeffutter opened this issue Jan 30, 2025 · 4 comments

Comments

@jeffutter
Copy link
Contributor

jeffutter commented Jan 30, 2025

Summary

My team and I have been struggling to track down a GraphQL related performance 'regression' for the past couple of weeks. We have two versions of a query that are effectively identical, one has more of it's fragments as spread/named fragments one has more of them as in-lined fragments. The text document for both are > 200kb and the Blueprints are very large (> 5MB from :erts_debug.size IIRC).

The version with more inline fragments has a slightly smaller text size and blueprint size.

Under heavy load the version with the in-lined fragments has 2-5x worse p95 latency and uses ~2x the CPU.

Image

I'm hoping folks can read this explanation and see if my hypothesis makes sense. I'm making some logical jumps that I haven't been able to specifically measure and substantiate yet.

Details & Background

  • The two versions of the query are:
    • A.) Handwritten by our clients devs (Referred to as the "Good" query from here on out)
    • B.) That same query passed through Apollo Router with only one subgraph behind the router (essentially Apollo doing the optimizations it thinks are appropriate (Referred to as the "Bad" query from here on out - because who likes slow p95s?)
  • Both versions
    • Have a lot of fragments
    • Often have @skip or @include directives on the fragments
    • Return the exact same data
    • Seem to execute the same as far as resolvers go
  • The "Bad" query has
    • Smaller byte size
    • Smaller blueprint size
    • Many of the fragments which were only used once converted to inlined fragments (by Apollo Router)
  • We've seen similar poor performance with a 3rd variation of the query where all fragments were converted to named fragments
  • We have completely removed Apollo Router from the equation by dumping the queries it makes and injecting them into our load test framework
    • Now the load tests make the same requests the router would make as if the router were in place
    • This removes any theory about it being some kind of connection pooling / load balancing issue with Apollo Router
  • In any kind of isolated testing the "Bad" query actually performs faster (It's slightly smaller, why wouldn't it?)
    • We've tried looking at the following
      • Traces in Datadog
      • eprof profiles
      • cprof profiles
      • flamegraphs
    • In any of these situations the "Bad" query is actually (as one might expect) better

The only difference

The only inkling of a difference I can find between the two queries is this:

I instrumented the application with the Datadog native continuous profiler and compared two load tests (left "Good", right "Bad"). The time spent doing elixir things is nearly identical between the two, but on the "Bad" side we see this big blob of time spent in... garbage collection.

Image

When we look at the comparison table it's even more obvious.

Image

For some reason the "Bad" query is causing more churn in memory and thus the GC is burning more CPU cycles, taking those cycles away from doing actual work and leading to increased CPU and p95 latency.

Hypothesis

Why would Inline fragments vs spread fragments make any difference to memory usage? One wouldn't think they would, especially when the fragments being in-lined were only used once to begin with (so we're not duplicating them inline in different objects).

My hypothesis, while rather unsatisfactory is that something about the overall shape of the blueprint crosses a tipping point in how the BEAM manages memory. Other, larger queries with or without fragments might fall into this same pitfall. I suspect it has something to do with the width or depth of maps or size of lists and how walk traverses and updates them. I suspect that since walk maps over the list of children and re-sets it on the map; even if no data is changing, this cause the BEAM to copy the data and GC the old copies.

Supporting Findings

In trying to figure out why in-line vs spread fragments would have any different behavior here I 'instrumented' the run_phase function of Absinthe.Pipeline as follows:

diff --git a/lib/absinthe/pipeline.ex b/lib/absinthe/pipeline.ex
index 16c7d1c2..26d91bef 100644
--- a/lib/absinthe/pipeline.ex
+++ b/lib/absinthe/pipeline.ex
@@ -405,7 +405,19 @@ defmodule Absinthe.Pipeline do
   def run_phase([phase_config | todo] = all_phases, input, done) do
     {phase, options} = phase_invocation(phase_config)
 
-    case phase.run(input, options) do
+    :erlang.garbage_collect()
+    {:memory, before} = :erlang.process_info(self(), :memory)
+
+    res = phase.run(input, options)
+    {:memory, aft} = :erlang.process_info(self(), :memory)
+
+    :erlang.garbage_collect()
+
+    {:memory, aft_gc} = :erlang.process_info(self(), :memory)
+
+    IO.puts("Phase: #{phase} | Allocated: #{aft - before} | Churn: #{aft_gc - aft}")
+
+    case res do
       {:record_phases, result, fun} ->
         result = fun.(result, all_phases)
         run_phase(todo, result, [phase | done])

The idea here is to measure memory used by each phase by doing the following:

  • Force a GC
  • Measure memory used in the process (MemA)
  • Run the phase
  • Measure memory used in the process (MemB)
  • Run a GC
  • Measure memory used in the process (MemC)

This tells us two things:

  • MemB - MemA tells us how much the memory grew while running that pipeline step
  • MemB - MemC tells us how much of that growth is garbage

I then ran this against a sample project I create (working on the red tape to publish a public repo for this).
The sample project allows me to on-the-fly build schemas for an interface with N implementers and then creates queries for those with N either Spread/Named or Inline fragments.

Given the same number of fragments I see a difference like this:

Image

You can see the in-line version on the left has much more allocations and churn than the spread version on the right. The spread version seems to allocate nothing for many of the phases.

What gives?

This is the part I'm having difficulty answering. I dug into some of the more unassuming looking phases like Phase.Document.Arguments.CoerceLists and see that they should essentially do nothing given my query, yet processing them seems to allocate memory. This lead me to the walk implementation that I highlighted under "Hypothesis".

The final leap I'm making which I haven't been able to substantiate is that the inline ("Bad") version's blueprint depth or the width are crossing some threshold that causes the reduce over the children and reassignment to the map to copy. I wonder if it's related to how small vs big maps are handled?

Avoid having more than 32 elements in the map. As soon as there are more than 32 elements in the map, it will require more memory and keys can no longer be shared with other instances of the map.

I'm not even sure there are any large maps happening here though, it could potentially have something to do with updating the lists of children.

In conclusion

As you can see, the last bit of my theory depends on a fuzzy and partial understanding of how the BEAM manages memory, particularly when updating maps. If my hypothesis is accurate, I would love to discuss how this could be improved. I've pondered this a bit and quite frankly don't have many good ideas. It doesn't seem feasible that Absinthe could avoid walking the blueprint and reducing/mapping the children while it goes.

Are there any other completely different avenues I should approach this from?

Are there any good techniques for gathering more information about this specific issue?

@benwilson512
Copy link
Contributor

Hey Jeff,

Thanks for the very thorough issue and investigation. For the sake of expediting our discussion of this, would you be able to provide a pair of queries and perhaps a Benchee script that illustrates this difference? As I dig in I just want to make sure that we're observing the same thing.

@jeffutter
Copy link
Contributor Author

jeffutter commented Jan 30, 2025

Hi @benwilson512 Thanks for the quick response.

While I can't share our actual queries publicly (they are so massive it's really difficult to learn anything from them anyway). I can share the repo I've been using to drive the investigation and isolate the issue: https://github.com/jeffutter/absinthe_test_fragment_performance

It isn't exactly setup perfectly to illustrate my hypothesis, since that requires forking and instrumenting Absinthe, but hopefully it can give you an idea of what I'm doing.

In that repo you can generate one of the schemas I'm using by doing something like this: mix profile dump_schema 120

Here's a trimmed down version with 5 fragments

defmodule FragmentBench.Schema.F do
  use Absinthe.Schema

  interface(:thing) do
    field(:id, :id)
    resolve_type(fn _, _ -> nil end)
  end

  object(:a) do
    field(:id, :id)
    interface(:thing)
  end

  object(:b) do
    field(:id, :id)
    interface(:thing)
  end

  object(:c) do
    field(:id, :id)
    interface(:thing)
  end

  object(:d) do
    field(:id, :id)
    interface(:thing)
  end

  object(:e) do
    field(:id, :id)
    interface(:thing)
  end

  query do
    field(:things, list_of(:thing)) do
      resolve(fn _, _ -> {:ok, []} end)
    end
  end

  def query_skip_spread do
    AbsintheTestFragmentPerformance.SchemaBuilder.query_skip_spread(5, 0)
  end

  def query_include_spread do
    AbsintheTestFragmentPerformance.SchemaBuilder.query_include_spread(5, 0)
  end

  def query_skip_inline do
    AbsintheTestFragmentPerformance.SchemaBuilder.query_skip_inline(5, 0)
  end

  def query_include_inline do
    AbsintheTestFragmentPerformance.SchemaBuilder.query_include_inline(5, 0)
  end
end

Likewise, you can generate the queries with mix profile dump_query skip_spread 120 or mix profile dump_query skip_inline 120

Here's versions with 5 fragments:

query ($skip: Boolean!) {
  things {
    id
    __typename
... AFragment @skip(if: $skip)
... BFragment @skip(if: $skip)
... CFragment @skip(if: $skip)
... DFragment @skip(if: $skip)
... EFragment @skip(if: $skip)
  }
}
fragment AFragment on A { id __typename }
fragment BFragment on B { id __typename }
fragment CFragment on C { id __typename }
fragment DFragment on D { id __typename }
fragment EFragment on E { id __typename }
query ($skip: Boolean!) {
  things {
    id
    __typename
... on A @skip(if: $skip) { id __typename }
... on B @skip(if: $skip) { id __typename }
... on C @skip(if: $skip) { id __typename }
... on D @skip(if: $skip) { id __typename }
... on E @skip(if: $skip) { id __typename }
  }
}

The hypothesis I laid out can be seen with setups with larger numbers of fragments and flipping between skip_spread and skip_inline.

Any kind of benchmarking, or even the profiles all show the "inline" versions as being faster, taking less time and using fewer function calls. Even looking at memory USAGE it looks like "inline" is more efficient, but what I think all of those methods don't show is the garbage generated by executing the document that must be GC'd afterwards.

Please let me know if there's a different way you'd like me to present this and I can try to whip something up.

@benwilson512
Copy link
Contributor

Incredible, this will be a huge help. I hope to work on this this weekend. Please feel free to ping me Monday if that hasn't happened!

@jeffutter
Copy link
Contributor Author

Just wanted to post a little update here with some further investigation today. Unfortunately, this investigation doesn't support my hypothesis.

1.) I instrumented the Absinthe app with metrics to count of GCs and number of bytes GC'd from the whole VM via :erlang.statistics(:garbage_collection). To be consistent with my hypothesis I would expect the "Bad" version to GC considerably more bytes. Instead I see the following (Good - left, Bad - right):

Image

The Bad one incurs slightly fewer GCs and slightly fewer Bytes GC'd.

2.) I updated the profiler tool I shared with gc profiles. Rather than checking GC metrics between each phase in Absinthe, It uses the :trace module to watch GC events (Similar to how Benchee does it's memory profiling) during one whole execution of the blueprint. When comparing the Good (mostly spread) and Bad (mostly inline) generated schemas the memory usage/allocations/gcs are nearly identical, with the inline version being slightly better:

❯ mix profile skip_inline gc 10 1
Profile Start
Profile Done
%Mix.Tasks.Profile.GCStats{
  word_size: 8,
  memory: 700456,
  minor_gc_count: 21,
  major_gc_count: 2,
  bin_vheap_size: 2888,
  total_freed: 700456,
  minor_freed: 603536,
  major_freed: 96920
}

❯ mix profile skip_spread gc 10 1
Profile Start
Profile Done
%Mix.Tasks.Profile.GCStats{
  word_size: 8,
  memory: 741440,
  minor_gc_count: 22,
  major_gc_count: 1,
  bin_vheap_size: 2888,
  total_freed: 741440,
  minor_freed: 655528,
  major_freed: 85912
}
❯ mix profile skip_inline gc 150 1
Profile Start
Profile Done
%Mix.Tasks.Profile.GCStats{
  word_size: 8,
  memory: 9089592,
  minor_gc_count: 60,
  major_gc_count: 2,
  bin_vheap_size: 34248,
  total_freed: 9089592,
  minor_freed: 7952536,
  major_freed: 1137056
}

❯ mix profile skip_spread gc 150 1
Profile Start
Profile Done
%Mix.Tasks.Profile.GCStats{
  word_size: 8,
  memory: 9656384,
  minor_gc_count: 47,
  major_gc_count: 2,
  bin_vheap_size: 34248,
  total_freed: 9656384,
  minor_freed: 8436080,
  major_freed: 1220304
}

I still feel like there's something going on with memory usage or GCs but I can't quite pinpoint it. For the profile, it's possible it's not exercising the code in the same way as our real queries - as they have various nested fragments. But the overall VM GC stats during a load test seem to refute my original idea.

This somewhat puts me back to square one. Two semantically similar queries perform drastically different under load and that's about all I have to go on. If you have any suggestions on other approaches to debug this, I'd greatly appreciate it.

I may try to see if I can share examples of our real queries, if I can mange to obfuscate the fields enough. They might be too large/unwieldy to compare but maybe someone more familiar with Absinthe internals would know what types of things to look for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants