Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Riemann becomes unresponsive when CPU Count count is increased #1032

Open
vipinvkmenon opened this issue Nov 6, 2022 · 9 comments
Open

Riemann becomes unresponsive when CPU Count count is increased #1032

vipinvkmenon opened this issue Nov 6, 2022 · 9 comments

Comments

@vipinvkmenon
Copy link

vipinvkmenon commented Nov 6, 2022

Describe the bug
Currently we are on a machine with 48 CPU and 192GB (aws m5.12xLarge). We’ve had an almost 2x increase in the number of metrics in the system. Hence the VM’s are usually at 100% CPU and system load around 80 and 100. So its extremly loaded. This causes our riemann instance to fail occassionally with a high stream and netty queue size.
Naturally we thought of resizing our instance to a higher size VM. We decided to go with a 64CPU 256 GB machine (m5.18x).

However the moment we reaize the VM, riemann becomes un reaponsive in a few miments of starting up. We see that the
‘ riemann executor stream-processor queue size’ almost shoots up to 5~10k before no metrics are seen anymore. If we do a top on the VM again cpu would be at around (6200%) and system load close to 110.

The metrics are forewarded to an influxdb downstream( cpu and memory are around 30%~40%).

Bringing the VM back to the original size solves it. (As in riemann is teaponsibe but unstable at 100% because of the load).

We havent update riemann so it shouldn’t be an update issue. We are using JDK 13.

on the java side we’ve configured heap to 75% with the MaxRAM value and GC is +ParallelGC. (These settings have been working for us perfectly until this new load increase.
Expected behavior
Riemann works with higher VM and handles the higher load.
Background (please complete the following information):

  • OS: linux/ubuntu
  • Java/JVM version 13
@sanel
Copy link
Contributor

sanel commented Nov 7, 2022

If you got you correctly, at 192 GB, riemann used 100% of CPU. However, when you resized the box to 256 GB, riemann had a slow startup and extreme CPU usage? This sounds like a GC issue. Try tuning GC, look for -XX:MaxGCPauseMillis or -XX:GCTimeRatio and similar options.

If this happens to be a problem, try switching to ZGC. However, be aware it needs way more heap than ParallelGC.

Also, try splitting the load between multiple riemann instances, as explained in #1003 (comment). I doubt you'll be able to scale it vertically indefinitely. edit: just noticed in that talk you mentioned you already have a similar setup :)

@vipinvkmenon
Copy link
Author

Certain updates we tried after.

  • On the 48cpu-192gb VM… CPU ranges between 75~100.. When load is less it even goes to 50 sometimes. (Memory consumption btw… is decent only about 70gb)
  • On the larger 64cpu-256gb machine, things go out of hand… cpu goes for a complete disaster
  • However interestingly on a lower sized machine 48cpu-96gb…(memory consumption was about 40gb)…CPU was still ranging between 70~100%. (C5.12x aws instance - compute optimized)
  • In short memory is ok in both cases but CPU is a disaster….. (Not surprised riemann is CPU intensive… but this is a bit too high).

Tried ZGC algo….. on the 96gb machine the heap quickly tose to 75gb….. But CPU just got stuck up at 100%….

What I’ve noticed is if we use parallelGC or (we dont set any GC algorithm)…. rieman is atleast useable ranging between 70 and 100.

I’ll probably set the GC flags and see.

Wht could be the reason for CPU to sit at 100 even when I increase CPU core count from 48 to 64 though?🤔

@vipinvkmenon
Copy link
Author

vipinvkmenon commented Nov 8, 2022

On that note, if it really is GC ( which is probably the reason) one more question that strikes me is why is the jvm limiting the heap size here despite of having a large amount of headroom..🤔

@sanel
Copy link
Contributor

sanel commented Nov 8, 2022

On that note, if it really is GC ( which is probably the reason) one more question that strikes me is why is the jvm limiting the heap size here despite of having a large amount of headroom

Unless you explicitly set Xmx, ParallelGC will use 1/4 of max memory by default. AFAIK, usual practice is to have Xms == Xmx size to get predictable behavior - if they are different, JVM will try to find optimal balance. Probably the best way is to see official guide.

Wht could be the reason for CPU to sit at 100 even when I increase CPU core count from 48 to 64 though?

You mentioned in the first post you already monitoring stream and netty queue size; try collecting JMX metrics from Riemann or attach tools like VisualVM. There you'll see how much GC is in use. I think ParallelGC will use N or N-1 threads (where N == number of cores) by default. To be sure, check it with:

java -XX:+PrintFlagsFinal 2>&1 | grep ParallelGCThreads

Since you already have VM at 100% of CPU usage for a while, I'd review Riemann configuration and streaming rules first. Heavy use of by, rollup, putting everyting in index with large/infinite TTL or reinjecting events frequently will contribute to heavy memory/gc/cpu usage and slow streams. If you absolutely must do these things, try to split between e.g. 2-3 Riemann nodes and have one node dedicated for slow/heavy streams only.

@vipinvkmenon
Copy link
Author

Since the Riemann rules are maintained by several teams and there are 100s of services that rely on this stack items like by are used heavily.
We will check into seeing the impact of XMX values, and most likely look into how we can split.
One approach was to split on the basis of metrics coming in based on tcp-server vs graphite-server....or the other approach is to give separate nodes for those that send large amounts of metrics.

@sanel
Copy link
Contributor

sanel commented Nov 13, 2022

In that case, I'm not sure how this be considered as a bug on Riemann side; it is hard to reproduce without exact setup you are using.

@pyr
Copy link
Contributor

pyr commented Nov 13, 2022

I agree with @sanel here. One thing that could help is moving up in JDK versions and trying the newer GC options.

@vipinvkmenon vipinvkmenon changed the title Stream Riemann becomes unresponsive when CPU Count count is increased Dec 24, 2022
@vipinvkmenon
Copy link
Author

vipinvkmenon commented Dec 24, 2022

Updates:
The run parameters is as follows:

    MAX_HEAP=$(awk '/MemTotal/ { printf "%.0f", $2/1024*0.75 }' /proc/meminfo)
    HALF_MAX_HEAP=`expr $MAX_HEAP / 2`
    exec chpst -u vcap:vcap java \
      -XX:+UseParallelGC \
      -XX:+ExitOnOutOfMemoryError \
      -Xms"$HALF_MAX_HEAP"m \
      -Xmx"$MAX_HEAP"m \
      -Djava.io.tmpdir="$TMP_DIR" \
      -jar /var/vcap/packages/riemann/riemann.jar \
      "$CONFIG_DIR"/clojure/riemann.config \
      1>>"$LOG_DIR"/"$JOB_NAME".stdout.log \
      2>>"$LOG_DIR"/"$JOB_NAME".stderr.log &

JVM

./jcmd "$(ps -C java -o pid=)" VM.version
37825:
OpenJDK 64-Bit Server VM version 17.0.3.0.1+7-LTS
JDK 17.0.3

Currently, we are on an m5.12xlarge machine. The system is stable with CPU running around 30% to 40% generally and 50%~60% during busy hours. On our 192 GB machine, our memory consumption is between 45GB & 70GB (which means memory consumption is less than even 50% at peak). So the current situation is not a concern.

Initially, Riemann was on JVM8 and used CMS

      -XX:+CMSParallelRemarkEnabled \
      -XX:+UseCompressedOops \
      -XX:+CMSClassUnloadingEnabled \

However, on moving to JDK13, G1GC didn't seem to help as it was unstable and memory kept collecting after a few days and kept crashing. ZGC also didn't seem to help. (and JDK 17 after that). Parallel GC as a throughput collector was the most stable. (We probably would go back to see how GC can be changed to a better option than stop-the-world).

However, from the stack traces, it didn't seem to be GC. Even switching to -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC a no-op GC CPU was at 100 on the BIG machine.

What is concerning is, the moment with the new JVM also, when we upgrade the VM to a higher VM (m5.16xlarge [64CPU/256gb] or c5n.18xlarge [72cpu/192gb]), Riemann starts to become unresponsive and CPU stays at 100% with load touching 120~150.

image

In the given screenshot, The VM was recreated to try the bigger VM size. The Yellow and green parts of the graph were the smaller 48CPU machines in the Blue region when we tried to use the higher configuration 72CPU machine.

So this issue of high CPU only exists on the BIG VM(72 or 64 CPU). Switching back to the smaller(48 CPU) machine brings everything back to normal and stable.

The slight tapering towards the end of the blue region is when we tried to remove rules from the riemann.config to see its effect. While it did reduce a bit the application is still playing around 100%, we tried removing the downstream sending of metrics to influxdb assuming that might be a bottleneck... and again it did reduce a bit but still sitting around 100.

I am noting the 2 thread dumps.
thread_dump1

thread_dump2

Seems like most threads are parked.

@sanel
Copy link
Contributor

sanel commented Dec 24, 2022

Sadly, it is still hard to figure out what is happening without seeing the full riemann.config and accompanying files. Of course, seeing a lot of waiting threads means something in the pipeline needs to be corrected, but from the thread dumps, I don't see anything particularly related to Riemann itself (and I might miss it).

I'd try these things in the given order:

  1. Try with a minimal riemann.config as possible, like (streams index). Measure it. I'm expecting Riemann not to use 100% of CPU here, and if it does, something might need to be fixed with JVM settings.
  2. Add direct influxdb storage e.g. (streams index influx). Influx can be a bottleneck if it is not configured properly. Check if Influx is running CQs or heavy queries while Riemann is writing to it. Plenty of space to optimize here. Depending on how your streams are organized, Riemann might wait until Influx write all the metrics it receives.
  3. Add batched influxdb storage. (batch) is a beast on its own, so again, try, measure, rinse and repeat. (batch) should reduce pressure on Influx, hopefully improving overall CPU usage.
  4. Add your first custom Riemann streaming. Measure. Proceed with the next custom Riemann streaming function, etc.

I like riemann-configuration-example from @mcorbin and I find it easy to scale and debug. Try reorganizing riemann.config to something like this, if possible:

(let [index (index)]
 (streams
   index
   influx
   cpu-check-stream
   mem-check-stream
   ;; disk-check-stream     ;; disk stream is disabled
   custom-stream
   ))

and simply commenting/disabling specific streams, you can quickly see how Riemann behaves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants