-
-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Condition where the bayes_expiry loop fails because the memory is full #4921
Comments
@japc check how much elements you have in learned_ids, this almost only one key without TTL set, previously there was a bug it's doesn't rotated at all. Now it should be splitted by 10k elements into 5 keys which looks like learned_ids, learned_idsX, learned_idsXX, learned_idsXXX, etc. |
@dragoangel Have learned_ids, learned_idsX, ..., learned_idsXXXX, with 10k elements each, and learned_idsXXXXX with 7k and growing:
|
well looks fine. I honestly don't understand how you get to I don't know how "stable" you can reproduce this issue and it would be nice to know, but you need report it to Redis instead of Rspamd 😉. As of your Redis version, before reporting any issues, I would advice you to update to latest one - 7.2.4 and check if issue would still in place. I assume with moving to 7.2.4 you will fail to getting such OOM. |
The ball is on the redis side probably. I was thinking of a weird play of low or OOM on redis with the lua script eval (following some of the considerations on "Execution under low memory conditions" here: https://redis.io/docs/latest/develop/interact/programmability/eval-intro/). I set up a small lab, with same versions and configurations of everything, with 40MB as redis memory limit, and COULD NOT replicate the problem. Tried first with 10MB and the error happened, but could be because it was too small for proper use. In production the issue occurred 2 or 3 times in the last couple of months. Will try to see if it happens as soon as the memory fills or if it works as should (redis evicting and keeping being usable) for a while and then enters the OOM. |
Well 2 or 3 times in months is not an issue, it will run later and get success. |
That's 2 or 3 times where it started failing altogether and we had to recover with the workaround I mentioned, not 2 or 3 bayes_expiry failed steps between successes. |
why not try considering update to 7.2? |
It is planned, but that and trying to understand why it is happening are not mutually exclusive. Even because I'm not sure if the upgrade to 7.2 will change the behaviour. It's happening right now. Started at 04h18 for the redis learn, and some minutes later for the bayes_expiry:
Redis seems (from the logs) to keep having writes until about 06h31. Any suggestion, to try to get some more debug info, before I fix it? |
Are you sure your configuration is actually being loaded? |
Pretty sure:
Don't know much about how aggressive the redis evictions are, the redis memory does seem to go down some MBs from time to time. But both the rspamd_redis_learned and bayes_expiry have been consistently failing for hours. |
@japc you using separate redis for bayes or you using shared redis that contains other modules as well? If this shared redis - try moving bayes to dedicated redis or start to use 1 db for all modules if this not the case. Redis eviction working strange on multi-db instances with violate ttl. And I don't understand why you not try to update to more fresh version. Rspamd works good with 7.2. If you so panic about moving on new version, backup rbd and update, in any case you can revert to older version from rbd backup. |
It's a dedicated instance: have one default and history (6379), one for fuzzy (6377), one for classifier_bayer (6378), one for neural (6381) and one for reputation (6380). As for the redis upgrade, it is scheduled, plus other considerations I've replied before. |
Upgraded redis to 7.2.4:
After the redis restart the memory gone down a bit, but maxed out soon:
To the same effect: both learn and expire and failing with OOM:
|
This explains a bit:
Have the theory, see if it makes any sense, that, with the autolearn enabled, for our scenario, the token set rate exceeds the expiry rate to the point where we end with all keys with no ttl configured and redis unable to free memory. A simple info keyspace; sleep 60; info keyspace shows a large increase on the keys vs expires, which seem to go along with that theory:
Being this the case, having some kind of expire on the cache_learn would help. |
It should log messages like this about every minute:
And statistics when full cycle is done (depends on database size, about a week):
|
The bayes expiry module is configured and working. As I said, it just doesn't seem to be setting ttls fast enough for the new tokens being added (without ttl), and the DB ends up full, with no keys eligible for eviction, and no more writes (either new keys or expires on the existing). |
What ttl you set for bayes recods? This doesn't have any sense as Moiseev said. I have exactly same env and can't reproduce your issues. Show please your statistics.conf, I assume you set ttl in config to 0, which should not be used 🤣 |
Let me help you visualize. I've increase the maxmemory to a bit over the used memory. This is the output of
You can see the used_memory increasing along with the keys. The expiry module kicks in and sets ttls (1000 per minute, per each frontend) but the keys increase too fast, to the point where redis memory gets full, it frees what it can, more ttl=0 keys are added and it stops being able to do anything (except reads). My
|
Use statistics conf instead of classifier bayes, this config is deprecated long time ago. Delete it, and read how-to's on statistics conf. New keys should be set by default with ttl, not without it afaik. More over random is not best option for selecting read replicas, remove this detective, as well as timeout in 20s. Also just curious how you can use password without user? Also as you using violate-ttl expire of 10d doesn't have much sense, you can put it to 1y, as anyway keys would be evicted by lowest ttl before expiration. P.s. the only key should be at any time frame existing in your server is learned_ids (and X) of them, that's all. In short your configuration is wrong, leading to all this issues. |
What do you expect if you learn faster than expire can recycle? |
But statistics keys ttls are set right at stage of their creation and with violate ttl policy it should work just fine, no matter how much quickly it writes, no? |
Don't seem to, no. From the behaviour and from
So the only solution is to either reduce the learn rate or to increase the redis memory? Can't the keys be set with a (very very) long expire to prevent the problem? |
It will break expiry module's logic. You are suggesting to allow Redis to evict non-categorized (yet) tokens that will lead to statistics degradation. You can decrease |
I totally agree with @moisseev here - this will likely break the whole learning procedure if we allow Redis to expire some random (potentially significant) tokens. If we want some memory footprint reduction, we might introduce unigrams mode (instead of bigrams that are used currently). That will lead to at least 5 times lesser databases but will potentially degrade Bayes mode to naive one. Another option is to reduce the window size (e.g. use only 2 joint bigramms instead of 5). All the options are possible and not very hard to implement, but they will require full stat reset to be used. It's actually a discussion point here (@moisseev + @fatalbanana to highlight). |
@dragoangel Are you sure? May be I missed something, but I believe for most setups |
Well I read that from the comments in config and docs, @moisseev |
Probably we should correct that. |
I get it. Just under the assumption that a just seen token could be categorized as "infrequent". But that could lead to it being evicted too soon and the degradation that you point out.
Been doing that. That's what I meant with "a bit more aggressive bayes_expiry" on the 1st post. Will check what we can do to control the auto_learn rate (already have learning on user reports and spamtraps so we can live with that). |
@moisseev https://github.com/rspamd/rspamd/blob/master/conf/statistic.conf#L11-L14 |
Btw, @japc how much replicas of rspamd normal\controller you have? |
2 controller processes, 20 normal processes. |
Would increasing the controller process count improve the bayes categorization? |
Just to clarify what you mean by process? 1 instance with 2 processes of controller and 20 normal child proccesses limit or this is dedicated servers(containers/pod/etc) count? |
From the logs it seem to always run on the same controller, so I guess not.
5 rspamd servers. Each one multipurpose, with 2 controller + 20 normal each. |
Yeah, from code there is lock based on hostname, and only rspamd with matching hostname would run it. Try what Moiseev proposed: play with interval\count. I think to start you can try increasing count to x5 times, and then look how it goes. |
That not what I read from the documentation, or from the includes on statistics.conf.
With requirepass, not ACLs.
I understand the rational to have a big number there, if functioning properly, but, even more extreme, I usually shrinked that down to a couple of hours to make redis usable (as part of the "a bit more aggressive bayes_expiry" I mentioned).
Don't think that's the case. |
Yes, first records created without ttl, then processed and only after they get their ttl. Was wrong at that part |
@japc you still have issues? I updated to Rspamd 3.9.0 (meaning from current master, not yet released, build from source) and can say that I get situation that bayes expiry not quickly enough was going over bayes DB resulting in situation that new keys was added much quicker then others get TTLs set. In my case I set bayes expiry count to 15000 which mean per 1 interval 15k keys would be proccessed, and with such configuration everyting works. You have monitoring of your bayes redis dbs like prom with grafana? It can simplify your understanding of what is going on. P.s. To apply mentioned changes I had to add #4972 in my rspamd to load settings count and interval from |
As said before that has always been my workaround. Although I used to restore those values to the defaults. after redis memory was ok, and would then end with a unusable redis after a while, but have since considered them a definitive solution and kept them with a smaller interval and larger count and things have been under control ever since. Having those as configurable settings is great. Thank you. |
Prerequisites
Describe the bug
Have a kind of deadlock situation, where the bayes_expiry fails because the memory is full:
As seen from the logs, the learn processes also fails (and possibly not triggering any key setting that would make redis evict keys according to the policy, and free some memory?).
My redis mm configuration is:
Steps to Reproduce
Give it some time until enough statistical tokens fill the database.
Expected behavior
Should work as described on the Bayes expiry module documentation, with the memory optimally full, but functional.
Versions
(rspamd is compiled from git 5f8ea68, but this has alway happened, for as far as I can remember)
OS: Debian 12.5 64bits
Redis server v=6.2.7 sha=00000000:0 malloc=jemalloc-5.1.0 bits=64 build=8d5678d625cf855c
Additional Information
As a workaround we temporarily increase the redis memory and let rspamd go with a bit more agressive bayes_expiry configuration for a while.
The text was updated successfully, but these errors were encountered: