Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Condition where the bayes_expiry loop fails because the memory is full #4921

Open
japc opened this issue Apr 12, 2024 · 39 comments
Open
Labels

Comments

@japc
Copy link

japc commented Apr 12, 2024

Prerequisites

Describe the bug

Have a kind of deadlock situation, where the bayes_expiry fails because the memory is full:

2024-04-12 08:46:37 #501(controller) <djphd3>; lua; bayes_expiry.lua:408: cannot perform expiry step: ERR Error running script (call to f_acab6c1e47b0f81a1f2e58a41611fed28f249daf): @user_script:36: @user_script: 36: -OOM command not allowed when used memory > 'maxmemory'.
2024-04-12 08:46:37 #501(controller) <bd8064>; csession; rspamd_redis_learned: cannot learn task: ERR Error running script (call to f_18c1f7e837d77f2d1d51cc491aa89365ed1c5bc6): @user_script:1: @user_script: 1: -OOM command not allowed when used memory > 'maxmemory'.
2024-04-12 08:46:37 #501(controller) <bd8064>; csession; rspamd_task_process: learn error: ERR Error running script (call to f_18c1f7e837d77f2d1d51cc491aa89365ed1c5bc6): @user_script:1: @user_script: 1: -OOM command not allowed when used memory > 'maxmemory'.

As seen from the logs, the learn processes also fails (and possibly not triggering any key setting that would make redis evict keys according to the policy, and free some memory?).

My redis mm configuration is:

maxmemory 6GB
maxmemory-policy volatile-ttl

Steps to Reproduce

Give it some time until enough statistical tokens fill the database.

Expected behavior
Should work as described on the Bayes expiry module documentation, with the memory optimally full, but functional.

Versions

$ rspamd --version
Rspamd daemon version 3.9.0

CPU architecture x86_64; features: avx2, avx, sse2, sse3, ssse3, sse4.1, sse4.2, rdrand
Hyperscan enabled: TRUE
Jemalloc enabled: FALSE
LuaJIT enabled: TRUE (LuaJIT version: LuaJIT 2.1.0-beta3)
ASAN enabled: FALSE
BLAS enabled: FALSE
Fasttext enabled: FALSE

(rspamd is compiled from git 5f8ea68, but this has alway happened, for as far as I can remember)

OS: Debian 12.5 64bits

Redis server v=6.2.7 sha=00000000:0 malloc=jemalloc-5.1.0 bits=64 build=8d5678d625cf855c

Additional Information

As a workaround we temporarily increase the redis memory and let rspamd go with a bit more agressive bayes_expiry configuration for a while.

@japc japc added the bug label Apr 12, 2024
@dragoangel
Copy link
Contributor

dragoangel commented Apr 16, 2024

@japc check how much elements you have in learned_ids, this almost only one key without TTL set, previously there was a bug it's doesn't rotated at all. Now it should be splitted by 10k elements into 5 keys which looks like learned_ids, learned_idsX, learned_idsXX, learned_idsXXX, etc.

@japc
Copy link
Author

japc commented Apr 17, 2024

@dragoangel Have learned_ids, learned_idsX, ..., learned_idsXXXX, with 10k elements each, and learned_idsXXXXX with 7k and growing:

127.0.0.1:6378> hlen learned_ids
(integer) 10000
127.0.0.1:6378> hlen learned_idsX
(integer) 10000
127.0.0.1:6378> hlen learned_idsXX
(integer) 10000
127.0.0.1:6378> hlen learned_idsXXX
(integer) 10000
127.0.0.1:6378> hlen learned_idsXXXX
(integer) 10000
127.0.0.1:6378> hlen learned_idsXXXXX
(integer) 7048

@dragoangel
Copy link
Contributor

dragoangel commented Apr 17, 2024

well looks fine. I honestly don't understand how you get to OOM command not allowed when used memory > 'maxmemory'. while you had maxmemory 6GB && maxmemory-policy volatile-ttl because in this case Redis RESPONSIBLE for cleaning memory and not getting into OOM. What I tried to say: this not Rspamd fault here at all, and error you see - this Redis message which was replied to Rspamd, as of that - what you expect Rspamd could do with such error in Redis? Answer: nothing.

I don't know how "stable" you can reproduce this issue and it would be nice to know, but you need report it to Redis instead of Rspamd 😉. As of your Redis version, before reporting any issues, I would advice you to update to latest one - 7.2.4 and check if issue would still in place. I assume with moving to 7.2.4 you will fail to getting such OOM.

@japc
Copy link
Author

japc commented Apr 17, 2024

The ball is on the redis side probably. I was thinking of a weird play of low or OOM on redis with the lua script eval (following some of the considerations on "Execution under low memory conditions" here: https://redis.io/docs/latest/develop/interact/programmability/eval-intro/).

I set up a small lab, with same versions and configurations of everything, with 40MB as redis memory limit, and COULD NOT replicate the problem. Tried first with 10MB and the error happened, but could be because it was too small for proper use.

In production the issue occurred 2 or 3 times in the last couple of months. Will try to see if it happens as soon as the memory fills or if it works as should (redis evicting and keeping being usable) for a while and then enters the OOM.

@dragoangel
Copy link
Contributor

Well 2 or 3 times in months is not an issue, it will run later and get success.

@japc
Copy link
Author

japc commented Apr 17, 2024

That's 2 or 3 times where it started failing altogether and we had to recover with the workaround I mentioned, not 2 or 3 bayes_expiry failed steps between successes.

@dragoangel
Copy link
Contributor

That's 2 or 3 times where it started failing altogether and we had to recover with the workaround I mentioned, not 2 or 3 bayes_expiry failed steps between successes.

why not try considering update to 7.2?

@japc
Copy link
Author

japc commented Apr 19, 2024

why not try considering update to 7.2?

It is planned, but that and trying to understand why it is happening are not mutually exclusive. Even because I'm not sure if the upgrade to 7.2 will change the behaviour.

It's happening right now. Started at 04h18 for the redis learn, and some minutes later for the bayes_expiry:

2024-04-19 04:18:08 #6127(normal) <a9a7a5>; task; rspamd_redis_learned: cannot learn task: ERR Error running script (call to f_18c1f7e837d77f2d1d51cc491aa89365ed1c5bc6): @
user_script:1: @user_script: 1: -OOM command not allowed when used memory > 'maxmemory'.

2024-04-19 04:22:57 #4571(controller) <kbbx9s>; lua; bayes_expiry.lua:408: cannot perform expiry step: ERR Error running script (call to f_acab6c1e47b0f81a1f2e58a41611fed2
8f249daf): @user_script:36: @user_script: 36: -OOM command not allowed when used memory > 'maxmemory'.

Redis seems (from the logs) to keep having writes until about 06h31.

Any suggestion, to try to get some more debug info, before I fix it?

@moisseev
Copy link
Member

My redis mm configuration is:

maxmemory 6GB
maxmemory-policy volatile-ttl

Are you sure your configuration is actually being loaded?
Does # redis-cli -p 6378 info memory | grep maxmemory show the same policy as configured (maxmemory_policy:volatile-ttl) ?

@japc
Copy link
Author

japc commented Apr 19, 2024

My redis mm configuration is:

maxmemory 6GB
maxmemory-policy volatile-ttl

Are you sure your configuration is actually being loaded? Does # redis-cli -p 6378 info memory | grep maxmemory show the same policy as configured (maxmemory_policy:volatile-ttl) ?

Pretty sure:

$ redis-cli -p 6378 info memory | grep memory
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
used_memory:7518842472
used_memory_human:7.00G
...
maxmemory:7516192768
maxmemory_human:7.00G
maxmemory_policy:volatile-ttl

Don't know much about how aggressive the redis evictions are, the redis memory does seem to go down some MBs from time to time. But both the rspamd_redis_learned and bayes_expiry have been consistently failing for hours.

@dragoangel
Copy link
Contributor

dragoangel commented Apr 19, 2024

@japc you using separate redis for bayes or you using shared redis that contains other modules as well? If this shared redis - try moving bayes to dedicated redis or start to use 1 db for all modules if this not the case. Redis eviction working strange on multi-db instances with violate ttl. And I don't understand why you not try to update to more fresh version. Rspamd works good with 7.2. If you so panic about moving on new version, backup rbd and update, in any case you can revert to older version from rbd backup.

@japc
Copy link
Author

japc commented Apr 19, 2024

@dragoangel

It's a dedicated instance: have one default and history (6379), one for fuzzy (6377), one for classifier_bayer (6378), one for neural (6381) and one for reputation (6380).

As for the redis upgrade, it is scheduled, plus other considerations I've replied before.

@japc
Copy link
Author

japc commented Apr 19, 2024

Upgraded redis to 7.2.4:

127.0.0.1:6378> info server
# Server
redis_version:7.2.4

After the redis restart the memory gone down a bit, but maxed out soon:

127.0.0.1:6378> info memory
# Memory
used_memory:7516561904
used_memory_human:7.00G
...
maxmemory:7516192768
maxmemory_human:7.00G
maxmemory_policy:volatile-ttl

To the same effect: both learn and expire and failing with OOM:

2024-04-20 00:05:58 #4571(controller) <kbbx9s>; lua; bayes_expiry.lua:408: cannot perform expiry step: OOM command not allowed when used memory > 'maxmemory'. script: acab6c1e47b0f81a1f2e58a41611fed28f249daf, on @user_script:36.
2024-04-20 00:05:58 #8991(normal) <e4f656>; task; rspamd_redis_learned: cannot learn task: OOM command not allowed when used memory > 'maxmemory'. script: 18c1f7e837d77f2d1d51cc491aa89365ed1c5bc6, on @user_script:1.

@japc
Copy link
Author

japc commented Apr 20, 2024

This explains a bit:

127.0.0.1:6378> info keyspace
# Keyspace
db0:keys=69893012,expires=0,avg_ttl=0

Have the theory, see if it makes any sense, that, with the autolearn enabled, for our scenario, the token set rate exceeds the expiry rate to the point where we end with all keys with no ttl configured and redis unable to free memory.

A simple info keyspace; sleep 60; info keyspace shows a large increase on the keys vs expires, which seem to go along with that theory:

db0:keys=69933904,expires=40030,avg_ttl=863517322
db0:keys=69946176,expires=42869,avg_ttl=863452830

Being this the case, having some kind of expire on the cache_learn would help.

@moisseev
Copy link
Member

db0:keys=69893012,expires=0,avg_ttl=0

avg_ttl = 0 suggests none of the keys had an ttl set. In this casevolatile-ttl behave like noeviction.
Check if bayes expiry module is configured and working.

It should log messages like this about every minute:

2024-04-19 23:08:50 #16243(controller) <b5o8ft>; lua; bayes_expiry.lua:443: finished expiry step 754: 1003 items checked, 161 significant (5 made persistent), 25 insignificant (1 ttls set), 0 common (0 discriminated), 817 infrequent (2 ttls set), 2 mean, 9 std

And statistics when full cycle is done (depends on database size, about a week):

2024-04-19 04:15:01 #82966(controller) <b5o8ft>; lua; bayes_expiry.lua:443: finished expiry cycle in 4182 steps: 4183474 items checked, 700674 significant (19775 made persistent), 77716 insignificant (31 ttls set), 670 common (126 discriminated), 3404414 infrequent (23052 ttls set), 3 mean, 20 std

@japc
Copy link
Author

japc commented Apr 20, 2024

@moisseev:

The bayes expiry module is configured and working.

As I said, it just doesn't seem to be setting ttls fast enough for the new tokens being added (without ttl), and the DB ends up full, with no keys eligible for eviction, and no more writes (either new keys or expires on the existing).

@dragoangel
Copy link
Contributor

dragoangel commented Apr 20, 2024

@moisseev:

The bayes expiry module is configured and working.

As I said, it just doesn't seem to be setting ttls fast enough for the new tokens being added (without ttl), and the DB ends up full, with no keys eligible for eviction, and no more writes (either new keys or expires on the existing).

What ttl you set for bayes recods? This doesn't have any sense as Moiseev said. I have exactly same env and can't reproduce your issues. Show please your statistics.conf, I assume you set ttl in config to 0, which should not be used 🤣

@japc
Copy link
Author

japc commented Apr 20, 2024

@dragoangel

Let me help you visualize. I've increase the maxmemory to a bit over the used memory. This is the output of info keyspace once each 60s:

     max_memory      used_memory        keys     expires
     7090000000       7018502016    70069493        1889
     7090000000       7032416960    70081869        6529
     7090000000       7025780096    70076159        3755
     7090000000       7037167640    70085911        8378
     7090000000       7044163360    70089336       12138
     7090000000       7050636544    70094485       13993
     7090000000       7056466136    70099361       15880
     7090000000       7061918592    70104532       18665
     7090000000       7068456624    70108818       20519
     7090000000       7074640480    70112359       24223
     7090000000       7080812968    70118271       27019
     7090000000       7086820608    70127044       30800
     7090000000       7090074792    70098710           0
     7090000000       7090231216    70099355           0
     7090000000       7090353144    70099590           0
     7090000000       7090703600    70099678           0
     7090000000       7090311968    70099756           0

You can see the used_memory increasing along with the keys. The expiry module kicks in and sets ttls (1000 per minute, per each frontend) but the keys increase too fast, to the point where redis memory gets full, it frees what it can, more ttl=0 keys are added and it stops being able to do anything (except reads).

My local.d/classifier-bayes.conf is

backend = "redis";
write_servers = "ip1:6378";
read_servers = "random:ip1:6378,ip2:6378,ip2:6378,ip4:6378";
password = "***********";
timeout = 20s;
new_schema = true;
autolearn = [1, 12 ];
lazy = true;
expire = 864000;

@dragoangel
Copy link
Contributor

dragoangel commented Apr 20, 2024

Use statistics conf instead of classifier bayes, this config is deprecated long time ago. Delete it, and read how-to's on statistics conf. New keys should be set by default with ttl, not without it afaik. More over random is not best option for selecting read replicas, remove this detective, as well as timeout in 20s. Also just curious how you can use password without user? Also as you using violate-ttl expire of 10d doesn't have much sense, you can put it to 1y, as anyway keys would be evicted by lowest ttl before expiration.

P.s. the only key should be at any time frame existing in your server is learned_ids (and X) of them, that's all. In short your configuration is wrong, leading to all this issues.

@vstakhov
Copy link
Member

What do you expect if you learn faster than expire can recycle?

@dragoangel
Copy link
Contributor

dragoangel commented Apr 20, 2024

What do you expect if you learn faster than expire can recycle?

But statistics keys ttls are set right at stage of their creation and with violate ttl policy it should work just fine, no matter how much quickly it writes, no?

@japc
Copy link
Author

japc commented Apr 20, 2024

But statistics keys ttls are set right at stage of their creation and with violate ttl policy it should work just fine, no matter how much quickly it writes, no?

Don't seem to, no. From the behaviour and from lualib/redis_scripts/bayes_learn.lua where, if I'm not mistaken, the token msgpack gets expanded and the keys are HSET.

What do you expect if you learn faster than expire can recycle?

So the only solution is to either reduce the learn rate or to increase the redis memory? Can't the keys be set with a (very very) long expire to prevent the problem?

@moisseev
Copy link
Member

So the only solution is to either reduce the learn rate or to increase the redis memory? Can't the keys be set with a (very very) long expire to prevent the problem?

It will break expiry module's logic. You are suggesting to allow Redis to evict non-categorized (yet) tokens that will lead to statistics degradation.

You can decrease interval or increase count to speed up Bayes expiry module.
interval = 60, -- one iteration step per minute
count = 1000, -- check up to 1000 keys on each iteration

@vstakhov
Copy link
Member

I totally agree with @moisseev here - this will likely break the whole learning procedure if we allow Redis to expire some random (potentially significant) tokens. If we want some memory footprint reduction, we might introduce unigrams mode (instead of bigrams that are used currently). That will lead to at least 5 times lesser databases but will potentially degrade Bayes mode to naive one. Another option is to reduce the window size (e.g. use only 2 joint bigramms instead of 5). All the options are possible and not very hard to implement, but they will require full stat reset to be used.

It's actually a discussion point here (@moisseev + @fatalbanana to highlight).

@moisseev
Copy link
Member

Use statistics conf instead of classifier bayes, this config is deprecated long time ago.

@dragoangel Are you sure? May be I missed something, but I believe for most setups local.d/classifier-bayes.conf is a better place as it is included inside the classifier "bayes" {} section.

@dragoangel
Copy link
Contributor

dragoangel commented Apr 20, 2024

Use statistics conf instead of classifier bayes, this config is deprecated long time ago.

@dragoangel Are you sure? May be I missed something, but I believe for most setups local.d/classifier-bayes.conf is a better place as it is included inside the classifier "bayes" {} section.

Well I read that from the comments in config and docs, @moisseev

@moisseev
Copy link
Member

Well I read that from the comments in config and docs, @moisseev

Probably we should correct that.
@dragoangel Can you point me to the exact places? I can't find it.

@japc
Copy link
Author

japc commented Apr 20, 2024

It will break expiry module's logic. You are suggesting to allow Redis to evict non-categorized (yet) tokens that will lead to statistics degradation.

I get it. Just under the assumption that a just seen token could be categorized as "infrequent". But that could lead to it being evicted too soon and the degradation that you point out.

You can decrease interval or increase count to speed up Bayes expiry module.

Been doing that. That's what I meant with "a bit more aggressive bayes_expiry" on the 1st post.

Will check what we can do to control the auto_learn rate (already have learning on user reports and spamtraps so we can live with that).

@dragoangel
Copy link
Contributor

dragoangel commented Apr 20, 2024

Well I read that from the comments in config and docs, @moisseev

Probably we should correct that. @dragoangel Can you point me to the exact places? I can't find it.

@moisseev https://github.com/rspamd/rspamd/blob/master/conf/statistic.conf#L11-L14
Well, it phrased a bit in different way it was stored in my head 😅, but main point that I remembered to not used it, and everything that can be set in classifier-bayes.conf can be set in statistic.conf but for multiply bayeses f.e.

@dragoangel
Copy link
Contributor

Btw, @japc how much replicas of rspamd normal\controller you have?

@japc
Copy link
Author

japc commented Apr 20, 2024

Btw, @japc how much replicas of rspamd normal\controller you have?

2 controller processes, 20 normal processes.

@japc
Copy link
Author

japc commented Apr 20, 2024

Would increasing the controller process count improve the bayes categorization?

@dragoangel
Copy link
Contributor

dragoangel commented Apr 20, 2024

Btw, @japc how much replicas of rspamd normal\controller you have?

2 controller processes, 20 normal processes.

Just to clarify what you mean by process? 1 instance with 2 processes of controller and 20 normal child proccesses limit or this is dedicated servers(containers/pod/etc) count?

@japc
Copy link
Author

japc commented Apr 20, 2024

Would increasing the controller process count improve the bayes categorization?

From the logs it seem to always run on the same controller, so I guess not.

Just to clarify what you mean by process? 1 instance with 2 processes of controller and 20 normal child proccesses limit or this is dedicated servers(containers/pod/etc) count?

5 rspamd servers. Each one multipurpose, with 2 controller + 20 normal each.

@dragoangel
Copy link
Contributor

Would increasing the controller process count improve the bayes categorization?

From the logs it seem to always run on the same controller, so I guess not.

Just to clarify what you mean by process? 1 instance with 2 processes of controller and 20 normal child proccesses limit or this is dedicated servers(containers/pod/etc) count?

5 rspamd servers. Each one multipurpose, with 2 controller + 20 normal each.

Yeah, from code there is lock based on hostname, and only rspamd with matching hostname would run it. Try what Moiseev proposed: play with interval\count. I think to start you can try increasing count to x5 times, and then look how it goes.

@japc
Copy link
Author

japc commented Apr 28, 2024

Use statistics conf instead of classifier bayes, this config is deprecated long time ago. Delete it, and read how-to's on statistics conf.

That not what I read from the documentation, or from the includes on statistics.conf.

Also just curious how you can use password without user?

With requirepass, not ACLs.

Also as you using violate-ttl expire of 10d doesn't have much sense, you can put it to 1y, as anyway keys would be evicted by lowest ttl before expiration.

I understand the rational to have a big number there, if functioning properly, but, even more extreme, I usually shrinked that down to a couple of hours to make redis usable (as part of the "a bit more aggressive bayes_expiry" I mentioned).

P.s. the only key should be at any time frame existing in your server is learned_ids (and X) of them, that's all. In short your configuration is wrong, leading to all this issues.

Don't think that's the case.

@dragoangel
Copy link
Contributor

Don't think that's the case.

Yes, first records created without ttl, then processed and only after they get their ttl. Was wrong at that part

@dragoangel
Copy link
Contributor

@japc you still have issues?

I updated to Rspamd 3.9.0 (meaning from current master, not yet released, build from source) and can say that I get situation that bayes expiry not quickly enough was going over bayes DB resulting in situation that new keys was added much quicker then others get TTLs set. In my case I set bayes expiry count to 15000 which mean per 1 interval 15k keys would be proccessed, and with such configuration everyting works.

You have monitoring of your bayes redis dbs like prom with grafana? It can simplify your understanding of what is going on.

P.s. To apply mentioned changes I had to add #4972 in my rspamd to load settings count and interval from local.d/bayes_expiry.conf

@japc
Copy link
Author

japc commented May 20, 2024

@japc you still have issues?
...
In my case I set bayes expiry count to 15000 which mean per 1 interval 15k keys would be proccessed, and with such configuration everyting works.

As said before that has always been my workaround. Although I used to restore those values to the defaults. after redis memory was ok, and would then end with a unusable redis after a while, but have since considered them a definitive solution and kept them with a smaller interval and larger count and things have been under control ever since.

Having those as configurable settings is great. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants