Backplane AutoRecover not working as expected (or are my expectations wrong? 😅) #158

martinobordin · 2023-07-14T20:16:15Z

martinobordin
Jul 14, 2023

Hello,
I built a FusionCache POC to test your amazing library and use it in our Saas application.

The POC consists of 3 projects:

FusionCacheApi, a simple Rest API that POST\GET\DELETE data using FusionCache
FusionCacheWorker, a simple background service that keep printing the cache data
FusionCacheShared, a simple library to share constants and cache data definition.

FusionCache is configured to use Redis as 2nd level cache and backplane.

I wanted to test the AutoRecover feature so I start up Redis and both FusionCacheApi and FusionCacheWorker.

I perform a POST to FusionCacheApi and it stores the data
FusionCacheWorker prints the cached data correctly
I stop Redis
I perform another POST with different data to FusionCacheApi and it stores data (only in memory, since Redis is down)
If I get data from FusionCacheApi I see the updated data
FusionCacheWorker keeps printing the old cached data correctly (obviously, Redis is still down so it is not aware of the changes)
I restart Redis
FusionCacheWorker keeps printing the old cache data wrongly (I supposed the AutoRecover feature would have published the notification as soon as the backplane channel had come back online)

Am I missing something?
Is this the expected behavior?

Thanks for your help :)

Answered by jodydonetti

Jul 23, 2023

Hi @martinobordin , sorry for the delay but it has been a rough week.
I have some updates for you: your question pushed me into a rabbit hole of changes, optimizations and bugfixes and... I'm not out of it yet 😅

Anyway I wanted to update you on some things, specifically:

a bug
your expectations
the path forward

🐞 A Bug

There was in fact a bug or, to be more precise, a scenario that was not supported as well as I would've liked, that is when both the distributed cache and the backplane would fail at the same time. In that situation the end result was not always the ideal one, and there was some work to be done.

Now, thanks to your input, I'm handling it correctly 🎉

🤔 Your Expectations

Al…

View full answer

jodydonetti · 2023-07-15T09:06:57Z

jodydonetti
Jul 15, 2023
Maintainer

Hi @martinobordin , thanks for considering FusionCache.

Your expectations seem correct (note: see below for a correction)!

I was about to ask you some more details like which cache duration you were using, which configurations, which version, etc but you've been so kind to prepare a POC.

Therefore I'll take a look at that and will let you know 😊

0 replies

jodydonetti · 2023-07-23T18:16:02Z

jodydonetti
Jul 23, 2023
Maintainer

Hi @martinobordin , sorry for the delay but it has been a rough week.
I have some updates for you: your question pushed me into a rabbit hole of changes, optimizations and bugfixes and... I'm not out of it yet 😅

Anyway I wanted to update you on some things, specifically:

a bug
your expectations
the path forward

🐞 A Bug

There was in fact a bug or, to be more precise, a scenario that was not supported as well as I would've liked, that is when both the distributed cache and the backplane would fail at the same time. In that situation the end result was not always the ideal one, and there was some work to be done.

Now, thanks to your input, I'm handling it correctly 🎉

🤔 Your Expectations

Although I told you expectations were correct in my first answer, in fact I have to change that and tell you that is not actually the case.

Let me explain.

It is true that the idea behind the AutoRecovery feature is that it would automatically handle cases of out-of-sync caches when there are connection issues, but there's one thing to keep in mind: the 1st (memory) and 2nd (distributed) levels are, still, caches.

They are not and should not be used as a data source.

This means that FusionCache is built around the fact that the cache is just a cache, and the usual way of working with it is, for example, to use the GetOrSet method so that if FusionCache needs to get the data for real it can do it.

The AutoRecovery feature has been designed to cover a scenario in which there are multiple nodes, and for some time the backplane (used to keep the nodes in-sync) is gone away: when that happens outgoing messages will not be delivered, and AutoRecovery will keep them in a local queue and, as soon as the backplane will be up again, they will be sent (in an optimized way, with de-duplication of multiple messages for the same cache key and other heuristics like that).

But again, this means that there must be a so called "single source of truth", which normally is a database.

In your POC though, for which again I thank you (and I'm using, slightly modified, as a testing scenario), you set some data in the cache from the API, and then read it from the WORKER: the problem is that in this way the data exists only in the cache (memory and/or distributed) and nowhere else because there's no database (real or fake).

Now consider that every cache, be it in memory or distributed, is ephemeral by definition: in the case of your POC in fact, Redis is memory only and not persisted.

This means that when the cache itself goes away, like in the part of the test where we stop Redis and then restart it, the data is basically gone for good.

So basically the normal flow in FusionCache (and for which the AutoRecovery feature has been designed) is this: call GetOrSet and pass the factory to get the data from the single source of truth, whatever that is.

This will:

get from memory cache
if not there get from distributed cache
if not there calls the factory (with automatic Cache Stampede protection)
set data in memory cache
set data in distributed cache
send notifications to the other nodes
the other nodes will evict their local copy, and on the next get will get the value from the distributed cache
all is synchronized

I'm not sure I've been able to explain myself fully and clearly, so please let me know!

⏩ The Path Forward

On top of fixing the first situation I mentioned earlier, there are other things I'd like to change while I'm on this "backplane sprint".

For example up until now, the processing of the autorecovery queue would happen as soon as a message would arrive to a node (like a wake up call) in a "passive" way: this means btw that if the backplane is up again but no new messages are received for, say, 1 min, the AutoRecovery would not kick in for 1 min.
In a reasonably normal real world scenario with lots of messages going around this is not a problem, and wouldn't even be noticed but still, not ideal.

Your POC is super simple and isolated and this becomes more evident, because the only messages going around are for the cache key MyCacheKey, which is not changed after the backplane comes up again, so that is why "it seems not to work".

Now, with the new version I'm working on, this is no longer the case and FusionCache is more "active" in this regard (but still, since the distributed cache is the only source of truth, it does not work as you expected).

I'm also investigating other edge cases and area for improvements, on top of even better performance in some cases.

Again, thanks for trying out FusionCache, for the POC and in general for your time.

Will update you as soon as I'll have some news about this.

0 replies

martinobordin · 2023-07-24T07:49:22Z

martinobordin
Jul 24, 2023
Author

Hey @jodydonetti,
thank you very much for your detailed response. Very clear 😊

I totally agree we're talking about cache and not datastore, so GetOrSet would be the correct method to use; you did well to highlight this peculiarity.

Anyway, after step 3 this would not be enough, since the worker keeps using the not expired InMemory entry; it won't get fresh data (from the 2nd level cache or from the datastore) until the entry expires (or the backplane starts to deliver change notification again). I'm glad you find a way to handle this edge case.

So thank you very much again for your wonderful library and the even better support and documentation you deliver!

0 replies

jodydonetti · 2023-07-30T17:49:05Z

jodydonetti
Jul 30, 2023
Maintainer

Hey @martinobordin , I just finished all the work behind the backplane sprint I talked about, and I'll release a new version with these optimizations very soon.

You can see most of the things here and here.

Hope this helps.

ps: when done I'll create a PR for your POC so you'll be able to test the changes.

0 replies

jodydonetti · 2023-07-30T18:32:55Z

jodydonetti
Jul 30, 2023
Maintainer

ps: if you've found my answer useful, can you please mark it as the answer? Thanks!

1 reply

martinobordin Jul 31, 2023
Author

Thank you very much Jody!

jodydonetti · 2023-08-02T00:05:41Z

jodydonetti
Aug 2, 2023
Maintainer

Hi @martinobordin , v0.23.0 has been released with all of the above included 🎉

Please let me know if all is working correctly now, thanks!

0 replies

martinobordin · 2023-08-02T07:18:03Z

martinobordin
Aug 2, 2023
Author

Hi @jodydonetti and thank you for the support.
I just updated the FusionCache libraries and retried my usage pattern.

After restarting Redis (step 8), I see that the "Worker" reconnects, but it continues to print the old value (10, in my screenshot below) and after a while, it doesn't print anything (the cache entry has been evicted).

Given your explanation, maybe this is the expected behavior since the worker should go to directly to the database and get the latest version of the data, is that true?
But I retried several times and sometimes I have this behavior, some other it keeps printing the old value.

At this point, my question would be when the backplane will update other nodes with the newest value, and when not? 🤔

If you want we can have a shared session so I'll show you.

Thank you for your time.

2 replies

jodydonetti Aug 2, 2023
Maintainer

Hi @martinobordin , before publishing the new version I tried again with your POC + v0.23.0, and it seemed to work fine.
Based on your observations I'll check again asap and will let you know.

Stay tuned.

jodydonetti Sep 3, 2023
Maintainer

Hi @martinobordin , v0.24.0-preview1 pre-release is out now on Nuget 🎉

This version has an even more robust backplane auto-recovery logic, and you can read more in the release description.

If you can, please try it out and let me know what you think.

NOTE: remember to check the "include prerelease" checkbox in your IDE to see it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backplane AutoRecover not working as expected (or are my expectations wrong? 😅) #158

{{title}}

Replies: 7 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Backplane AutoRecover not working as expected (or are my expectations wrong? 😅) #158

martinobordin Jul 14, 2023

🐞 A Bug

🤔 Your Expectations

Replies: 7 comments · 3 replies

jodydonetti Jul 15, 2023 Maintainer

jodydonetti Jul 23, 2023 Maintainer

🐞 A Bug

🤔 Your Expectations

⏩ The Path Forward

martinobordin Jul 24, 2023 Author

jodydonetti Jul 30, 2023 Maintainer

jodydonetti Jul 30, 2023 Maintainer

martinobordin Jul 31, 2023 Author

jodydonetti Aug 2, 2023 Maintainer

martinobordin Aug 2, 2023 Author

jodydonetti Aug 2, 2023 Maintainer

jodydonetti Sep 3, 2023 Maintainer

martinobordin
Jul 14, 2023

Replies: 7 comments 3 replies

jodydonetti
Jul 15, 2023
Maintainer

jodydonetti
Jul 23, 2023
Maintainer

martinobordin
Jul 24, 2023
Author

jodydonetti
Jul 30, 2023
Maintainer

jodydonetti
Jul 30, 2023
Maintainer

martinobordin Jul 31, 2023
Author

jodydonetti
Aug 2, 2023
Maintainer

martinobordin
Aug 2, 2023
Author

jodydonetti Aug 2, 2023
Maintainer

jodydonetti Sep 3, 2023
Maintainer