[feat] redeployment long-lived actors when node left cluster #290

qazwsxedckll · 2024-04-16T02:23:25Z

No description provided.

Tochemey · 2024-04-16T07:27:19Z

@qazwsxedckll. Thank you so much for this issue. There is already an issue I created some time back to tackle that. You can find it here. At the moment, I need to rethink it properly. However I am open to any PR in that regard.

qazwsxedckll · 2024-04-16T10:11:13Z

First of all, I am so impressed by your hard work, the documentation is very clear and the code is very clean.
To my understanding, the problem now is the framework doesn't support spawn actor remotely.
I am contributing to another project protoacotor-go which each node announces what kind of actors it can spawn using service discovery.

ps. I haven't checked the code entirely yet. I want to know which framwork is more similar, akka, orleans or erlang?

Tochemey · 2024-04-16T11:09:11Z

@qazwsxedckll one can spawn an actor remotely with some caveats. :)
kindly check the release note https://github.com/Tochemey/goakt/releases/tag/v1.6.0

Tochemey · 2024-04-16T11:11:59Z

First of all, I am so impressed by your hard work, the documentation is very clear and the code is very clean. To my understanding, the problem now is the framework doesn't support spawn actor remotely. I am contributing to another project protoacotor-go which each node announces what kind of actors it can spawn using service discovery.

ps. I haven't checked the code entirely yet. I want to know which framwork is more similar, akka, orleans or erlang?

I have used all these actor framework in the past to be honest and was contributing to proto-actor-go some time before starting this. So I pick ideas from all of them. :)

Tochemey · 2024-04-16T17:34:37Z

@qazwsxedckll the challenge Go-Akt is having is to be able to serialize the Actor implementation so that we can pass it over the wire. There are various approaches to that I stated in the closed ticket here

qazwsxedckll · 2024-04-17T01:03:44Z

Sorry, but I don't get it. Why serialization is related to redeployment? Who should send the serialized data and who should receive it? I saw the RemoteSpawn method, all it need is the actor type and the actor name. I do saw that the Register method is not automatically called when starting the actor system. How about adding the registered type to olric, since nodes are aware of the node dead event, all nodes who can spawn the dead actor can decide who should spawn it? Usually, all nodes is accessible to all actor types, heterogeneous cluster is not common, I think.

Here is my usecase. The user can create a MQTT client on the web and subscribe to some topics. I want to create an actor for each client and when the node is down I need to recreate the actor and subscribe to the topics again. So recreate the actor whenever it is accessed is not going to work for me.

Tochemey · 2024-04-17T17:19:12Z

@qazwsxedckll I will take a deep look into this use case. However a PR will also be welcomed in case you think of something that you would like to share. However I really don't understand this So recreate the actor whenever it is accessed is not going to work for me.

The redeployment involve a little bit more than creating actors remotely because we need at least cater for the cluster topology and answer the following question: which node is going to receive the actors of the down node in the cluster?

qazwsxedckll · 2024-04-18T00:57:06Z

When a node leaves the cluster, re-creates its actors whenever they are accessed on other nodes in the cluster. This is feasible because the same actor system is run across the cluster.

You have mentioned this in the previous issue. I just want to clearify that in my usecase, the actor should start automatically and no other actor is actually calling it since subscribing topics is an active action. It is the downside of the so called 'virtual actor', so I am looking for other solutions.

Tochemey · 2024-04-18T06:07:25Z

@qazwsxedckll This was what I meant:

A node dies. In the redeployment case all the actors on that nodes will be moved to another node and started. When it comes to stateful actor we will need some kind of distributed datastore for recovery from previous state. Stateless actors should be fine.
When there are called then they can respond (a.k.a ready to respond to messages)

Apologies for the misunderstanding.

Tochemey · 2024-04-18T06:16:38Z

@qazwsxedckll I will start some draft work on the redeployment feature. However I cannot promise when it should be ready because I need to dig a bit in Olric capacity and see what can help implementing it or if I have to add some layer on top of it to achieve it.

qazwsxedckll · 2024-04-18T06:44:41Z

A node dies. In the redeployment case all the actors on that nodes will be moved to another node and started. When it comes to stateful actor we will need some kind of distributed datastore for recovery from previous state. Stateless actors should be fine.

Just for your reference, when using actors, I alway load data from database on initialization and constantly save data to database when processing messages. If I store some state in memory, I am assuming it will lose at any time.

If what you mean is something like dapr actor store. I don't find it is very useful. For business data, key-value store is far from enough. For actor state, I have not come out with a good use case yet.

Anyway, I think it is nice to have it, but not nessary.

moved to another node and started

Yes, exactly.

Tochemey · 2024-04-18T19:44:04Z

I saw the RemoteSpawn method, all it need is the actor type and the actor name. I do saw that the Register method is not automatically called when starting the actor system. How about adding the registered type to olric, since nodes are aware of the node dead event, all nodes who can spawn the dead actor can decide who should spawn it? Usually, all nodes is accessible to all actor types, heterogeneous cluster is not common, I think.

@qazwsxedckll I think I have understood what you meant. That is why I was saying I need to implement some binary mechanism for Actor implementation because Olric need some binary encoding for custom type serialization. I hope this make sense to you.

qazwsxedckll · 2024-04-19T01:21:50Z

I have not checked Olric yet. Is using plain strings for actor type not enough?

Tochemey · 2024-04-21T09:05:08Z

Design consideration

start the actor system with all the types of actor the system will be handling
each node to have the list of its peers
persist in the cache a map of node id and the list of actor instances created (actor ids)
each node should listen to cluster events
when a node leaves the cluster the rest of the nodes should be aware.
there should be a consensus on which node(s) should receive the dead node actors
the consensus should be based on the load of the other nodes

Lack

Olric does not have a mechanism to know the owners of entries. This will help on the cluster management

Tochemey · 2024-04-22T19:20:59Z

@qazwsxedckll I hope you are doing well. In case you have any suggestions please bring them on. I hope I will not have to change the design. However I may change the design of the cluster engine.

waeljammal · 2024-05-21T19:08:16Z

Great library, covers many of the things that seem to be missing from protoactor. Though this specific issue is a bit of a blocker. Any idea when this one might be resolved?

Tochemey · 2024-05-22T06:43:58Z

@waeljammal there is draft PR #319. I am still thinking about the best way to implement this due to the underlying cluster engine library I am using.

waeljammal · 2024-05-22T07:22:43Z

@Tochemey have you thought about using Olric locks to have a single instance act as the leader? The leader can catch member left events and then decide if it can re-distribute the actors if there is another instance that can spawn the same actors available? Seeing as if there is only 1 instance of a service that spawns a specific type of actor/s running then you simply can't bring them back to life until either the node that went down comes back or there are other instances of that service that can spawn the actor that are still alive.

For this you need to track which node can spawn which actors and have 1 node act as the re-balancing orchestrator, if Olric lock works fine then you can use that to make sure just 1 node at any given time can be the orchestrator.

I would even say that each group of microservices should have it's own leader as each microservice hosts specific actors, you would need to provide the ability for dev to set a cluster name and a service id so you can form the overall cluster made up of different types of services and track which service can spawn which actor types.

You would also need to cater for software updates, eg. doing a rolling update in kubernetes where each pod is replaced 1 at a time, you might want to calculate a quorum based on existing values so you do not re-balance prematurely, ie. if you had 3 instances before and 2 suddenly went down for a software upgrade then you do not want to re-balance unless at least 2 instances are alive again otherwise if the leader goes down then the first instance that comes back to life would end up with all of the actors being restored there instead of being distributed across at least a few instances. You can use a n-1 formula to decide when it's time to re-balance. eg. you have 3 instances 1 goes down you can safely rebalance straight away across the remaining 2, if you 2 go down wait a little while if the quorum is not recovered you can restore to the remaining instance but if they do come back then you reach n-1 quorum again you can re-balance across multiple instances again. The trick is not rely only on a member left event but rather to have another event after a member leaves that triggers after some configurable period that flags the member as dead, seeing as members will go down cos of a crash or new version being released you need to allow a little time for the instances to recover.

You can also look at how nats implements only the leader election part of raft in their graft library, you can do the same using your pub/sub to do more reliable leader election rather than using locks.

Tochemey · 2024-05-22T07:42:24Z

@waeljammal thanks for the suggestions. I hope to have some bandwidth to tackle this in the short term.😉

Tochemey · 2024-05-29T22:41:47Z

@waeljammal and @qazwsxedckll I believe I have found a simpler way to get this out. Hopefully I will push it in the weekend. I am currently working on a solution that I need to test.

qazwsxedckll changed the title ~~[feat-want] redeployment long-lived actors when node left cluster~~ [feat] redeployment long-lived actors when node left cluster Apr 16, 2024

Tochemey added duplicate This issue or pull request already exists feature New feature or request labels Apr 16, 2024

Tochemey self-assigned this Apr 20, 2024

Tochemey mentioned this issue May 11, 2024

remote deployment implementation #319

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] redeployment long-lived actors when node left cluster #290

[feat] redeployment long-lived actors when node left cluster #290

qazwsxedckll commented Apr 16, 2024

Tochemey commented Apr 16, 2024

qazwsxedckll commented Apr 16, 2024

Tochemey commented Apr 16, 2024 •

edited

Tochemey commented Apr 16, 2024

Tochemey commented Apr 16, 2024 •

edited

qazwsxedckll commented Apr 17, 2024 •

edited

Tochemey commented Apr 17, 2024 •

edited

qazwsxedckll commented Apr 18, 2024 •

edited

Tochemey commented Apr 18, 2024 •

edited

Tochemey commented Apr 18, 2024

qazwsxedckll commented Apr 18, 2024 •

edited

Tochemey commented Apr 18, 2024 •

edited

qazwsxedckll commented Apr 19, 2024 •

edited

Tochemey commented Apr 21, 2024

Tochemey commented Apr 22, 2024

waeljammal commented May 21, 2024

Tochemey commented May 22, 2024

waeljammal commented May 22, 2024 •

edited

Tochemey commented May 22, 2024

Tochemey commented May 29, 2024 •

edited

[feat] redeployment long-lived actors when node left cluster #290

[feat] redeployment long-lived actors when node left cluster #290

Comments

qazwsxedckll commented Apr 16, 2024

Tochemey commented Apr 16, 2024

qazwsxedckll commented Apr 16, 2024

Tochemey commented Apr 16, 2024 • edited

Tochemey commented Apr 16, 2024

Tochemey commented Apr 16, 2024 • edited

qazwsxedckll commented Apr 17, 2024 • edited

Tochemey commented Apr 17, 2024 • edited

qazwsxedckll commented Apr 18, 2024 • edited

Tochemey commented Apr 18, 2024 • edited

Tochemey commented Apr 18, 2024

qazwsxedckll commented Apr 18, 2024 • edited

Tochemey commented Apr 18, 2024 • edited

qazwsxedckll commented Apr 19, 2024 • edited

Tochemey commented Apr 21, 2024

Design consideration

Lack

Tochemey commented Apr 22, 2024

waeljammal commented May 21, 2024

Tochemey commented May 22, 2024

waeljammal commented May 22, 2024 • edited

Tochemey commented May 22, 2024

Tochemey commented May 29, 2024 • edited

Tochemey commented Apr 16, 2024 •

edited

Tochemey commented Apr 16, 2024 •

edited

qazwsxedckll commented Apr 17, 2024 •

edited

Tochemey commented Apr 17, 2024 •

edited

qazwsxedckll commented Apr 18, 2024 •

edited

Tochemey commented Apr 18, 2024 •

edited

qazwsxedckll commented Apr 18, 2024 •

edited

Tochemey commented Apr 18, 2024 •

edited

qazwsxedckll commented Apr 19, 2024 •

edited

waeljammal commented May 22, 2024 •

edited

Tochemey commented May 29, 2024 •

edited