Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] redeployment long-lived actors when node left cluster #290

Open
qazwsxedckll opened this issue Apr 16, 2024 · 20 comments
Open

[feat] redeployment long-lived actors when node left cluster #290

qazwsxedckll opened this issue Apr 16, 2024 · 20 comments
Assignees
Labels
duplicate This issue or pull request already exists feature New feature or request

Comments

@qazwsxedckll
Copy link
Collaborator

No description provided.

@qazwsxedckll qazwsxedckll changed the title [feat-want] redeployment long-lived actors when node left cluster [feat] redeployment long-lived actors when node left cluster Apr 16, 2024
@Tochemey
Copy link
Owner

@qazwsxedckll. Thank you so much for this issue. There is already an issue I created some time back to tackle that. You can find it here. At the moment, I need to rethink it properly. However I am open to any PR in that regard.

@Tochemey Tochemey added duplicate This issue or pull request already exists feature New feature or request labels Apr 16, 2024
@qazwsxedckll
Copy link
Collaborator Author

First of all, I am so impressed by your hard work, the documentation is very clear and the code is very clean.
To my understanding, the problem now is the framework doesn't support spawn actor remotely.
I am contributing to another project protoacotor-go which each node announces what kind of actors it can spawn using service discovery.

ps. I haven't checked the code entirely yet. I want to know which framwork is more similar, akka, orleans or erlang?

@Tochemey
Copy link
Owner

Tochemey commented Apr 16, 2024

@qazwsxedckll one can spawn an actor remotely with some caveats. :)
kindly check the release note https://github.com/Tochemey/goakt/releases/tag/v1.6.0

@Tochemey
Copy link
Owner

First of all, I am so impressed by your hard work, the documentation is very clear and the code is very clean. To my understanding, the problem now is the framework doesn't support spawn actor remotely. I am contributing to another project protoacotor-go which each node announces what kind of actors it can spawn using service discovery.

ps. I haven't checked the code entirely yet. I want to know which framwork is more similar, akka, orleans or erlang?

I have used all these actor framework in the past to be honest and was contributing to proto-actor-go some time before starting this. So I pick ideas from all of them. :)

@Tochemey
Copy link
Owner

Tochemey commented Apr 16, 2024

@qazwsxedckll the challenge Go-Akt is having is to be able to serialize the Actor implementation so that we can pass it over the wire. There are various approaches to that I stated in the closed ticket here

@qazwsxedckll
Copy link
Collaborator Author

qazwsxedckll commented Apr 17, 2024

Sorry, but I don't get it. Why serialization is related to redeployment? Who should send the serialized data and who should receive it? I saw the RemoteSpawn method, all it need is the actor type and the actor name. I do saw that the Register method is not automatically called when starting the actor system. How about adding the registered type to olric, since nodes are aware of the node dead event, all nodes who can spawn the dead actor can decide who should spawn it? Usually, all nodes is accessible to all actor types, heterogeneous cluster is not common, I think.

Here is my usecase. The user can create a MQTT client on the web and subscribe to some topics. I want to create an actor for each client and when the node is down I need to recreate the actor and subscribe to the topics again. So recreate the actor whenever it is accessed is not going to work for me.

@Tochemey
Copy link
Owner

Tochemey commented Apr 17, 2024

@qazwsxedckll I will take a deep look into this use case. However a PR will also be welcomed in case you think of something that you would like to share. However I really don't understand this So recreate the actor whenever it is accessed is not going to work for me.

The redeployment involve a little bit more than creating actors remotely because we need at least cater for the cluster topology and answer the following question: which node is going to receive the actors of the down node in the cluster?

@qazwsxedckll
Copy link
Collaborator Author

qazwsxedckll commented Apr 18, 2024

When a node leaves the cluster, re-creates its actors whenever they are accessed on other nodes in the cluster. This is feasible because the same actor system is run across the cluster.

You have mentioned this in the previous issue. I just want to clearify that in my usecase, the actor should start automatically and no other actor is actually calling it since subscribing topics is an active action. It is the downside of the so called 'virtual actor', so I am looking for other solutions.

@Tochemey
Copy link
Owner

Tochemey commented Apr 18, 2024

@qazwsxedckll This was what I meant:

  • A node dies. In the redeployment case all the actors on that nodes will be moved to another node and started. When it comes to stateful actor we will need some kind of distributed datastore for recovery from previous state. Stateless actors should be fine.
  • When there are called then they can respond (a.k.a ready to respond to messages)

Apologies for the misunderstanding.

@Tochemey
Copy link
Owner

@qazwsxedckll I will start some draft work on the redeployment feature. However I cannot promise when it should be ready because I need to dig a bit in Olric capacity and see what can help implementing it or if I have to add some layer on top of it to achieve it.

@qazwsxedckll
Copy link
Collaborator Author

qazwsxedckll commented Apr 18, 2024

  • A node dies. In the redeployment case all the actors on that nodes will be moved to another node and started. When it comes to stateful actor we will need some kind of distributed datastore for recovery from previous state. Stateless actors should be fine.

Just for your reference, when using actors, I alway load data from database on initialization and constantly save data to database when processing messages. If I store some state in memory, I am assuming it will lose at any time.

If what you mean is something like dapr actor store. I don't find it is very useful. For business data, key-value store is far from enough. For actor state, I have not come out with a good use case yet.

Anyway, I think it is nice to have it, but not nessary.

moved to another node and started

Yes, exactly.

@Tochemey
Copy link
Owner

Tochemey commented Apr 18, 2024

I saw the RemoteSpawn method, all it need is the actor type and the actor name. I do saw that the Register method is not automatically called when starting the actor system. How about adding the registered type to olric, since nodes are aware of the node dead event, all nodes who can spawn the dead actor can decide who should spawn it? Usually, all nodes is accessible to all actor types, heterogeneous cluster is not common, I think.

@qazwsxedckll I think I have understood what you meant. That is why I was saying I need to implement some binary mechanism for Actor implementation because Olric need some binary encoding for custom type serialization. I hope this make sense to you.

@qazwsxedckll
Copy link
Collaborator Author

qazwsxedckll commented Apr 19, 2024

I have not checked Olric yet. Is using plain strings for actor type not enough?

@Tochemey Tochemey self-assigned this Apr 20, 2024
@Tochemey
Copy link
Owner

Design consideration

  • start the actor system with all the types of actor the system will be handling
  • each node to have the list of its peers
  • persist in the cache a map of node id and the list of actor instances created (actor ids)
  • each node should listen to cluster events
  • when a node leaves the cluster the rest of the nodes should be aware.
  • there should be a consensus on which node(s) should receive the dead node actors
  • the consensus should be based on the load of the other nodes

Lack

  • Olric does not have a mechanism to know the owners of entries. This will help on the cluster management

@Tochemey
Copy link
Owner

@qazwsxedckll I hope you are doing well. In case you have any suggestions please bring them on. I hope I will not have to change the design. However I may change the design of the cluster engine.

@waeljammal
Copy link

Great library, covers many of the things that seem to be missing from protoactor. Though this specific issue is a bit of a blocker. Any idea when this one might be resolved?

@Tochemey
Copy link
Owner

@waeljammal there is draft PR #319. I am still thinking about the best way to implement this due to the underlying cluster engine library I am using.

@waeljammal
Copy link

waeljammal commented May 22, 2024

@Tochemey have you thought about using Olric locks to have a single instance act as the leader? The leader can catch member left events and then decide if it can re-distribute the actors if there is another instance that can spawn the same actors available? Seeing as if there is only 1 instance of a service that spawns a specific type of actor/s running then you simply can't bring them back to life until either the node that went down comes back or there are other instances of that service that can spawn the actor that are still alive.

For this you need to track which node can spawn which actors and have 1 node act as the re-balancing orchestrator, if Olric lock works fine then you can use that to make sure just 1 node at any given time can be the orchestrator.

I would even say that each group of microservices should have it's own leader as each microservice hosts specific actors, you would need to provide the ability for dev to set a cluster name and a service id so you can form the overall cluster made up of different types of services and track which service can spawn which actor types.

You would also need to cater for software updates, eg. doing a rolling update in kubernetes where each pod is replaced 1 at a time, you might want to calculate a quorum based on existing values so you do not re-balance prematurely, ie. if you had 3 instances before and 2 suddenly went down for a software upgrade then you do not want to re-balance unless at least 2 instances are alive again otherwise if the leader goes down then the first instance that comes back to life would end up with all of the actors being restored there instead of being distributed across at least a few instances. You can use a n-1 formula to decide when it's time to re-balance. eg. you have 3 instances 1 goes down you can safely rebalance straight away across the remaining 2, if you 2 go down wait a little while if the quorum is not recovered you can restore to the remaining instance but if they do come back then you reach n-1 quorum again you can re-balance across multiple instances again. The trick is not rely only on a member left event but rather to have another event after a member leaves that triggers after some configurable period that flags the member as dead, seeing as members will go down cos of a crash or new version being released you need to allow a little time for the instances to recover.

You can also look at how nats implements only the leader election part of raft in their graft library, you can do the same using your pub/sub to do more reliable leader election rather than using locks.

@Tochemey
Copy link
Owner

@waeljammal thanks for the suggestions. I hope to have some bandwidth to tackle this in the short term.😉

@Tochemey
Copy link
Owner

Tochemey commented May 29, 2024

@waeljammal and @qazwsxedckll I believe I have found a simpler way to get this out. Hopefully I will push it in the weekend. I am currently working on a solution that I need to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants