Developing for Production (Daemon or Instance?) #681

kristopher-wood · 2021-08-22T22:18:27Z

kristopher-wood
Aug 22, 2021

Hi everyone!

Yes, I'm posting two questions in one day, I hope that's not a problem! :)

I'll start with my use case and hopefully you can help me find a solution...

I'm working with time series data coming in from multiple sources, each of which formats its data differently. My current plan is to have a daemon or cron job scouring the data streams and storing them in a time series database in a normalized format which my River learner can step through at its own pace.

The question is, should the learner algorithm also be a daemon, constantly waiting for new data to come in from the stream, or should it be a cron job which checks for new rows in the database since its last run?

In both cases I think I'll need to save the model periodically, in the daemon's case to recover from catastrophic failure or manual restarts, and in the cron job case to pick up where it left off without losing any learning.

In both cases I think the important thing is a persistent and recoverable state for the learner.

Do you have any ideas for how this could work?

Thanks!
Kris

Answered by jacobmontiel

Aug 22, 2021

This is a very interesting question. Both approaches have pros and cons, I guess it depends on what works better for your application. Stream learning models are designed to process single instances of data, in other words, for online environments. However, they can be used in offline settings if that works for your application.

The traditional approach is to allow the model to learn on the go as new data appears on the stream. This is particularly relevant if your stream has a constant flow of data at a high rate. This approach exploits the adaptability and incremental nature of the model.

It is possible to read the database at intervals, but your system would be dependent on when learni…

View full answer

jacobmontiel · 2021-08-22T23:40:15Z

jacobmontiel
Aug 22, 2021
Maintainer

This is a very interesting question. Both approaches have pros and cons, I guess it depends on what works better for your application. Stream learning models are designed to process single instances of data, in other words, for online environments. However, they can be used in offline settings if that works for your application.

The traditional approach is to allow the model to learn on the go as new data appears on the stream. This is particularly relevant if your stream has a constant flow of data at a high rate. This approach exploits the adaptability and incremental nature of the model.

It is possible to read the database at intervals, but your system would be dependent on when learning is triggered, either by using a fixed time or some defined rules. This could impact the reaction time to changes in your data. In this scenario, you could even exploit batch earning techniques. Notice that this setup is the most common in production environments based on batch learning.

Once again, this will depend on what works for your application. The good news is that River works with both approaches :-)

Similarly, model persistency should be built around your application's requirements.

1 reply

kristopher-wood Aug 23, 2021
Author

Thank you for your reply!

My data comes in consistent intervals, so really either approach would work for my use case. I could either have it constantly checking for new data, or have it scheduled to run once per time interval.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Developing for Production (Daemon or Instance?) #681

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Developing for Production (Daemon or Instance?) #681

kristopher-wood Aug 22, 2021

Replies: 1 comment · 1 reply

jacobmontiel Aug 22, 2021 Maintainer

kristopher-wood Aug 23, 2021 Author

kristopher-wood
Aug 22, 2021

Replies: 1 comment 1 reply

jacobmontiel
Aug 22, 2021
Maintainer

kristopher-wood Aug 23, 2021
Author