-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: System database support for AWS S3 #8435
Comments
Hey Matt, thanks for filing! Can you say more about how you're using the I'm trying to think what this would look like. State backends are essentially key-value stores so it's easy to use object storage for that, but the runs table is more transactional. See for example how the job/run model is used within a transaction to keep a heartbeat meltano/src/meltano/core/job/job.py Lines 249 to 280 in bcbe3eb
and to control concurrency meltano/src/meltano/core/block/extract_load.py Lines 477 to 499 in bcbe3eb
I'm happy to discuss spec and implementation proposals, and even review PRs, but this is something that we probably won't prioritize ourselves. That said, one option that may be available today is to rely on the default SQLite system db and use something like Litestream1 to sync the database with S3. Footnotes
|
Another idea that just came to mind is to search for or implement a sqlalchemy dialect that's sqlite + s3, so it could be used like the example in #7143 (comment). The individual components seem to be out there:
|
We don't have anything today that depends on it specifically. We query it manually, occasionally, for debugging purposes. As we build out our "second-generation" Meltano platform, though, we would like to better implement "reporting and analytics on our ELT workflows" - ideally we'd be able to e.g. surface in a dashboard which ELT jobs have run recently, succeeded, failed, etc. We run Meltano in Kubernetes via Argo Workflows and we have the Argo Workflows workflow archive set up, so all Argo Workflows executions are recorded in a database today. A Meltano run corresponds to exactly one Argo Workflows run, so we still have good information available on what Meltano jobs ran when, succeeded, failed, etc. The part I'm thinking about specifically as a potential limitation is not having the "payload" field from the meltano runs table available. It seems like it'd be useful to have that explicitly persisted - it seems to provide the value of the replication key for each stream in the run, at the start of the run. |
The other thought I had is that we're moving from Postgres into Snowflake for our warehouse. System database support for Snowflake would accomplish the same goal for us (continue to leverage a persistent state backend without running a Postgres database). |
Thanks for adding context! That makes sense. The
Yeah, that's been asked in Slack before. The Now, I recall that doesn't work because at least one of the migration scripts is not compatible with snowflake's sql so changes would be required there (see #6529 and #6167). Do log an issue for it if Snowflake support for systemdb would make this transition easier for you, and of course PRs would be welcome 😄. Footnotes |
After some further considerations we've decided that it's actually not feasible to retire our Postgres database. We plan to continue to use it for the Meltano system database (and for some other metadata capturing, Argo Workflows archive etc). So, from my perspective this issue could be closed. |
Feature scope
Configuration (settings parsing, validation, etc.)
Description
It would be great to have Meltano able to write
runs
data to AWS S3. S3 is a supported "state backend" (so Meltano can write state there) but it's not a supported "system database" (so Meltano cannot write runs data there).We currently run Meltano using a Postgres system database and have become accustomed to having the
runs
table data available. However we'd like to retire this Postgres database and have been standing up a Meltano project using the S3 state backend. I am now realizing that this means we will no longer have runs data available to us.The text was updated successfully, but these errors were encountered: