feature: System database support for AWS S3 #8435

menzenski · 2024-03-06T12:15:50Z

Feature scope

Configuration (settings parsing, validation, etc.)

Description

It would be great to have Meltano able to write runs data to AWS S3. S3 is a supported "state backend" (so Meltano can write state there) but it's not a supported "system database" (so Meltano cannot write runs data there).

We currently run Meltano using a Postgres system database and have become accustomed to having the runs table data available. However we'd like to retire this Postgres database and have been standing up a Meltano project using the S3 state backend. I am now realizing that this means we will no longer have runs data available to us.

The text was updated successfully, but these errors were encountered:

edgarrmondragon · 2024-03-06T16:51:39Z

Hey Matt, thanks for filing!

Can you say more about how you're using the runs table? Is it something that you glance at occasionally, or are some of your workflows dependent on it?

I'm trying to think what this would look like. State backends are essentially key-value stores so it's easy to use object storage for that, but the runs table is more transactional.

See for example how the job/run model is used within a transaction to keep a heartbeat

meltano/src/meltano/core/job/job.py

Lines 249 to 280 in bcbe3eb

 @asynccontextmanager 

 async def run(self, session): 

 """Run wrapped code in context of a job. 

  Transitions state to RUNNING and SUCCESS/FAIL as appropriate and 

  records heartbeat every second. 

  Args: 

  session: the session to use for writing to the db 

  Raises: 

  BaseException: re-raises an exception occurring in the job running 

  in this context 

  """ # noqa: DAR301 

 try: 

 self.start() 

 self.save(session) 

 with self._handling_sigterm(session): 

 async with self._heartbeating(session): 

 yield 

 self.success() 

 self.save(session) 

 except BaseException as err: # noqa: WPS424 

 if not self.is_running(): 

 raise 

 self.fail(error=self._error_message(err)) 

 self.save(session) 

 raise

and to control concurrency

meltano/src/meltano/core/block/extract_load.py

Lines 477 to 499 in bcbe3eb

 async def run_with_job(self) -> None: 

 """Run the ELT task within the context of a job. 

  Raises: 

  RunnerError: if failures are encountered during execution or if the 

  underlying pipeline/job is already running. 

  """ 

 job = self.context.job 

 fail_stale_jobs(self.context.session, job.job_name) 

 if not self.context.force and ( 

 existing := JobFinder(job.job_name).latest_running( 

 self.context.session, 

 ) 

 ): 

 raise RunnerError( 

 f"Another '{job.job_name}' pipeline is already running " 

 f"which started at {existing.started_at}. To ignore this " 

 "check use the '--force' option.", 

 ) 

 with closing(self.context.session) as session: 

 async with job.run(session): 

 await self.execute()

I'm happy to discuss spec and implementation proposals, and even review PRs, but this is something that we probably won't prioritize ourselves.

That said, one option that may be available today is to rely on the default SQLite system db and use something like Litestream¹ to sync the database with S3.

https://litestream.io Litestream - Streaming SQLite Replication ↩

edgarrmondragon · 2024-03-06T17:09:28Z

Another idea that just came to mind is to search for or implement a sqlalchemy dialect that's sqlite + s3, so it could be used like the example in #7143 (comment).

The individual components seem to be out there:

https://github.com/uktrade/sqlite-s3vfs

Although this is concerning

No locking is performed, so client code must ensure that writes do not overlap with other writes or reads. If multiple writes happen at the same time, the database will probably become corrupt and data be lost.
https://rogerbinns.github.io/apsw/vfs.html
https://github.com/edgarrmondragon/sqlalchemy-sqlean/blob/206f86df0a325a9d853d5afb066d26cbcddd41d7/src/sqlean_driver/__init__.py (Example of a SQLite driver implementation)

menzenski · 2024-03-08T13:13:34Z

Can you say more about how you're using the runs table? Is it something that you glance at occasionally, or are some of your workflows dependent on it?

We don't have anything today that depends on it specifically. We query it manually, occasionally, for debugging purposes.

As we build out our "second-generation" Meltano platform, though, we would like to better implement "reporting and analytics on our ELT workflows" - ideally we'd be able to e.g. surface in a dashboard which ELT jobs have run recently, succeeded, failed, etc.

We run Meltano in Kubernetes via Argo Workflows and we have the Argo Workflows workflow archive set up, so all Argo Workflows executions are recorded in a database today. A Meltano run corresponds to exactly one Argo Workflows run, so we still have good information available on what Meltano jobs ran when, succeeded, failed, etc.

The part I'm thinking about specifically as a potential limitation is not having the "payload" field from the meltano runs table available. It seems like it'd be useful to have that explicitly persisted - it seems to provide the value of the replication key for each stream in the run, at the start of the run.

menzenski · 2024-03-08T19:09:18Z

The other thought I had is that we're moving from Postgres into Snowflake for our warehouse. System database support for Snowflake would accomplish the same goal for us (continue to leverage a persistent state backend without running a Postgres database).

edgarrmondragon · 2024-03-08T19:47:49Z

We don't have anything today that depends on it specifically. We query it manually, occasionally, for debugging purposes.

As we build out our "second-generation" Meltano platform, though, we would like to better implement "reporting and analytics on our ELT workflows" - ideally we'd be able to e.g. surface in a dashboard which ELT jobs have run recently, succeeded, failed, etc.

We run Meltano in Kubernetes via Argo Workflows and we have the Argo Workflows workflow archive set up, so all Argo Workflows executions are recorded in a database today. A Meltano run corresponds to exactly one Argo Workflows run, so we still have good information available on what Meltano jobs ran when, succeeded, failed, etc.

The part I'm thinking about specifically as a potential limitation is not having the "payload" field from the meltano runs table available. It seems like it'd be useful to have that explicitly persisted - it seems to provide the value of the replication key for each stream in the run, at the start of the run.

Thanks for adding context! That makes sense. The payload field can indeed give some insight into "state evolution" of a tap and its streams, which can be valuable.

The other thought I had is that we're moving from Postgres into Snowflake for our warehouse. System database support for Snowflake would accomplish the same goal for us (continue to leverage a persistent state backend without running a Postgres database).

Yeah, that's been asked in Slack before. The database_uri is a SQLAlchemy URL, so in theory you could point it to a snowflake instance by setting it to 'snowflake://<user_login_name>:<password>@<account_name>/<database_name>/<schema_name>?warehouse=<warehouse_name>&role=<role_name>'¹.

Now, I recall that doesn't work because at least one of the migration scripts is not compatible with snowflake's sql so changes would be required there (see #6529 and #6167). Do log an issue for it if Snowflake support for systemdb would make this transition easier for you, and of course PRs would be welcome 😄.

https://github.com/snowflakedb/snowflake-sqlalchemy/?tab=readme-ov-file#connection-parameters ↩

menzenski · 2024-05-08T17:31:47Z

After some further considerations we've decided that it's actually not feasible to retire our Postgres database. We plan to continue to use it for the Meltano system database (and for some other metadata capturing, Argo Workflows archive etc).

So, from my perspective this issue could be closed.

menzenski added kind/Feature valuestream/Meltano labels Mar 6, 2024

edgarrmondragon added the discussion label Mar 6, 2024

edgarrmondragon closed this as not planned Won't fix, can't repro, duplicate, stale May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: System database support for AWS S3 #8435

feature: System database support for AWS S3 #8435

menzenski commented Mar 6, 2024

edgarrmondragon commented Mar 6, 2024

edgarrmondragon commented Mar 6, 2024

menzenski commented Mar 8, 2024

menzenski commented Mar 8, 2024

edgarrmondragon commented Mar 8, 2024 •

edited

menzenski commented May 8, 2024

feature: System database support for AWS S3 #8435

feature: System database support for AWS S3 #8435

Comments

menzenski commented Mar 6, 2024

Feature scope

Description

edgarrmondragon commented Mar 6, 2024

Footnotes

edgarrmondragon commented Mar 6, 2024

menzenski commented Mar 8, 2024

menzenski commented Mar 8, 2024

edgarrmondragon commented Mar 8, 2024 • edited

Footnotes

menzenski commented May 8, 2024

edgarrmondragon commented Mar 8, 2024 •

edited