-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/csv destination #92
base: main
Are you sure you want to change the base?
Conversation
…d add in a couple of error handling and logging messages
self.header = self.error_handler.assert_get_key(self.config, 'header', dtype=bool, required=False, default=True) | ||
self.separator = self.error_handler.assert_get_key(self.config, 'separator', dtype=str, required=False, default=",") | ||
self.limit = self.error_handler.assert_get_key(self.config, 'limit', dtype=int, required=False, default=None) | ||
self.extension = self.error_handler.assert_get_key(self.config, 'extension', dtype=str, required=False, default="csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field is technically required, since it has to be populated to initialize the CSV Destination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used a different approach here, adding another kind
property which is used to select the destination Class. I envision that kind
could have values like
file.jsonl
file.csv
file.tsv
file.parquet
file.xml
...
database.mysql
database.postgres
database.snowflake
...
This does perhaps make extension
superfluous, except that if someone insists on using .ndjson
for JSONL, or even really wants to put TSV data in a file with a .xml
extension, I guess that should be possible. Maybe eventually extension
becomes optional, with a default value of "infer."
Eventually we may add an (optional) location
indicating where to materialize files or execute database.*
SQL, with values like
local # (default)
s3://bucket_name/path/to/dir/
sftp://user:[email protected]/path/to/dir/
postgres://user:[email protected]:123/database_name?currentSchema=schema_name # (a SQLalchemy connection string)
snowflake://username:password@account_id/db_name/schema_name?warehouse=wh_name&role=role_name # (a SQLalchemy connection string)
...
(earthmover could parse the location
and figure out what connector/library to use internally.)
extension
also doesn't really make sense for database.*
destination kind
s, unless location
is a file system in which case it would probably be .sql
.
Eventually we may add an optional mode: overwrite # or append
. For file materialization, this is self-explanatory. For database.*
materialization, this could trigger a TRUNCATE
statement before INSERTS
begin.
One other relatively unrelated comment, if/when we support writing to databases, the order in which we process destinations will become important (if there are primary/foreign key references in the data). Currently there's no way in earthmover to control the order in which destinations are processed, we'd have to figure out how to handle that... maybe (like dbt does) an optional depends_on
property in each destination.
self.data = self.upstream_sources[self.source].data | ||
|
||
# Apply limit to dataframe if specified. | ||
if self.limit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if raising an error is the right choice here. If the user specifies more rows than exist in the dataframe, we should just return all rows.
mode: str = 'csv' # Documents which class was chosen. | ||
allowed_configs: Tuple[str] = ( | ||
'debug', 'expect', 'show_progress', 'repartition', 'source', | ||
'extension', 'header', 'separator', 'limit', 'keep_columns' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to reiterate my opinion that limit
and keep_columns
should not be part of destination configs. These are data transformations which should be done separately. We already have a keep_columns
transformation operation, adding a limit
operation would be simple (and, for performance reasons, should be done as far upstream as possible, not at the final destination).
Drafting a PR to leave review comments.
TODO Keen: Update this description with template.