-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding a new show command, debug operation, and NoOpDestination #89
base: main
Are you sure you want to change the base?
Conversation
@@ -50,7 +49,7 @@ def main(argv=None): | |||
parser.add_argument('command', | |||
nargs="?", | |||
type=str, | |||
help='the command to run: `run`, `compile`, `visualize`' | |||
help='the command to run: `deps`, `compile`, `run`, `show`, `visualize`' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we actually have a visualize
command, maybe that should be removed.
parser.add_argument("--transpose", | ||
action='store_true', | ||
help='transposes the output of `earthmover show`' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These three new CLI flags are only used by earthmover show
.
|
||
# Set empty defaults in case they've not been populated by the user. | ||
parser.set_defaults(**{ | ||
"selector": "*", | ||
"params": "", | ||
"results_file": "", | ||
"function": "head", | ||
"rows": 10, | ||
"transpose": False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default values if the flags are omitted. I think these defaults make sense, but happy to discuss alternatives.
active_graph = self.filter_graph_on_selector(self.graph, selector=f"{selector}_destination") | ||
self.execute(active_graph) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explaining what happens here; suppose you select the transformation node my_node
(with earthmover show -s my_node
):
- we create a transformation node
my_node_show
withsource=my_node
which contains just one debug operation (per the CLI inputs) - we connect the new transformation node
my_node_show
to a new NoOpDestination nodemy_node_destination
so earthmover wouldn't prune off our dangling transformation node - we filter down the graph to just the selected
my_node
(and upstream and downstream nodes) - we execute this subgraph
The result is that the debug operation will cause information about my_node
to be output to the console.
elif (type(config)==dict or type(config)==YamlMapping) and 'kind' not in config.keys(): | ||
# default for backward compatibility | ||
return object.__new__(FileDestination) | ||
# else: throw an error? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I'm adding a new Destination class, I need a way to instantiate it... since type
is already a property of the Node superclass, and class
is a reserved Python keyword, I landed on kind
as the property one can use (in a destination's config
) to pick a specific destination.
In the future, we can extend this to other destination kind
s, such as file.jsonl
, file.csv
, file.tsv
, file.parquet
, perhaps even database.snowflake
or database.postgres
(with additional column typing config
).
The default destination kind
(when unspecified by the user) is the existing FileDestination, for backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've started using extension
as this keyword. Maybe we can set extension to "debug" in this new model, instead of adding a new keyword field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment here: I think overloading extension
is not a good solution. A file's extension
is not one-to-one with it's type.
self.data = ( | ||
self.upstream_sources[self.source].data | ||
.map_partitions(lambda x: x.apply(self.render_row, axis=1), meta=pd.Series('str')) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new destination type does nothing, as the name suggests.
@@ -36,7 +36,7 @@ def __init__(self, name: str, config: 'YamlMapping', *, earthmover: 'Earthmover' | |||
self.error_handler: 'ErrorHandler' = earthmover.error_handler | |||
|
|||
self.error_handler.ctx.update( | |||
file=self.config.__file__, line=self.config.__line__, node=self, operation=None | |||
file=self.config.get("__file__",""), line=self.config.get("__line__",0), node=self, operation=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since earthmover show
injects nodes into the graph which didn't come from a file, I had to modify the context update in a few places like so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To circumvent needing to change these lines, we should just initialize the No-Op config block as a YamlMapping
and set default values for the __file__
and __line__
attributes in the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what would those default values be? (when nodes are added programmatically, not from a YAML configuration file)
Can __file__
and __line__
be None
? That's the only default value I think would make sense, but I don't know how it would come through in the logging messages, or if it would cause errors.
Sort rows by one or more columns. | ||
```yaml | ||
- operation: debug | ||
function: head | tail | describe | columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the default function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't currently one. What do you think? Does head
make sense?
self.earthmover.logger.info(f"debug ({self.func}{rows_str}{transpose_str}) for {transformation_name}:") | ||
|
||
# call function and display debug info | ||
if self.func == 'head': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A warning that .head()
will raise a warning if the number of rows in the dataframe are less than those specified. We should emulate the head()
behavior in the current debug logic above the conditional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this and it doesn't seem to... not sure why, or maybe that's configurable.
I'm converting this PR to a draft pending further discussion. The |
This PR adds two new features:
earthmover show
command, likewise to be used for debuggingas well as a few under-the-hood changes such as a new NoOpDestination (earthmover prunes graph branches with no destination attached, so I had to create this in order to be able to programmatically add a debug operation for
earthmover show
without actually materializing any file to disk).I will go through and do a self-review, adding some comments about various parts of the changed code.