Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding a new show command, debug operation, and NoOpDestination #89

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

tomreitz
Copy link
Collaborator

@tomreitz tomreitz commented May 13, 2024

This PR adds two new features:

  1. debug operation, to be used for debugging an earthmover project
  2. earthmover show command, likewise to be used for debugging

as well as a few under-the-hood changes such as a new NoOpDestination (earthmover prunes graph branches with no destination attached, so I had to create this in order to be able to programmatically add a debug operation for earthmover show without actually materializing any file to disk).

I will go through and do a self-review, adding some comments about various parts of the changed code.

@tomreitz tomreitz marked this pull request as ready for review May 14, 2024 14:46
@@ -50,7 +49,7 @@ def main(argv=None):
parser.add_argument('command',
nargs="?",
type=str,
help='the command to run: `run`, `compile`, `visualize`'
help='the command to run: `deps`, `compile`, `run`, `show`, `visualize`'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we actually have a visualize command, maybe that should be removed.

parser.add_argument("--transpose",
action='store_true',
help='transposes the output of `earthmover show`'
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three new CLI flags are only used by earthmover show.


# Set empty defaults in case they've not been populated by the user.
parser.set_defaults(**{
"selector": "*",
"params": "",
"results_file": "",
"function": "head",
"rows": 10,
"transpose": False,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default values if the flags are omitted. I think these defaults make sense, but happy to discuss alternatives.

earthmover/earthmover.py Outdated Show resolved Hide resolved
active_graph = self.filter_graph_on_selector(self.graph, selector=f"{selector}_destination")
self.execute(active_graph)


Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explaining what happens here; suppose you select the transformation node my_node (with earthmover show -s my_node):

  • we create a transformation node my_node_show with source=my_node which contains just one debug operation (per the CLI inputs)
  • we connect the new transformation node my_node_show to a new NoOpDestination node my_node_destination so earthmover wouldn't prune off our dangling transformation node
  • we filter down the graph to just the selected my_node (and upstream and downstream nodes)
  • we execute this subgraph

The result is that the debug operation will cause information about my_node to be output to the console.

elif (type(config)==dict or type(config)==YamlMapping) and 'kind' not in config.keys():
# default for backward compatibility
return object.__new__(FileDestination)
# else: throw an error?
Copy link
Collaborator Author

@tomreitz tomreitz May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I'm adding a new Destination class, I need a way to instantiate it... since type is already a property of the Node superclass, and class is a reserved Python keyword, I landed on kind as the property one can use (in a destination's config) to pick a specific destination.

In the future, we can extend this to other destination kinds, such as file.jsonl, file.csv, file.tsv, file.parquet, perhaps even database.snowflake or database.postgres (with additional column typing config).

The default destination kind (when unspecified by the user) is the existing FileDestination, for backward compatibility.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've started using extension as this keyword. Maybe we can set extension to "debug" in this new model, instead of adding a new keyword field.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment here: I think overloading extension is not a good solution. A file's extension is not one-to-one with it's type.

self.data = (
self.upstream_sources[self.source].data
.map_partitions(lambda x: x.apply(self.render_row, axis=1), meta=pd.Series('str'))
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new destination type does nothing, as the name suggests.

@@ -36,7 +36,7 @@ def __init__(self, name: str, config: 'YamlMapping', *, earthmover: 'Earthmover'
self.error_handler: 'ErrorHandler' = earthmover.error_handler

self.error_handler.ctx.update(
file=self.config.__file__, line=self.config.__line__, node=self, operation=None
file=self.config.get("__file__",""), line=self.config.get("__line__",0), node=self, operation=None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since earthmover show injects nodes into the graph which didn't come from a file, I had to modify the context update in a few places like so.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To circumvent needing to change these lines, we should just initialize the No-Op config block as a YamlMapping and set default values for the __file__ and __line__ attributes in the class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what would those default values be? (when nodes are added programmatically, not from a YAML configuration file)

Can __file__ and __line__ be None? That's the only default value I think would make sense, but I don't know how it would come through in the logging messages, or if it would cause errors.

@tomreitz tomreitz requested a review from jayckaiser May 14, 2024 15:37
Sort rows by one or more columns.
```yaml
- operation: debug
function: head | tail | describe | columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't currently one. What do you think? Does head make sense?

self.earthmover.logger.info(f"debug ({self.func}{rows_str}{transpose_str}) for {transformation_name}:")

# call function and display debug info
if self.func == 'head':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A warning that .head() will raise a warning if the number of rows in the dataframe are less than those specified. We should emulate the head() behavior in the current debug logic above the conditional.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and it doesn't seem to... not sure why, or maybe that's configurable.

@tomreitz tomreitz marked this pull request as draft June 13, 2024 21:47
@tomreitz
Copy link
Collaborator Author

I'm converting this PR to a draft pending further discussion. The debug operation has been split out to a separate PR (#100).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants