-
Notifications
You must be signed in to change notification settings - Fork 72
Add methods to create data generation specs from files #310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -1604,3 +1617,51 @@ def scriptMerge(self, tgtName=None, srcName=None, updateExpr=None, delExpr=None, | |||
result = HtmlUtils.formatCodeAsHtml(results) | |||
|
|||
return result | |||
|
|||
@staticmethod | |||
def fromDict(options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to have explicit tests for this covering the following use cases:
1 - with simple options
2 - with composite (object valued options)
See the examples on the following page for object valued options - i.e DateRange, Distribution objects
dbldatagen/data_generator.py
Outdated
return DataGenerator(**options) | ||
|
||
@staticmethod | ||
def fromFile(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dont add fromFile as method as open
does not support reading file from a Databricks workspace or dbfs
dbldatagen/data_generator.py
Outdated
raise ValueError("File type must be '.json' or '.yml'") | ||
|
||
@staticmethod | ||
def fromJson(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than taking a path, pass a string containing the definition to method
Calling code should be responsible for loading string
it could be from dbfs, from a database, from unity catalog
dbldatagen/data_generator.py
Outdated
return DataGenerator.fromDict(generator).withColumns(columns) | ||
|
||
@staticmethod | ||
def fromYaml(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than taking a path, pass a string containing the definition to method
Calling code should be responsible for loading string
it could be from dbfs, from a database, from unity catalog
@@ -182,3 +182,48 @@ This has several implications: | |||
SQL expression. | |||
To enforce the dependency, you must use the `baseColumn` attribute to indicate the dependency. | |||
|
|||
Creating data generation specs from files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be creating data specs from string based YAML or JSON
Also we should have capability to write to JSON and YAML
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ronanstokes-db the code is done. I will update the docs.
tests/test_quick_tests.py
Outdated
assert gen_from_dict.randomSeed == dg_spec.get("randomSeed") | ||
|
||
def test_generation_from_file(self): | ||
path = "tests/files/test_generator_spec.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use string based APIs, they'll be more general - also you can simply define the definitions as multi-line strings rather than requiring separate data files
Proposed changes
Added several methods to support creating
DataGenerator
andColumnGenerationSpec
objects from Python dictionaries and JSON/YAML files.Types of changes
What types of changes does your code introduce to dbldatagen?
Put an
x
in the boxes that applyChecklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR.If you're unsure about any of them, don't hesitate to ask. We're here to help!
This is simply a reminder of what we are going to look for before merging your code.
Further comments
I added several methods:
withColumns
addsColumnGenerationSpec
objects via a list of dictionaries; It iteratively passes the dictionary values as arguments towithColumn
fromDict
creates aDataGenerator
from a dictionary by passing the values as arguments to the constructorfromJson
allows users to create aDataGenerator
and addColumnGenerationSpecs
from a JSON filefromYaml
allows users to create aDataGenerator and add
ColumnGenerationSpecs` from a YAML filefromFile
wraps bothfromJson
andfromYaml
into a single API