This repository has been archived by the owner on Jul 3, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 37
Data quality next plans POC #149
Draft
elijahbenizzy
wants to merge
24
commits into
main
Choose a base branch
from
data-quality-next-plans-POC
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is the first-take at the initial data quality decorator. A few components: 1. check_outputs decorator -- this enables us to run a few defualt decorators 2. the DataValidator base class -- this allows us to have extensible data validators 3. the DefaultDataValidator base class -- this allows us to have a few default validators that map to args of check_outputs 4. some basic default data validators All is tested so far. Upcoming is: 1. round out the list of default data validators 2. Add documentation for check_output 3. Add end-to-end tests 4. Configure log/warn levels 5. Add documentatino for extending validators
We now have: 1. DataInRangeValidatorPandas 2. DataInRangeValidatorPrimitives 3. MaxFractionNansValidatorPandas 4. PandasSeriesDataTypeValidator 5. PandasMaxStandardDevValidator 6. PandasMeanInRangeValidator
The naming is suboptimal, will change soon. But now we have two avenues: 1. check_output (using the default validators) 2. check_output_custom (using specified, custom validators)
Note that this is not perfect -- the issues are: 1. The node names collide 2. The DAG structure is weird -- ideally we'd be able to combine the dq decorators into one Next commits should address (1) and (2)
This just delegates to the MaxFractionNansValidator (its superclass's) method. This also makes name a classmethod for BaseDefaultValidator
Currently these actions are hard-coded but we might want to configure it soon. That'll come later.
same name For default validators, the name and the arg should be isomorphically related. The different classes are multiple implementations of the same mapping. This tests that property.
This allows one to query for DQ nodes and capture the results.
This will be useful later but for now it confuses things...
This is a pretty simple approach but I think it works nicely. We use the schema= decorator to specify data quality checks that validate a pandera schema. Note that this will only be registered if pandera is installed (which is an option for setup.py). The other option we were thinking of is to compile the default checks to pandera, but I think that's a little overkill, and can be done later. Furthermore, this proves out the abstraction for data validation quite nicely -- this was straightforward to implement.
We now have a test-integrations section in config.yml. I've decided to group them together to avoid a proliferation. Should we arrive at conflicting requirements we can solve that later.
It was causing circular imports otherwise.
elijahbenizzy
force-pushed
the
data-quality-next-plans-POC
branch
from
July 5, 2022 15:20
6ccc12a
to
a407ca1
Compare
elijahbenizzy
force-pushed
the
data-quality-pandera
branch
2 times, most recently
from
July 5, 2022 17:09
c61ad65
to
f00b799
Compare
What this does... 1. Adds a new profiler argument in data validation decorator 2. Adds scaffolding for a whylogs class 3. Messes with DAG validation to allow for a profiler. Note that if we don't have a profiler, the DAG just runs validators on the original data. If we do, it runs validators on that. The validators then have to accept the type the profiler outputs, rather than the original data type. What this is missing 1. Testing -- its just a POC with pieces left out. Run at your own risk :) 2. Configuration-wiring -- hopefully that'll come soon. 3. Tag-exposure -- that will be useful for giving metadata to the decorator. The data is there, but its not available yet. Note there are other approaches -- I think this is nice as its flexible yet opinionated, and we can always go down to the lower API if we have something more complex.
Again, not the final, but this shows what we can do: Say one has a node called `foo`. We might want the following: 0. Disable all data validation globally 1. Disable all data validation for foo 2. Disable a few checks for foo but not all This would translate to config: 0. "data_quality.disable = True" 1. "data_quality.foo.disable = True" 2. "data_quality.foo.disable = ['check_1, 'check_2']
This is *very* rough. The idea is that we should be able to choose one of the following modes: 1. Apply a validator to every final nod ein a subdag of a decorated function 2. Apply a validator to a specific node within the subdag of a decorated function This basically allows (1) which is the default, but also (2) if using the applies_to keyword. Note that this only works if its in the final subdag (I.E. a sync), and not in the middle. We should add that but it'll be a little bit of a change. Nothing we can't make backwards compatible, we just might need to crawl back a little further in our layered API -- E.G. use subdag transformer rather than node transformer. Either way, this shows that we can ddo what we want without too many modifications. Note that this is not tested, just a proof of concept.
elijahbenizzy
force-pushed
the
data-quality-next-plans-POC
branch
from
July 6, 2022 15:33
a407ca1
to
4e6f081
Compare
This was referenced Aug 2, 2022
This was referenced Feb 26, 2023
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
OK so this is a pure proof of concept. Not necessarily the right way to do things, and not tested. That said, I wanted to prove the following:
(1) is useful for integrations with complex stuff -- E.G. an expensive profiling step with lots of validations.
(2) is useful for disabling -- this will probably be the first we release.
(3) is useful for
extract_columns
-- it now makes it clear what it applies to.While some of this code still has placeholders and isn't tested, it demonstrates feasible solutions, and de-risks the release of data quality enough to make me comfortable.
Look through commits for more explanations.
Changes
Testing
Notes
Checklist
Testing checklist
Python - local testing