Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Data quality next plans POC #149

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from
Draft

Conversation

elijahbenizzy
Copy link
Collaborator

@elijahbenizzy elijahbenizzy commented Jul 4, 2022

OK so this is a pure proof of concept. Not necessarily the right way to do things, and not tested. That said, I wanted to prove the following:

  1. That we could build a two-step data quality pass (E.G. with a profiler and a validator). This will quickly be a whylogs blocker.
  2. That we can use config to enable/disable items at run/compile time.
  3. That we can add an applies_to keyword to narrow focus of data quality.

(1) is useful for integrations with complex stuff -- E.G. an expensive profiling step with lots of validations.
(2) is useful for disabling -- this will probably be the first we release.
(3) is useful for extract_columns -- it now makes it clear what it applies to.

While some of this code still has placeholders and isn't tested, it demonstrates feasible solutions, and de-risks the release of data quality enough to make me comfortable.

Look through commits for more explanations.

Changes

Testing

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist

Python - local testing

  • python 3.6
  • python 3.7

elijahbenizzy and others added 21 commits July 4, 2022 14:09
This is the first-take at the initial data quality decorator.

A few components:

1. check_outputs decorator -- this enables us to run a few defualt decorators
2. the DataValidator base class -- this allows us to have extensible data validators
3. the DefaultDataValidator base class -- this allows us to have a few default validators that map to args of check_outputs
4. some basic default data validators

All is tested so far.

Upcoming is:

1. round out the list of default data validators
2. Add documentation for check_output
3. Add end-to-end tests
4. Configure log/warn levels
5. Add documentatino for extending validators
We now have:
1. DataInRangeValidatorPandas
2. DataInRangeValidatorPrimitives
3. MaxFractionNansValidatorPandas
4. PandasSeriesDataTypeValidator
5. PandasMaxStandardDevValidator
6. PandasMeanInRangeValidator
The naming is suboptimal, will change soon. But now we have two avenues:

1. check_output (using the default validators)
2. check_output_custom (using specified, custom validators)
Note that this is not perfect -- the issues are:

1. The node names collide
2. The DAG structure is weird -- ideally we'd be able to combine the dq
   decorators into one

Next commits should address (1) and (2)
This just delegates to the MaxFractionNansValidator (its superclass's)
method.

This also makes name a classmethod for BaseDefaultValidator
Currently these actions are hard-coded but we might want to configure it
soon. That'll come later.
same name

For default validators, the name and the arg should be isomorphically
related. The different classes are multiple implementations of the same
mapping. This tests that property.
This allows one to query for DQ nodes and capture the results.
This will be useful later but for now it confuses things...
This is a pretty simple approach but I think it works nicely.

We use the schema= decorator to specify data quality checks
that validate a pandera schema. Note that this will only be registered
if pandera is installed (which is an option for setup.py).

The other option we were thinking of is to compile the default checks to
pandera, but I think that's a little overkill, and can be done later.

Furthermore, this proves out the abstraction for data validation quite
nicely -- this was straightforward to implement.
We now have a test-integrations section in config.yml. I've
decided to group them together to avoid a proliferation. Should
we arrive at conflicting requirements we can solve that later.
It was causing circular imports otherwise.
@elijahbenizzy elijahbenizzy changed the title Data quality next plans poc Data quality next plans POC Jul 4, 2022
@elijahbenizzy elijahbenizzy mentioned this pull request Jul 4, 2022
12 tasks
@elijahbenizzy elijahbenizzy force-pushed the data-quality-pandera branch 2 times, most recently from c61ad65 to f00b799 Compare July 5, 2022 17:09
What this does...

1. Adds a new profiler argument in data validation decorator
2. Adds scaffolding for a whylogs class
3. Messes with DAG validation to allow for a profiler. Note that if we
   don't have a profiler, the DAG just runs validators on the original
data. If we do, it runs validators on that. The validators then have to
accept the type the profiler outputs, rather than the original data
type.

What this is missing

1. Testing -- its just a POC with pieces left out. Run at your own risk
   :)
2. Configuration-wiring -- hopefully that'll come soon.
3. Tag-exposure -- that will be useful for giving metadata to the
   decorator. The data is there, but its not available yet.

Note there are other approaches -- I think this is nice as its flexible
yet opinionated, and we can always go down to the lower API if we have
something more complex.
Again, not the final, but this shows what we can do:

Say one has a node called `foo`. We might want the following:

0. Disable all data validation globally
1. Disable all data validation for foo
2. Disable a few checks for foo but not all

This would translate to config:
0. "data_quality.disable = True"
1. "data_quality.foo.disable = True"
2. "data_quality.foo.disable = ['check_1, 'check_2']
This is *very* rough.

The idea is that we should be able to choose one of the following modes:

1. Apply a validator to every final nod ein a subdag of a decorated function
2. Apply a validator to a specific node within the subdag of a decorated
   function

This basically allows (1) which is the default, but also (2) if using
the applies_to keyword. Note that this only works if its in the final
subdag (I.E. a sync), and not in the middle. We should add that but
it'll be a little bit of a change. Nothing we can't make backwards
compatible, we just might need to crawl back a little further in our
layered API -- E.G. use subdag transformer rather than node transformer.

Either way, this shows that we can ddo what we want without too many
modifications.

Note that this is not tested, just a proof of concept.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant