Releases: GoogleCloudPlatform/cloud-data-quality
v0.5.2
What's Changed
- fix issue with missing format field in bigquery metadata response by @thinhha in #135
- Gcs entity int by @AmandeepSinghCS in #133
- added flag for validating dataplex gcs entities using bigquery extern… by @AmandeepSinghCS in #139
- allow cache to update on new fields by @thinhha in #134
- make --gcp_region_id optional by @thinhha in #136
- allow int32 and float32 types by @thinhha in #141
- aggregate entity-level summary tables to avoid bigquery complexity limit by @thinhha in #137
- Metadata partition by @AmandeepSinghCS in #142
- Advanced dq rules by @hejnal in #138
- allow case insensitive ids by @thinhha in #144
- Version0.5.2 by @thinhha in #145
- disable advanced_dq_rules by @thinhha in #146
- fix lint by @thinhha in #147
Full Changelog: v0.5.1...v0.5.2
v0.5.1
What's Changed
- Validation error of CUSTOM_SQL_STATEMENT without custom parameter by @pbalm in #121
- allow parametrization of CUSTOM_SQL_EXPR rules by @thinhha in #124
- update user-agent for attribution by @thinhha in #125
- Fixing an issue with conflicting names: column_id=data with the CTE data alias by @hejnal in #127
- Setting up documentation structure by @pbalm in #120
- bq client supports different regions by @thinhha in #129
- update to v0.5.1 by @thinhha in #130
- Small fix to README and linting by @pbalm in #131
- add throttling for dataplex client by @thinhha in #132
New Contributors
Full Changelog: v0.5.0...v0.5.1
v0.5.0
What's Changed
- added test for dq rules by @AmandeepSinghCS in #115
- better error by @thinhha in #116
- update high watermark logic and return 0 rows by @thinhha in #117
- fix method signature by @thinhha in #118
This release fixes the following bugs and includes breaking changes to the way CloudDQ reports summary statistics to the summary BigQuery table.:
- Incremental rule-bindings with 'incremental_time_filter_column_id' will fail if executed for the first time on a BQ dataset and the dq_summary table has not created yet. In the new behaviour, CloudDQ checks if dq_summary table exists and only runs the high-watermark query if the dq_summary table exists, otherwise it executes a full-table scan as it would normally if 'incremental_time_filter_column_id' is not set.
- Rule-bindings with 'incremental_time_filter_column_id' returns 0 records to dq_summary if no new data has arrived since the last high-watermark. In the new behaviour, CloudDQ returns 1 record with 0 rows_validated and all other statistics set to NULL.
- CUSTOM_SQL_STATEMENT returns 0 records to dq_summary if the custom-sql returns 0 records (i.e. everything succeeds. In the new behaviour, CloudDQ returns 1 record with 0 rows_validated and all other statistics set to NULL.
- NOT_NULL rules returning the same count across failed_count and null_count, causing the sum of success_count + failed_count + null_count to exceed the rows_validated. In the new behaviour, if the rule is NOT_NULL, null_count is set to NULL.
Going forward:
- success_count + failed_count + null_count should always be equal to rows_validated
- CloudDQ always return 1 record for each rule-binding/rule into dq_summary
- null_count will be NULL for NOT_NULL rule type
- CUSTOM_SQL_STATEMENT rules will have NULL values in success_count, failed_count, and null_count. And will only populate the existing columns 'complex_rule_validation_errors_count' with the number of rows returned by the custom_sql and additionally a new column 'complex_rule_validation_success_flag' which is TRUE if 'complex_rule_validation_errors_count' is 0, FALSE if 'complex_rule_validation_errors_count' is greater than 0, and NULL if the rule_type if not CUSTOM_SQL_STATEMENT.
- CUSTOM_SQL_STATEMENT is no longer recommended for record-level validation. Users are recommended to use CUSTOM_SQL_EXPR to implement custom record-level this requirements.
This constitutes a breaking change to the dq_summary results & we hope that the change will reduce the confusions in the way different summary statistics are calculated.
The change is illustrated in the following example:
Given the YAML provided in this example on the sample contact_details data, CloudDQ will now generate the following results
Full Changelog: v0.4.1...v0.5.0
v0.4.1
What's Changed
- split release pipelines by @thinhha in #106
- removed = from f-strings by @AmandeepSinghCS in #107
- Add dimensions to rules by @pbalm in #109
- Metadata test updates by @AmandeepSinghCS in #108
- Patch last mod by @thinhha in #110
- log to cloud logging by @thinhha in #111
- User agent by @thinhha in #112
- escape last mod table by @thinhha in #113
Full Changelog: v0.4.0...v0.4.1
v0.4.1-rc1
What's Changed
- Split release pipelines by @thinhha in #106
- Removed = from f-strings by @AmandeepSinghCS in #107
- Add dimensions to rules by @pbalm in #109
Full Changelog: v0.4.0...v0.4.1-rc1
v0.4.0
What's Changed
- add set_environment_variables.sh script by @thinhha in #65
- Target table by @AmandeepSinghCS in #63
- Dataplex Entity Spec v0.1 by @thinhha in #67
- fixing readme instructions for dependency installation by @ant-laz in #68
- make dataproc pyspark job bubble up run failures by @thinhha in #71
- add missing dev dependencies by @thinhha in #73
- Update pyspark driver by @thinhha in #75
- Adding usage guide and scripts for CloudDQ with Dataproc Workflow using Composer by @jayBana in #77
- update cli test by @thinhha in #79
- Relax YAML spec by @pbalm in #80
- Dataplex task by @thinhha in #76
- make test less flaky by @thinhha in #84
- Adding last_modified col to dq_summary by @pbalm in #85
- Log dq_summary table to stdout by @pbalm in #92
- Patch: Loggers can log the same message multiple times by @pbalm in #95
- fix get_json_logger logic by @thinhha in #99
- allow hyphens in custom sql strings by @thinhha in #87
- add user-agent by @thinhha in #100
- Dataplex metadata by @thinhha in #98
- Add lake zones by @thinhha in #101
- Hide bq flag by @thinhha in #102
- Release v0.4.0 by @thinhha in #103
- edit build by @thinhha in #104
- Gcb by @thinhha in #105
New Contributors
- @ant-laz made their first contribution in #68
- @jayBana made their first contribution in #77
- @pbalm made their first contribution in #80
Full Changelog: v0.3.1...v0.4.0
v0.4.0-rc2
Patch release to avoid duplication of log records.
v0.4.0-rc1
v0.3.2-rc2
What's Changed
- Addition of a column
last_modified
to thedq_summary
table that indicates the last modification date of the data being checked.
Full Changelog: v0.3.1...v0.3.2-rc2
v0.3.2-rc1
What's Changed
- add set_environment_variables.sh script by @thinhha in #65
- Target table by @AmandeepSinghCS in #63
- Dataplex Entity Spec v0.1 by @thinhha in #67
- fixing readme instructions for dependency installation by @ant-laz in #68
- make dataproc pyspark job bubble up run failures by @thinhha in #71
- add missing dev dependencies by @thinhha in #73
- Update pyspark driver by @thinhha in #75
- Adding usage guide and scripts for CloudDQ with Dataproc Workflow using Composer by @jayBana in #77
- update cli test by @thinhha in #79
- Relax YAML spec by @pbalm in #80
New Contributors
- @ant-laz made their first contribution in #68
- @jayBana made their first contribution in #77
- @pbalm made their first contribution in #80
Full Changelog: v0.3.1...v0.3.2-rc1