Releases: Nike-Inc/koheesio
koheesio-v0.10.1a0
Bugfix related to OktaAccessToken
Full Changelog: koheesio-v0.10.0...koheesio-v0.10.1a0
koheesio-v0.10.0
What's Changed
v0.10.0 brings several important features, security improvements, and bug fixes across different modules of Koheesio.
The overall API remains unchanged.
New Features / Refactors
The following new features are included:
- [feature] Core Introduces Koheesio specific
SecretStr
andSecretBytes
classes for handling secret strings and bytes with enhanced security by @dannymeijer in #164- Introduces Koheesio specific
SecretStr
andSecretBytes
along with aSecretMixin
class (to reduce code duplication across) - New Secret classes are compatible with Pydantic's
SecretStr
andSecretBytes
and allow seamless integration with existing code - To use: Replace
from pydantic import SecretStr, SecretBytes
withfrom koheesio.models import SecretStr, SecretBytes
to use the enhanced secret handling - These classes expand support to allow usage with an f-string (or
.format
) and"string" + "other_string"
concatenation while remaining secure
- Introduces Koheesio specific
- [feature] Core > BaseModel Added
partial
classmethod toBaseModel
for enhanced customization and flexibility by @dannymeijer in #150- Partial allows for creating a new instance of a model with only the specified fields updated (such as overwriting or setting a fields default values)
- [feature] Box Added a buffered version of
BoxFileWriter
by @riccamini in #161 (#87, #148)- Added the
BoxBufferFileWriter
class for writing files to Box when physical storage isn't available. Data is instead buffered in memory before being written to Box. - Also improves
BoxCsvFileReader
logging output by providing the file name in addition to the file ID.
- Added the
- [feature] Dev Experience Easier debugging and dev improvements by @dannymeijer in #168
- To make debugging easier,
pyproject.toml
was updated to allow for easier runningspark connect
in your local dev environment:- Added extra dependencies for
pyspark[connect]==3.5.4
. - Added environment variables for Spark Connect in the development environment.
- Changed to verbose mode logging in the pytest output (also visible through Github Actions tests run output).
- Added extra dependencies for
- To make debugging easier,
- [refactor] Snowflake Snowflake classes now use
params
overoptions
by @dannymeijer in #168- Snowflake classes now also bases
ExtraParamsMixin
- Renamed
options
field toparams
and added aliasoptions
for backwards compatibility. - Introduced
SF_DEFAULT_PARAMS
.
- Snowflake classes now also bases
- [feature] Delta Support for Delta table history by @zarembat in #163
- Enables fetching Delta table history and checking data staleness based on defined intervals and refresh days.
- Changes to
DeltaTableStep
class:- Added
describe_history()
method toDeltaTableStep
for fetching Delta table history as a Spark DataFrame. - Added
is_date_stale()
method toDeltaTableStep
to check data staleness based on time intervals or specific refresh days.
- Added
- [feature] Http Added support for authorization headers with proper masking for improved security by @dannymeijer in #158 and #170 (#157)
- Addresses potential data leaks in authorization headers, ensuring secure handling of sensitive information.
- Comprehensive unit tests added to prevent regressions and ensure expected behavior.
- Changes to
HttpStep
class:- Added
decode_sensitive_headers
method to decodeSecretStr
values in headers. - Modified
get_headers
method to dump headers into JSON withoutSecretStr
masking. - Added
auth_header
field to handle authorization headers. - Implemented masking for bearer tokens to maintain their 'secret' status.
- Added
- [feature] Step & Spark Add transformation to download file from url data through python or spark by @mikita-sakalouski and @dannymeijer in #143 (#75)
- Allow downloading files from a given URL
- Added
DownloadFileStep
class in a new modulekoheesio.steps.download_file
- Added
FileWriteMode
enum with supported wrtie modes:OVERWRITE
,APPEND
,IGNORE
,EXCLUSIVE
,BACKUP
: - Also made available as a spark
Transformation
inDownloadFileFromUrlTransformation
in a new modulekoheesio.spark.transformations.download_files
- The spark implementation allows passing urls through a column in the a given DataFrame
- All URLs are then downloaded by the Spark Driver to a given location
- [refactor] Spark > Reader > JDBC Updated JDBC behavior by @dannymeijer in #168
JDBCReader
class now also baseExtraParamsMixin
.- Renamed
options
field toparams
and added aliasoptions
for backwards compatibility. dbtable
andquery
validation now handled upon initialization rather than at runtime.- Behavior now requires either
dbtable
orquery
to be submitted to be able to use JDBC.
- [refactor] Spark > Reader > HanaReader Updated
HanaReader
behavior by @dannymeijer in #168HanaReader
class no longer has anoptions
field.- Instead uses
params
and the aliasoptions
for backwards compatibility (seeJDBCReader
changes mentioned above).
- [refactor] Spark > Reader > TeradataReader Updated
TeradataReader
behavior by @dannymeijer in #168TeradataReader
class no longer has anoptions
field- Instead uses
params
and the aliasoptions
for backwards compatibility (seeJDBCReader
changes mentioned above).
- [feature] Spark > Transformation > CamelToSnake added more efficient Spark 3.4+ supported operation for
CamelToSnakeTransformation
by @dannymeijer in #142
Bug fixes
The following bug fixes are included:
- [bugfix] Core > Context Fix Context initialization with another Context object and dotted notation by @dannymeijer in #160 (#159)
- The init method of the Context class incorrectly updated the
kwargs
making it returnNone
. Calls to Context containing another Context object, would previously fail. - Also fixed an issue with how Context handled get operations for nested keys when using dotted notation
- The init method of the Context class incorrectly updated the
- [bugfix] Core > Step Fixed duplicate logging issues in nested Step classes by @dannymeijer in #168
- We observed log duplication when using specific super call sequences in nested Step classes
- Several changes were made to the
StepMetaClass
to address duplicate logs when usingsuper()
in the execute method of a Step class under specific circumstances. - Updated
_is_called_through_super
method to traverse the entire method resolution order (MRO) and correctly identifysuper()
calls. - Ensured
_execute_wrapper
method triggers logging only once per execute call. - This change prevents duplicate logs and ensures accurate log entries. The
_is_called_through_super
method was also used forOutput
validation, ensuring it is called only once.
- [bugfix]: Delta Improve merge clause handling in
DeltaTableWriter
by @mikita-sakalouski in #155 (#149)- Before, when using delta merge configuration (as dict) to provide merge condition to merge builder and having multiple calls for merge operation (e.g. for each batch processing in streaming), the original implementation was breaking due to a pop call on the used dictionary.
- [bugfix] Spark Pyspark Connect support fixes by @nogitting and @dannymeijer in #154 (#153)
- Connect support check previously excluded Spark 3.4 wrongfully
- Fix gets rid of False positives in our spark connect check utility
- [bugfix] Spark > ColumnsTransformation
ColumnConfig
defaults inColumnsTransformation
not working correctly by @dannymeijer in #142run_for_all_data_type
andlimit_data_type
were previously not working correctly
- [bugfix] Spark > Transformation > Hash Fix error handling missing columns in Spark Connect by @dannymeijer in #168
- Updated
sha2
function call to use named parameters. - Changes to
Sha2Hash
class:- Added check for missing columns.
- Improved handling when no columns are provided.
- Updated
New Contributors
Big shout out to all contributors and a heartfelt welcome to our new contributors:
- @nogitting made their first contribution in #154
- @zarembat made their first contribution in #163
Full Changelog: koheesio-v0.9.1...koheesio-v0.10.0
koheesio-v0.10.0a0
Alpha0 release of Koheesio v0.10
Note: this release in incomplete, further features are still being developed. This release is ALPHA, and should not be used in a production setting until the actual v0.10 release.
What's Changed
- [FEATURE] Add transformation to download file from url by @dannymeijer in #143
- [BUGFIX] Restore Box CSVReader behavior (#144) by @dannymeijer in #147
- [BUGFIX] ColumnConfig defaults in ColumnsTransformation for run_for_all_data_type and limit_data_type were not working correctly by @dannymeijer in #142
- [FEATURE] add partial method to BaseModel by @dannymeijer in #150
- [BUGFIX] Update PySpark connect support check by @dannymeijer in #154
- [BUGFIX]: Improve merge clause handling in DeltaTableWriter by @mikita-sakalouski in #155
Full Changelog: koheesio-v0.9.1...koheesio-v0.10.0a0
koheesio-v0.9.1
What's Changed
- [BUGFIX] #144 Box CSV Reader handling data types incorrectly by @louis-paulvlx in #145
Full Changelog: koheesio-v0.9.0...koheesio-v0.9.1
koheesio-v0.9.0
What's Changed
v0.9 brings many changes to the spark module, allowing support for pyspark connect along with a bunch of bug fixes and some new features. Additionally, the snowflake implementation is significantly reworked now relying on a pure python implementation for interacting with Snowflake outside of spark.
New features / Refactors
The following new features are included with 0.9:
- [feature] Box - Add overwrite functionality to the BoxFileWriterClass by @ToneVDB in #103
- [feature] Box - allow setting file encoding by @louis-paulvlx in #96
- [refactor] Core - change private attr and step getter by @mikita-sakalouski in #82
- [feature] DataBricks - DataBricksSecret for getting secrets from DataBricks scope by @mikita-sakalouski in #133
- [feature] Delta - Enable adding options to DeltaReader both streaming and batch by @mikita-sakalouski in #111
- [feature] SE - SparkExpectations bump version to 2.2.0 by @dannymeijer in #99
- [feature] Snowflake - Populate account from url if not provided in SnowflakeBaseModel by @mikita-sakalouski in #117
- [feature] Spark - add support for Spark Connect by @mikita-sakalouski in #63
- [feature] Spark - Make Transformations callable by @dannymeijer in #126
- [feature] Tableau - Add support for HyperProcess parameters by @maxim-mityutko in #112
Bug fixes
The following bugfixes are included with 0.9:
- [bugfix] Core - Accidental duplication of logs by @dannymeijer in #105
- [bugfix] Core - Adjust branch fetching logic for forked repo for Github Actions by @mikita-sakalouski in #106 and @mikita-sakalouski in #109
- [bugfix] Delta - DeltaMergeBuilder instance type didn't check out by @dannymeijer in #100
- [bugfix] Delta - fix merge builder instance check for connect + util fix by @dannymeijer in #130
- [bugfix] Docs - broken import statements and updated hello-world.md by @dannymeijer in #107
- [bugfix] Snowflake - python connector default config dir by @mikita-sakalouski in #125
- [bugfix] Snowflake - Remove duplicated implementation by @mikita-sakalouski in #116
- [bugfix] Spark - unused SparkSession being import from pyspark.sql in several tests by @dannymeijer in #140
- [bugfix] Spark/Docs - Remove mention of non-existent class type in docs by @dannymeijer in #138
- [bugfix] Tableau - Decimals conversion in HyperFileDataFrameWriter by @maxim-mityutko in #77
- [bugfix] Tableau - small fix for Tableau Server path checking by @dannymeijer in #134
- [bugfix] Snowflake - replace RunQuery with SnowflakeRunQueryPython by @mikita-sakalouski in #121
New Contributors
Big shout out to all contributors and a heartfelt welcome to our new contributors:
- @louis-paulvlx made their first contribution in #96
- @ToneVDB made their first contribution in #103
Migrating from v0.8
For users currently using v0.8, consider the following:
-
Spark connect is now fully supported. For this to work we've had to introduce several replacement types for pyspark such as DataFrame (i.e.
pyspark.sql.DataFrame
vspyspark.sql.connect.DataFrame
) as well as the SparkSession. If you are using custom Step logic in which you reference spark types, take these types from thekoheesio.spark
module instead. This will allow you to use pyspark connect with your custom code also. -
Snowflake was extensively reworked.
- To be able to use snowflake, a new
extra
/feature
was added to thepyproject.toml
- install this usingkoheesio[snowflake]
in order to have access to snowflake python - Code for snowflake support was moved to new primary modules:
koheesio.integrations.spark.snowflake
hosts all spark related snowflake codekoheesio.integrations.snowflake
hosts the non-spark / pure-python implementations- The original API was kept in place through pass-through imports; no immediate code changes should be needed
- To be able to use snowflake, a new
Full Changelog: koheesio-v0.8.1...koheesio-v0.9.0
koheesio-v0.9.0rc7
What's Changed
- Release/0.9 - final version bump and docs by @dannymeijer in #132
- [FEATURE] Make Transformations callable by @dannymeijer in #126
- [BUG] small fix for Tableau Server path checking by @dannymeijer in #134
- [FEATURE] DataBricksSecret for getting secrets from DataBricks scope by @mikita-sakalouski in #133
Full Changelog: koheesio-v0.9.0rc6...koheesio-v0.9.0rc7
koheesio-v0.9.0rc6
What's Changed
- refactor: replace RunQuery with SnowflakeRunQueryPython by @mikita-sakalouski in #121
- hotfix: snowflake python connector default config dir by @mikita-sakalouski in #125
- hotfix: delta merge builder instance check for connect + util fix by @dannymeijer in #130
Full Changelog: koheesio-v0.9.0rc5...koheesio-v0.9.0rc6
koheesio-v0.9.0rc5
Adjust logic for getting account from url/sfURL
koheesio-v0.9.0rc4
What's Changed
- fix: test github by @mikita-sakalouski in #109
- [Fix] Add overwrite functionality to the BoxFileWriterClass by @ToneVDB in #103
- [FEATURE] Enable adding options to DeltaReader both streaming and writing by @mikita-sakalouski in #111
- Add support for HyperProcess parameters by @maxim-mityutko in #112
- [HOTFIX] Remove duplicated implementation by @mikita-sakalouski in #116
- [FEATURE] Populate account from url if not provided in SnowflakeBaseModel by @mikita-sakalouski in #117
New Contributors
Full Changelog: koheesio-v0.9.0rc3...koheesio-v0.9.0rc4
koheesio-v0.9.0rc3
What's Changed
- [FIX] Accidental duplication of logs by @dannymeijer in #105
- fix: adjust branch fetching by @mikita-sakalouski in #106
- [FIX] broken import statements and updated hello-world.md by @dannymeijer in #107
Full Changelog: koheesio-v0.9.0rc2...koheesio-v0.9.0rc3