Releases: ipums/hlink
v4.2.1
What's Changed
- Adjust how choose_classifier handles seed parameters by @riley-harper in #222
- Bump the version to 4.2.1 by @riley-harper in #225
Full Changelog: v4.2.0...v4.2.1
v4.2.0
What's Changed
- Update developer documentation by @riley-harper in #204
- Refactor column mapping transforms by @riley-harper in #207
- Document 5 column mapping transforms by @riley-harper in #212
- Add github workflow to build and publish sphinx docs to github pages by @joegrover in #210
- Don't create a history file on startup by @riley-harper in #215
- Update pyproject to switch to new license spec format by @riley-harper in #216
- Support custom column mapping transforms by @riley-harper in #213
- Simplify handling of the deprecated training.param_grid attribute by @riley-harper in #217
- Clean up Sphinx docs workflow by @riley-harper in #218
- Remove notes about XGBoost being unstable by @riley-harper in #219
- Bump the version to 4.2.0 by @riley-harper in #220
Full Changelog: v4.1.0...v4.2.0
v4.1.0
What's Changed
- Require setuptools >= 71 by @riley-harper in #198
- Remove restriction of scikit-learn < 1.6 for xgboost optional feature by @riley-harper in #196
- Fix threshold ratio bug by @riley-harper in #200
- Allow rematching in households by @riley-harper in #201
- Save hh training metadata as step 3 of hh_training by @joegrover in #202
- Updated the project version in pyproject.toml to 4.1.0 by @joegrover in #203
New Contributors
- @joegrover made their first contribution in #202
Full Changelog: v4.0.0...v4.1.0
v4.0.0
Overview
This version of hlink contains a large update to the model exploration task, several bug fixes, and a few breaking changes. For a curated list of changes, check out the changelog at https://hlink.docs.ipums.org/changelog.html.
What's Changed
- Refactor nested cross validation by @ccdavis in #169
- Add Randomized Parameter Search by @riley-harper in #168
- Update linking.core.classifier and linking.core.threshold by @riley-harper in #175
- Model exploration metrics by @ccdavis in #177
- Remove "suspicious data" functionality from model exploration by @riley-harper in #178
- Add the F-measure model metric, restructure for clarity by @riley-harper in #180
- Allow setting the checkpoint directory through SparkConnection by @riley-harper in #182
- Remove deprecated code for version 4 by @riley-harper in #184
- Use tomli instead of the toml package by default by @riley-harper in #185
- Fix a bug where model_metrics.mcc() < -1.0 by @riley-harper in #188
- Create a changelog file by @riley-harper in #189
- Add docs for Model Exploration by @riley-harper in #190
- Update docs for training.param_grid by @riley-harper in #191
- Version 4.0.0 by @riley-harper in #186
- Bump the version to 4.0.0 by @riley-harper in #192
New Contributors
Full Changelog: v3.8.0...v4.0.0
v4.0.0b1
Overview
This is the first beta release for version 4. We do not expect to be doing any more feature work or breaking changes for version 4 after this release, so if all goes well, the interface should be pretty stable now. Like the alpha release, this is a pre-release, and so pip should not install it unless you specifically request it. Listed below are the changes from 4.0.0a1 to 4.0.0b1, which include a few small breaking changes.
There is now a user-facing changelog for hlink which is more carefully curated than these auto-generated release notes! You can see the changelog, which has a preview of v4.0.0, here.
What's Changed
- Remove deprecated code for version 4 by @riley-harper in #184
- Use tomli instead of the toml package by default by @riley-harper in #185
- Fix a bug where model_metrics.mcc() < -1.0 by @riley-harper in #188
- Create a changelog file by @riley-harper in #189
- Add docs for Model Exploration by @riley-harper in #190
- Update docs for training.param_grid by @riley-harper in #191
Full Changelog: v4.0.0a1...v4.0.0b1
v4.0.0a1
Version 4.0.0 Alpha 1
This pre-release has upcoming changes for version 4 of hlink. Since this includes breaking changes and an overhaul of the model exploration task, we'd like to test it out a bit before creating a full release. Part of the work yet to be done is documentation and code cleanup. The documentation for these changes and new features is lacking so far. Here is a preview of the version 4 highlights (so far!):
- Completely overhauled the model exploration task, switching to a nested cross-validation algorithm.
- Added support for a third strategy for generating models to test in model exploration. Along with "explicit" (take exactly what's in
training.model_parameters
) and grid search, there is now randomized search. Randomized search takes a certain number of samples from a distribution defined intraining.model_parameters
. - Added the F-measure metric to the model exploration output, and simplified the output so that it always has the same columns.
- Removed the
training.output_suspicious_TD
configuration option because it was rarely used and presented code and performance issues. Removingoutput_suspicious_TD
makes the model exploration code more maintainable and helps it run more quickly. - Disentangled two core modules (
classifier
andpipeline
) from the configuration format by changing the arguments to a couple of functions. This should help separate those concerns more neatly and make changes to the configuration easier if we end up doing that in the future. - Changed
SparkConnection
to require acheckpoint_dir
argument, which fixes a bug related to Spark configuration.
v3.8.0
What's Changed
- Added optional support for two new gradient boosting ML libraries: XGBoost and LightGBM. You can read more about these libraries and how to install them with their dependencies in the docs here. PR #165
- Added a new
hlink.linking.transformers.RenameVectorAttributes
transformer which can rename the attributes or "slots" of Spark vector columns. Hlink uses this to support LightGBM, which disallows certain characters in its feature names. PR #165 - Documented comparisons, which are not the same as comparison features. Previously the documentation was misleading and seemed to indicate that these were the same thing. PR #159
- Fixed a bug in the substitution file documentation. The documentation had the meaning of the substitution file columns flip-flopped, which was confusing. PR #166
Developer-Facing Changes
- Updated Sphinx to 8.1.3 and fixed two Sphinx build warnings. PR #159
- Updated CI/CD to automatically run only on PRs and on pushes to main. You can also now manually trigger a CI/CD run from the Actions tab in GitHub. Also removed the custom "quickcheck" pytest marker in favor of using
pytest -k
and removed flake8 from CI/CD because it kept causing more trouble than it was worth. PR #164
Full Changelog: v3.7.0...v3.8.0
v3.7.0
What's Changed
- Add tests to cover several untested sections of code by @riley-harper in #147
- Refactor core.transforms.generate_transforms() for readability and maintainability; improve documentation and type hints by @riley-harper in #148
- Fix tests for Python 3.12 and clarify Python 3.12 support and dependence on PySpark by @riley-harper in #151
- Improve logging by writing to module-level loggers instead of the root logger by @riley-harper in #152
- Support setting the app name via an optional argument in SparkConnection. The default behavior of setting the app name to "linking" is unchanged. By @riley-harper in #156
- Improve model_exploration step 2 terminal output, logging, and documentation to make the step more understandable by @riley-harper in #155
Full Changelog: v3.6.1...v3.7.0
v3.6.1
What's Changed
- Support blocking sections with multiple exploded columns by @riley-harper in #143. This fixes a bug that caused a crash in Matching step 0 - explode.
Full Changelog: v3.6.0...v3.6.1
v3.6.0
What's Changed
- Support OR conditions in blocking by @riley-harper in #138. This new feature supports connecting some or all blocking conditions together with ORs instead of with ANDs. You can read more documentation about it under the "or_group" bullet point here.
- Unskip several skipped tests by @riley-harper in #139. This is a development change that should not affect users.
Full Changelog: v3.5.5...v3.6.0