Skip to content

Commit

Permalink
[#158] Rename the "Comparison Types" page to "Comparison Features"
Browse files Browse the repository at this point in the history
  • Loading branch information
riley-harper committed Oct 30, 2024
1 parent 319c60b commit 63a49d8
Show file tree
Hide file tree
Showing 4 changed files with 38 additions and 26 deletions.
Original file line number Diff line number Diff line change
@@ -1,30 +1,42 @@
# Comparison types, transform add-ons, aggregate features, and household aggregate features
# Comparison Features

This page has information on the different comparison types available for the `[[comparison_features]]`
section, along with some attributes available to all of the comparison types and some aggregate features
that are not configurable.
During matching, hlink computes comparison features on each record pair which
it considers a potential match. These comparison features can be passed as
features to machine-learning algorithms or used to define
[comparisons](comparisons) which filter the `potential_matches` table.

## Comparison types
Each header below represents a comparison type. Transforms are used in the context of `comparison_features`.
Each comparison feature must have a comparison type, which tells hlink how to
compute the comparison feature. This page has information on the available
comparison types and how to configure them. It also lists some attributes
available to all comparison types and some predefined aggregate features which
do not need to be explicitly configured.

```
[[comparison_features]]
alias = "relatematch"
column_name = "relate_div_100"
comparison_type = "equals"
categorical = true
```
## Comparison Types

Each section below describes a comparison type. Each type represents a
different operation, computation, or transformation that hlink can perform on
one or more input columns. Some comparison types expect their own attributes
for additional configuration. These attributes are listed in each section,
along with an example.

### maximum_jaro_winkler
Finds the greatest Jaro-Winkler value among the cartesian product of multiple columns. For example, given an input of `column_names = ['namefrst', 'namelast']`, it would return the maximum Jaro-Winkler name comparison value among the following four comparisons:

Finds the greatest Jaro-Winkler value among the cartesian product of multiple
columns. For example, given an input of `column_names = ['namefrst',
'namelast']`, it would return the maximum Jaro-Winkler name comparison value
among the following four comparisons:

```
[('namefrst_a', 'namefrst_b'),
('namefrst_a', 'namelast_b'),
('namelast_a', 'namefrst_b'),
('namelast_a', 'namelast_b')]
```
a.namefrst, b.namefrst
a.namefrst, b.namelast
a.namelast, b.namefrst
a.namelast, b.namelast
```

* Attributes:
* `column_names` -- Type: list of strings. Required. The list of columns used as input for the set of comparisons generated by taking the cartesian product.
* `column_names` -- Type: list of strings. Required. The list of columns
used as input for the set of comparisons, which are generated by taking the
Cartesian product of the set of input columns with itself.

```
[[comparison_features]]
Expand Down
8 changes: 4 additions & 4 deletions sphinx-docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -671,18 +671,18 @@ feature_name = "byrdiff"
threshold_expr = "<= 10"
```

## [Comparison Features](comparison_types)
## [Comparison Features](comparison_features)

* Header name: `comparison_features`
* Description: A list of comparison features to create when comparing records. Comparisons for individual and household linking rounds are both represented here -- no need to duplicate comparisons if used in both rounds, simply specify the `column_name` in the appropriate `training` or `hh_training` section of the config. See the [comparison types](comparison_types) section for more information.
* Description: A list of comparison features to create when comparing records. Comparisons for individual and household linking rounds are both represented here -- no need to duplicate comparisons if used in both rounds, simply specify the `column_name` in the appropriate `training` or `hh_training` section of the config. See the [comparison features documentation page](comparison_features) for more information.
* Required: True
* Type: List
* Attributes:
* `alias` -- Type: `string`. Optional. The name of the comparison feature column to be generated. If not specified, the output column will default to `column_name`.
* `column_name` -- Type: `string`. The name of the columns to compare.
* `comparison_type` -- Type: `string`. The name of the comparison type to use. See the [comparison types](comparison_types) section for more information.
* `comparison_type` -- Type: `string`. The name of the comparison type to use.
* `categorical` -- Type: `boolean`. Optional. Whether the output data should be treated as categorical data (important information used during one-hot encoding and vectorizing in the machine learning pipeline stage).
* Other attributes may be included as well depending on `comparison_type`. See the [comparison types](comparison_types) section for details on each comparison type.
* Other attributes may be included as well depending on `comparison_type`. See the [comparison features page](comparison_features) for details on each comparison type.

```
[[comparison_features]]
Expand Down
2 changes: 1 addition & 1 deletion sphinx-docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Configuration API

Column Mappings <column_mappings.md>
comparisons
Comparison Types <comparison_types.md>
Comparison Features <comparison_features.md>
Feature Selection <feature_selection_transforms.md>
Pipeline Features <pipeline_features.md>
substitutions
Expand Down
2 changes: 1 addition & 1 deletion sphinx-docs/link_tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ are grouped into the same blocking bucket.
on each record. These features may be passed to a machine learning model through the
[`training`](config.html#training-and-models) section and/or passed to deterministic
rules with the [`comparisons`](config.html#comparisons) section. There are many
different [comparison types](comparison_types) available for use with
different [comparison types](comparison_features) available for use with
`comparison_features`.
* [`pipeline_features`](pipeline_features.html#pipeline-generated-features) are machine learning transformations
useful for reshaping and interacting data before they are fed to the machine learning
Expand Down

0 comments on commit 63a49d8

Please sign in to comment.