diff --git a/sphinx-docs/comparison_types.md b/sphinx-docs/comparison_features.md similarity index 96% rename from sphinx-docs/comparison_types.md rename to sphinx-docs/comparison_features.md index c2a96ac..99d5803 100644 --- a/sphinx-docs/comparison_types.md +++ b/sphinx-docs/comparison_features.md @@ -1,30 +1,42 @@ -# Comparison types, transform add-ons, aggregate features, and household aggregate features +# Comparison Features -This page has information on the different comparison types available for the `[[comparison_features]]` -section, along with some attributes available to all of the comparison types and some aggregate features -that are not configurable. +During matching, hlink computes comparison features on each record pair which +it considers a potential match. These comparison features can be passed as +features to machine-learning algorithms or used to define +[comparisons](comparisons) which filter the `potential_matches` table. -## Comparison types -Each header below represents a comparison type. Transforms are used in the context of `comparison_features`. +Each comparison feature must have a comparison type, which tells hlink how to +compute the comparison feature. This page has information on the available +comparison types and how to configure them. It also lists some attributes +available to all comparison types and some predefined aggregate features which +do not need to be explicitly configured. -``` -[[comparison_features]] -alias = "relatematch" -column_name = "relate_div_100" -comparison_type = "equals" -categorical = true -``` +## Comparison Types + +Each section below describes a comparison type. Each type represents a +different operation, computation, or transformation that hlink can perform on +one or more input columns. Some comparison types expect their own attributes +for additional configuration. These attributes are listed in each section, +along with an example. ### maximum_jaro_winkler -Finds the greatest Jaro-Winkler value among the cartesian product of multiple columns. For example, given an input of `column_names = ['namefrst', 'namelast']`, it would return the maximum Jaro-Winkler name comparison value among the following four comparisons: + +Finds the greatest Jaro-Winkler value among the cartesian product of multiple +columns. For example, given an input of `column_names = ['namefrst', +'namelast']`, it would return the maximum Jaro-Winkler name comparison value +among the following four comparisons: + ``` -[('namefrst_a', 'namefrst_b'), - ('namefrst_a', 'namelast_b'), - ('namelast_a', 'namefrst_b'), - ('namelast_a', 'namelast_b')] - ``` +a.namefrst, b.namefrst +a.namefrst, b.namelast +a.namelast, b.namefrst +a.namelast, b.namelast +``` + * Attributes: - * `column_names` -- Type: list of strings. Required. The list of columns used as input for the set of comparisons generated by taking the cartesian product. + * `column_names` -- Type: list of strings. Required. The list of columns + used as input for the set of comparisons, which are generated by taking the + Cartesian product of the set of input columns with itself. ``` [[comparison_features]] diff --git a/sphinx-docs/config.md b/sphinx-docs/config.md index 29122dc..0ed63a3 100644 --- a/sphinx-docs/config.md +++ b/sphinx-docs/config.md @@ -671,18 +671,18 @@ feature_name = "byrdiff" threshold_expr = "<= 10" ``` -## [Comparison Features](comparison_types) +## [Comparison Features](comparison_features) * Header name: `comparison_features` -* Description: A list of comparison features to create when comparing records. Comparisons for individual and household linking rounds are both represented here -- no need to duplicate comparisons if used in both rounds, simply specify the `column_name` in the appropriate `training` or `hh_training` section of the config. See the [comparison types](comparison_types) section for more information. +* Description: A list of comparison features to create when comparing records. Comparisons for individual and household linking rounds are both represented here -- no need to duplicate comparisons if used in both rounds, simply specify the `column_name` in the appropriate `training` or `hh_training` section of the config. See the [comparison features documentation page](comparison_features) for more information. * Required: True * Type: List * Attributes: * `alias` -- Type: `string`. Optional. The name of the comparison feature column to be generated. If not specified, the output column will default to `column_name`. * `column_name` -- Type: `string`. The name of the columns to compare. - * `comparison_type` -- Type: `string`. The name of the comparison type to use. See the [comparison types](comparison_types) section for more information. + * `comparison_type` -- Type: `string`. The name of the comparison type to use. * `categorical` -- Type: `boolean`. Optional. Whether the output data should be treated as categorical data (important information used during one-hot encoding and vectorizing in the machine learning pipeline stage). - * Other attributes may be included as well depending on `comparison_type`. See the [comparison types](comparison_types) section for details on each comparison type. + * Other attributes may be included as well depending on `comparison_type`. See the [comparison features page](comparison_features) for details on each comparison type. ``` [[comparison_features]] diff --git a/sphinx-docs/index.rst b/sphinx-docs/index.rst index f2efb53..4793844 100644 --- a/sphinx-docs/index.rst +++ b/sphinx-docs/index.rst @@ -25,7 +25,7 @@ Configuration API Column Mappings comparisons - Comparison Types + Comparison Features Feature Selection Pipeline Features substitutions diff --git a/sphinx-docs/link_tasks.md b/sphinx-docs/link_tasks.md index dc201b7..769d589 100644 --- a/sphinx-docs/link_tasks.md +++ b/sphinx-docs/link_tasks.md @@ -84,7 +84,7 @@ are grouped into the same blocking bucket. on each record. These features may be passed to a machine learning model through the [`training`](config.html#training-and-models) section and/or passed to deterministic rules with the [`comparisons`](config.html#comparisons) section. There are many -different [comparison types](comparison_types) available for use with +different [comparison types](comparison_features) available for use with `comparison_features`. * [`pipeline_features`](pipeline_features.html#pipeline-generated-features) are machine learning transformations useful for reshaping and interacting data before they are fed to the machine learning