[#158] Update sphinx to 8.1.3 and regenerate the docs

ipums · Nov 1, 2024 · 6f7004d · 6f7004d
1 parent 63a49d8
commit 6f7004d
Show file tree

Hide file tree

Showing 51 changed files with 818 additions and 24,247 deletions.
diff --git a/docs/.buildinfo b/docs/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
-# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: de74adeb0864eb6d8e73600964a3e52d
+# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: a706061ae4b2d0ec765440a2505ca382
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/docs/.buildinfo.bak b/docs/.buildinfo.bak
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: de74adeb0864eb6d8e73600964a3e52d
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/docs/.doctrees/column_mapping_transforms.doctree b/docs/.doctrees/column_mapping_transforms.doctree
diff --git a/docs/.doctrees/comparison_types.doctree b/docs/.doctrees/comparison_types.doctree
diff --git a/docs/.doctrees/config.doctree b/docs/.doctrees/config.doctree
diff --git a/docs/.doctrees/environment.pickle b/docs/.doctrees/environment.pickle
diff --git a/docs/.doctrees/example_workflow.doctree b/docs/.doctrees/example_workflow.doctree
diff --git a/docs/.doctrees/feature_selection_transforms.doctree b/docs/.doctrees/feature_selection_transforms.doctree
diff --git a/docs/.doctrees/index.doctree b/docs/.doctrees/index.doctree
diff --git a/docs/.doctrees/installation.doctree b/docs/.doctrees/installation.doctree
diff --git a/docs/.doctrees/introduction.doctree b/docs/.doctrees/introduction.doctree
diff --git a/docs/.doctrees/link_tasks.doctree b/docs/.doctrees/link_tasks.doctree
diff --git a/docs/.doctrees/models.doctree b/docs/.doctrees/models.doctree
diff --git a/docs/.doctrees/pipeline_features.doctree b/docs/.doctrees/pipeline_features.doctree
diff --git a/docs/.doctrees/running_the_program.doctree b/docs/.doctrees/running_the_program.doctree
diff --git a/docs/.doctrees/substitutions.doctree b/docs/.doctrees/substitutions.doctree
diff --git a/docs/.doctrees/use_examples.doctree b/docs/.doctrees/use_examples.doctree
diff --git a/docs/_sources/comparison_types.md.txt → docs/_sources/comparison_features.md.txt b/docs/_sources/comparison_types.md.txt → docs/_sources/comparison_features.md.txt
@@ -1,30 +1,42 @@
-# Comparison types, transform add-ons, aggregate features, and household aggregate features
+# Comparison Features
 
-This page has information on the different comparison types available for the `[[comparison_features]]`
-section, along with some attributes available to all of the comparison types and some aggregate features
-that are not configurable.
+During matching, hlink computes comparison features on each record pair which
+it considers a potential match. These comparison features can be passed as
+features to machine-learning algorithms or used to define
+[comparisons](comparisons) which filter the `potential_matches` table.
 
-## Comparison types
-Each header below represents a comparison type.  Transforms are used in the context of `comparison_features`.
+Each comparison feature must have a comparison type, which tells hlink how to
+compute the comparison feature. This page has information on the available
+comparison types and how to configure them. It also lists some attributes
+available to all comparison types and some predefined aggregate features which
+do not need to be explicitly configured.
 
-```
-[[comparison_features]]
-alias = "relatematch"
-column_name = "relate_div_100"
-comparison_type = "equals"
-categorical = true
-```
+## Comparison Types
+
+Each section below describes a comparison type. Each type represents a
+different operation, computation, or transformation that hlink can perform on
+one or more input columns. Some comparison types expect their own attributes
+for additional configuration. These attributes are listed in each section,
+along with an example.
 
 ### maximum_jaro_winkler
-Finds the greatest Jaro-Winkler value among the cartesian product of multiple columns.  For example, given an input of `column_names = ['namefrst', 'namelast']`, it would return the maximum Jaro-Winkler name comparison value among the following four comparisons: 
+
+Finds the greatest Jaro-Winkler value among the cartesian product of multiple
+columns.  For example, given an input of `column_names = ['namefrst',
+'namelast']`, it would return the maximum Jaro-Winkler name comparison value
+among the following four comparisons: 
+
 ```
-[('namefrst_a', 'namefrst_b'),
- ('namefrst_a', 'namelast_b'),
- ('namelast_a', 'namefrst_b'),
- ('namelast_a', 'namelast_b')]
- ```
+a.namefrst, b.namefrst
+a.namefrst, b.namelast
+a.namelast, b.namefrst
+a.namelast, b.namelast
+```
+
 * Attributes:
-  * `column_names` -- Type: list of strings.  Required.  The list of columns used as input for the set of comparisons generated by taking the cartesian product.
+  * `column_names` -- Type: list of strings.  Required.  The list of columns
+    used as input for the set of comparisons, which are generated by taking the
+    Cartesian product of the set of input columns with itself.
 
  ```
 [[comparison_features]]

diff --git a/docs/_sources/comparisons.md.txt b/docs/_sources/comparisons.md.txt
@@ -0,0 +1,125 @@
+# Comparisons
+
+## Overview
+
+The `comparisons` configuration section defines constraints on the matching
+process. Unlike `comparison_features` and `feature_selections`, which define
+features for use with a machine-learning algorithm, `comparisons` define rules
+which directly filter the output `potential_matches` table. These rules often
+depend on some comparison features, and hlink always applies the rules after
+exploding and blocking in the matching task.
+
+As an example, suppose that your `comparisons` configuration section looks like
+the following.
+
+```
+[comparisons]
+comparison_type = "threshold"
+feature_name = "namefrst_jw"
+threshold = 0.79
+```
+
+This comparison defines a rule that depends on the `namefrst_jw` comparison
+feature. During matching, only pairs of records with `namefrst_jw` greater than
+or equal to 0.79 will be added to the potential matches table. Pairs of records
+which do not satisfy the comparison will not be potential matches.
+
+*Note: This page focuses on the `comparisons` section in particular, but the
+household comparisons section `hh_comparisons` has the same structure. It
+defines rules which hlink uses to filter record pairs after household blocking
+in the hh_matching task. These rules are effectively filters on the output
+`hh_potential_matches` table.*
+
+## Comparison Types
+
+Currently the only `comparison_type` supported for the `comparisons` section is
+`"threshold"`. This requires the `threshold` attribute, and by default, it
+restricts a comparison feature to be greater than or equal to the value given
+by `threshold`. The configuration section
+
+```
+[comparisons]
+comparison_type = "threshold"
+feature_name = "namelast_jw"
+threshold = 0.84
+```
+
+adds the condition `namelast_jw >= 0.84` to each record pair considered during
+matching. Only record pairs which satisfy this condition are marked as
+potential matches.
+
+Hlink also supports a `threshold_expr` attribute in `comparisons` for more
+flexibility. This attribute takes SQL syntax and replaces the `threshold`
+attribute described above. For example, to define the condition `flag < 0.5`,
+you could set `threshold_expr` like
+
+```
+[comparisons]
+comparison_type = "threshold"
+feature_name = "flag"
+threshold_expr = "< 0.5"
+```
+
+Note that there is now no need for the `threshold` attribute because the
+`threshold_expr` implicitly defines it.
+
+## Defining Multiple Comparisons
+
+In some cases, you may have multiple comparisons to make between record pairs.
+The `comparisons` section supports this in a flexible but somewhat verbose way.
+Suppose that you would like to combine two of the conditions used in the
+examples above, so that record pairs are potential matches only if `namefrst_jw >= 0.79`
+and `namelast_jw >= 0.84`. You could do this by setting the `operator`
+attribute to `"AND"` and then defining the `comp_a` (comparison A) and `comp_b`
+(comparison B) attributes.
+
+```
+[comparisons]
+operator = "AND"
+
+[comparisons.comp_a]
+comparison_type = "threshold"
+feature_name = "namefrst_jw"
+threshold = 0.79
+
+[comparisons.comp_b]
+comparison_type = "threshold"
+feature_name = "namelast_jw"
+threshold = 0.84
+```
+
+Both `comp_a` and `comp_b` are recursive, so they may have the same structure
+as the `comparisons` section itself. This means that you can add as many
+comparisons as you would like by recursively defining comparisons. `operator`
+may be either `"AND"` or `"OR"` and defines the logic for connecting the two
+sub-comparisons `comp_a` and `comp_b`. Defining more than two comparisons can
+get pretty ugly and verbose, so make sure to use care when defining nested
+comparisons. Here is an example of a section with three comparisons.
+
+```
+# This comparisons section defines 3 rules for potential matches.
+# They are that potential matches must either have
+# 1. flag < 0.5
+# OR
+# 2. namefrst_jw >= 0.79 AND 3. namelast_jw >= 0.84
+[comparisons]
+operator = "OR"
+
+[comparisons.comp_a]
+comparison_type = "threshold"
+feature_name = "flag"
+threshold_expr = "< 0.5"
+
+[comparisons.comp_b]
+operator = "AND"
+
+[comparisons.comp_b.comp_a]
+comparison_type = "threshold"
+feature_name = "namefrst_jw"
+threshold = 0.79
+
+[comparisons.comp_b.comp_b]
+comparison_type = "threshold"
+feature_name = "namelast_jw"
+threshold = 0.84
+```
diff --git a/docs/_sources/config.md.txt b/docs/_sources/config.md.txt
@@ -594,61 +594,95 @@ expand_length = 3
 explode = true
 ```
 
-## [Comparisons](comparison_types)
+## [Comparisons](comparisons)
 
 * Header name: `comparisons`
-* Description: A list of comparisons to threshold the potential matches on. Only potential matches that pass the thresholds will be created. See [comparison types](comparison_types) for more information.
+* Description: A set of comparisons which filter the potential matches.
+  Only record pairs which satisfy the comparisons qualify as potential matches.
+  See [comparisons](comparisons) for some more information.
 * Required: True
 * Type: Object
-* Attributes:
-  * `comparison_type` -- Type: `string`. Required. See [comparison types](comparison_types) for more information.
-  * `feature_name` -- Type: `string`. Required. The `comparison_feature` to use for the comparison threshold.  A `comparison_feature` column by this name must be specified in the `comparison_features` section.
 
-```
-[comparisons]
-operator = "AND"
+There are two different forms that the comparisons table may take. It may either
+be a single comparison definition, or it may be a nested comparison definition
+with multiple sub-comparisons.
+
+### Single Comparison
+
+  * Attributes:
+    * `comparison_type` -- Type: `string`. Required. The type of the comparison.
+    Currently the only supported comparison type is `"threshold"`, which compares
+    a comparison feature to a given value.
+    * `feature_name` -- Type: `string`. Required. The comparison feature to compare
+    against.
+    * `threshold` -- Type: `Any`. Optional. The value to compare against.
+    * `threshold_expr` -- Type: `string`. Optional. A SQL condition which defines
+    the comparison on the comparison feature named by `feature_name`.
+
+  The comparison definition must contain either `threshold` or `threshold_expr`,
+  but not both. Providing `threshold = X` is equivalent to the threshold
+  expression `threshold_expr >= X`.
+
+  ```
+  # Only record pairs with namefrst_jw >= 0.79 can be
+  # potential matches.
+  [comparisons]
+  comparison_type = "threshold"
+  feature_name = "namefrst_jw"
+  threshold = 0.79
+  ```
+
+  ```
+  # Only record pairs with flag < 0.5 can be potential matches.
+  [comparisons]
+  comparison_type = "threshold"
+  feature_name = "flag"
+  threshold_expr = "< 0.5"
+  ```
+
+### Multiple Comparisons
 
-[comparisons.comp_a]
-comparison_type = "threshold"
-feature_name = "namefrst_jw"
-threshold = 0.79
+* Attributes:
+  * `operator` -- Type: `string`. Required. The logical operator which connects
+  the two sub-comparisons. May be `"AND"` or `"OR"`.
+  * `comp_a` -- Type: `object`. Required. The first sub-comparison.
+  * `comp_b` -- Type: `object`. Required. The second sub-comparison.
 
-[comparisons.comp_b]
-comparison_type = "threshold"
-feature_name = "namelast_jw"
-threshold = 0.79
-```
+Both `comp_a` and `comp_b` are recursive comparison sections and may contain
+either a single comparison or another set of sub-comparisons. Please see the
+[comparisons documentation](comparisons.html#defining-multiple-comparisons) for
+more details and examples.
 
-## [Household Comparisons](comparison_types)
+## [Household Comparisons](comparisons)
 
 * Header name: `hh_comparisons`
-* Description: A list of comparisons to threshold the household potential matches on. Also referred to as post-blocking filters, as all household potential matches are created, then only potential matches that pass the post-blocking filters will be kept for scoring. See [comparison types](comparison_types) for more information.
-* Required: False
-* Type: Object
-* Attributes:
-  * `comparison_type` -- Type: `string`.  Required. See [comparison types](comparison_types) for more information.
-  * `feature_name` -- Type: `string`. Required. The `comparison_feature` to use for the comparison threshold. A `comparison_feature` column by this name must be specified in the `comparison_features` section.
-
+* Description: A set of comparisons which filter the household potential
+  matches. `hh_comparisons` has the same configuration structure as
+  `comparisons` and works in a similar way, except that it applies during the
+  `hh_matching` task instead of `matching`. You can read more about comparisons
+  [here](comparisons).
+
 ```
+# Only household record pairs with an age difference <= 10 can be
+# household potential matches.
 [hh_comparisons]
-# only keep household potential matches with an age difference less than or equal than ten years
 comparison_type = "threshold"
 feature_name = "byrdiff"
 threshold_expr = "<= 10"
 ```
 
-## [Comparison Features](comparison_types)
+## [Comparison Features](comparison_features)
 
 * Header name: `comparison_features`
-* Description: A list of comparison features to create when comparing records. Comparisons for individual and household linking rounds are both represented here -- no need to duplicate comparisons if used in both rounds, simply specify the `column_name` in the appropriate `training` or `hh_training` section of the config.  See the [comparison types](comparison_types) section for more information.
+* Description: A list of comparison features to create when comparing records. Comparisons for individual and household linking rounds are both represented here -- no need to duplicate comparisons if used in both rounds, simply specify the `column_name` in the appropriate `training` or `hh_training` section of the config.  See the [comparison features documentation page](comparison_features) for more information.
 * Required: True
 * Type: List
 * Attributes:
   * `alias` -- Type: `string`. Optional. The name of the comparison feature column to be generated.  If not specified, the output column will default to `column_name`.
   * `column_name` -- Type: `string`. The name of the columns to compare.
-  * `comparison_type` -- Type: `string`. The name of the comparison type to use. See the [comparison types](comparison_types) section for more information.
+  * `comparison_type` -- Type: `string`. The name of the comparison type to use.
   * `categorical` -- Type: `boolean`.  Optional.  Whether the output data should be treated as categorical data (important information used during one-hot encoding and vectorizing in the machine learning pipeline stage).
-  * Other attributes may be included as well depending on `comparison_type`.  See the [comparison types](comparison_types) section for details on each comparison type.
+  * Other attributes may be included as well depending on `comparison_type`.  See the [comparison features page](comparison_features) for details on each comparison type.
 
 ```
 [[comparison_features]]

diff --git a/docs/_sources/index.rst.txt b/docs/_sources/index.rst.txt
@@ -24,7 +24,8 @@ Configuration API
    :caption: Configuration API
 
    Column Mappings <column_mappings.md>
-   Comparison Types <comparison_types.md>
+   comparisons
+   Comparison Features <comparison_features.md>
    Feature Selection <feature_selection_transforms.md>
    Pipeline Features <pipeline_features.md>
    substitutions

diff --git a/docs/_sources/link_tasks.md.txt b/docs/_sources/link_tasks.md.txt
@@ -84,7 +84,7 @@ are grouped into the same blocking bucket.
 on each record. These features may be passed to a machine learning model through the
 [`training`](config.html#training-and-models) section and/or passed to deterministic
 rules with the [`comparisons`](config.html#comparisons) section. There are many
-different [comparison types](comparison_types) available for use with
+different [comparison types](comparison_features) available for use with
 `comparison_features`.
 * [`pipeline_features`](pipeline_features.html#pipeline-generated-features) are machine learning transformations
 useful for reshaping and interacting data before they are fed to the machine learning