Skip to content

Commit

Permalink
[#158] Update sphinx to 8.1.3 and regenerate the docs
Browse files Browse the repository at this point in the history
  • Loading branch information
riley-harper committed Nov 1, 2024
1 parent 63a49d8 commit 6f7004d
Show file tree
Hide file tree
Showing 51 changed files with 818 additions and 24,247 deletions.
4 changes: 2 additions & 2 deletions docs/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: de74adeb0864eb6d8e73600964a3e52d
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
config: a706061ae4b2d0ec765440a2505ca382
tags: 645f666f9bcd5a90fca523b33c5a78b7
4 changes: 4 additions & 0 deletions docs/.buildinfo.bak
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: de74adeb0864eb6d8e73600964a3e52d
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file removed docs/.doctrees/column_mapping_transforms.doctree
Binary file not shown.
Binary file removed docs/.doctrees/comparison_types.doctree
Binary file not shown.
Binary file removed docs/.doctrees/config.doctree
Binary file not shown.
Binary file removed docs/.doctrees/environment.pickle
Binary file not shown.
Binary file removed docs/.doctrees/example_workflow.doctree
Binary file not shown.
Binary file removed docs/.doctrees/feature_selection_transforms.doctree
Binary file not shown.
Binary file removed docs/.doctrees/index.doctree
Binary file not shown.
Binary file removed docs/.doctrees/installation.doctree
Binary file not shown.
Binary file removed docs/.doctrees/introduction.doctree
Binary file not shown.
Binary file removed docs/.doctrees/link_tasks.doctree
Binary file not shown.
Binary file removed docs/.doctrees/models.doctree
Binary file not shown.
Binary file removed docs/.doctrees/pipeline_features.doctree
Binary file not shown.
Binary file removed docs/.doctrees/running_the_program.doctree
Binary file not shown.
Binary file removed docs/.doctrees/substitutions.doctree
Binary file not shown.
Binary file removed docs/.doctrees/use_examples.doctree
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -1,30 +1,42 @@
# Comparison types, transform add-ons, aggregate features, and household aggregate features
# Comparison Features

This page has information on the different comparison types available for the `[[comparison_features]]`
section, along with some attributes available to all of the comparison types and some aggregate features
that are not configurable.
During matching, hlink computes comparison features on each record pair which
it considers a potential match. These comparison features can be passed as
features to machine-learning algorithms or used to define
[comparisons](comparisons) which filter the `potential_matches` table.

## Comparison types
Each header below represents a comparison type. Transforms are used in the context of `comparison_features`.
Each comparison feature must have a comparison type, which tells hlink how to
compute the comparison feature. This page has information on the available
comparison types and how to configure them. It also lists some attributes
available to all comparison types and some predefined aggregate features which
do not need to be explicitly configured.

```
[[comparison_features]]
alias = "relatematch"
column_name = "relate_div_100"
comparison_type = "equals"
categorical = true
```
## Comparison Types

Each section below describes a comparison type. Each type represents a
different operation, computation, or transformation that hlink can perform on
one or more input columns. Some comparison types expect their own attributes
for additional configuration. These attributes are listed in each section,
along with an example.

### maximum_jaro_winkler
Finds the greatest Jaro-Winkler value among the cartesian product of multiple columns. For example, given an input of `column_names = ['namefrst', 'namelast']`, it would return the maximum Jaro-Winkler name comparison value among the following four comparisons:

Finds the greatest Jaro-Winkler value among the cartesian product of multiple
columns. For example, given an input of `column_names = ['namefrst',
'namelast']`, it would return the maximum Jaro-Winkler name comparison value
among the following four comparisons:

```
[('namefrst_a', 'namefrst_b'),
('namefrst_a', 'namelast_b'),
('namelast_a', 'namefrst_b'),
('namelast_a', 'namelast_b')]
```
a.namefrst, b.namefrst
a.namefrst, b.namelast
a.namelast, b.namefrst
a.namelast, b.namelast
```

* Attributes:
* `column_names` -- Type: list of strings. Required. The list of columns used as input for the set of comparisons generated by taking the cartesian product.
* `column_names` -- Type: list of strings. Required. The list of columns
used as input for the set of comparisons, which are generated by taking the
Cartesian product of the set of input columns with itself.

```
[[comparison_features]]
Expand Down
125 changes: 125 additions & 0 deletions docs/_sources/comparisons.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Comparisons

## Overview

The `comparisons` configuration section defines constraints on the matching
process. Unlike `comparison_features` and `feature_selections`, which define
features for use with a machine-learning algorithm, `comparisons` define rules
which directly filter the output `potential_matches` table. These rules often
depend on some comparison features, and hlink always applies the rules after
exploding and blocking in the matching task.

As an example, suppose that your `comparisons` configuration section looks like
the following.

```
[comparisons]
comparison_type = "threshold"
feature_name = "namefrst_jw"
threshold = 0.79
```

This comparison defines a rule that depends on the `namefrst_jw` comparison
feature. During matching, only pairs of records with `namefrst_jw` greater than
or equal to 0.79 will be added to the potential matches table. Pairs of records
which do not satisfy the comparison will not be potential matches.

*Note: This page focuses on the `comparisons` section in particular, but the
household comparisons section `hh_comparisons` has the same structure. It
defines rules which hlink uses to filter record pairs after household blocking
in the hh_matching task. These rules are effectively filters on the output
`hh_potential_matches` table.*

## Comparison Types

Currently the only `comparison_type` supported for the `comparisons` section is
`"threshold"`. This requires the `threshold` attribute, and by default, it
restricts a comparison feature to be greater than or equal to the value given
by `threshold`. The configuration section

```
[comparisons]
comparison_type = "threshold"
feature_name = "namelast_jw"
threshold = 0.84
```

adds the condition `namelast_jw >= 0.84` to each record pair considered during
matching. Only record pairs which satisfy this condition are marked as
potential matches.

Hlink also supports a `threshold_expr` attribute in `comparisons` for more
flexibility. This attribute takes SQL syntax and replaces the `threshold`
attribute described above. For example, to define the condition `flag < 0.5`,
you could set `threshold_expr` like

```
[comparisons]
comparison_type = "threshold"
feature_name = "flag"
threshold_expr = "< 0.5"
```

Note that there is now no need for the `threshold` attribute because the
`threshold_expr` implicitly defines it.

## Defining Multiple Comparisons

In some cases, you may have multiple comparisons to make between record pairs.
The `comparisons` section supports this in a flexible but somewhat verbose way.
Suppose that you would like to combine two of the conditions used in the
examples above, so that record pairs are potential matches only if `namefrst_jw >= 0.79`
and `namelast_jw >= 0.84`. You could do this by setting the `operator`
attribute to `"AND"` and then defining the `comp_a` (comparison A) and `comp_b`
(comparison B) attributes.

```
[comparisons]
operator = "AND"

[comparisons.comp_a]
comparison_type = "threshold"
feature_name = "namefrst_jw"
threshold = 0.79

[comparisons.comp_b]
comparison_type = "threshold"
feature_name = "namelast_jw"
threshold = 0.84
```

Both `comp_a` and `comp_b` are recursive, so they may have the same structure
as the `comparisons` section itself. This means that you can add as many
comparisons as you would like by recursively defining comparisons. `operator`
may be either `"AND"` or `"OR"` and defines the logic for connecting the two
sub-comparisons `comp_a` and `comp_b`. Defining more than two comparisons can
get pretty ugly and verbose, so make sure to use care when defining nested
comparisons. Here is an example of a section with three comparisons.

```
# This comparisons section defines 3 rules for potential matches.
# They are that potential matches must either have
# 1. flag < 0.5
# OR
# 2. namefrst_jw >= 0.79 AND 3. namelast_jw >= 0.84
[comparisons]
operator = "OR"

[comparisons.comp_a]
comparison_type = "threshold"
feature_name = "flag"
threshold_expr = "< 0.5"

[comparisons.comp_b]
operator = "AND"

[comparisons.comp_b.comp_a]
comparison_type = "threshold"
feature_name = "namefrst_jw"
threshold = 0.79

[comparisons.comp_b.comp_b]
comparison_type = "threshold"
feature_name = "namelast_jw"
threshold = 0.84
```
94 changes: 64 additions & 30 deletions docs/_sources/config.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -594,61 +594,95 @@ expand_length = 3
explode = true
```

## [Comparisons](comparison_types)
## [Comparisons](comparisons)

* Header name: `comparisons`
* Description: A list of comparisons to threshold the potential matches on. Only potential matches that pass the thresholds will be created. See [comparison types](comparison_types) for more information.
* Description: A set of comparisons which filter the potential matches.
Only record pairs which satisfy the comparisons qualify as potential matches.
See [comparisons](comparisons) for some more information.
* Required: True
* Type: Object
* Attributes:
* `comparison_type` -- Type: `string`. Required. See [comparison types](comparison_types) for more information.
* `feature_name` -- Type: `string`. Required. The `comparison_feature` to use for the comparison threshold. A `comparison_feature` column by this name must be specified in the `comparison_features` section.

```
[comparisons]
operator = "AND"
There are two different forms that the comparisons table may take. It may either
be a single comparison definition, or it may be a nested comparison definition
with multiple sub-comparisons.

### Single Comparison

* Attributes:
* `comparison_type` -- Type: `string`. Required. The type of the comparison.
Currently the only supported comparison type is `"threshold"`, which compares
a comparison feature to a given value.
* `feature_name` -- Type: `string`. Required. The comparison feature to compare
against.
* `threshold` -- Type: `Any`. Optional. The value to compare against.
* `threshold_expr` -- Type: `string`. Optional. A SQL condition which defines
the comparison on the comparison feature named by `feature_name`.

The comparison definition must contain either `threshold` or `threshold_expr`,
but not both. Providing `threshold = X` is equivalent to the threshold
expression `threshold_expr >= X`.

```
# Only record pairs with namefrst_jw >= 0.79 can be
# potential matches.
[comparisons]
comparison_type = "threshold"
feature_name = "namefrst_jw"
threshold = 0.79
```

```
# Only record pairs with flag < 0.5 can be potential matches.
[comparisons]
comparison_type = "threshold"
feature_name = "flag"
threshold_expr = "< 0.5"
```

### Multiple Comparisons

[comparisons.comp_a]
comparison_type = "threshold"
feature_name = "namefrst_jw"
threshold = 0.79
* Attributes:
* `operator` -- Type: `string`. Required. The logical operator which connects
the two sub-comparisons. May be `"AND"` or `"OR"`.
* `comp_a` -- Type: `object`. Required. The first sub-comparison.
* `comp_b` -- Type: `object`. Required. The second sub-comparison.

[comparisons.comp_b]
comparison_type = "threshold"
feature_name = "namelast_jw"
threshold = 0.79
```
Both `comp_a` and `comp_b` are recursive comparison sections and may contain
either a single comparison or another set of sub-comparisons. Please see the
[comparisons documentation](comparisons.html#defining-multiple-comparisons) for
more details and examples.

## [Household Comparisons](comparison_types)
## [Household Comparisons](comparisons)

* Header name: `hh_comparisons`
* Description: A list of comparisons to threshold the household potential matches on. Also referred to as post-blocking filters, as all household potential matches are created, then only potential matches that pass the post-blocking filters will be kept for scoring. See [comparison types](comparison_types) for more information.
* Required: False
* Type: Object
* Attributes:
* `comparison_type` -- Type: `string`. Required. See [comparison types](comparison_types) for more information.
* `feature_name` -- Type: `string`. Required. The `comparison_feature` to use for the comparison threshold. A `comparison_feature` column by this name must be specified in the `comparison_features` section.

* Description: A set of comparisons which filter the household potential
matches. `hh_comparisons` has the same configuration structure as
`comparisons` and works in a similar way, except that it applies during the
`hh_matching` task instead of `matching`. You can read more about comparisons
[here](comparisons).

```
# Only household record pairs with an age difference <= 10 can be
# household potential matches.
[hh_comparisons]
# only keep household potential matches with an age difference less than or equal than ten years
comparison_type = "threshold"
feature_name = "byrdiff"
threshold_expr = "<= 10"
```

## [Comparison Features](comparison_types)
## [Comparison Features](comparison_features)

* Header name: `comparison_features`
* Description: A list of comparison features to create when comparing records. Comparisons for individual and household linking rounds are both represented here -- no need to duplicate comparisons if used in both rounds, simply specify the `column_name` in the appropriate `training` or `hh_training` section of the config. See the [comparison types](comparison_types) section for more information.
* Description: A list of comparison features to create when comparing records. Comparisons for individual and household linking rounds are both represented here -- no need to duplicate comparisons if used in both rounds, simply specify the `column_name` in the appropriate `training` or `hh_training` section of the config. See the [comparison features documentation page](comparison_features) for more information.
* Required: True
* Type: List
* Attributes:
* `alias` -- Type: `string`. Optional. The name of the comparison feature column to be generated. If not specified, the output column will default to `column_name`.
* `column_name` -- Type: `string`. The name of the columns to compare.
* `comparison_type` -- Type: `string`. The name of the comparison type to use. See the [comparison types](comparison_types) section for more information.
* `comparison_type` -- Type: `string`. The name of the comparison type to use.
* `categorical` -- Type: `boolean`. Optional. Whether the output data should be treated as categorical data (important information used during one-hot encoding and vectorizing in the machine learning pipeline stage).
* Other attributes may be included as well depending on `comparison_type`. See the [comparison types](comparison_types) section for details on each comparison type.
* Other attributes may be included as well depending on `comparison_type`. See the [comparison features page](comparison_features) for details on each comparison type.

```
[[comparison_features]]
Expand Down
3 changes: 2 additions & 1 deletion docs/_sources/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ Configuration API
:caption: Configuration API

Column Mappings <column_mappings.md>
Comparison Types <comparison_types.md>
comparisons
Comparison Features <comparison_features.md>
Feature Selection <feature_selection_transforms.md>
Pipeline Features <pipeline_features.md>
substitutions
Expand Down
2 changes: 1 addition & 1 deletion docs/_sources/link_tasks.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ are grouped into the same blocking bucket.
on each record. These features may be passed to a machine learning model through the
[`training`](config.html#training-and-models) section and/or passed to deterministic
rules with the [`comparisons`](config.html#comparisons) section. There are many
different [comparison types](comparison_types) available for use with
different [comparison types](comparison_features) available for use with
`comparison_features`.
* [`pipeline_features`](pipeline_features.html#pipeline-generated-features) are machine learning transformations
useful for reshaping and interacting data before they are fed to the machine learning
Expand Down
Loading

0 comments on commit 6f7004d

Please sign in to comment.