diff --git a/sphinx-docs/comparisons.md b/sphinx-docs/comparisons.md new file mode 100644 index 0000000..38e2517 --- /dev/null +++ b/sphinx-docs/comparisons.md @@ -0,0 +1,119 @@ +# Comparisons + +## Overview + +The `comparisons` configuration section defines constraints on the matching +process. Unlike `comparison_features` and `feature_selections`, which define +features for use with a machine-learning algorithm, `comparisons` define rules +which directly filter the output `potential_matches` table. These rules often +depend on some comparison features, and hlink always applies the rules after +exploding and blocking in the matching task. + +As an example, suppose that your `comparisons` configuration section looks like +the following. + +``` +[comparisons] +comparison_type = "threshold" +feature_name = "namefrst_jw" +threshold = 0.79 +``` + +This comparison defines a rule that depends on the `namefrst_jw` comparison +feature. During matching, only pairs of records with `namefrst_jw` greater than +or equal to 0.79 will be added to the potential matches table. Pairs of records +which do not satisfy the comparison will not be potential matches. + +## Comparison Types + +Currently the only `comparison_type` supported for the `comparisons` section is +`"threshold"`. This requires the `threshold` attribute, and by default, it +restricts a comparison feature to be greater than or equal to the value given +by `threshold`. The configuration section + +``` +[comparisons] +comparison_type = "threshold" +feature_name = "namelast_jw" +threshold = 0.84 +``` + +adds the condition `namelast_jw >= 0.84` to each record pair considered during +matching. Only record pairs which satisfy this condition are marked as +potential matches. + +Hlink also supports a `threshold_expr` attribute in `comparisons` for more +flexibility. This attribute takes SQL syntax and replaces the `threshold` +attribute described above. For example, to define the condition `flag < 0.5`, +you could set `threshold_expr` like + +``` +[comparisons] +comparison_type = "threshold" +feature_name = "flag" +threshold_expr = "< 0.5" +``` + +Note that there is now no need for the `threshold` attribute because the +`threshold_expr` implicitly defines it. + +## Defining Multiple Comparisons + +In some cases, you may have multiple comparisons to make between record pairs. +The `comparisons` section supports this in a flexible but somewhat verbose way. +Suppose that you would like to combine two of the conditions used in the +examples above, so that record pairs are potential matches only if `namefrst_jw >= 0.79` +and `namelast_jw >= 0.84`. You could do this by setting the `operator` +attribute to `"AND"` and then defining the `comp_a` (comparison A) and `comp_b` +(comparison B) attributes. + +``` +[comparisons] +operator = "AND" + +[comparisons.comp_a] +comparison_type = "threshold" +feature_name = "namefrst_jw" +threshold = 0.79 + +[comparisons.comp_b] +comparison_type = "threshold" +feature_name = "namelast_jw" +threshold = 0.84 +``` + +Both `comp_a` and `comp_b` are recursive, so they may have the same structure +as the `comparisons` section itself. This means that you can add as many +comparisons as you would like by recursively defining comparisons. `operator` +may be either `"AND"` or `"OR"` and defines the logic for connecting the two +sub-comparisons `comp_a` and `comp_b`. Defining more than two comparisons can +get pretty ugly and verbose, so make sure to use care when defining nested +comparisons. Here is an example of a section with three comparisons. + +``` +# This comparisons section defines 3 rules for potential matches. +# They are that potential matches must either have +# 1. flag < 0.5 +# OR +# 2. namefrst_jw >= 0.79 AND 3. namelast_jw >= 0.84 +[comparisons] +operator = "OR" + +[comparisons.comp_a] +comparison_type = "threshold" +feature_name = "flag" +threshold_expr = "< 0.5" + +[comparisons.comp_b] +operator = "AND" + +[comparisons.comp_b.comp_a] +comparison_type = "threshold" +feature_name = "namefrst_jw" +threshold = 0.79 + +[comparisons.comp_b.comp_b] +comparison_type = "threshold" +feature_name = "namelast_jw" +threshold = 0.84 +``` diff --git a/sphinx-docs/index.rst b/sphinx-docs/index.rst index efdb528..f2efb53 100644 --- a/sphinx-docs/index.rst +++ b/sphinx-docs/index.rst @@ -24,6 +24,7 @@ Configuration API :caption: Configuration API Column Mappings + comparisons Comparison Types Feature Selection Pipeline Features