-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[#158] Add a rough draft of a comparisons documentation page
- Loading branch information
1 parent
c855921
commit 0e3c85e
Showing
2 changed files
with
120 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
# Comparisons | ||
|
||
## Overview | ||
|
||
The `comparisons` configuration section defines constraints on the matching | ||
process. Unlike `comparison_features` and `feature_selections`, which define | ||
features for use with a machine-learning algorithm, `comparisons` define rules | ||
which directly filter the output `potential_matches` table. These rules often | ||
depend on some comparison features, and hlink always applies the rules after | ||
exploding and blocking in the matching task. | ||
|
||
As an example, suppose that your `comparisons` configuration section looks like | ||
the following. | ||
|
||
``` | ||
[comparisons] | ||
comparison_type = "threshold" | ||
feature_name = "namefrst_jw" | ||
threshold = 0.79 | ||
``` | ||
|
||
This comparison defines a rule that depends on the `namefrst_jw` comparison | ||
feature. During matching, only pairs of records with `namefrst_jw` greater than | ||
or equal to 0.79 will be added to the potential matches table. Pairs of records | ||
which do not satisfy the comparison will not be potential matches. | ||
|
||
## Comparison Types | ||
|
||
Currently the only `comparison_type` supported for the `comparisons` section is | ||
`"threshold"`. This requires the `threshold` attribute, and by default, it | ||
restricts a comparison feature to be greater than or equal to the value given | ||
by `threshold`. The configuration section | ||
|
||
``` | ||
[comparisons] | ||
comparison_type = "threshold" | ||
feature_name = "namelast_jw" | ||
threshold = 0.84 | ||
``` | ||
|
||
adds the condition `namelast_jw >= 0.84` to each record pair considered during | ||
matching. Only record pairs which satisfy this condition are marked as | ||
potential matches. | ||
|
||
Hlink also supports a `threshold_expr` attribute in `comparisons` for more | ||
flexibility. This attribute takes SQL syntax and replaces the `threshold` | ||
attribute described above. For example, to define the condition `flag < 0.5`, | ||
you could set `threshold_expr` like | ||
|
||
``` | ||
[comparisons] | ||
comparison_type = "threshold" | ||
feature_name = "flag" | ||
threshold_expr = "< 0.5" | ||
``` | ||
|
||
Note that there is now no need for the `threshold` attribute because the | ||
`threshold_expr` implicitly defines it. | ||
|
||
## Defining Multiple Comparisons | ||
|
||
In some cases, you may have multiple comparisons to make between record pairs. | ||
The `comparisons` section supports this in a flexible but somewhat verbose way. | ||
Suppose that you would like to combine two of the conditions used in the | ||
examples above, so that record pairs are potential matches only if `namefrst_jw >= 0.79` | ||
and `namelast_jw >= 0.84`. You could do this by setting the `operator` | ||
attribute to `"AND"` and then defining the `comp_a` (comparison A) and `comp_b` | ||
(comparison B) attributes. | ||
|
||
``` | ||
[comparisons] | ||
operator = "AND" | ||
[comparisons.comp_a] | ||
comparison_type = "threshold" | ||
feature_name = "namefrst_jw" | ||
threshold = 0.79 | ||
[comparisons.comp_b] | ||
comparison_type = "threshold" | ||
feature_name = "namelast_jw" | ||
threshold = 0.84 | ||
``` | ||
|
||
Both `comp_a` and `comp_b` are recursive, so they may have the same structure | ||
as the `comparisons` section itself. This means that you can add as many | ||
comparisons as you would like by recursively defining comparisons. `operator` | ||
may be either `"AND"` or `"OR"` and defines the logic for connecting the two | ||
sub-comparisons `comp_a` and `comp_b`. Defining more than two comparisons can | ||
get pretty ugly and verbose, so make sure to use care when defining nested | ||
comparisons. Here is an example of a section with three comparisons. | ||
|
||
``` | ||
# This comparisons section defines 3 rules for potential matches. | ||
# They are that potential matches must either have | ||
# 1. flag < 0.5 | ||
# OR | ||
# 2. namefrst_jw >= 0.79 AND 3. namelast_jw >= 0.84 | ||
[comparisons] | ||
operator = "OR" | ||
[comparisons.comp_a] | ||
comparison_type = "threshold" | ||
feature_name = "flag" | ||
threshold_expr = "< 0.5" | ||
[comparisons.comp_b] | ||
operator = "AND" | ||
[comparisons.comp_b.comp_a] | ||
comparison_type = "threshold" | ||
feature_name = "namefrst_jw" | ||
threshold = 0.79 | ||
[comparisons.comp_b.comp_b] | ||
comparison_type = "threshold" | ||
feature_name = "namelast_jw" | ||
threshold = 0.84 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters