Skip to content

Commit 0370afb

Browse files
authored
Merge pull request #138 from ipums/or_in_blocking
Support OR conditions in blocking
2 parents bd69a9e + f569256 commit 0370afb

13 files changed

+441
-7
lines changed

docs/_sources/config.md.txt

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -568,6 +568,18 @@ expression = "sex == 1"
568568
* `dataset` -- Type: `string`. Optional. Must be `a` or `b` and used in conjuction with `explode`. Will only explode the column from the `a` or `b` dataset when specified.
569569
* `derived_from` -- Type: `string`. Used in conjunction with `explode = true`. Specifies an input column from the existing dataset to be exploded.
570570
* `expand_length` -- Type: `integer`. When `explode` is used on a column that is an integer, this can be specified to create an array with a range of integer values from (`expand_length` minus `original_value`) to (`expand_length` plus `original_value`). For example, if the input column value for birthyr is 1870, explode is true, and the expand_length is 3, the exploded column birthyr_3 value would be the array [1867, 1868, 1869, 1870, 1871, 1872, 1873].
571+
* `or_group` -- Type: `string`. Optional. The "OR group" to which this
572+
blocking table belongs. Blocking tables that belong to the same OR group
573+
are joined by OR in the blocking condition instead of AND. By default each
574+
blocking table belongs to a different OR group. For example, suppose that
575+
your dataset has 3 possible birthplaces BPL1, BPL2, and BPL3 for each
576+
record. If you don't provide OR groups when blocking on each BPL variable,
577+
then you will get a blocking condition like `(a.BPL1 = b.BPL1) AND (a.BPL2
578+
= b.BPL2) AND (a.BPL3 = b.BPL3)`. But if you set `or_group = "BPL"` for
579+
each of the 3 variables, then you will get a blocking condition like this
580+
instead: `(a.BPL1 = b.BPL1 OR a.BPL2 = b.BPL2 OR a.BPL3 = b.BPL3)`. Note
581+
the parentheses around the entire OR group condition. Other OR groups would
582+
be connected to the BPL OR group with an AND condition.
571583

572584

573585
```

docs/config.html

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -619,6 +619,17 @@ <h2>Blocking<a class="headerlink" href="#blocking" title="Link to this heading">
619619
<li><p><code class="docutils literal notranslate"><span class="pre">dataset</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">string</span></code>. Optional. Must be <code class="docutils literal notranslate"><span class="pre">a</span></code> or <code class="docutils literal notranslate"><span class="pre">b</span></code> and used in conjuction with <code class="docutils literal notranslate"><span class="pre">explode</span></code>. Will only explode the column from the <code class="docutils literal notranslate"><span class="pre">a</span></code> or <code class="docutils literal notranslate"><span class="pre">b</span></code> dataset when specified.</p></li>
620620
<li><p><code class="docutils literal notranslate"><span class="pre">derived_from</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">string</span></code>. Used in conjunction with <code class="docutils literal notranslate"><span class="pre">explode</span> <span class="pre">=</span> <span class="pre">true</span></code>. Specifies an input column from the existing dataset to be exploded.</p></li>
621621
<li><p><code class="docutils literal notranslate"><span class="pre">expand_length</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">integer</span></code>. When <code class="docutils literal notranslate"><span class="pre">explode</span></code> is used on a column that is an integer, this can be specified to create an array with a range of integer values from (<code class="docutils literal notranslate"><span class="pre">expand_length</span></code> minus <code class="docutils literal notranslate"><span class="pre">original_value</span></code>) to (<code class="docutils literal notranslate"><span class="pre">expand_length</span></code> plus <code class="docutils literal notranslate"><span class="pre">original_value</span></code>). For example, if the input column value for birthyr is 1870, explode is true, and the expand_length is 3, the exploded column birthyr_3 value would be the array [1867, 1868, 1869, 1870, 1871, 1872, 1873].</p></li>
622+
<li><p><code class="docutils literal notranslate"><span class="pre">or_group</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">string</span></code>. Optional. The “OR group” to which this
623+
blocking table belongs. Blocking tables that belong to the same OR group
624+
are joined by OR in the blocking condition instead of AND. By default each
625+
blocking table belongs to a different OR group. For example, suppose that
626+
your dataset has 3 possible birthplaces BPL1, BPL2, and BPL3 for each
627+
record. If you don’t provide OR groups when blocking on each BPL variable,
628+
then you will get a blocking condition like <code class="docutils literal notranslate"><span class="pre">(a.BPL1</span> <span class="pre">=</span> <span class="pre">b.BPL1)</span> <span class="pre">AND</span> <span class="pre">(a.BPL2</span> <span class="pre">=</span> <span class="pre">b.BPL2)</span> <span class="pre">AND</span> <span class="pre">(a.BPL3</span> <span class="pre">=</span> <span class="pre">b.BPL3)</span></code>. But if you set <code class="docutils literal notranslate"><span class="pre">or_group</span> <span class="pre">=</span> <span class="pre">&quot;BPL&quot;</span></code> for
629+
each of the 3 variables, then you will get a blocking condition like this
630+
instead: <code class="docutils literal notranslate"><span class="pre">(a.BPL1</span> <span class="pre">=</span> <span class="pre">b.BPL1</span> <span class="pre">OR</span> <span class="pre">a.BPL2</span> <span class="pre">=</span> <span class="pre">b.BPL2</span> <span class="pre">OR</span> <span class="pre">a.BPL3</span> <span class="pre">=</span> <span class="pre">b.BPL3)</span></code>. Note
631+
the parentheses around the entire OR group condition. Other OR groups would
632+
be connected to the BPL OR group with an AND condition.</p></li>
622633
</ul>
623634
</li>
624635
</ul>

docs/searchindex.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

hlink/linking/matching/link_step_explode.py

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,13 @@
33
# in this project's top-level directory, and also on-line at:
44
# https://github.com/ipums/hlink
55

6+
from typing import Any
7+
8+
from pyspark.sql import Column, DataFrame
69
from pyspark.sql.functions import array, explode, col
710

811
import hlink.linking.core.comparison as comparison_core
912
from . import _helpers as matching_helpers
10-
1113
from hlink.linking.link_step import LinkStep
1214

1315

@@ -64,7 +66,15 @@ def _run(self):
6466
),
6567
)
6668

67-
def _explode(self, df, comparisons, comparison_features, blocking, id_column, is_a):
69+
def _explode(
70+
self,
71+
df: DataFrame,
72+
comparisons: dict[str, Any],
73+
comparison_features: list[dict[str, Any]],
74+
blocking: list[dict[str, Any]],
75+
id_column: str,
76+
is_a: bool,
77+
) -> DataFrame:
6878
# comp_feature_names, dist_features_to_run, feature_columns = comparison_core.get_feature_specs_from_comp(
6979
# comparisons, comparison_features
7080
# )
@@ -159,7 +169,7 @@ def _explode(self, df, comparisons, comparison_features, blocking, id_column, is
159169
exploded_df = exploded_df.select(explode_selects)
160170
return exploded_df
161171

162-
def _expand(self, column_name, expand_length):
172+
def _expand(self, column_name: str, expand_length: int) -> Column:
163173
return array(
164174
[
165175
col(column_name).cast("int") + i

hlink/linking/matching/link_step_match.py

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,9 @@
33
# in this project's top-level directory, and also on-line at:
44
# https://github.com/ipums/hlink
55

6+
from collections import defaultdict
67
import logging
8+
from typing import Any
79

810
import hlink.linking.core.comparison_feature as comparison_feature_core
911
import hlink.linking.core.dist_table as dist_table_core
@@ -14,6 +16,50 @@
1416
from hlink.linking.link_step import LinkStep
1517

1618

19+
def extract_or_groups_from_blocking(blocking: list[dict[str, Any]]) -> list[list[str]]:
20+
"""
21+
Extract a list of "or_groups" from the blocking section of the config. Each
22+
blocking table may have an or_group attribute. When two or more tables have
23+
the same value for or_group, they belong to the same or_group and will be
24+
connected by ORs in the potential_matches SQL query instead of by ANDs.
25+
Tables without an explicit or_group belong to their own or_group.
26+
27+
For example, the blocking section
28+
29+
```
30+
[[blocking]]
31+
column_name = "AGE1"
32+
or_group = "AGE"
33+
34+
[[blocking]]
35+
column_name = "AGE2"
36+
or_group = "AGE"
37+
38+
[[blocking]]
39+
column_name = "BPL"
40+
```
41+
42+
Would give the SQL condition
43+
44+
```
45+
(a.AGE1 = b.AGE1 OR a.AGE2 = b.AGE2) AND (a.BPL = b.BPL)
46+
```
47+
48+
This function returns a list of or_groups, each of which is a list of
49+
column names. It maintains the input order except that the implicit
50+
or_groups are all placed after the explicit or_groups.
51+
"""
52+
or_groups: defaultdict[str | None, list[str]] = defaultdict(list)
53+
54+
for blocking_table in blocking:
55+
column_name = blocking_table["column_name"]
56+
or_group = blocking_table.get("or_group")
57+
or_groups[or_group].append(column_name)
58+
59+
implicit_or_groups = [[column_name] for column_name in or_groups.pop(None, [])]
60+
return list(or_groups.values()) + implicit_or_groups
61+
62+
1763
class LinkStepMatch(LinkStep):
1864
def __init__(self, task):
1965
super().__init__(
@@ -46,7 +92,7 @@ def _run(self):
4692
config["id_column"],
4793
)
4894

49-
t_ctx["blocking_columns"] = [bc["column_name"] for bc in blocking]
95+
t_ctx["blocking_columns"] = extract_or_groups_from_blocking(blocking)
5096

5197
blocking_exploded_columns = [
5298
bc["column_name"] for bc in blocking if "explode" in bc and bc["explode"]

hlink/linking/matching/templates/potential_matches.sql

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ SELECT DISTINCT
1515
{% endif %}
1616
FROM exploded_df_a a
1717
JOIN exploded_df_b b ON
18-
{% for col in blocking_columns %}
19-
a.{{ col }} = b.{{ col }} {{ "AND" if not loop.last }}
18+
{% for or_group in blocking_columns %}
19+
({% for col in or_group %}a.{{ col }} = b.{{ col }}{{ " OR " if not loop.last }}{% endfor %}) {{ "AND" if not loop.last }}
2020
{% endfor %}
2121
{% if distance_table %}
2222
{% for d in distance_table %}

hlink/tests/conftest.py

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1160,6 +1160,63 @@ def blocking_explode_conf(spark, conf):
11601160
return conf
11611161

11621162

1163+
@pytest.fixture(scope="function")
1164+
def blocking_or_groups_conf(spark, conf):
1165+
"""
1166+
For testing the or_groups blocking functionality.
1167+
"""
1168+
conf["column_mappings"] = [
1169+
{"column_name": "namefrst"},
1170+
{"column_name": "namelast"},
1171+
{"column_name": "birthyr"},
1172+
{"column_name": "sex"},
1173+
{"column_name": "bpl1"},
1174+
{"column_name": "bpl2"},
1175+
{"column_name": "bpl3"},
1176+
]
1177+
1178+
conf["blocking"] = [
1179+
{
1180+
"column_name": "birthyr_3",
1181+
"dataset": "a",
1182+
"derived_from": "birthyr",
1183+
"expand_length": 3,
1184+
"explode": True,
1185+
"or_group": "birthyr",
1186+
},
1187+
{"column_name": "sex"},
1188+
{"column_name": "bpl1", "or_group": "bpl"},
1189+
{"column_name": "bpl2", "or_group": "bpl"},
1190+
{"column_name": "bpl3", "or_group": "bpl"},
1191+
]
1192+
conf["comparison_features"] = [
1193+
{
1194+
"alias": "namefrst_jw",
1195+
"column_name": "namefrst",
1196+
"comparison_type": "jaro_winkler",
1197+
},
1198+
{
1199+
"alias": "namelast_jw",
1200+
"column_name": "namelast",
1201+
"comparison_type": "jaro_winkler",
1202+
},
1203+
]
1204+
conf["comparisons"] = {
1205+
"comp_a": {
1206+
"feature_name": "namefrst_jw",
1207+
"threshold": 0.8,
1208+
"comparison_type": "threshold",
1209+
},
1210+
"comp_b": {
1211+
"feature_name": "namelast_jw",
1212+
"threshold": 0.8,
1213+
"comparison_type": "threshold",
1214+
},
1215+
"operator": "AND",
1216+
}
1217+
return conf
1218+
1219+
11631220
@pytest.fixture(scope="function")
11641221
def matching_household_conf(
11651222
spark, conf, datasource_real_households, preprocessing, matching
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
id,namefrst,namelast,birthyr,sex,bpl1,bpl2,bpl3
2+
b5689d06-edd3-498e-8b5b-e04f2fa2f2a9,Catherine,Beebe,1866,2,,,
3+
a7118f06-949d-4d02-be0a-db33a6f8f3a8,Frances E,Bird,1870,2,,,
4+
85d089c0-b907-4d9c-95ab-c5fa4a3dd2bb,J S,Luff,1861,1,,,
5+
cddd9455-48e0-4b48-89a5-9ee315e00087,John,Smith,1884,1,,,
6+
8cb74256-6dfa-4d17-913a-59fa646c388a,Saml H,Russell,1833,1,,,
7+
1f8e1a74-d486-44ad-8d5c-51aedf86208e,Charles,Robertson,1884,1,,,
8+
61a1590f-1d3a-4666-8406-3d4aaf0770b4,John,Dickinson,1868,1,,,
9+
92277f0b-1476-41f5-9dc8-bf83672616d0,Joseph,Shissler,1874,1,,,
10+
322291a1-de91-439d-bba0-45fc2f47a2eb,David,Hall,1839,1,,,
11+
136f7105-ff59-4eac-9d95-44b002cbb448,John,Decame,1858,1,,,
12+
1138ab41-e234-4c72-b812-eaaf0fc5f76c,Nancy,Decame,1857,2,,,
13+
066ea4e1-f340-4231-b505-ec7bb9a07103,Peter N,Decame,1895,1,,,
14+
b7d96336-404e-490c-8c45-61f2287b52ff,Annam,Decame,1897,2,,,
15+
24bdff6a-5590-4494-8e8a-ac4a549c8890,Sarah,Decame,1900,2,,,
16+
c1fedaab-f026-4aa4-9320-e10f2432d539,James,Carney,1888,1,,,
17+
43a6ebe5-752b-4054-818d-6f6f75cc89e7,Alfred,Dell,1883,1,,,
18+
0d693015-2349-4363-9667-45036af7d0db,Chas,Syaex,1870,1,,,
19+
1d586e26-aac1-49df-a2ad-fe0a385a26bf,Sarah,Russell,1897,2,,,
20+
93b7ac89-f9db-49b2-a1f2-c189fecc14ae,Wm H,Hazard,1881,1,,,
21+
e51c36c9-570c-466d-aac1-bf380c9c20f1,Martha,Hazard,1880,2,,,
22+
9250341a-8336-494a-bc84-2b803efe64c6,Willie May,Hazard,1902,2,,,
23+
a70679f0-9313-4ef3-bf87-5dfe81beed5d,Samuel,Hazard,1906,2,,,
24+
4715bbf6-d3e2-4260-9ddd-6aece147e5c1,Samuel,Morgan,1878,1,,,
25+
77378570-5214-4ac5-8258-c5156e8b99b3,J Clauson,Mcfarland,1890,1,,,
26+
6542b541-6e10-411f-9b2a-7c0b93b0aa68,Eugene,Mcfarland,1892,1,,,
27+
396c4077-6a70-4a17-97fb-f8a0c06fdafe,Anna,Preston,1871,2,,,
28+
7e9dde5e-3fad-4b2e-b367-643c0dc8cabb,Rebecca N,Alexander,1861,2,,,
29+
f7d9e25f-c390-4222-ac24-4e93d72daa05,Martha,Ellis,1873,2,,,
30+
24b7afa1-8c49-4833-8292-c545c85d3b89,Otillia,Zeider,1876,2,,,
31+
4b416874-0c5c-4233-81ec-39223bc66f4f,Mary,Doyle,1846,2,,,
32+
a499b0dc-7ac0-4d61-b493-91a3036c712e ,ANNIE ,FAUBLE ,1884,2,1,,
33+
ae7261c3-7d71-4ea1-997f-5d1a68c18777 ,MARY ,REESE ,1875,2,,,
34+
ad6442b5-42bc-4c2e-a517-5a951d989a92 ,MARY ,REESE ,1899,2,1,2,3
35+
b0b6695f-dfa5-4e4d-bc75-798c27195fff ,SALLY ,REESE ,1901,2,,,
36+
9e807937-de09-414c-bfb2-ac821e112929 ,JOHN ,SHIELDS ,1880,1,1,,
37+
426f2cbe-32e1-45eb-9f86-89a2b9116b7e ,ANNE ,FAUBLE ,1884,2,,,
38+
a76697d9-b0c8-4774-bc3e-12a7e403c7e6 ,JOHN ,COLLINS ,1893,1,,,
39+
3575c9ba-1527-4ca2-aff0-d7c2d1efb421 ,MAGGIE ,COLLINS ,1894,2,,,
40+
49e53dbc-fe8e-4e55-8cb9-a1d93c284d98 ,MARY ,COLLINS ,1898,2,,,
41+
50b33ef6-259d-43af-8cdc-56a61f881169 ,WILLIAM H. ,SEWARD ,1856,1,,4,
42+
952754a5-48b4-462a-ac57-e4a059a9ef98 ,ESTHER ,BIERHAHN ,1870,2,,,
43+
ea6d77b3-2e2d-4c59-a0ac-6b297e8898e3 ,CHARLES ,CLEVELAND ,1865,1,,,
44+
60a5052e-6d67-455a-a3aa-bb79560c7d8d ,SUSAN ,WILSON ,1850,2,,,
45+
0d4472ec-6378-4aeb-b6c7-17e1c388bb94 ,ARCHER ,HARVEY ,1890,1,,,
46+
65ccbeb7-2c79-4fb0-b354-c67f150ad80c ,ELIZABETH ,MC LEAN ,1868,2,,,
47+
72cbe5fa-f558-4393-8423-1842fadf7f11 ,MARY A. ,FLEMMING ,1837,2,,,
48+
44693008-fd6f-48fe-9c52-e6c07baff361 ,BESSIE ,CHAMBERS ,1908,2,,,
49+
bcc0988e-2397-4f1b-8e76-4bfe1b05dbc6 ,THOMAS ,GRAHAM ,1846,1,,,
50+
a7b10530-b7c9-44d5-9125-c603f392d6d3 ,EDWARD ,DEKAY ,1875,1,,,
51+
1e635c1c-7faa-4270-acf3-a22635884b90 ,NATHEN ,THORPE ,1836,1,,,
52+
d3217545-3453-4d96-86c0-d6a3e60fb2f8 ,JOB ,FOSTER ,1884,1,,,
53+
2a35bae5-3120-4e2c-87da-694d4419c9ce ,JEZEBEL ,FOSTER ,1888,2,,,
54+
94460fc2-954b-469d-9726-f7126c30e5e2 ,ELIZA ,GOODWIN ,1871,2,,,
55+
620b6ebb-82e6-42db-8aae-300ca2be0c00 ,MARY ,GOODWIN ,1893,2,,,
56+
bfe1080e-2e67-4a8c-a6e1-ed94ea103712 ,JO ,GOODWIN ,1895,1,,6,7
57+
7fb55d25-2a7d-486d-9efa-27b9d7e60c24 ,PHINEAS ,TAYLOR ,1871,1,,5,
58+
a0f33b36-cef7-4949-a031-22b90f1055d4 ,MARY A. ,LORD ,1856,2,,,1
59+
1a76745c-acf8-48a0-9992-7fb10c11710b ,E.B. ,ALLEN ,1889,1,,,
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
id,namefrst,namelast,birthyr,sex,bpl1,bpl2,bpl3
2+
a499b0dc-7ac0-4d61-b493-91a3036c712e ,ANNIE ,FAUBLE ,1884,2,1,,
3+
ae7261c3-7d71-4ea1-997f-5d1a68c18777 ,MARY ,REESE ,1875,2,,,
4+
ad6442b5-42bc-4c2e-a517-5a951d989a92 ,MARY ,REESE ,1902,2,1,2,3
5+
9e807937-de09-414c-bfb2-ac821e112929 ,JOHN ,SHIELDS ,1889,1,1,,
6+
426f2cbe-32e1-45eb-9f86-89a2b9116b7e ,ANNE ,FAUBLE ,1884,2,,,
7+
a76697d9-b0c8-4774-bc3e-12a7e403c7e6 ,JOHN ,COLLINS ,1893,1,,,
8+
3575c9ba-1527-4ca2-aff0-d7c2d1efb421 ,MAGGIE ,COLLINS ,1894,2,,,
9+
49e53dbc-fe8e-4e55-8cb9-a1d93c284d98 ,MARY ,COLLINS ,1898,2,,,
10+
50b33ef6-259d-43af-8cdc-56a61f881169 ,WILLIAM H. ,SEWARD ,1866,1,,4,
11+
952754a5-48b4-462a-ac57-e4a059a9ef98 ,ESTHER ,BIERHAHN ,1870,2,,,
12+
ea6d77b3-2e2d-4c59-a0ac-6b297e8898e3 ,CHARLES ,CLEVELAND ,1865,1,,,
13+
60a5052e-6d67-455a-a3aa-bb79560c7d8d ,SUSAN ,WILSON ,1850,2,,,
14+
0d4472ec-6378-4aeb-b6c7-17e1c388bb94 ,ARCHER ,HARVEY ,1893,1,,,
15+
65ccbeb7-2c79-4fb0-b354-c67f150ad80c ,ELIZABETH ,MC LEAN ,1868,2,,,
16+
72cbe5fa-f558-4393-8423-1842fadf7f11 ,MARY A. ,FLEMMING ,1842,2,,,
17+
bcc0988e-2397-4f1b-8e76-4bfe1b05dbc6 ,THOMAS ,GRAHAM ,1846,1,,,
18+
a7b10530-b7c9-44d5-9125-c603f392d6d3 ,EDWARD ,DEKAY ,1875,1,,,
19+
1e635c1c-7faa-4270-acf3-a22635884b90 ,NATHEN ,THORPE ,1836,1,,,
20+
d3217545-3453-4d96-86c0-d6a3e60fb2f8 ,JOB ,FOSTER ,1884,1,,,
21+
2a35bae5-3120-4e2c-87da-694d4419c9ce ,JEZEBEL ,FOSTER ,1888,2,,,
22+
94460fc2-954b-469d-9726-f7126c30e5e2 ,ELIZA ,GOODWIN ,1871,2,,,
23+
620b6ebb-82e6-42db-8aae-300ca2be0c00 ,MARY ,GOODWIN ,1893,2,,,
24+
bfe1080e-2e67-4a8c-a6e1-ed94ea103712 ,JO ,GOODWIN ,1890,1,,6,7
25+
7fb55d25-2a7d-486d-9efa-27b9d7e60c24 ,PHINEAS ,TAYLOR ,1871,1,,5,
26+
a0f33b36-cef7-4949-a031-22b90f1055d4 ,MARY A. ,LORD ,1856,2,,,1
27+
1a76745c-acf8-48a0-9992-7fb10c11710b ,E.B. ,ALLEN ,1889,1,,,

0 commit comments

Comments
 (0)