Skip to content

Commit

Permalink
Merge pull request #138 from ipums/or_in_blocking
Browse files Browse the repository at this point in the history
Support OR conditions in blocking
  • Loading branch information
riley-harper authored Jun 18, 2024
2 parents bd69a9e + f569256 commit 0370afb
Show file tree
Hide file tree
Showing 13 changed files with 441 additions and 7 deletions.
12 changes: 12 additions & 0 deletions docs/_sources/config.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -568,6 +568,18 @@ expression = "sex == 1"
* `dataset` -- Type: `string`. Optional. Must be `a` or `b` and used in conjuction with `explode`. Will only explode the column from the `a` or `b` dataset when specified.
* `derived_from` -- Type: `string`. Used in conjunction with `explode = true`. Specifies an input column from the existing dataset to be exploded.
* `expand_length` -- Type: `integer`. When `explode` is used on a column that is an integer, this can be specified to create an array with a range of integer values from (`expand_length` minus `original_value`) to (`expand_length` plus `original_value`). For example, if the input column value for birthyr is 1870, explode is true, and the expand_length is 3, the exploded column birthyr_3 value would be the array [1867, 1868, 1869, 1870, 1871, 1872, 1873].
* `or_group` -- Type: `string`. Optional. The "OR group" to which this
blocking table belongs. Blocking tables that belong to the same OR group
are joined by OR in the blocking condition instead of AND. By default each
blocking table belongs to a different OR group. For example, suppose that
your dataset has 3 possible birthplaces BPL1, BPL2, and BPL3 for each
record. If you don't provide OR groups when blocking on each BPL variable,
then you will get a blocking condition like `(a.BPL1 = b.BPL1) AND (a.BPL2
= b.BPL2) AND (a.BPL3 = b.BPL3)`. But if you set `or_group = "BPL"` for
each of the 3 variables, then you will get a blocking condition like this
instead: `(a.BPL1 = b.BPL1 OR a.BPL2 = b.BPL2 OR a.BPL3 = b.BPL3)`. Note
the parentheses around the entire OR group condition. Other OR groups would
be connected to the BPL OR group with an AND condition.


```
Expand Down
11 changes: 11 additions & 0 deletions docs/config.html
Original file line number Diff line number Diff line change
Expand Up @@ -619,6 +619,17 @@ <h2>Blocking<a class="headerlink" href="#blocking" title="Link to this heading">
<li><p><code class="docutils literal notranslate"><span class="pre">dataset</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">string</span></code>. Optional. Must be <code class="docutils literal notranslate"><span class="pre">a</span></code> or <code class="docutils literal notranslate"><span class="pre">b</span></code> and used in conjuction with <code class="docutils literal notranslate"><span class="pre">explode</span></code>. Will only explode the column from the <code class="docutils literal notranslate"><span class="pre">a</span></code> or <code class="docutils literal notranslate"><span class="pre">b</span></code> dataset when specified.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">derived_from</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">string</span></code>. Used in conjunction with <code class="docutils literal notranslate"><span class="pre">explode</span> <span class="pre">=</span> <span class="pre">true</span></code>. Specifies an input column from the existing dataset to be exploded.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">expand_length</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">integer</span></code>. When <code class="docutils literal notranslate"><span class="pre">explode</span></code> is used on a column that is an integer, this can be specified to create an array with a range of integer values from (<code class="docutils literal notranslate"><span class="pre">expand_length</span></code> minus <code class="docutils literal notranslate"><span class="pre">original_value</span></code>) to (<code class="docutils literal notranslate"><span class="pre">expand_length</span></code> plus <code class="docutils literal notranslate"><span class="pre">original_value</span></code>). For example, if the input column value for birthyr is 1870, explode is true, and the expand_length is 3, the exploded column birthyr_3 value would be the array [1867, 1868, 1869, 1870, 1871, 1872, 1873].</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">or_group</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">string</span></code>. Optional. The “OR group” to which this
blocking table belongs. Blocking tables that belong to the same OR group
are joined by OR in the blocking condition instead of AND. By default each
blocking table belongs to a different OR group. For example, suppose that
your dataset has 3 possible birthplaces BPL1, BPL2, and BPL3 for each
record. If you don’t provide OR groups when blocking on each BPL variable,
then you will get a blocking condition like <code class="docutils literal notranslate"><span class="pre">(a.BPL1</span> <span class="pre">=</span> <span class="pre">b.BPL1)</span> <span class="pre">AND</span> <span class="pre">(a.BPL2</span> <span class="pre">=</span> <span class="pre">b.BPL2)</span> <span class="pre">AND</span> <span class="pre">(a.BPL3</span> <span class="pre">=</span> <span class="pre">b.BPL3)</span></code>. But if you set <code class="docutils literal notranslate"><span class="pre">or_group</span> <span class="pre">=</span> <span class="pre">&quot;BPL&quot;</span></code> for
each of the 3 variables, then you will get a blocking condition like this
instead: <code class="docutils literal notranslate"><span class="pre">(a.BPL1</span> <span class="pre">=</span> <span class="pre">b.BPL1</span> <span class="pre">OR</span> <span class="pre">a.BPL2</span> <span class="pre">=</span> <span class="pre">b.BPL2</span> <span class="pre">OR</span> <span class="pre">a.BPL3</span> <span class="pre">=</span> <span class="pre">b.BPL3)</span></code>. Note
the parentheses around the entire OR group condition. Other OR groups would
be connected to the BPL OR group with an AND condition.</p></li>
</ul>
</li>
</ul>
Expand Down
2 changes: 1 addition & 1 deletion docs/searchindex.js

Large diffs are not rendered by default.

16 changes: 13 additions & 3 deletions hlink/linking/matching/link_step_explode.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@
# in this project's top-level directory, and also on-line at:
# https://github.com/ipums/hlink

from typing import Any

from pyspark.sql import Column, DataFrame
from pyspark.sql.functions import array, explode, col

import hlink.linking.core.comparison as comparison_core
from . import _helpers as matching_helpers

from hlink.linking.link_step import LinkStep


Expand Down Expand Up @@ -64,7 +66,15 @@ def _run(self):
),
)

def _explode(self, df, comparisons, comparison_features, blocking, id_column, is_a):
def _explode(
self,
df: DataFrame,
comparisons: dict[str, Any],
comparison_features: list[dict[str, Any]],
blocking: list[dict[str, Any]],
id_column: str,
is_a: bool,
) -> DataFrame:
# comp_feature_names, dist_features_to_run, feature_columns = comparison_core.get_feature_specs_from_comp(
# comparisons, comparison_features
# )
Expand Down Expand Up @@ -159,7 +169,7 @@ def _explode(self, df, comparisons, comparison_features, blocking, id_column, is
exploded_df = exploded_df.select(explode_selects)
return exploded_df

def _expand(self, column_name, expand_length):
def _expand(self, column_name: str, expand_length: int) -> Column:
return array(
[
col(column_name).cast("int") + i
Expand Down
48 changes: 47 additions & 1 deletion hlink/linking/matching/link_step_match.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
# in this project's top-level directory, and also on-line at:
# https://github.com/ipums/hlink

from collections import defaultdict
import logging
from typing import Any

import hlink.linking.core.comparison_feature as comparison_feature_core
import hlink.linking.core.dist_table as dist_table_core
Expand All @@ -14,6 +16,50 @@
from hlink.linking.link_step import LinkStep


def extract_or_groups_from_blocking(blocking: list[dict[str, Any]]) -> list[list[str]]:
"""
Extract a list of "or_groups" from the blocking section of the config. Each
blocking table may have an or_group attribute. When two or more tables have
the same value for or_group, they belong to the same or_group and will be
connected by ORs in the potential_matches SQL query instead of by ANDs.
Tables without an explicit or_group belong to their own or_group.
For example, the blocking section
```
[[blocking]]
column_name = "AGE1"
or_group = "AGE"
[[blocking]]
column_name = "AGE2"
or_group = "AGE"
[[blocking]]
column_name = "BPL"
```
Would give the SQL condition
```
(a.AGE1 = b.AGE1 OR a.AGE2 = b.AGE2) AND (a.BPL = b.BPL)
```
This function returns a list of or_groups, each of which is a list of
column names. It maintains the input order except that the implicit
or_groups are all placed after the explicit or_groups.
"""
or_groups: defaultdict[str | None, list[str]] = defaultdict(list)

for blocking_table in blocking:
column_name = blocking_table["column_name"]
or_group = blocking_table.get("or_group")
or_groups[or_group].append(column_name)

implicit_or_groups = [[column_name] for column_name in or_groups.pop(None, [])]
return list(or_groups.values()) + implicit_or_groups


class LinkStepMatch(LinkStep):
def __init__(self, task):
super().__init__(
Expand Down Expand Up @@ -46,7 +92,7 @@ def _run(self):
config["id_column"],
)

t_ctx["blocking_columns"] = [bc["column_name"] for bc in blocking]
t_ctx["blocking_columns"] = extract_or_groups_from_blocking(blocking)

blocking_exploded_columns = [
bc["column_name"] for bc in blocking if "explode" in bc and bc["explode"]
Expand Down
4 changes: 2 additions & 2 deletions hlink/linking/matching/templates/potential_matches.sql
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ SELECT DISTINCT
{% endif %}
FROM exploded_df_a a
JOIN exploded_df_b b ON
{% for col in blocking_columns %}
a.{{ col }} = b.{{ col }} {{ "AND" if not loop.last }}
{% for or_group in blocking_columns %}
({% for col in or_group %}a.{{ col }} = b.{{ col }}{{ " OR " if not loop.last }}{% endfor %}) {{ "AND" if not loop.last }}
{% endfor %}
{% if distance_table %}
{% for d in distance_table %}
Expand Down
57 changes: 57 additions & 0 deletions hlink/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1160,6 +1160,63 @@ def blocking_explode_conf(spark, conf):
return conf


@pytest.fixture(scope="function")
def blocking_or_groups_conf(spark, conf):
"""
For testing the or_groups blocking functionality.
"""
conf["column_mappings"] = [
{"column_name": "namefrst"},
{"column_name": "namelast"},
{"column_name": "birthyr"},
{"column_name": "sex"},
{"column_name": "bpl1"},
{"column_name": "bpl2"},
{"column_name": "bpl3"},
]

conf["blocking"] = [
{
"column_name": "birthyr_3",
"dataset": "a",
"derived_from": "birthyr",
"expand_length": 3,
"explode": True,
"or_group": "birthyr",
},
{"column_name": "sex"},
{"column_name": "bpl1", "or_group": "bpl"},
{"column_name": "bpl2", "or_group": "bpl"},
{"column_name": "bpl3", "or_group": "bpl"},
]
conf["comparison_features"] = [
{
"alias": "namefrst_jw",
"column_name": "namefrst",
"comparison_type": "jaro_winkler",
},
{
"alias": "namelast_jw",
"column_name": "namelast",
"comparison_type": "jaro_winkler",
},
]
conf["comparisons"] = {
"comp_a": {
"feature_name": "namefrst_jw",
"threshold": 0.8,
"comparison_type": "threshold",
},
"comp_b": {
"feature_name": "namelast_jw",
"threshold": 0.8,
"comparison_type": "threshold",
},
"operator": "AND",
}
return conf


@pytest.fixture(scope="function")
def matching_household_conf(
spark, conf, datasource_real_households, preprocessing, matching
Expand Down
59 changes: 59 additions & 0 deletions hlink/tests/input_data/matching_or_group_test_a.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
id,namefrst,namelast,birthyr,sex,bpl1,bpl2,bpl3
b5689d06-edd3-498e-8b5b-e04f2fa2f2a9,Catherine,Beebe,1866,2,,,
a7118f06-949d-4d02-be0a-db33a6f8f3a8,Frances E,Bird,1870,2,,,
85d089c0-b907-4d9c-95ab-c5fa4a3dd2bb,J S,Luff,1861,1,,,
cddd9455-48e0-4b48-89a5-9ee315e00087,John,Smith,1884,1,,,
8cb74256-6dfa-4d17-913a-59fa646c388a,Saml H,Russell,1833,1,,,
1f8e1a74-d486-44ad-8d5c-51aedf86208e,Charles,Robertson,1884,1,,,
61a1590f-1d3a-4666-8406-3d4aaf0770b4,John,Dickinson,1868,1,,,
92277f0b-1476-41f5-9dc8-bf83672616d0,Joseph,Shissler,1874,1,,,
322291a1-de91-439d-bba0-45fc2f47a2eb,David,Hall,1839,1,,,
136f7105-ff59-4eac-9d95-44b002cbb448,John,Decame,1858,1,,,
1138ab41-e234-4c72-b812-eaaf0fc5f76c,Nancy,Decame,1857,2,,,
066ea4e1-f340-4231-b505-ec7bb9a07103,Peter N,Decame,1895,1,,,
b7d96336-404e-490c-8c45-61f2287b52ff,Annam,Decame,1897,2,,,
24bdff6a-5590-4494-8e8a-ac4a549c8890,Sarah,Decame,1900,2,,,
c1fedaab-f026-4aa4-9320-e10f2432d539,James,Carney,1888,1,,,
43a6ebe5-752b-4054-818d-6f6f75cc89e7,Alfred,Dell,1883,1,,,
0d693015-2349-4363-9667-45036af7d0db,Chas,Syaex,1870,1,,,
1d586e26-aac1-49df-a2ad-fe0a385a26bf,Sarah,Russell,1897,2,,,
93b7ac89-f9db-49b2-a1f2-c189fecc14ae,Wm H,Hazard,1881,1,,,
e51c36c9-570c-466d-aac1-bf380c9c20f1,Martha,Hazard,1880,2,,,
9250341a-8336-494a-bc84-2b803efe64c6,Willie May,Hazard,1902,2,,,
a70679f0-9313-4ef3-bf87-5dfe81beed5d,Samuel,Hazard,1906,2,,,
4715bbf6-d3e2-4260-9ddd-6aece147e5c1,Samuel,Morgan,1878,1,,,
77378570-5214-4ac5-8258-c5156e8b99b3,J Clauson,Mcfarland,1890,1,,,
6542b541-6e10-411f-9b2a-7c0b93b0aa68,Eugene,Mcfarland,1892,1,,,
396c4077-6a70-4a17-97fb-f8a0c06fdafe,Anna,Preston,1871,2,,,
7e9dde5e-3fad-4b2e-b367-643c0dc8cabb,Rebecca N,Alexander,1861,2,,,
f7d9e25f-c390-4222-ac24-4e93d72daa05,Martha,Ellis,1873,2,,,
24b7afa1-8c49-4833-8292-c545c85d3b89,Otillia,Zeider,1876,2,,,
4b416874-0c5c-4233-81ec-39223bc66f4f,Mary,Doyle,1846,2,,,
a499b0dc-7ac0-4d61-b493-91a3036c712e ,ANNIE ,FAUBLE ,1884,2,1,,
ae7261c3-7d71-4ea1-997f-5d1a68c18777 ,MARY ,REESE ,1875,2,,,
ad6442b5-42bc-4c2e-a517-5a951d989a92 ,MARY ,REESE ,1899,2,1,2,3
b0b6695f-dfa5-4e4d-bc75-798c27195fff ,SALLY ,REESE ,1901,2,,,
9e807937-de09-414c-bfb2-ac821e112929 ,JOHN ,SHIELDS ,1880,1,1,,
426f2cbe-32e1-45eb-9f86-89a2b9116b7e ,ANNE ,FAUBLE ,1884,2,,,
a76697d9-b0c8-4774-bc3e-12a7e403c7e6 ,JOHN ,COLLINS ,1893,1,,,
3575c9ba-1527-4ca2-aff0-d7c2d1efb421 ,MAGGIE ,COLLINS ,1894,2,,,
49e53dbc-fe8e-4e55-8cb9-a1d93c284d98 ,MARY ,COLLINS ,1898,2,,,
50b33ef6-259d-43af-8cdc-56a61f881169 ,WILLIAM H. ,SEWARD ,1856,1,,4,
952754a5-48b4-462a-ac57-e4a059a9ef98 ,ESTHER ,BIERHAHN ,1870,2,,,
ea6d77b3-2e2d-4c59-a0ac-6b297e8898e3 ,CHARLES ,CLEVELAND ,1865,1,,,
60a5052e-6d67-455a-a3aa-bb79560c7d8d ,SUSAN ,WILSON ,1850,2,,,
0d4472ec-6378-4aeb-b6c7-17e1c388bb94 ,ARCHER ,HARVEY ,1890,1,,,
65ccbeb7-2c79-4fb0-b354-c67f150ad80c ,ELIZABETH ,MC LEAN ,1868,2,,,
72cbe5fa-f558-4393-8423-1842fadf7f11 ,MARY A. ,FLEMMING ,1837,2,,,
44693008-fd6f-48fe-9c52-e6c07baff361 ,BESSIE ,CHAMBERS ,1908,2,,,
bcc0988e-2397-4f1b-8e76-4bfe1b05dbc6 ,THOMAS ,GRAHAM ,1846,1,,,
a7b10530-b7c9-44d5-9125-c603f392d6d3 ,EDWARD ,DEKAY ,1875,1,,,
1e635c1c-7faa-4270-acf3-a22635884b90 ,NATHEN ,THORPE ,1836,1,,,
d3217545-3453-4d96-86c0-d6a3e60fb2f8 ,JOB ,FOSTER ,1884,1,,,
2a35bae5-3120-4e2c-87da-694d4419c9ce ,JEZEBEL ,FOSTER ,1888,2,,,
94460fc2-954b-469d-9726-f7126c30e5e2 ,ELIZA ,GOODWIN ,1871,2,,,
620b6ebb-82e6-42db-8aae-300ca2be0c00 ,MARY ,GOODWIN ,1893,2,,,
bfe1080e-2e67-4a8c-a6e1-ed94ea103712 ,JO ,GOODWIN ,1895,1,,6,7
7fb55d25-2a7d-486d-9efa-27b9d7e60c24 ,PHINEAS ,TAYLOR ,1871,1,,5,
a0f33b36-cef7-4949-a031-22b90f1055d4 ,MARY A. ,LORD ,1856,2,,,1
1a76745c-acf8-48a0-9992-7fb10c11710b ,E.B. ,ALLEN ,1889,1,,,
27 changes: 27 additions & 0 deletions hlink/tests/input_data/matching_or_group_test_b.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
id,namefrst,namelast,birthyr,sex,bpl1,bpl2,bpl3
a499b0dc-7ac0-4d61-b493-91a3036c712e ,ANNIE ,FAUBLE ,1884,2,1,,
ae7261c3-7d71-4ea1-997f-5d1a68c18777 ,MARY ,REESE ,1875,2,,,
ad6442b5-42bc-4c2e-a517-5a951d989a92 ,MARY ,REESE ,1902,2,1,2,3
9e807937-de09-414c-bfb2-ac821e112929 ,JOHN ,SHIELDS ,1889,1,1,,
426f2cbe-32e1-45eb-9f86-89a2b9116b7e ,ANNE ,FAUBLE ,1884,2,,,
a76697d9-b0c8-4774-bc3e-12a7e403c7e6 ,JOHN ,COLLINS ,1893,1,,,
3575c9ba-1527-4ca2-aff0-d7c2d1efb421 ,MAGGIE ,COLLINS ,1894,2,,,
49e53dbc-fe8e-4e55-8cb9-a1d93c284d98 ,MARY ,COLLINS ,1898,2,,,
50b33ef6-259d-43af-8cdc-56a61f881169 ,WILLIAM H. ,SEWARD ,1866,1,,4,
952754a5-48b4-462a-ac57-e4a059a9ef98 ,ESTHER ,BIERHAHN ,1870,2,,,
ea6d77b3-2e2d-4c59-a0ac-6b297e8898e3 ,CHARLES ,CLEVELAND ,1865,1,,,
60a5052e-6d67-455a-a3aa-bb79560c7d8d ,SUSAN ,WILSON ,1850,2,,,
0d4472ec-6378-4aeb-b6c7-17e1c388bb94 ,ARCHER ,HARVEY ,1893,1,,,
65ccbeb7-2c79-4fb0-b354-c67f150ad80c ,ELIZABETH ,MC LEAN ,1868,2,,,
72cbe5fa-f558-4393-8423-1842fadf7f11 ,MARY A. ,FLEMMING ,1842,2,,,
bcc0988e-2397-4f1b-8e76-4bfe1b05dbc6 ,THOMAS ,GRAHAM ,1846,1,,,
a7b10530-b7c9-44d5-9125-c603f392d6d3 ,EDWARD ,DEKAY ,1875,1,,,
1e635c1c-7faa-4270-acf3-a22635884b90 ,NATHEN ,THORPE ,1836,1,,,
d3217545-3453-4d96-86c0-d6a3e60fb2f8 ,JOB ,FOSTER ,1884,1,,,
2a35bae5-3120-4e2c-87da-694d4419c9ce ,JEZEBEL ,FOSTER ,1888,2,,,
94460fc2-954b-469d-9726-f7126c30e5e2 ,ELIZA ,GOODWIN ,1871,2,,,
620b6ebb-82e6-42db-8aae-300ca2be0c00 ,MARY ,GOODWIN ,1893,2,,,
bfe1080e-2e67-4a8c-a6e1-ed94ea103712 ,JO ,GOODWIN ,1890,1,,6,7
7fb55d25-2a7d-486d-9efa-27b9d7e60c24 ,PHINEAS ,TAYLOR ,1871,1,,5,
a0f33b36-cef7-4949-a031-22b90f1055d4 ,MARY A. ,LORD ,1856,2,,,1
1a76745c-acf8-48a0-9992-7fb10c11710b ,E.B. ,ALLEN ,1889,1,,,
Loading

0 comments on commit 0370afb

Please sign in to comment.