Skip to content

Commit f10b822

Browse files
authored
Merge pull request #147 from ipums/add_tests
Add tests to cover several untested sections of code
2 parents 54d4820 + e4c9941 commit f10b822

File tree

9 files changed

+405
-47
lines changed

9 files changed

+405
-47
lines changed

docs/_sources/feature_selection_transforms.md.txt

Lines changed: 23 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,26 @@
1-
# Feature Selection transforms
2-
3-
Each header below represents a feature selection transform. These transforms are used in the context of `feature_selections`.
4-
5-
```
6-
[[feature_selections]]
7-
input_column = "clean_birthyr"
8-
output_column = "replaced_birthyr"
9-
condition = "case when clean_birthyr is null or clean_birthyr == '' then year - age else clean_birthyr end"
10-
transform = "sql_condition"
11-
```
12-
13-
There are some additional attributes available for all transforms: `checkpoint`, `override_column_a`, `override_column_b`, `set_value_column_a`, `set_value_column_b`.
1+
# Feature Selection Transforms
2+
3+
Each feature selection in the `[[feature_selections]]` list must have a
4+
`transform` attribute which tells hlink which transform it uses. The available
5+
feature selection transforms are listed below. The attributes of the feature
6+
selection often vary with the feature selection transform. However, there are a
7+
few utility attributes which are available for all transforms:
8+
9+
- `override_column_a` - Type: `string`. Optional. Given the name of a column in
10+
dataset A, copy that column into the output column instead of computing the
11+
feature selection for dataset A. This does not affect dataset B.
12+
- `override_column_b` - Type: `string`. Optional. Given the name of a column in
13+
dataset B, copy that column into the output column instead of computing the
14+
feature selection for dataset B. This does not affect dataset A.
15+
- `set_value_column_a` - Type: any. Optional. Instead of computing the feature
16+
selection for dataset A, use the given value for every row in the output
17+
column. This does not affect dataset B.
18+
- `set_value_column_b` - Type: any. Optional. Instead of computing the feature
19+
selection for dataset B, use the given value for every row in the output
20+
column. This does not affect dataset A.
21+
- `checkpoint` - Type: `boolean`. Optional. If set to true, checkpoint the
22+
dataset in Spark before computing the feature selection. This can reduce some
23+
resource usage for very complex workflows, but should not be necessary.
1424

1525
## bigrams
1626

docs/feature_selection_transforms.html

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
<meta charset="utf-8" />
66
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
77

8-
<title>Feature Selection transforms &#8212; hlink 3.6.1 documentation</title>
8+
<title>Feature Selection Transforms &#8212; hlink 3.6.1 documentation</title>
99
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=d1102ebc" />
1010
<link rel="stylesheet" type="text/css" href="_static/alabaster.css?v=12dfc556" />
1111
<script src="_static/documentation_options.js?v=f731707b"></script>
@@ -33,16 +33,29 @@
3333
<div class="body" role="main">
3434

3535
<section id="feature-selection-transforms">
36-
<h1>Feature Selection transforms<a class="headerlink" href="#feature-selection-transforms" title="Link to this heading"></a></h1>
37-
<p>Each header below represents a feature selection transform. These transforms are used in the context of <code class="docutils literal notranslate"><span class="pre">feature_selections</span></code>.</p>
38-
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[[</span><span class="n">feature_selections</span><span class="p">]]</span>
39-
<span class="n">input_column</span> <span class="o">=</span> <span class="s2">&quot;clean_birthyr&quot;</span>
40-
<span class="n">output_column</span> <span class="o">=</span> <span class="s2">&quot;replaced_birthyr&quot;</span>
41-
<span class="n">condition</span> <span class="o">=</span> <span class="s2">&quot;case when clean_birthyr is null or clean_birthyr == &#39;&#39; then year - age else clean_birthyr end&quot;</span>
42-
<span class="n">transform</span> <span class="o">=</span> <span class="s2">&quot;sql_condition&quot;</span>
43-
</pre></div>
44-
</div>
45-
<p>There are some additional attributes available for all transforms: <code class="docutils literal notranslate"><span class="pre">checkpoint</span></code>, <code class="docutils literal notranslate"><span class="pre">override_column_a</span></code>, <code class="docutils literal notranslate"><span class="pre">override_column_b</span></code>, <code class="docutils literal notranslate"><span class="pre">set_value_column_a</span></code>, <code class="docutils literal notranslate"><span class="pre">set_value_column_b</span></code>.</p>
36+
<h1>Feature Selection Transforms<a class="headerlink" href="#feature-selection-transforms" title="Link to this heading"></a></h1>
37+
<p>Each feature selection in the <code class="docutils literal notranslate"><span class="pre">[[feature_selections]]</span></code> list must have a
38+
<code class="docutils literal notranslate"><span class="pre">transform</span></code> attribute which tells hlink which transform it uses. The available
39+
feature selection transforms are listed below. The attributes of the feature
40+
selection often vary with the feature selection transform. However, there are a
41+
few utility attributes which are available for all transforms:</p>
42+
<ul class="simple">
43+
<li><p><code class="docutils literal notranslate"><span class="pre">override_column_a</span></code> - Type: <code class="docutils literal notranslate"><span class="pre">string</span></code>. Optional. Given the name of a column in
44+
dataset A, copy that column into the output column instead of computing the
45+
feature selection for dataset A. This does not affect dataset B.</p></li>
46+
<li><p><code class="docutils literal notranslate"><span class="pre">override_column_b</span></code> - Type: <code class="docutils literal notranslate"><span class="pre">string</span></code>. Optional. Given the name of a column in
47+
dataset B, copy that column into the output column instead of computing the
48+
feature selection for dataset B. This does not affect dataset A.</p></li>
49+
<li><p><code class="docutils literal notranslate"><span class="pre">set_value_column_a</span></code> - Type: any. Optional. Instead of computing the feature
50+
selection for dataset A, use the given value for every row in the output
51+
column. This does not affect dataset B.</p></li>
52+
<li><p><code class="docutils literal notranslate"><span class="pre">set_value_column_b</span></code> - Type: any. Optional. Instead of computing the feature
53+
selection for dataset B, use the given value for every row in the output
54+
column. This does not affect dataset A.</p></li>
55+
<li><p><code class="docutils literal notranslate"><span class="pre">checkpoint</span></code> - Type: <code class="docutils literal notranslate"><span class="pre">boolean</span></code>. Optional. If set to true, checkpoint the
56+
dataset in Spark before computing the feature selection. This can reduce some
57+
resource usage for very complex workflows, but should not be necessary.</p></li>
58+
</ul>
4659
<section id="bigrams">
4760
<h2>bigrams<a class="headerlink" href="#bigrams" title="Link to this heading"></a></h2>
4861
<p>Split the given string column into <a class="reference external" href="https://en.wikipedia.org/wiki/Bigram">bigrams</a>.</p>

docs/objects.inv

4 Bytes
Binary file not shown.

docs/searchindex.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

hlink/linking/core/transforms.py

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
)
2727
from pyspark.sql.types import ArrayType, LongType, StringType
2828
from pyspark.ml import Pipeline
29-
from pyspark.sql import DataFrame, SparkSession, Window
29+
from pyspark.sql import Column, DataFrame, SparkSession, Window
3030
from pyspark.ml.feature import NGram, RegexTokenizer, CountVectorizer, MinHashLSH
3131

3232

@@ -402,13 +402,18 @@ def get_transforms(name: str, is_a: bool) -> list[dict[str, Any]]:
402402

403403

404404
# These apply to the column mappings in the current config
405-
def apply_transform(column_select, transform, is_a):
406-
"""Given a dataframe select string return a new string having applied the given transform.
407-
column_select: A PySpark column type
408-
transform: The transform info from the current config
409-
is_a: Is running on dataset 'a' or 'b ?
410-
411-
See the json_schema config file in config_schemas/config.json for definitions on each transform type.
405+
def apply_transform(
406+
column_select: Column, transform: dict[str, Any], is_a: bool
407+
) -> Column:
408+
"""Return a new column that is the result of applying the given transform
409+
to the given input column (column_select). The is_a parameter controls the
410+
behavior of the transforms like "add_to_a" which act differently on
411+
datasets A and B.
412+
413+
Args:
414+
column_select: a PySpark Column
415+
transform: the transform to apply
416+
is_a: whether this is dataset A (True) or B (False)
412417
"""
413418
transform_type = transform["type"]
414419
if transform_type == "add_to_a":

hlink/tests/core/comparison_feature_test.py

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,12 @@
22
# For copyright and licensing information, see the NOTICE and LICENSE files
33
# in this project's top-level directory, and also on-line at:
44
# https://github.com/ipums/hlink
5+
import pytest
56

67
import hlink.linking.core.comparison_feature as comparison_feature_core
78
import hlink.linking.core.pipeline as pipeline_core
89
from pyspark.ml import Pipeline
10+
from pyspark.sql import Row
911

1012

1113
def test_rel_jaro_winkler_comparison(spark, conf, datasource_rel_jw_input):
@@ -374,3 +376,81 @@ def test_multi_jaro_winkler_search_column_templating():
374376
assert "static_column" in sql_expr
375377
assert "static_column1" not in sql_expr
376378
assert "static_colum1" not in sql_expr
379+
380+
381+
def test_b_minus_a_comparison(spark) -> None:
382+
comparison_feature = {
383+
"alias": "agediff",
384+
"column_name": "age",
385+
"comparison_type": "b_minus_a",
386+
}
387+
388+
df_a = spark.createDataFrame([[0, 15], [1, 77]], "id:integer, age:integer")
389+
df_b = spark.createDataFrame([[100, 15], [101, 70]], "id:integer, age:integer")
390+
df_a.write.saveAsTable("table_a")
391+
df_b.write.saveAsTable("table_b")
392+
393+
sql_expr = comparison_feature_core.generate_comparison_feature(
394+
comparison_feature, "id", include_as=True
395+
)
396+
397+
result = (
398+
spark.sql(
399+
f"SELECT a.id AS id_a, b.id AS id_b, {sql_expr} FROM table_a a CROSS JOIN table_b b"
400+
)
401+
.sort("id_a", "id_b")
402+
.collect()
403+
)
404+
405+
assert result == [
406+
Row(id_a=0, id_b=100, agediff=0),
407+
Row(id_a=0, id_b=101, agediff=55),
408+
Row(id_a=1, id_b=100, agediff=-62),
409+
Row(id_a=1, id_b=101, agediff=-7),
410+
]
411+
412+
413+
def test_b_minus_a_comparison_with_not_equals(spark) -> None:
414+
comparison_feature = {
415+
"alias": "agediff",
416+
"column_name": "age",
417+
"comparison_type": "b_minus_a",
418+
"not_equals": 99,
419+
}
420+
df_a = spark.createDataFrame([[0, 15], [1, 77], [2, 99]], "id:integer, age:integer")
421+
df_b = spark.createDataFrame(
422+
[[100, 15], [101, 70], [102, 99]], "id:integer, age:integer"
423+
)
424+
425+
df_a.write.saveAsTable("table_a")
426+
df_b.write.saveAsTable("table_b")
427+
428+
sql_expr = comparison_feature_core.generate_comparison_feature(
429+
comparison_feature, "id", include_as=True
430+
)
431+
432+
result = (
433+
spark.sql(
434+
f"SELECT a.id AS id_a, b.id AS id_b, {sql_expr} FROM table_a a CROSS JOIN table_b b"
435+
)
436+
.sort("id_a", "id_b")
437+
.collect()
438+
)
439+
440+
assert result == [
441+
Row(id_a=0, id_b=100, agediff=0),
442+
Row(id_a=0, id_b=101, agediff=55),
443+
Row(id_a=0, id_b=102, agediff=-1),
444+
Row(id_a=1, id_b=100, agediff=-62),
445+
Row(id_a=1, id_b=101, agediff=-7),
446+
Row(id_a=1, id_b=102, agediff=-1),
447+
Row(id_a=2, id_b=100, agediff=-1),
448+
Row(id_a=2, id_b=101, agediff=-1),
449+
Row(id_a=2, id_b=102, agediff=-1),
450+
]
451+
452+
453+
def test_generate_comparison_feature_error_on_unknown_comparison_type() -> None:
454+
comparison_feature = {"comparison_type": "not_supported"}
455+
with pytest.raises(ValueError, match="No comparison type"):
456+
comparison_feature_core.generate_comparison_feature(comparison_feature, "id")

0 commit comments

Comments
 (0)