Skip to content

Commit

Permalink
Merge pull request #129 from ipums/document_column_mapping_overrides
Browse files Browse the repository at this point in the history
Document column mapping overrides
  • Loading branch information
riley-harper authored Dec 1, 2023
2 parents 93d33ec + 03a6c0e commit e8db991
Show file tree
Hide file tree
Showing 27 changed files with 720 additions and 258 deletions.
Original file line number Diff line number Diff line change
@@ -1,36 +1,81 @@
# Column mapping transforms
# Column Mappings

Each header below represents a column mapping transform type. Transforms are used in the context of the `column_mappings` list.
Each transform operates on a single input column and outputs a single output column.
## Basic Usage

Some transforms are suffixed by "a" or "b". These suffixes mean that the transforms apply
to columns from only one of the two datasets to be linked (dataset A or dataset B).
Most transforms operate on both dataset A and dataset B independently.
Each column mapping reads a column from the input datasets into hlink. It has a
`column_name` attribute which specifies the name of the input column to read in
from both datasets. Optionally, it may have an `alias` attribute which gives a
new name to use for the column in hlink.

More than one transform can be applied to a column. Transforms apply in the order that
they're listed in the configuration file, so the output of one transform may be the input of another.
Input and output column types are listed in the format "Maps input column type → output column type".
The letters T and U represent arbitrary column types.
Column mappings support some *transforms* which make changes to the data as they
are read in. These changes support data cleaning and harmonization. The available
column mapping transforms are listed below in the [transforms](#transforms) section.

Each column mapping applies to the column specified by the `column_name` attribute in
the configuration file under its `[[column_mappings]]` section. The output column
name is specified by the `alias` attribute, and the `transforms` attribute lists the transforms to apply. Along
with `type`, which must be one of the names listed below, there may be additional attributes used by a transform.
These vary by type, and additional information is given for each type of transform in its section
below. Often attributes are just named `value` or `values` if there is only one attribute expected.
## Advanced Usage

By default, the input column must have the same name in both input datasets.
With the `override_column_a` and `override_column_b` attributes, you can
specify a different name for either dataset A or dataset B. When you do this,
the `transforms` attribute applies only to the non-override dataset. You can also
provide an `override_transforms` attribute which applies only to the override
dataset.

## Transforms

Each section below describes a column mapping transform type. Each transform
operates on a single input column and outputs a single output column. More than
one transform may be applied to a column. Transforms apply in the order that
they are listed in the `transforms` list, so the output of one transform may
be the input of another. Input and output column types are listed in the format
"Maps input column type → output column type". The letters T and U represent
arbitrary column types.

Each transform requires a `type` attribute, which must be one of the names
listed below. Some transforms may use additional attributes. These vary by
type, and additional information appears for each type of transform in its
section below.

Some transforms are suffixed by "a" or "b". These suffixes mean that the
transforms apply to columns from only one of the two datasets to be linked
(dataset A or dataset B). Most transforms operate on both dataset A and dataset
B independently.

For example, if you have two datasets taken 10 years apart, you may want to
standardize the `age` variable so that it is comparable between the two
datasets. To do this, you could create a new `age_at_dataset_b` variable by
reading in the `age` variable from each dataset and then adding 10 to the
variable from dataset A with the `add_to_a` transform.

```
# An example column mappings section
[[column_mappings]]
# Name of the output column
alias = "namefrst_split"
# Name of the input column
column_name = "namefrst_clean"
# List of transforms to apply
transforms = [{type = "split"}]
alias = "age_at_dataset_b"
column_name = "age"
transforms = [
{type = "add_to_a", value = 10}
]
```

## add_to_a
As another example, suppose that both datasets record each person's first name
as a string. In dataset A the variable is called `namefrst` and is entirely
lowercase, but in dataset B it is called `first_name` and is entirely uppercase.
You could read these two columns into a `namefrst` column in hlink and apply
a lowercase transform to only dataset B with the following configuration section.

```
[[column_mappings]]
alias = "namefrst"
column_name = "namefrst"
# Read from column first_name in dataset B
override_column_b = "first_name"
# Apply these transforms only to dataset B
override_transforms = [
{type = "lowercase_strip"}
]
```



### add_to_a

Add the given `value` to a column from dataset A.

Expand All @@ -40,7 +85,7 @@ Maps numerical → numerical.
transforms = [{type = "add_to_a", value = 11}]
```

## concat_to_a
### concat_to_a

Concatenate the string `value` to the end of a column in dataset A.

Expand All @@ -51,7 +96,7 @@ transforms = [{type = "concat_to_a", value = " "}]
```


## concat_to_b
### concat_to_b

Concatenate the string `value` to the end of a column in dataset B.

Expand All @@ -61,7 +106,7 @@ Maps string → string.
transforms = [{type = "concat_to_b", value = " "}]
```

## concat_two_cols
### concat_two_cols

Concatenate the values from two columns together as strings. This transform takes
a `column_to_append` attribute, which specifies the name of the column to concatenate
Expand All @@ -81,7 +126,7 @@ transforms = [
]
```

## lowercase_strip
### lowercase_strip

Used in name cleaning. Convert alphabetical characters to lower-case and strip white
space characters from the start and end of the strings in the column.
Expand All @@ -92,7 +137,7 @@ Maps string → string.
transforms = [{type = "lowercase_strip"}]
```

## rationalize_name_words
### rationalize_name_words

Used in name cleaning. Replace the characters `?`, `*`, and `-` with spaces. Since
people's names in raw census data can contain these characters, replacing these characters
Expand All @@ -105,7 +150,7 @@ transforms = [{type = "rationalize_name_words"}]
```


## remove_qmark_hyphen
### remove_qmark_hyphen

Used in name cleaning. Remove the characters `?` and `-` from strings in the column.

Expand All @@ -115,7 +160,7 @@ Maps string → string.
transforms = [{type = "remove_qmark_hyphen"}]
```

## remove_punctuation
### remove_punctuation

Remove most punctuation from strings in the column. This transform removes these characters:
`? - \ / " ' : , . [ ] { }`.
Expand All @@ -126,7 +171,7 @@ Maps string → string.
transforms = [{type = "remove_punctuation"}]
```

## replace_apostrophe
### replace_apostrophe

Used in name cleaning. Replace each apostrophe `'` with a space.

Expand All @@ -137,7 +182,7 @@ transforms = [{type = "replace_apostrophe"}]

```

## remove_alternate_names
### remove_alternate_names

Used in name cleaning. If a string in the column contains the string ` or ` ("or" surrounded by spaces),
then remove the ` or ` and all following characters.
Expand All @@ -148,7 +193,7 @@ Maps string → string.
transforms = [{type = "remove_alternate_names"}]
```

## remove_suffixes
### remove_suffixes

Used in name cleaning. Given a list of suffixes, remove them from the strings in the column.

Expand All @@ -163,7 +208,7 @@ transforms = [
]
```

## remove_stop_words
### remove_stop_words

Used in name cleaning. Remove last words from names such as street names.

Expand All @@ -178,7 +223,7 @@ transforms = [
]
```

## remove_prefixes
### remove_prefixes

Used in name cleaning. Remove prefixes like "Ms.", "Mr.", or "Mrs." from names.

Expand All @@ -189,7 +234,7 @@ Maps string → string.
transforms = [{type = "remove_prefixes", values = ["ah"]}]
```

## condense_strip_whitespace
### condense_strip_whitespace

Used in name cleaning. Take white space that may be more than one character or contain
non-space characters and replace it with a single space.
Expand All @@ -200,7 +245,7 @@ Maps string → string.
transforms = [{type = "condense_strip_whitespace"}]
```

## remove_one_letter_names
### remove_one_letter_names

Used in name cleaning. If a name is a single character, remove it and leave the white space behind.

Expand All @@ -210,7 +255,7 @@ Maps string → string.
transforms = [{type = "remove_one_letter_names"}]
```

## split
### split

Split the column value on space characters.

Expand All @@ -223,7 +268,7 @@ column_name = "namefrst_clean"
transforms = [{type = "split"}]
```

## array_index
### array_index

If the column contains an array, select the element at the given position.

Expand All @@ -241,7 +286,7 @@ transforms = [
]
```

## mapping
### mapping

Map single or multiple values to a single output value, otherwise known as a "recoding."

Expand All @@ -262,7 +307,7 @@ transforms = [
]
```

## substring
### substring

Replace a column with a substring of the data in the column.

Expand All @@ -274,7 +319,7 @@ transforms = [
]
```

## divide_by_int
### divide_by_int

Divide data in a column by an integer value. It may leave a non-integer result.

Expand All @@ -296,7 +341,7 @@ transforms = [
```


## when_value
### when_value

Apply conditional logic to replacement of values in a column. Works like the SQL `if()` or `case()` expressions in the SQL `select` clause.
When the value of a column is `value` replace it with `if_value`. Otherwise replace it with `else_value`.
Expand All @@ -313,7 +358,7 @@ transforms = [
```


## get_floor
### get_floor

Round down to the nearest whole number.

Expand Down
50 changes: 34 additions & 16 deletions docs/_sources/config.md.txt
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
# Configuration
1. [Basic Example Config File](#basic-config-file)
2. [Advanced Example Config File](#advanced-config-file)
3. [Top level configs](#top-level-configs)
4. [Data sources](#data-sources)
3. [Top-Level Configs](#top-level-configs)
4. [Data Sources](#data-sources)
5. [Filter](#filter)
6. [Column mappings](#column-mappings)
7. [Substitution columns](#substitution-columns)
8. [Feature selections](#feature-selections)
9. [Potential matches universe](#potential-matches-universe)
6. [Column Mappings](#column-mappings)
7. [Substitution Columns](#substitution-columns)
8. [Feature Selections](#feature-selections)
9. [Potential Matches Universe](#potential-matches-universe)
10. [Blocking](#blocking)
11. [Comparisons](#comparisons)
12. [Household comparisons](#household-comparisons)
13. [Comparison features](#comparison-features)
14. [Pipeline-generated features](#pipeline-generated-features)
15. [Training and models](#training-and-models)
16. [Household training and models](#household-training-and-models)
12. [Household Comparisons](#household-comparisons)
13. [Comparison Features](#comparison-features)
14. [Pipeline-Generated Features](#pipeline-generated-features)
15. [Training and Models](#training-and-models)
16. [Household Training and Models](#household-training-and-models)

## Basic Config File

Expand Down Expand Up @@ -438,16 +438,34 @@ datasource = "b"
```


## [Column Mappings](column_mapping_transforms)
## [Column Mappings](column_mappings)

* Header name: `column_mappings`
* Description: Base column mappings and transformations to extract from your input datasets.
* Description: Base column mappings and transformations to extract from your
input datasets. Each column mapping requires a `column_name` which tells it
which input column to read from. Optionally you may provide an `alias` for
the column and `transforms` to modify it as it is read in. There are some additional
attributes listed below that are meant for advanced usage. These are described
in more detail on the [column mappings](column_mappings) page.
* Required: True
* Type: List
* Attributes:
* `alias` -- Type: `string`. Optional; if not specified the new column name defaults to `column_name`. New name of column.
* `column_name` -- Type: `string`. Name of column in input data. Used as the name of the output column if `alias` is not specified.
* `transforms` -- Type: `List`. Optional. A list of transforms to apply, in order, to the input data. See the [column mapping transforms](column_mapping_transforms) section for more information.
* `column_name` -- Type: `string`. The name of the column in the input data.
* `alias` -- Type: `string`. Optional. The new name of the column to use
in hlink. By default, this is the same as `column_name`.
* `transforms` -- Type: `List`. Optional. A list of transforms to apply, in
order, to the input data. See the [column mapping transforms](column_mappings.html#transforms)
section for more information.
* `set_value_column_a` -- Type: `Any`. Optional. Set all records for dataset
A to the given literal value.
* `set_value_column_b` -- Type: `Any`. Optional. Set all records for dataset
B to the given literal value.
* `override_column_a` -- Type: `string`. Read from this column in dataset A
instead of the column specified with `column_name`.
* `override_column_b` -- Type: `string`. Read from this column in dataset B
instead of the column specified with `column_name`.
* `override_transforms` -- Type: `List`. Transforms to apply to the override
column specified with `override_column_a` or `override_column_b`.

```
[[column_mappings]]
Expand Down
2 changes: 1 addition & 1 deletion docs/_sources/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Configuration API
:maxdepth: 2
:caption: Configuration API

Column Mapping <column_mapping_transforms.md>
Column Mappings <column_mappings.md>
Comparison Types <comparison_types.md>
Feature Selection <feature_selection_transforms.md>
Pipeline Features <pipeline_features.md>
Expand Down
2 changes: 1 addition & 1 deletion docs/_sources/link_tasks.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ datasets.
### Related Configuration Sections

* The [`datasource_a` and `datasource_b`](config.html#data-sources) sections specify where to find the input data.
* [```column_mappings```](column_mapping_transforms.html#column-mapping-transforms),
* [`column_mappings`](column_mappings.html#column-mappings),
[`feature_selections`](feature_selection_transforms.html#feature-selection-transforms),
and [`substitution_columns`](substitutions.html#substitutions) may all be used to define transformations on the input data.
* The [`filter`](config.html#filter) section may be used to filter some records out of the input data
Expand Down
Loading

0 comments on commit e8db991

Please sign in to comment.