Merge pull request #129 from ipums/document_column_mapping_overrides

Document column mapping overrides
ipums · Dec 1, 2023 · e8db991 · e8db991
2 parents 93d33ec + 03a6c0e
commit e8db991
Show file tree

Hide file tree

Showing 27 changed files with 720 additions and 258 deletions.
diff --git a/..._sources/column_mapping_transforms.md.txt → docs/_sources/column_mappings.md.txt b/..._sources/column_mapping_transforms.md.txt → docs/_sources/column_mappings.md.txt
@@ -1,36 +1,81 @@
-# Column mapping transforms
+# Column Mappings
 
-Each header below represents a column mapping transform type. Transforms are used in the context of the `column_mappings` list.
-Each transform operates on a single input column and outputs a single output column.
+## Basic Usage
 
-Some transforms are suffixed by "a" or "b". These suffixes mean that the transforms apply
-to columns from only one of the two datasets to be linked (dataset A or dataset B).
-Most transforms operate on both dataset A and dataset B independently.
+Each column mapping reads a column from the input datasets into hlink. It has a
+`column_name` attribute which specifies the name of the input column to read in
+from both datasets. Optionally, it may have an `alias` attribute which gives a
+new name to use for the column in hlink.
 
-More than one transform can be applied to a column. Transforms apply in the order that
-they're listed in the configuration file, so the output of one transform may be the input of another.
-Input and output column types are listed in the format "Maps input column type → output column type".
-The letters T and U represent arbitrary column types.
+Column mappings support some *transforms* which make changes to the data as they
+are read in. These changes support data cleaning and harmonization. The available
+column mapping transforms are listed below in the [transforms](#transforms) section.
 
-Each column mapping applies to the column specified by the `column_name` attribute in
-the configuration file under its `[[column_mappings]]` section. The output column
-name is specified by the `alias` attribute, and the `transforms` attribute lists the transforms to apply. Along
-with `type`, which must be one of the names listed below, there may be additional attributes used by a transform.
-These vary by type, and additional information is given for each type of transform in its section
-below. Often attributes are just named `value` or `values` if there is only one attribute expected.
+## Advanced Usage
+
+By default, the input column must have the same name in both input datasets.
+With the `override_column_a` and `override_column_b` attributes, you can
+specify a different name for either dataset A or dataset B. When you do this,
+the `transforms` attribute applies only to the non-override dataset. You can also
+provide an `override_transforms` attribute which applies only to the override
+dataset.
+
+## Transforms
+
+Each section below describes a column mapping transform type. Each transform
+operates on a single input column and outputs a single output column. More than
+one transform may be applied to a column. Transforms apply in the order that
+they are listed in the `transforms` list, so the output of one transform may
+be the input of another. Input and output column types are listed in the format
+"Maps input column type → output column type". The letters T and U represent
+arbitrary column types.
+
+Each transform requires a `type` attribute, which must be one of the names
+listed below. Some transforms may use additional attributes. These vary by
+type, and additional information appears for each type of transform in its
+section below.
+
+Some transforms are suffixed by "a" or "b". These suffixes mean that the
+transforms apply to columns from only one of the two datasets to be linked
+(dataset A or dataset B). Most transforms operate on both dataset A and dataset
+B independently.
+
+For example, if you have two datasets taken 10 years apart, you may want to
+standardize the `age` variable so that it is comparable between the two
+datasets. To do this, you could create a new `age_at_dataset_b` variable by
+reading in the `age` variable from each dataset and then adding 10 to the
+variable from dataset A with the `add_to_a` transform.
 
 ```
-# An example column mappings section
 [[column_mappings]]
-# Name of the output column
-alias = "namefrst_split"
-# Name of the input column
-column_name = "namefrst_clean"
-# List of transforms to apply
-transforms = [{type = "split"}]
+alias = "age_at_dataset_b"
+column_name = "age"
+transforms = [
+    {type = "add_to_a", value = 10}
+]
 ```
 
-## add_to_a
+As another example, suppose that both datasets record each person's first name
+as a string. In dataset A the variable is called `namefrst` and is entirely
+lowercase, but in dataset B it is called `first_name` and is entirely uppercase.
+You could read these two columns into a `namefrst` column in hlink and apply
+a lowercase transform to only dataset B with the following configuration section.
+
+```
+[[column_mappings]]
+alias = "namefrst"
+column_name = "namefrst"
+# Read from column first_name in dataset B
+override_column_b = "first_name"
+# Apply these transforms only to dataset B
+override_transforms = [
+    {type = "lowercase_strip"}
+]
+```
+
+
+
+### add_to_a
 
 Add the given `value` to a column from dataset A.
 
@@ -40,7 +85,7 @@ Maps numerical → numerical.
 transforms = [{type = "add_to_a", value = 11}]
 ```
 
-## concat_to_a
+### concat_to_a
 
 Concatenate the string `value` to the end of a column in dataset A.
 
@@ -51,7 +96,7 @@ transforms = [{type = "concat_to_a", value = " "}]
 ```
 
 
-## concat_to_b
+### concat_to_b
 
 Concatenate the string `value` to the end of a column in dataset B.
 
@@ -61,7 +106,7 @@ Maps string → string.
 transforms = [{type = "concat_to_b", value = " "}]
 ```
 
-## concat_two_cols
+### concat_two_cols
 
 Concatenate the values from two columns together as strings. This transform takes
 a `column_to_append` attribute, which specifies the name of the column to concatenate
@@ -81,7 +126,7 @@ transforms = [
 ]
 ```
 
-## lowercase_strip
+### lowercase_strip
 
 Used in name cleaning. Convert alphabetical characters to lower-case and strip white
 space characters from the start and end of the strings in the column.
@@ -92,7 +137,7 @@ Maps string → string.
 transforms = [{type = "lowercase_strip"}]
 ```
 
-## rationalize_name_words
+### rationalize_name_words
 
 Used in name cleaning. Replace the characters `?`, `*`, and `-` with spaces. Since
 people's names in raw census data can contain these characters, replacing these characters
@@ -105,7 +150,7 @@ transforms = [{type = "rationalize_name_words"}]
 ```
 
 
-## remove_qmark_hyphen
+### remove_qmark_hyphen
 
 Used in name cleaning. Remove the characters `?` and `-` from strings in the column.
 
@@ -115,7 +160,7 @@ Maps string → string.
 transforms = [{type = "remove_qmark_hyphen"}]
 ```
 
-## remove_punctuation
+### remove_punctuation
 
 Remove most punctuation from strings in the column. This transform removes these characters:
 `? - \ / " ' : , . [ ] { }`.
@@ -126,7 +171,7 @@ Maps string → string.
 transforms = [{type = "remove_punctuation"}]
 ```
 
-## replace_apostrophe
+### replace_apostrophe
 
 Used in name cleaning. Replace each apostrophe `'` with a space.
 
@@ -137,7 +182,7 @@ transforms = [{type = "replace_apostrophe"}]
 
 ```
 
-## remove_alternate_names
+### remove_alternate_names
 
 Used in name cleaning. If a string in the column contains the string ` or ` ("or" surrounded by spaces),
 then remove the ` or ` and all following characters.
@@ -148,7 +193,7 @@ Maps string → string.
 transforms = [{type = "remove_alternate_names"}]
 ```
 
-## remove_suffixes
+### remove_suffixes
 
 Used in name cleaning. Given a list of suffixes, remove them from the strings in the column.
 
@@ -163,7 +208,7 @@ transforms = [
 ]
 ```
 
-## remove_stop_words
+### remove_stop_words
 
 Used in name cleaning. Remove last words from names such as street names.
 
@@ -178,7 +223,7 @@ transforms = [
 ]
 ```
 
-## remove_prefixes
+### remove_prefixes
 
 Used in name cleaning. Remove prefixes like "Ms.", "Mr.", or "Mrs." from names.
 
@@ -189,7 +234,7 @@ Maps string → string.
 transforms = [{type = "remove_prefixes", values = ["ah"]}]
 ```
 
-## condense_strip_whitespace
+### condense_strip_whitespace
 
 Used in name cleaning. Take white space that may be more than one character or contain
 non-space characters and replace it with a single space.
@@ -200,7 +245,7 @@ Maps string → string.
 transforms = [{type = "condense_strip_whitespace"}]
 ```
 
-## remove_one_letter_names
+### remove_one_letter_names
 
 Used in name cleaning. If a name is a single character, remove it and leave the white space behind.
 
@@ -210,7 +255,7 @@ Maps string → string.
 transforms = [{type = "remove_one_letter_names"}]
 ```
 
-## split
+### split
 
 Split the column value on space characters.
 
@@ -223,7 +268,7 @@ column_name = "namefrst_clean"
 transforms = [{type = "split"}]
 ```
 
-## array_index
+### array_index
 
 If the column contains an array, select the element at the given position.
 
@@ -241,7 +286,7 @@ transforms = [
 ]
 ```
 
-## mapping
+### mapping
 
 Map single or multiple values to a single output value, otherwise known as a "recoding."
 
@@ -262,7 +307,7 @@ transforms = [
 ]
 ```
 
-## substring
+### substring
 
 Replace a column with a substring of the data in the column.
 
@@ -274,7 +319,7 @@ transforms = [
 ]
  ```
 
-## divide_by_int
+### divide_by_int
 
 Divide data in a column by an integer value. It may leave a non-integer result.
 
@@ -296,7 +341,7 @@ transforms = [
 ```
 
 
-## when_value
+### when_value
 
 Apply conditional logic to replacement of values in a column. Works like the SQL `if()` or `case()` expressions in the SQL `select` clause.
 When the value of a column is `value` replace it with `if_value`. Otherwise replace it with `else_value`.
@@ -313,7 +358,7 @@ transforms = [
 ```
 
 
-## get_floor
+### get_floor
 
 Round down to the nearest whole number.
 

diff --git a/docs/_sources/config.md.txt b/docs/_sources/config.md.txt
@@ -1,20 +1,20 @@
 # Configuration
 1. [Basic Example Config File](#basic-config-file)
 2. [Advanced Example Config File](#advanced-config-file)
-3. [Top level configs](#top-level-configs)
-4. [Data sources](#data-sources)
+3. [Top-Level Configs](#top-level-configs)
+4. [Data Sources](#data-sources)
 5. [Filter](#filter)
-6. [Column mappings](#column-mappings)
-7. [Substitution columns](#substitution-columns)
-8. [Feature selections](#feature-selections)
-9. [Potential matches universe](#potential-matches-universe)
+6. [Column Mappings](#column-mappings)
+7. [Substitution Columns](#substitution-columns)
+8. [Feature Selections](#feature-selections)
+9. [Potential Matches Universe](#potential-matches-universe)
 10. [Blocking](#blocking)
 11. [Comparisons](#comparisons)
-12. [Household comparisons](#household-comparisons)
-13. [Comparison features](#comparison-features)
-14. [Pipeline-generated features](#pipeline-generated-features)
-15. [Training and models](#training-and-models)
-16. [Household training and models](#household-training-and-models)
+12. [Household Comparisons](#household-comparisons)
+13. [Comparison Features](#comparison-features)
+14. [Pipeline-Generated Features](#pipeline-generated-features)
+15. [Training and Models](#training-and-models)
+16. [Household Training and Models](#household-training-and-models)
 
 ## Basic Config File
 
@@ -438,16 +438,34 @@ datasource = "b"
 ```
 
 
-## [Column Mappings](column_mapping_transforms)
+## [Column Mappings](column_mappings)
 
 * Header name: `column_mappings`
-* Description: Base column mappings and transformations to extract from your input datasets.
+* Description: Base column mappings and transformations to extract from your
+  input datasets. Each column mapping requires a `column_name` which tells it
+  which input column to read from. Optionally you may provide an `alias` for
+  the column and `transforms` to modify it as it is read in. There are some additional
+  attributes listed below that are meant for advanced usage. These are described
+  in more detail on the [column mappings](column_mappings) page.
 * Required: True
 * Type: List
 * Attributes:
-  * `alias` -- Type: `string`. Optional; if not specified the new column name defaults to `column_name`. New name of column.
-  * `column_name` -- Type: `string`. Name of column in input data. Used as the name of the output column if `alias` is not specified.
-  * `transforms` -- Type: `List`. Optional. A list of transforms to apply, in order, to the input data. See the [column mapping transforms](column_mapping_transforms) section for more information.
+  * `column_name` -- Type: `string`. The name of the column in the input data.
+  * `alias` -- Type: `string`. Optional. The new name of the column to use
+    in hlink. By default, this is the same as `column_name`.
+  * `transforms` -- Type: `List`. Optional. A list of transforms to apply, in
+    order, to the input data. See the [column mapping transforms](column_mappings.html#transforms)
+    section for more information.
+  * `set_value_column_a` -- Type: `Any`. Optional. Set all records for dataset
+    A to the given literal value.
+  * `set_value_column_b` -- Type: `Any`. Optional. Set all records for dataset
+    B to the given literal value.
+  * `override_column_a` -- Type: `string`. Read from this column in dataset A
+    instead of the column specified with `column_name`.
+  * `override_column_b` -- Type: `string`. Read from this column in dataset B
+    instead of the column specified with `column_name`.
+  * `override_transforms` -- Type: `List`. Transforms to apply to the override
+    column specified with `override_column_a` or `override_column_b`.
 
 ```
 [[column_mappings]]

diff --git a/docs/_sources/index.rst.txt b/docs/_sources/index.rst.txt
@@ -23,7 +23,7 @@ Configuration API
    :maxdepth: 2
    :caption: Configuration API
 
-   Column Mapping <column_mapping_transforms.md>
+   Column Mappings <column_mappings.md>
    Comparison Types <comparison_types.md>
    Feature Selection <feature_selection_transforms.md>
    Pipeline Features <pipeline_features.md>

diff --git a/docs/_sources/link_tasks.md.txt b/docs/_sources/link_tasks.md.txt
@@ -17,7 +17,7 @@ datasets.
 ### Related Configuration Sections
 
 * The [`datasource_a` and `datasource_b`](config.html#data-sources) sections specify where to find the input data.
-* [```column_mappings```](column_mapping_transforms.html#column-mapping-transforms),
+* [`column_mappings`](column_mappings.html#column-mappings),
 [`feature_selections`](feature_selection_transforms.html#feature-selection-transforms),
 and [`substitution_columns`](substitutions.html#substitutions) may all be used to define transformations on the input data.
 * The [`filter`](config.html#filter) section may be used to filter some records out of the input data