[#176] Remove output_suspicious_TD and "suspicious traininig data" su…

…pport
ipums · Dec 10, 2024 · b7f821c · b7f821c
1 parent bde173d
commit b7f821c
Show file tree

Hide file tree

Showing 9 changed files with 19 additions and 247 deletions.
diff --git a/docs/_sources/config.md.txt b/docs/_sources/config.md.txt
@@ -334,7 +334,6 @@ split_by_id_a = true
 decision = "drop_duplicate_with_threshold_ratio"
 
 n_training_iterations = 2
-output_suspicious_TD = true
 param_grid = true
 model_parameters = [ 
     { type = "random_forest", maxDepth = [7], numTrees = [100], threshold = [0.05, 0.005], threshold_ratio = [1.2, 1.3] },
@@ -361,7 +360,6 @@ split_by_id_a = true
 decision = "drop_duplicate_with_threshold_ratio"
 
 n_training_iterations = 10
-output_suspicious_TD = true
 param_grid = false
 model_parameters = [
     { type = "random_forest", maxDepth = 6, numTrees = 50, threshold = 0.5, threshold_ratio = 1.0 },
@@ -750,7 +748,6 @@ splits = [-1,0,6,11,9999]
   * `n_training_iterations` -- Type: `integer`. Optional; default value is 10. The number of training iterations to use during the `model_exploration` task.
   * `scale_data` -- Type: `boolean`.  Optional. Whether to scale the data as part of the machine learning pipeline.
   * `use_training_data_features` -- Type: `boolean`. Optional. If the identifiers in the training data set are not present in your raw input data, you will need to set this to `true`, or training features will not be able to be generated, giving null column errors.  For example, if the training data set you are using has individuals from 1900 and 1910, but you are about to train a model to score the 1930-1940 potential matches, you need this to be set to `true` or it will fail, since the individual IDs are not present in the 1930 and 1940 raw input data.  If you were about to train a model to score the 1900-1910 potential matches with this same training set, it would be best to set this to `false`, so you can be sure the training features are created from scratch to match your exact current configuration settings, although if you know the features haven't changed, you could set it to `true` to save a small amount of processing time.
-  * `output_suspicious_TD` -- Type: `boolean`.  Optional.  Used in the `model_exploration` link task.  Outputs tables of potential matches that the model repeatedly scores differently than the match value given by the training data.  Helps to identify false positives/false negatives in the training data, as well as areas that need additional training feature coverage in the model, or need increased representation in the training data set.
   * `split_by_id_a` -- Type: `boolean`.  Optional.  Used in the `model_exploration` link task.  When set to true, ensures that all potential matches for a given individual with ID_a are grouped together in the same train-test-split group. For example, if individual histid_a "A304BT" has three potential matches in the training data, one each to histid_b "B200", "C201", and "D425", all of those potential matches would either end up in the "train" split or the "test" split when evaluating the model performance.
   * `feature_importances` -- Type: `boolean`. Optional.  Whether to record
     feature importances or coefficients for the training features when training
@@ -764,7 +761,6 @@ scale_data = false
 dataset = "/path/to/1900_1910_training_data_20191023.csv"
 dependent_var = "match"
 use_training_data_features = false
-output_suspicious_TD = true
 split_by_id_a = true
 
 score_with_model = true
@@ -804,7 +800,6 @@ scale_data = false
 dataset = "/path/to/hh_training_data_1900_1910.csv"
 dependent_var = "match"
 use_training_data_features = false
-output_suspicious_TD = true
 split_by_id_a = true
 score_with_model = true
 feature_importances = true

diff --git a/docs/_sources/use_examples.md.txt b/docs/_sources/use_examples.md.txt
@@ -1,6 +1,5 @@
 # Advanced Workflow Examples 
 
-
 ## Export training data after generating features to reuse in different linking years
 
 It is common to have a single training data set that spans two linked years, which is then used to train a model that is applied to a different set of linked years.  For example, we have a training data set that spans linked individuals from the 1900 census to the 1910 census.  We use this training data to predict links in the full count 1900-1910 linking run, but we also use this training data to link year pairs 1910-1920, 1920-1930, and 1930-1940.  
@@ -66,12 +65,9 @@ However, when this training data set is used for other years, the program does n
 
 8) Launch the hlink program using your new config for the new year pair you want to link. Run your link tasks and export relevant data.
 
-## ML model exploration and export of lists of potential false positives/negatives in training data
-`hlink` accepts a matrix of ML models and hyper-parameters to run train/test splits for you, and outputs data you can use to select and tune your models.  You can see example `training` and `hh_training` configuration sections that implement this in the [training](config.html#training-and-models) and [household training](config.html#household-training-and-models) sections of the configuration documentation.
-
-The model exploration link task also allows you to export lists of potential false positives (FPs) and false negatives (FNs) in your training data.  This is calculated when running the train/test splits in the regular model exploration tasks if the `output_suspicious_TD` flag is true.
+## An Example Model Exploration Workflow
 
-### Example model exploration and FP/FN export workflow
+`hlink` accepts a matrix of ML models and hyper-parameters to run train/test splits for you, and outputs data you can use to select and tune your models.  You can see example `training` and `hh_training` configuration sections that implement this in the [training](config.html#training-and-models) and [household training](config.html#household-training-and-models) sections of the configuration documentation.
 
 1) Create a config file that has a `training` and/or `hh_training` section with model parameters to explore. For example:
 
@@ -88,9 +84,6 @@ The model exploration link task also allows you to export lists of potential fal
     # source data years weren't identical to the linked years of your training data.
     use_training_data_features = false
 
-    # VERY IMPORTANT if you want to output FPs/FNs
-    output_suspicious_TD = true
-
     split_by_id_a = true
     score_with_model = true
     feature_importances = false
@@ -127,11 +120,4 @@ The model exploration link task also allows you to export lists of potential fal
     hlink $ csv training_results /my/output/1900_1910_training_results.csv
     ```
 
-5) Export the potential FPs and FNs to csv.  For `training` params, the results will be in the `repeat_FPs` and `repeat_FNs` tables, and for `hh_training` in the `hh_repeat_FPs` and `hh_repeat_FNs` tables.
-
-    ```
-    hlink $ csv repeat_FPs /my/output/1900_1910_potential_FPs.csv
-    hlink $ csv repeat_FNs /my/output/1900_1910_potential_FNs.csv
-    ```
-
-6) Use your preferred methods to analyze the data you've just exported.  Update the `chosen_model` in your configuration, and/or create new versions of your training data following your findings and update the path to the new training data in your configs.
+5) Use your preferred methods to analyze the data you've just exported.  Update the `chosen_model` in your configuration, and/or create new versions of your training data following your findings and update the path to the new training data in your configs.
diff --git a/docs/config.html b/docs/config.html
@@ -367,7 +367,6 @@ <h2>Advanced Config File<a class="headerlink" href="#advanced-config-file" title
 <span class="n">decision</span> <span class="o">=</span> <span class="s2">&quot;drop_duplicate_with_threshold_ratio&quot;</span>
 
 <span class="n">n_training_iterations</span> <span class="o">=</span> <span class="mi">2</span>
-<span class="n">output_suspicious_TD</span> <span class="o">=</span> <span class="n">true</span>
 <span class="n">param_grid</span> <span class="o">=</span> <span class="n">true</span>
 <span class="n">model_parameters</span> <span class="o">=</span> <span class="p">[</span> 
     <span class="p">{</span> <span class="nb">type</span> <span class="o">=</span> <span class="s2">&quot;random_forest&quot;</span><span class="p">,</span> <span class="n">maxDepth</span> <span class="o">=</span> <span class="p">[</span><span class="mi">7</span><span class="p">],</span> <span class="n">numTrees</span> <span class="o">=</span> <span class="p">[</span><span class="mi">100</span><span class="p">],</span> <span class="n">threshold</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.05</span><span class="p">,</span> <span class="mf">0.005</span><span class="p">],</span> <span class="n">threshold_ratio</span> <span class="o">=</span> <span class="p">[</span><span class="mf">1.2</span><span class="p">,</span> <span class="mf">1.3</span><span class="p">]</span> <span class="p">},</span>
@@ -394,7 +393,6 @@ <h2>Advanced Config File<a class="headerlink" href="#advanced-config-file" title
 <span class="n">decision</span> <span class="o">=</span> <span class="s2">&quot;drop_duplicate_with_threshold_ratio&quot;</span>
 
 <span class="n">n_training_iterations</span> <span class="o">=</span> <span class="mi">10</span>
-<span class="n">output_suspicious_TD</span> <span class="o">=</span> <span class="n">true</span>
 <span class="n">param_grid</span> <span class="o">=</span> <span class="n">false</span>
 <span class="n">model_parameters</span> <span class="o">=</span> <span class="p">[</span>
     <span class="p">{</span> <span class="nb">type</span> <span class="o">=</span> <span class="s2">&quot;random_forest&quot;</span><span class="p">,</span> <span class="n">maxDepth</span> <span class="o">=</span> <span class="mi">6</span><span class="p">,</span> <span class="n">numTrees</span> <span class="o">=</span> <span class="mi">50</span><span class="p">,</span> <span class="n">threshold</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">threshold_ratio</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="p">},</span>
@@ -820,7 +818,6 @@ <h2>Training and <a class="reference internal" href="models.html"><span class="d
 <li><p><code class="docutils literal notranslate"><span class="pre">n_training_iterations</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">integer</span></code>. Optional; default value is 10. The number of training iterations to use during the <code class="docutils literal notranslate"><span class="pre">model_exploration</span></code> task.</p></li>
 <li><p><code class="docutils literal notranslate"><span class="pre">scale_data</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">boolean</span></code>.  Optional. Whether to scale the data as part of the machine learning pipeline.</p></li>
 <li><p><code class="docutils literal notranslate"><span class="pre">use_training_data_features</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">boolean</span></code>. Optional. If the identifiers in the training data set are not present in your raw input data, you will need to set this to <code class="docutils literal notranslate"><span class="pre">true</span></code>, or training features will not be able to be generated, giving null column errors.  For example, if the training data set you are using has individuals from 1900 and 1910, but you are about to train a model to score the 1930-1940 potential matches, you need this to be set to <code class="docutils literal notranslate"><span class="pre">true</span></code> or it will fail, since the individual IDs are not present in the 1930 and 1940 raw input data.  If you were about to train a model to score the 1900-1910 potential matches with this same training set, it would be best to set this to <code class="docutils literal notranslate"><span class="pre">false</span></code>, so you can be sure the training features are created from scratch to match your exact current configuration settings, although if you know the features haven’t changed, you could set it to <code class="docutils literal notranslate"><span class="pre">true</span></code> to save a small amount of processing time.</p></li>
-<li><p><code class="docutils literal notranslate"><span class="pre">output_suspicious_TD</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">boolean</span></code>.  Optional.  Used in the <code class="docutils literal notranslate"><span class="pre">model_exploration</span></code> link task.  Outputs tables of potential matches that the model repeatedly scores differently than the match value given by the training data.  Helps to identify false positives/false negatives in the training data, as well as areas that need additional training feature coverage in the model, or need increased representation in the training data set.</p></li>
 <li><p><code class="docutils literal notranslate"><span class="pre">split_by_id_a</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">boolean</span></code>.  Optional.  Used in the <code class="docutils literal notranslate"><span class="pre">model_exploration</span></code> link task.  When set to true, ensures that all potential matches for a given individual with ID_a are grouped together in the same train-test-split group. For example, if individual histid_a “A304BT” has three potential matches in the training data, one each to histid_b “B200”, “C201”, and “D425”, all of those potential matches would either end up in the “train” split or the “test” split when evaluating the model performance.</p></li>
 <li><p><code class="docutils literal notranslate"><span class="pre">feature_importances</span></code> – Type: <code class="docutils literal notranslate"><span class="pre">boolean</span></code>. Optional.  Whether to record
 feature importances or coefficients for the training features when training
@@ -834,7 +831,6 @@ <h2>Training and <a class="reference internal" href="models.html"><span class="d
 <span class="n">dataset</span> <span class="o">=</span> <span class="s2">&quot;/path/to/1900_1910_training_data_20191023.csv&quot;</span>
 <span class="n">dependent_var</span> <span class="o">=</span> <span class="s2">&quot;match&quot;</span>
 <span class="n">use_training_data_features</span> <span class="o">=</span> <span class="n">false</span>
-<span class="n">output_suspicious_TD</span> <span class="o">=</span> <span class="n">true</span>
 <span class="n">split_by_id_a</span> <span class="o">=</span> <span class="n">true</span>
 
 <span class="n">score_with_model</span> <span class="o">=</span> <span class="n">true</span>
@@ -878,7 +874,6 @@ <h2>Household training and models<a class="headerlink" href="#household-training
 <span class="n">dataset</span> <span class="o">=</span> <span class="s2">&quot;/path/to/hh_training_data_1900_1910.csv&quot;</span>
 <span class="n">dependent_var</span> <span class="o">=</span> <span class="s2">&quot;match&quot;</span>
 <span class="n">use_training_data_features</span> <span class="o">=</span> <span class="n">false</span>
-<span class="n">output_suspicious_TD</span> <span class="o">=</span> <span class="n">true</span>
 <span class="n">split_by_id_a</span> <span class="o">=</span> <span class="n">true</span>
 <span class="n">score_with_model</span> <span class="o">=</span> <span class="n">true</span>
 <span class="n">feature_importances</span> <span class="o">=</span> <span class="n">true</span>

diff --git a/docs/index.html b/docs/index.html
@@ -62,7 +62,7 @@ <h1>Welcome to hlink’s documentation!<a class="headerlink" href="#welcome-to-h
 </li>
 <li class="toctree-l1"><a class="reference internal" href="use_examples.html">Advanced Workflows</a><ul>
 <li class="toctree-l2"><a class="reference internal" href="use_examples.html#export-training-data-after-generating-features-to-reuse-in-different-linking-years">Export training data after generating features to reuse in different linking years</a></li>
-<li class="toctree-l2"><a class="reference internal" href="use_examples.html#ml-model-exploration-and-export-of-lists-of-potential-false-positives-negatives-in-training-data">ML model exploration and export of lists of potential false positives/negatives in training data</a></li>
+<li class="toctree-l2"><a class="reference internal" href="use_examples.html#an-example-model-exploration-workflow">An Example Model Exploration Workflow</a></li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="config.html">Configuration</a><ul>

diff --git a/docs/searchindex.js b/docs/searchindex.js