Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
88a79bd
Add exit code to raise when require_rows fails.
jayckaiser Mar 20, 2025
76e1008
Add best match output in student ID bundle.
jayckaiser Mar 21, 2025
7e88c61
Merge branch 'feature/human_readable_student_id_match_file' of https:…
jayckaiser Mar 21, 2025
82338e9
Make student_id_best_match a mandatory file.
jayckaiser Mar 21, 2025
2418218
Split best_id_match into two nodes and two destinations in attempt to…
jayckaiser Mar 21, 2025
254b246
Move require_rows filter step into last node.
jayckaiser Mar 21, 2025
787c4fa
Rename files and nodes for consistency.
jayckaiser Mar 21, 2025
7ae908c
Move and extend best student ID match information into a header to en…
jayckaiser Mar 24, 2025
13d9c5e
Revert best-match logic to its original location, but keep require_ro…
jayckaiser Mar 24, 2025
d5ad700
Bugfix: revert rename of match_rate column to its original value.
jayckaiser Mar 24, 2025
1f31039
Minor clean-up to header of student best ID match file.
jayckaiser Mar 24, 2025
ea17e56
Minor updates made to best match txt file.
jayckaiser Mar 24, 2025
4a5404b
Attempt to incorporate a no-match column into the source data to forc…
jayckaiser Mar 24, 2025
7b60e57
Bugfix.
jayckaiser Mar 24, 2025
980d5e9
Reconstitute the best ID match into a single file, now that at least …
jayckaiser Mar 24, 2025
a0a2f18
Force the no-match column to an empty string in input base to ensure …
jayckaiser Mar 24, 2025
732edbf
Drop matching metadata columns in the no match file.
jayckaiser Mar 24, 2025
e43cc9b
Default to a full failed match if no match rate supercedes the requir…
jayckaiser Mar 24, 2025
0e983c9
Update best match rate message to include the required match rate.
jayckaiser Mar 24, 2025
0045ebe
Rearrange some of the bundle to re-align with its original formatting.
jayckaiser Mar 24, 2025
4254d7d
More cleanup to align with original bundle.
jayckaiser Mar 24, 2025
068e166
Reorder logic to ensure sorting works as expected.
jayckaiser Mar 25, 2025
de8a687
Keep only a subset of columns from match rate table pulled from Snowf…
jayckaiser Mar 26, 2025
e72d351
Move num_matches type coercion logic to best_id_match operation.
jayckaiser Mar 26, 2025
254dd34
Make the no-match value dynamic; update the best match template to lo…
jayckaiser Mar 28, 2025
3c801af
Update best_id_match.txtt
jayckaiser Mar 31, 2025
c0ce3a4
Update best_id_match.txtt
jayckaiser Jun 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions packages/student_ids/best_id_match.txtt
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
Each ID in the source file is compared to the ID types populated in Ed-Fi, and the combination with the highest match rate is selected.
Note that the selected combination can differ by assessment and school year!

Source file ID columns checked: ${POSSIBLE_STUDENT_ID_COLUMNS}
Ed-Fi ID types compared against: ${EDFI_STUDENT_ID_TYPES}


{% if __source_column_name == '${NO_MATCH_VALUE}' -%}

No ID-combination met the required match rate of ${REQUIRED_ID_MATCH_RATE}!
Any ID column in the source file can be updated for the next attempt, but only the column with the highest match will be selected.

{%- else -%}


The CSV file outputted alongside this one contains raw records for students whose IDs could not be matched in this process.
Please use the information below to correct the student IDs in the CSV file before attempting reprocessing.

Best match column in CSV file: {{ __source_column_name }}
Matched ID type in Ed-Fi: {{ __edfi_column_name }}


This information can also be found in Stadium! Run the following query to view the best ID-match for any attempted run:

SELECT *
FROM raw.data_integration.student_id_match_rates
WHERE tenant_code = '${SNOWFLAKE_TENANT_CODE}'
AND api_year = ${SNOWFLAKE_API_YEAR}
AND assessment_name = '${ASSESSMENT_BUNDLE}'
ORDER BY match_rate desc, edfi_column_name desc, source_column_name desc
LIMIT 1;

{%- endif -%}
54 changes: 39 additions & 15 deletions packages/student_ids/earthmover.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ config:
-- and api_year=${SNOWFLAKE_API_YEAR}
-- and assessment='${ASSESSMENT_BUNDLE}'
)
NO_MATCH_VALUE: __no_match__
# and SNOWFLAKE_CONNECTION, SNOWFLAKE_TENANT_CODE, and SNOWFLAKE_API_YEAR (as above)
# ----------------------------------------------------

Expand Down Expand Up @@ -108,6 +109,11 @@ sources:
query: >
${MATCH_RATES_SNOWFLAKE_QUERY}
{% endif %}

# Set up a no-match CSV line to add to match-rates table.
no_match:
file: no_match.csv
header_rows: 1


{% set edfi_student_id_types = "${EDFI_STUDENT_ID_TYPES},studentUniqueId".split(",") %}
Expand Down Expand Up @@ -226,10 +232,6 @@ transformations:
- operation: modify_columns
columns:
num_matches: "{%raw%}{{value|string}}{%endraw%}"
- operation: sort_rows
columns:
- num_matches
descending: True
- operation: add_columns
columns:
__join_id: "1"
Expand All @@ -242,9 +244,6 @@ transformations:
- operation: add_columns
columns:
match_rate: "{%raw%}{{num_matches|float / num_rows|float}}{%endraw%}"
- operation: modify_columns
columns:
num_matches: "{%raw%}{{value|int}}{%endraw%}"
- operation: drop_columns
columns:
- __join_id
Expand All @@ -254,11 +253,20 @@ transformations:
student_id_match_rates:
{% if compute_match_rates %}
source: $transformations.id_match_rates
operations: []
{% else %}
source: $sources.student_id_match_rates
operations:
- operation: keep_columns
columns:
- source_column_name
- edfi_column_name
- num_matches
- num_rows
- match_rate
{% endif %}
operations: []

# Filter the match rates to either the best above the threshold, or to none.
best_id_match:
source: $transformations.student_id_match_rates
operations:
Expand All @@ -271,6 +279,20 @@ transformations:
- operation: drop_columns
columns:
- meets_filter_criteria
# Union the zero-match-rate values to ensure this table is always populated.
- operation: union
sources:
- $sources.no_match
# Ensure the row selected is the highest match.
- operation: modify_columns
columns:
num_matches: "{%raw%}{{value|int}}{%endraw%}"
- operation: sort_rows
columns:
- num_matches
descending: True
- operation: limit_rows
count: 1
# this should (hopefully) result in zero or one rows
- operation: rename_columns
columns:
Expand All @@ -284,13 +306,6 @@ transformations:
- operation: add_columns
columns:
__join_id: "1"
# ensure there's not more than 1 row:
- operation: limit_rows
count: 1
expect:
- __match_rate | float >= ${REQUIRED_ID_MATCH_RATE}
# ensure there's not 0 rows:
require_rows: True

edfi_roster:
source: $transformations.unpacked_edfi_roster
Expand All @@ -305,6 +320,7 @@ transformations:
- operation: add_columns
columns:
__join_id: "1"
__no_match__: "${NO_MATCH_VALUE}"
- operation: join
sources:
- $transformations.best_id_match
Expand Down Expand Up @@ -334,6 +350,7 @@ transformations:
- operation: add_columns
columns:
__join_id: "1"
__no_match__: ""
- operation: join
sources:
- $transformations.best_id_match
Expand Down Expand Up @@ -366,6 +383,7 @@ transformations:
- __num_matches
- __num_rows
- __match_rate
- __no_match__

input_no_student_id_match:
source: $transformations.input_base
Expand Down Expand Up @@ -405,6 +423,12 @@ destinations:
extension: csv
linearize: True

student_best_id_match:
source: $transformations.best_id_match
template: ./best_id_match.txtt
extension: txt
linearize: False

{% if compute_match_rates %}
student_id_match_rates:
source: $transformations.id_match_rates
Expand Down
2 changes: 2 additions & 0 deletions packages/student_ids/no_match.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
source_column_name,edfi_column_name,num_matches,num_rows,match_rate
__no_match__,__no_match__,0,0,0.0