Feature: Allow a `colspec_file` config with column info for `fixedwidth` inputs #139

johncmerfeld · 2024-11-26T19:35:41Z

For a bundle like STAAR Summative that takes multiple versions of fixed-width inputs with hundreds of columns each, specifying columns and colspecs within earthmover.yaml gets unwieldy. @jalvord1 suggested reading directly from the colspec files we already have. You can see a usage example here -- the alternative is 3,000+ lines in that file, which itself is just one of four STAAR Summative formats 😅

Happy to restructure the code to be more in line with what's preferred.

Update - I have removed support for manually specifying colspecs. I agree with the principle that we should only support a single usage of FWFs. Support for FWFs in Earthmover is still in its infancy, and because this project is still in 0.x, this would not violate any versioning assumptions. This removal is not critical to the rest of the PR so we should continue discussing

1/9 update - I have restored support for manually specifying columns/colspecs but it is not part of the official documentation around fixed-width files. If the user fails to provide a colspec_file or columns in their earthmover config, there is an error message that informs them of their options.

tomreitz · 2024-11-26T22:01:58Z

I'd like to propose an alternate solution which allows earthmover to remain agnostic to the formatting of the fixed-width file column definitions:

We add a readlines() global here which reads an input file path and returns the lines as a list
The first Jinja parse of earthmover.yml parses the lines as needed to construct valid colspecs, for example:

sources:
  input:
    file: ${INPUT_FILE}
    header_rows: 0
    type: fixedwidth

{%- set fwf_colspec_file = "../../fwf_to_csv_xwalks/staar_summative_fwf_xwalk_${API_YEAR}.csv" %}
    columns:
{%- for line in readlines(fwf_colspec_file ) %}
  {%- if not loop.first %}
  {%- set start_index, end_index, field_length, field_name = line.split(",") %}
      - {{field_name}}
  {%- endif %}
{%- endfor %}

    colspecs:
{%- for line in readlines(fwf_colspec_file ) %}
  {%- if not loop.first %}
  {%- set start_index, end_index, field_length, field_name = line.split(",") %}
      - [{{start_index}}, {{end_index}}]
  {%- endif %}
{%- endfor %}

Curious for @jayckaiser 's thoughts here too.

johncmerfeld · 2024-11-27T12:58:37Z

That's an interesting construction. I see the benefit in not committing to a CSV colspec_file and refraining from adding further syntactic sugar.

However, I think those benefits are outweighed by the additional complexity of the config file, which I would argue is already one of the key pain points of using Earthmover. If I had a vote, I'd want to reduce the amount of Jinja the user needs to read and write in order to use a basic feature.

earthmover/nodes/source.py

tomreitz · 2024-12-03T22:02:44Z

Another idea; this use-case seems like a good candidate for Jinja includes, for example:

sources:
  input:
    file: ${INPUT_FILE}
    header_rows: 0
    type: fixedwidth
{% include "./colspecs/staar_summative_${API_YEAR}.yaml" %}

where the file ./colspecs/staar_summative_${API_YEAR}.yaml is like

    columns:
      - administration_date
      - grade_level_tested
      - esc_region_number
      ...
    colspecs:
      - [0,4]
      - [4,6]
      - [6,8]
      ...

(I tested this and it seems to work nicely.)

jayckaiser · 2024-12-12T15:51:56Z

earthmover/nodes/source.py

@@ -266,13 +265,29 @@ def __get_skiprows(config: 'YamlMapping'):
            _header_rows = config.get('header_rows', 1)
            return int(_header_rows) - 1  # If header_rows = 1, skip none.

+        def __read_fwf(file: str, config: 'YamlMapping'):


We should define any helper methods like these outside the __get_read_lambda() helper.

I understand the purpose of what we're doing here, but it does not spark joy. We are defining our own filespec for documenting FWF headers. There are a couple of improvements I might suggest:

Make the columns of the file name-agnostic. (However, how do we handle that optional column`?)

Very clearly and forcefully define the filespec in the README, and tell users explicitly how to use it.

jayckaiser · 2024-12-12T15:53:25Z

earthmover/nodes/source.py

+                    )
+                colnames = file_format.field_name
+                colspecs = list(zip(file_format.start_index, file_format.end_index))
+                return dd.read_fwf(file, colspecs=colspecs, header=config.get('header_rows', "infer"), names=colnames, converters={c:str for c in colnames})


Add a couple of variables here to clean up these read lines. We should technically be using error_handler.assert_get_key() when retrieving variables from the YAML config blocks.

Could you be more specific about what kind of cleanup you want to see?

I'm happy to use assert_get_key although I notice that none of the other read lambdas use it, and it will make this code more verbose

jayckaiser · 2024-12-12T15:54:50Z

earthmover/nodes/source.py

+            colspec_file = config.get('colspec_file')
+            if colspec_file:
+                try:
+                    file_format = pd.read_csv(os.path.join(os.path.dirname(self.config.__file__), colspec_file))


Is this approach for ascertaining filepath directories consistent with the rest of the project? If so, maybe we should move this logic to a helper.

I'm not sure I can say whether it's consistent as such; it's needed in order to properly find the colspec file when using project composition. We do use the same construction once elsewhere but that's in a separate class. I've added a comment explaining this but I think it's too rare a usage to justify a separate function as of now

tomreitz

This looks great, I'm approving. Appreciate your patience with the back-and-forth on this one, @johncmerfeld!

update changelog

98af167

johncmerfeld requested review from tomreitz and jalvord1 November 26, 2024 19:35

johncmerfeld self-assigned this Nov 26, 2024

add note

2c4240e

johncmerfeld commented Dec 3, 2024

View reviewed changes

earthmover/nodes/source.py Outdated Show resolved Hide resolved

make colspec_file read safer

7076b53

jayckaiser reviewed Dec 12, 2024

View reviewed changes

johncmerfeld added 4 commits December 16, 2024 17:25

add documentation, remove support for user-supplied colspecs

f4bec28

add comment

25136b6

resolve conflict

413826a

fix link to doc

6724ed2

johncmerfeld requested a review from jayckaiser December 16, 2024 23:33

johncmerfeld added 5 commits January 9, 2025 09:44

restore optional columns/colspec functionality

d907ad1

tweak changelog language

7429ef9

fix colspecs

d714b88

fix colspecs

32fb7e3

change language

f2e332e

tomreitz approved these changes Jan 21, 2025

View reviewed changes

Merge branch 'main' into feature/fwf-colspec-file

b104a2a

tomreitz merged commit a9097ea into main Jan 23, 2025

tomreitz deleted the feature/fwf-colspec-file branch January 23, 2025 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Allow a `colspec_file` config with column info for `fixedwidth` inputs #139

Feature: Allow a `colspec_file` config with column info for `fixedwidth` inputs #139

johncmerfeld commented Nov 26, 2024 •

edited

Loading

tomreitz commented Nov 26, 2024

johncmerfeld commented Nov 27, 2024

tomreitz commented Dec 3, 2024

jayckaiser Dec 12, 2024

jayckaiser Dec 12, 2024

johncmerfeld Dec 16, 2024

jayckaiser Dec 12, 2024

johncmerfeld Dec 16, 2024

tomreitz left a comment

Feature: Allow a colspec_file config with column info for fixedwidth inputs #139

Feature: Allow a colspec_file config with column info for fixedwidth inputs #139

Conversation

johncmerfeld commented Nov 26, 2024 • edited Loading

tomreitz commented Nov 26, 2024

johncmerfeld commented Nov 27, 2024

tomreitz commented Dec 3, 2024

jayckaiser Dec 12, 2024

Choose a reason for hiding this comment

jayckaiser Dec 12, 2024

Choose a reason for hiding this comment

johncmerfeld Dec 16, 2024

Choose a reason for hiding this comment

jayckaiser Dec 12, 2024

Choose a reason for hiding this comment

johncmerfeld Dec 16, 2024

Choose a reason for hiding this comment

tomreitz left a comment

Choose a reason for hiding this comment

Feature: Allow a `colspec_file` config with column info for `fixedwidth` inputs #139

Feature: Allow a `colspec_file` config with column info for `fixedwidth` inputs #139

johncmerfeld commented Nov 26, 2024 •

edited

Loading