Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Allow a colspec_file config with column info for fixedwidth inputs #139

Merged
merged 13 commits into from
Jan 23, 2025

Conversation

johncmerfeld
Copy link
Contributor

@johncmerfeld johncmerfeld commented Nov 26, 2024

For a bundle like STAAR Summative that takes multiple versions of fixed-width inputs with hundreds of columns each, specifying columns and colspecs within earthmover.yaml gets unwieldy. @jalvord1 suggested reading directly from the colspec files we already have. You can see a usage example here -- the alternative is 3,000+ lines in that file, which itself is just one of four STAAR Summative formats 😅

Happy to restructure the code to be more in line with what's preferred.

Update - I have removed support for manually specifying colspecs. I agree with the principle that we should only support a single usage of FWFs. Support for FWFs in Earthmover is still in its infancy, and because this project is still in 0.x, this would not violate any versioning assumptions. This removal is not critical to the rest of the PR so we should continue discussing

1/9 update - I have restored support for manually specifying columns/colspecs but it is not part of the official documentation around fixed-width files. If the user fails to provide a colspec_file or columns in their earthmover config, there is an error message that informs them of their options.

@johncmerfeld johncmerfeld self-assigned this Nov 26, 2024
@tomreitz
Copy link
Collaborator

I'd like to propose an alternate solution which allows earthmover to remain agnostic to the formatting of the fixed-width file column definitions:

  1. We add a readlines() global here which reads an input file path and returns the lines as a list
  2. The first Jinja parse of earthmover.yml parses the lines as needed to construct valid colspecs, for example:
sources:
  input:
    file: ${INPUT_FILE}
    header_rows: 0
    type: fixedwidth

{%- set fwf_colspec_file = "../../fwf_to_csv_xwalks/staar_summative_fwf_xwalk_${API_YEAR}.csv" %}
    columns:
{%- for line in readlines(fwf_colspec_file ) %}
  {%- if not loop.first %}
  {%- set start_index, end_index, field_length, field_name = line.split(",") %}
      - {{field_name}}
  {%- endif %}
{%- endfor %}

    colspecs:
{%- for line in readlines(fwf_colspec_file ) %}
  {%- if not loop.first %}
  {%- set start_index, end_index, field_length, field_name = line.split(",") %}
      - [{{start_index}}, {{end_index}}]
  {%- endif %}
{%- endfor %}

Curious for @jayckaiser 's thoughts here too.

@johncmerfeld
Copy link
Contributor Author

That's an interesting construction. I see the benefit in not committing to a CSV colspec_file and refraining from adding further syntactic sugar.

However, I think those benefits are outweighed by the additional complexity of the config file, which I would argue is already one of the key pain points of using Earthmover. If I had a vote, I'd want to reduce the amount of Jinja the user needs to read and write in order to use a basic feature.

@tomreitz
Copy link
Collaborator

tomreitz commented Dec 3, 2024

Another idea; this use-case seems like a good candidate for Jinja includes, for example:

sources:
  input:
    file: ${INPUT_FILE}
    header_rows: 0
    type: fixedwidth
{% include "./colspecs/staar_summative_${API_YEAR}.yaml" %}

where the file ./colspecs/staar_summative_${API_YEAR}.yaml is like

    columns:
      - administration_date
      - grade_level_tested
      - esc_region_number
      ...
    colspecs:
      - [0,4]
      - [4,6]
      - [6,8]
      ...

(I tested this and it seems to work nicely.)

@@ -266,13 +265,29 @@ def __get_skiprows(config: 'YamlMapping'):
_header_rows = config.get('header_rows', 1)
return int(_header_rows) - 1 # If header_rows = 1, skip none.

def __read_fwf(file: str, config: 'YamlMapping'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define any helper methods like these outside the __get_read_lambda() helper.

I understand the purpose of what we're doing here, but it does not spark joy. We are defining our own filespec for documenting FWF headers. There are a couple of improvements I might suggest:

  • Make the columns of the file name-agnostic. (However, how do we handle that optional column`?)
  • Very clearly and forcefully define the filespec in the README, and tell users explicitly how to use it.

)
colnames = file_format.field_name
colspecs = list(zip(file_format.start_index, file_format.end_index))
return dd.read_fwf(file, colspecs=colspecs, header=config.get('header_rows', "infer"), names=colnames, converters={c:str for c in colnames})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a couple of variables here to clean up these read lines. We should technically be using error_handler.assert_get_key() when retrieving variables from the YAML config blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you be more specific about what kind of cleanup you want to see?

I'm happy to use assert_get_key although I notice that none of the other read lambdas use it, and it will make this code more verbose

colspec_file = config.get('colspec_file')
if colspec_file:
try:
file_format = pd.read_csv(os.path.join(os.path.dirname(self.config.__file__), colspec_file))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this approach for ascertaining filepath directories consistent with the rest of the project? If so, maybe we should move this logic to a helper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I can say whether it's consistent as such; it's needed in order to properly find the colspec file when using project composition. We do use the same construction once elsewhere but that's in a separate class. I've added a comment explaining this but I think it's too rare a usage to justify a separate function as of now

Copy link
Collaborator

@tomreitz tomreitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, I'm approving. Appreciate your patience with the back-and-forth on this one, @johncmerfeld!

@tomreitz tomreitz merged commit a9097ea into main Jan 23, 2025
@tomreitz tomreitz deleted the feature/fwf-colspec-file branch January 23, 2025 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants