-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Allow a colspec_file
config with column info for fixedwidth
inputs
#139
Conversation
I'd like to propose an alternate solution which allows earthmover to remain agnostic to the formatting of the fixed-width file column definitions:
sources:
input:
file: ${INPUT_FILE}
header_rows: 0
type: fixedwidth
{%- set fwf_colspec_file = "../../fwf_to_csv_xwalks/staar_summative_fwf_xwalk_${API_YEAR}.csv" %}
columns:
{%- for line in readlines(fwf_colspec_file ) %}
{%- if not loop.first %}
{%- set start_index, end_index, field_length, field_name = line.split(",") %}
- {{field_name}}
{%- endif %}
{%- endfor %}
colspecs:
{%- for line in readlines(fwf_colspec_file ) %}
{%- if not loop.first %}
{%- set start_index, end_index, field_length, field_name = line.split(",") %}
- [{{start_index}}, {{end_index}}]
{%- endif %}
{%- endfor %} Curious for @jayckaiser 's thoughts here too. |
That's an interesting construction. I see the benefit in not committing to a CSV However, I think those benefits are outweighed by the additional complexity of the config file, which I would argue is already one of the key pain points of using Earthmover. If I had a vote, I'd want to reduce the amount of Jinja the user needs to read and write in order to use a basic feature. |
Another idea; this use-case seems like a good candidate for Jinja includes, for example: sources:
input:
file: ${INPUT_FILE}
header_rows: 0
type: fixedwidth
{% include "./colspecs/staar_summative_${API_YEAR}.yaml" %} where the file columns:
- administration_date
- grade_level_tested
- esc_region_number
...
colspecs:
- [0,4]
- [4,6]
- [6,8]
... (I tested this and it seems to work nicely.) |
earthmover/nodes/source.py
Outdated
@@ -266,13 +265,29 @@ def __get_skiprows(config: 'YamlMapping'): | |||
_header_rows = config.get('header_rows', 1) | |||
return int(_header_rows) - 1 # If header_rows = 1, skip none. | |||
|
|||
def __read_fwf(file: str, config: 'YamlMapping'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should define any helper methods like these outside the __get_read_lambda()
helper.
I understand the purpose of what we're doing here, but it does not spark joy. We are defining our own filespec for documenting FWF headers. There are a couple of improvements I might suggest:
- Make the columns of the file name-agnostic. (However, how do we handle that optional column`?)
- Very clearly and forcefully define the filespec in the README, and tell users explicitly how to use it.
earthmover/nodes/source.py
Outdated
) | ||
colnames = file_format.field_name | ||
colspecs = list(zip(file_format.start_index, file_format.end_index)) | ||
return dd.read_fwf(file, colspecs=colspecs, header=config.get('header_rows', "infer"), names=colnames, converters={c:str for c in colnames}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a couple of variables here to clean up these read lines. We should technically be using error_handler.assert_get_key()
when retrieving variables from the YAML config blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you be more specific about what kind of cleanup you want to see?
I'm happy to use assert_get_key
although I notice that none of the other read lambdas use it, and it will make this code more verbose
earthmover/nodes/source.py
Outdated
colspec_file = config.get('colspec_file') | ||
if colspec_file: | ||
try: | ||
file_format = pd.read_csv(os.path.join(os.path.dirname(self.config.__file__), colspec_file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this approach for ascertaining filepath directories consistent with the rest of the project? If so, maybe we should move this logic to a helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I can say whether it's consistent as such; it's needed in order to properly find the colspec file when using project composition. We do use the same construction once elsewhere but that's in a separate class. I've added a comment explaining this but I think it's too rare a usage to justify a separate function as of now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, I'm approving. Appreciate your patience with the back-and-forth on this one, @johncmerfeld!
For a bundle like STAAR Summative that takes multiple versions of fixed-width inputs with hundreds of columns each, specifying
columns
andcolspecs
within earthmover.yaml gets unwieldy. @jalvord1 suggested reading directly from the colspec files we already have. You can see a usage example here -- the alternative is 3,000+ lines in that file, which itself is just one of four STAAR Summative formats 😅Happy to restructure the code to be more in line with what's preferred.
Update - I have removed support for manually specifying colspecs. I agree with the principle that we should only support a single usage of FWFs. Support for FWFs in Earthmover is still in its infancy, and because this project is still in 0.x, this would not violate any versioning assumptions. This removal is not critical to the rest of the PR so we should continue discussing
1/9 update - I have restored support for manually specifying columns/colspecs but it is not part of the official documentation around fixed-width files. If the user fails to provide a
colspec_file
orcolumns
in their earthmover config, there is an error message that informs them of their options.